Skip to content

Another exception occurred during handling sdcm.exceptions.KillNemesis in _refuse_connection_from_banned_node #11816

@andrzej-jackowski-scylladb

Description

Packages

Scylla version: 2025.4.0~dev-20250815.f689d417473c with build-id 6452d1d4d848e57c3cb48457baa0b3a614c1cffc

Kernel Version: 6.14.0-1011-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

The problem looks like internal SCT nemesis implementation issue.
There is sdcm.exceptions.KillNemesis used by stop_nemesis in sdcm/cluster.py.

However, this exception-based nemesis stopping doesn't work well with ExitStack() mechanism, that is used to execute callbacks when a with-block is exited.

In case of _refuse_connection_from_banned_node that implements disrupt_refuse_connection_with_block_scylla_ports_on_banned_node and disrupt_refuse_connection_with_send_sigstop_signal_to_scylla_on_banned_node the problem is that ExitStack() register a callback to _remove_node_add_node that might throw when we are during handling of sdcm.exceptions.KillNemesis

The code is:

...
with self.node_allocator.run_nemesis(
        nemesis_label=f"{simulate_node_unavailability.__name__}") as working_node, ExitStack() as stack:
    stack.enter_context(node_operations.block_loaders_payload_for_scylla_node(
        self.target_node, loader_nodes=self.loaders.nodes))
    stack.callback(drop_keyspace, node=working_node)
    target_host_id = self.target_node.host_id
    stack.callback(self._remove_node_add_node, verification_node=working_node, node_to_remove=self.target_node,
                   remove_node_host_id=target_host_id)
...

and if the problem occurs, the backtrace looks as follows:

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5430, in _refuse_connection_from_banned_node
    working_node.run_nodetool(f"removenode {target_host_id}", retry=0, long_running=True)
  ...
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/session.py", line 52, in simple_select
    select(readfds, writefds, (), timeout)
    ~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sdcm.exceptions.KillNemesis

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5582, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5363, in disrupt_refuse_connection_with_send_sigstop_signal_to_scylla_on_banned_node
    self._refuse_connection_from_banned_node(use_iptables=False)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5413, in _refuse_connection_from_banned_node
    nemesis_label=f"{simulate_node_unavailability.__name__}") as working_node, ExitStack() as stack:
                                                                               ~~~~~~~~~^^
  File "/usr/local/lib/python3.13/contextlib.py", line 619, in __exit__
    raise exc
  File "/usr/local/lib/python3.13/contextlib.py", line 604, in __exit__
    if cb(*exc_details):
       ~~^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/contextlib.py", line 482, in _exit_wrapper
    callback(*args, **kwds)
    ~~~~~~~~^^^^^^^^^^^^^^^
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3657, in _remove_node_add_node
    self.run_repair_on_nodes(nodes=up_normal_nodes, ignore_down_hosts=True)
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1868, in run_repair_on_nodes
    self._mgmt_repair_cli(ignore_down_hosts=ignore_down_hosts)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 231, in wrapped
    res = func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3095, in _mgmt_repair_cli
    raise ScyllaManagerError(
        f'Task: {mgr_task.id} final status is: {str(task_final_status)}.\nTask progress string: '
        f'{progress_full_string}')

run_repair_on_nodes indeed failed with an error but most likely it is expected as we are already handling sdcm.exceptions.KillNemesis.
The problem, however, is that we throw a new exception when unwinding due to sdcm.exceptions.KillNemesis.

How frequently does it reproduce?

I think not very often, but there are at least two test failures with this problem:
https://argus.scylladb.com/tests/scylla-cluster-tests/62d49bc0-4f18-476a-bd0c-94d156ae6c04/issues
https://argus.scylladb.com/tests/scylla-cluster-tests/e30e79c6-cbc0-4cc6-a14e-1d80676d67e6/events

Installation details

Cluster size: 6 nodes (i4i.8xlarge)

Scylla Nodes used in this run:

  • longevity-mv-si-4d-master-db-node-62d49bc0-9 (3.254.87.95 | 10.4.8.61) (shards: 30)
  • longevity-mv-si-4d-master-db-node-62d49bc0-8 (54.228.142.109 | 10.4.9.93) (shards: 30)
  • longevity-mv-si-4d-master-db-node-62d49bc0-7 (3.252.80.217 | 10.4.9.136) (shards: 30)
  • longevity-mv-si-4d-master-db-node-62d49bc0-6 (54.216.160.221 | 10.4.10.241) (shards: 30)
  • longevity-mv-si-4d-master-db-node-62d49bc0-5 (34.242.191.138 | 10.4.10.3) (shards: 30)
  • longevity-mv-si-4d-master-db-node-62d49bc0-4 (52.18.73.21 | 10.4.11.169) (shards: 30)
  • longevity-mv-si-4d-master-db-node-62d49bc0-3 (54.229.249.12 | 10.4.9.121) (shards: 30)
  • longevity-mv-si-4d-master-db-node-62d49bc0-2 (3.249.168.36 | 10.4.9.95) (shards: 30)
  • longevity-mv-si-4d-master-db-node-62d49bc0-1 (34.253.216.69 | 10.4.9.233) (shards: 30)

OS / Image: ami-0de01df1a3c5bdbac (aws: N/A)

Test: longevity-mv-si-4days-streaming-test
Test id: 62d49bc0-4f18-476a-bd0c-94d156ae6c04
Test name: scylla-master/tier1/longevity-mv-si-4days-streaming-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 62d49bc0-4f18-476a-bd0c-94d156ae6c04
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 62d49bc0-4f18-476a-bd0c-94d156ae6c04

Logs:

Jenkins job URL
Argus

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions