-
Notifications
You must be signed in to change notification settings - Fork 110
Description
Packages
Scylla version: 2025.4.0~dev-20250815.f689d417473c with build-id 6452d1d4d848e57c3cb48457baa0b3a614c1cffc
Kernel Version: 6.14.0-1011-aws
Issue description
- This issue is a regression.
- It is unknown if this issue is a regression.
Describe your issue in detail and steps it took to produce it.
The problem looks like internal SCT nemesis implementation issue.
There is sdcm.exceptions.KillNemesis used by stop_nemesis in sdcm/cluster.py.
However, this exception-based nemesis stopping doesn't work well with ExitStack() mechanism, that is used to execute callbacks when a with-block is exited.
In case of _refuse_connection_from_banned_node that implements disrupt_refuse_connection_with_block_scylla_ports_on_banned_node and disrupt_refuse_connection_with_send_sigstop_signal_to_scylla_on_banned_node the problem is that ExitStack() register a callback to _remove_node_add_node that might throw when we are during handling of sdcm.exceptions.KillNemesis
The code is:
...
with self.node_allocator.run_nemesis(
nemesis_label=f"{simulate_node_unavailability.__name__}") as working_node, ExitStack() as stack:
stack.enter_context(node_operations.block_loaders_payload_for_scylla_node(
self.target_node, loader_nodes=self.loaders.nodes))
stack.callback(drop_keyspace, node=working_node)
target_host_id = self.target_node.host_id
stack.callback(self._remove_node_add_node, verification_node=working_node, node_to_remove=self.target_node,
remove_node_host_id=target_host_id)
...
and if the problem occurs, the backtrace looks as follows:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5430, in _refuse_connection_from_banned_node
working_node.run_nodetool(f"removenode {target_host_id}", retry=0, long_running=True)
...
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/session.py", line 52, in simple_select
select(readfds, writefds, (), timeout)
~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sdcm.exceptions.KillNemesis
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5582, in wrapper
result = method(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5363, in disrupt_refuse_connection_with_send_sigstop_signal_to_scylla_on_banned_node
self._refuse_connection_from_banned_node(use_iptables=False)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5413, in _refuse_connection_from_banned_node
nemesis_label=f"{simulate_node_unavailability.__name__}") as working_node, ExitStack() as stack:
~~~~~~~~~^^
File "/usr/local/lib/python3.13/contextlib.py", line 619, in __exit__
raise exc
File "/usr/local/lib/python3.13/contextlib.py", line 604, in __exit__
if cb(*exc_details):
~~^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/contextlib.py", line 482, in _exit_wrapper
callback(*args, **kwds)
~~~~~~~~^^^^^^^^^^^^^^^
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3657, in _remove_node_add_node
self.run_repair_on_nodes(nodes=up_normal_nodes, ignore_down_hosts=True)
~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1868, in run_repair_on_nodes
self._mgmt_repair_cli(ignore_down_hosts=ignore_down_hosts)
~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 231, in wrapped
res = func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3095, in _mgmt_repair_cli
raise ScyllaManagerError(
f'Task: {mgr_task.id} final status is: {str(task_final_status)}.\nTask progress string: '
f'{progress_full_string}')
run_repair_on_nodes indeed failed with an error but most likely it is expected as we are already handling sdcm.exceptions.KillNemesis.
The problem, however, is that we throw a new exception when unwinding due to sdcm.exceptions.KillNemesis.
How frequently does it reproduce?
I think not very often, but there are at least two test failures with this problem:
https://argus.scylladb.com/tests/scylla-cluster-tests/62d49bc0-4f18-476a-bd0c-94d156ae6c04/issues
https://argus.scylladb.com/tests/scylla-cluster-tests/e30e79c6-cbc0-4cc6-a14e-1d80676d67e6/events
Installation details
Cluster size: 6 nodes (i4i.8xlarge)
Scylla Nodes used in this run:
- longevity-mv-si-4d-master-db-node-62d49bc0-9 (3.254.87.95 | 10.4.8.61) (shards: 30)
- longevity-mv-si-4d-master-db-node-62d49bc0-8 (54.228.142.109 | 10.4.9.93) (shards: 30)
- longevity-mv-si-4d-master-db-node-62d49bc0-7 (3.252.80.217 | 10.4.9.136) (shards: 30)
- longevity-mv-si-4d-master-db-node-62d49bc0-6 (54.216.160.221 | 10.4.10.241) (shards: 30)
- longevity-mv-si-4d-master-db-node-62d49bc0-5 (34.242.191.138 | 10.4.10.3) (shards: 30)
- longevity-mv-si-4d-master-db-node-62d49bc0-4 (52.18.73.21 | 10.4.11.169) (shards: 30)
- longevity-mv-si-4d-master-db-node-62d49bc0-3 (54.229.249.12 | 10.4.9.121) (shards: 30)
- longevity-mv-si-4d-master-db-node-62d49bc0-2 (3.249.168.36 | 10.4.9.95) (shards: 30)
- longevity-mv-si-4d-master-db-node-62d49bc0-1 (34.253.216.69 | 10.4.9.233) (shards: 30)
OS / Image: ami-0de01df1a3c5bdbac (aws: N/A)
Test: longevity-mv-si-4days-streaming-test
Test id: 62d49bc0-4f18-476a-bd0c-94d156ae6c04
Test name: scylla-master/tier1/longevity-mv-si-4days-streaming-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 62d49bc0-4f18-476a-bd0c-94d156ae6c04 - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 62d49bc0-4f18-476a-bd0c-94d156ae6c04
Logs:
- longevity-mv-si-4d-master-db-node-62d49bc0-5
- longevity-mv-si-4d-master-db-node-62d49bc0-3
- longevity-mv-si-4d-master-db-node-62d49bc0-1
- longevity-mv-si-4d-master-db-node-62d49bc0-9
- db-cluster-62d49bc0.tar.zst
- schema-logs-62d49bc0.tar.zst
- sct-runner-events-62d49bc0.tar.zst
- 2025_08_16__03_08_43_731.sct-62d49bc0.log.zst
- 2025_08_16__11_19_00_073.sct-62d49bc0.log.zst
- 2025_08_16__16_11_35_666.sct-62d49bc0.log.zst
- loader-set-62d49bc0.tar.zst
- monitor-set-62d49bc0.tar.zst
- builder-62d49bc0.log.tar.gz