Skip to content

Conversation

adelapena
Copy link

What is the issue

The slow query logger is one of the most useful tools when dealing with performance-related incidents. Currently it simply prints the CQL representation of the slow query and its running time. So while it can allow us to identify problematic SAI queries, we have to go to the aggregated metrics where it's difficult to see what matches the problematic query.

I think it would be useful if the slow query logger also printed the SAI query metrics and plan. That way we could examine log records at the time of the incident to see what's going on with individual slow queries, without needing to extract conclusions from aggregated metrics.

What does this PR fix and why was it fixed

This PR adds SAI-specific information to the log messages produced when slow queries are detected. That information consists on the metrics in the QueryContext and the tree view of the query Plan. There is a cassandra.cassandra.sai.slow_query_log.details_enabled property to disable this extended logging.

This is how slow queries are currently logged:

DEBUG [node1_isolatedExecutor:1] node1 2025-10-09 11:14:57,354 MonitoringTask.java:175 - 2 operations were slow in the last 131 msecs:
<SELECT s FROM distributed_test_keyspace.t WHERE (s = s_2 OR s = s_3) AND  ALLOW FILTERING>, time 107 msec - slow timeout 100 msec
<SELECT n FROM distributed_test_keyspace.t WHERE n > 1 AND  ALLOW FILTERING>, was slow 2 times: avg/min/max 104/104/105 msec - slow timeout 100 msec

And this is how they would look like with SAI details:

DEBUG [node1_isolatedExecutor:1] node1 2025-10-09 11:14:57,354 MonitoringTask.java:175 - 2 operations were slow in the last 131 msecs:
<SELECT * FROM distributed_test_keyspace.t WHERE (s = s_2 OR s = s_3) AND  ALLOW FILTERING>, time 111 msec - slow timeout 100 msec
  SAI slow query metrics:
    sstablesHit: 2
    segmentsHit: 2
    partitionsRead: 2
    rowsFiltered: 2
    rowsPreFiltered: 0
    trieSegmentsHit: 2
    bkdPostingListsHit: 0
    bkdSegmentsHit: 0
    bkdPostingsSkips: 0
    bkdPostingsDecodes: 0
    triePostingsSkips: 0
    triePostingsDecodes: 2
    annGraphSearchLatencyNanos: 0
    shadowedPrimaryKeyCount: 0
  SAI slow query plan:
    Limit 2147483647 (rows: 1.8, cost/row: 101.5, cost: 3000.0..3177.6)
     └─ Filter (s = s_2 OR s = s_3) (sel: 1.000000000) (rows: 1.8, cost/row: 101.5, cost: 3000.0..3177.6)
         └─ Fetch (rows: 1.8, cost/row: 101.5, cost: 3000.0..3177.6)
             └─ Union (keys: 1.8, cost/key: 0.1, cost: 3000.0..3000.2)
                 ├─ LiteralIndexScan of t_s_idx (sel: 0.250000000, step: 1.0) (keys: 1.0, cost/key: 0.1, cost: 1500.0..1500.1)
                 │  predicate: Expression{name: s, op: EQ, lower: (s_2, true), upper: (s_2, true), exclusions: []}
                 └─ LiteralIndexScan of t_s_idx (sel: 0.250000000, step: 1.0) (keys: 1.0, cost/key: 0.1, cost: 1500.0..1500.1)
                    predicate: Expression{name: s, op: EQ, lower: (s_3, true), upper: (s_3, true), exclusions: []}
<SELECT * FROM distributed_test_keyspace.t WHERE n > 1 AND  ALLOW FILTERING>, was slow 2 times: avg/min/max 258/103/413 msec - slow timeout 100 msec
  SAI slowest query metrics:
    sstablesHit: 3
    segmentsHit: 3
    partitionsRead: 3
    rowsFiltered: 6
    rowsPreFiltered: 0
    trieSegmentsHit: 0
    bkdPostingListsHit: 3
    bkdSegmentsHit: 3
    bkdPostingsSkips: 0
    bkdPostingsDecodes: 7
    triePostingsSkips: 0
    triePostingsDecodes: 0
    annGraphSearchLatencyNanos: 0
    shadowedPrimaryKeyCount: 0
  SAI slowest query plan:
    Limit 2147483647 (rows: 6.0, cost/row: 101.1, cost: 4500.0..5106.8)
     └─ Filter n > 1 (sel: 1.000000000) (rows: 6.0, cost/row: 101.1, cost: 4500.0..5106.8)
         └─ Fetch (rows: 6.0, cost/row: 101.1, cost: 4500.0..5106.8)
             └─ NumericIndexScan of t_n_idx (sel: 0.857142857, step: 1.0) (keys: 6.0, cost/key: 0.1, cost: 4500.0..4500.6)
                predicate: Expression{name: n, op: RANGE, lower: (1, false), upper: (null, false), exclusions: []}

Note that the queries are printed as invalid CQL and include user data. That is being dealt with in CNDB-15280, which is under review.

Adds the metrics in IndexContext and the tree view of the query's Plan
to the log messages produced for slow SAI queries.
@adelapena adelapena self-assigned this Oct 9, 2025
Copy link

github-actions bot commented Oct 9, 2025

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@cassci-bot
Copy link

✔️ Build ds-cassandra-pr-gate/PR-2044 approved by Butler


Approved by Butler
See build details here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants