Skip to content

Conversation

adelapena
Copy link

The method AbstractReadQuery.toCQLString prints commands as CQL queries including any column values. This includes the queried values in the WHERE part of a SELECT statement or the written values on INSERT and UPDATE statement. This method is used at least by the slow query logger, printing user data into the logs.

This PR modifies AbstractReadQuery.toCQLString so it doesn't include column values. There is a boolean flag to opt-out from redaction, since seeing the queried values can be useful while debugging.

The criteria for what should be redacted is:

  • Needs redaction: Messages that go to external monitoring systems, such as JMX, diagnostic events, etc.
  • Doesn't need redaction: User-facing exceptions such as InvalidRequestException, query tracing (Tracing.trace) and generic Object#toString() methods.
  • Ideally should use redaction: Things printed in logs. We treat logs as sensitive data and there is plenty of user data that is printed there. I think we should gradually move towards logs free of user data, and this PR does that for AbstractReadQuery.toCQLString, which is used for example by the slow query logger. However, there are still plenty of other things that print user data, for example partition keys. Discussion here: https://datastax.slack.com/archives/C05LHP4HX5J/p1757687570882049?thread_ts=1757533116.788859&cid=C05LHP4HX5J

At reviewer's request, this PR separately adds redaction over the tightly related changes in toCQLString methods done by this other PR. That PR originally combined both things in separate commits, and it already had multiple review comments regarding changes that now are in this PR.

@adelapena adelapena requested a review from k-rus October 7, 2025 10:18
@adelapena adelapena self-assigned this Oct 7, 2025
@github-actions
Copy link

github-actions bot commented Oct 7, 2025

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@sonarqubecloud
Copy link

eolivelli and others added 4 commits October 16, 2025 17:50
… current() depending on a keyspace (#2041)

There is a new cassandra.sai.version.selector.class system property allowing
to provide an implementation of the o.a.c.index.sai.disk.format.Version.Selector
interface to specify that version of the SAI on-disk index format should be
used for each keyspace.
### What is the issue
...
We need that knowledge for CNDB
### What does this PR fix and why was it fixed
...
It exposes `containsDateRangeTypeColumn` methods

---------

Co-authored-by: Massimiliano Tomassi <[email protected]>
… CA (#2071)

Creating vector indexes if version is earlier than CA would usually fail in the asynchronous build.
This patch makes them fail synchronously at CREATE INDEX depending on the local index version.
If the local node has the right version but any of the remotes doesn't, the failure will remain
asynchronous.
…es (#2066)

When row-aware and non-row-aware indexes are mixed, we now check
the clustering index filter for all the keys that have clustering
information, i.e. keys coming from the row-aware
indexes. Earlier that check was accidentally disabled
if at least one non-row-aware index was used by the query.
That could cause retrieving rows that do not match
the clustering condition of the query.
@adelapena adelapena changed the title CNDB-15280: Remove user data from AbstractReadQuery.toCQLString (redaction) CNDB-15280: Remove user data from AbstractReadQuery.toCQLString Oct 20, 2025
michaeljmarshall and others added 5 commits October 20, 2025 11:49
### What is the issue

Fixes: riptano/cndb#15640

### What does this PR fix and why was it fixed

In order to lay the ground work for Fused ADC, I want to refactor some
of the PQ/BQ logic. The unit length computation needs to move, so I
decided to move it out to its own PR.

The core idea is that:
* some models are documented to provide unit length vectors, and in
those cases, we should skip the computational check
* otherwise, we should check at runtime until we hit a non-unit length
vector, and then we can skip the check and configure the `writePQ`
method as needed

### Embedding normalization notes

(I asked chat gpt to provide proof for the config changes proposed in
this PR. Here is it's generated description.)

Quick rundown of which models spit out normalized vectors (so cosine ==
dot product, etc.):

* **OpenAI (ada-002, v3-small, v3-large)** → already normalized. [OpenAI
FAQ](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)
literally says embeddings are unit-length.
* **BERT** → depends. The SBERT “-cos-” models add a [`Normalize`
layer](https://www.sbert.net/docs/package_reference/layers.html#normalize)
so they’re fine; vanilla BERT doesn’t.
* **Google Gecko** → normalized out of the box per [Vertex AI
docs](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings).
* **NVIDIA QA-4** → nothing in the [NVIDIA NIM model
card](https://docs.api.nvidia.com/nim/reference/nvidia-embed-qa-4) about
normalization, so assume *not* normalized and handle it yourself.
* **Cohere v3** → not explicitly in their [API
docs](https://docs.cohere.com/docs/cohere-embed)

TL;DR: OpenAI + Gecko are definitely safe, Cohere/BERT/NV need manual
normalization due to lack of documentation.
Fixes for toCQLString mainly coming from CASSANDRA-16510,
and removal of code duplication.
Replace column values by '?' when converting internal read queries to CQL,
so user data don't end up in logs or any other unprotected place.

# Conflicts:
#	src/java/org/apache/cassandra/db/Clustering.java
#	src/java/org/apache/cassandra/db/Slices.java
@adelapena adelapena force-pushed the CNDB-15280-main-redaction branch from 2c07844 to c73a499 Compare October 21, 2025 16:02
@cassci-bot
Copy link

✔️ Build ds-cassandra-pr-gate/PR-2038 approved by Butler


Approved by Butler
See build details here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants