[AI-6101] Add support for customizable cache keys for persistent cache #21316

AAraKKe · 2025-09-10T14:08:51Z

What does this PR do?

This PR adds support for customizable cache keys to be used with the agent persistent cache. This is implemented in the following way:

A CacheKey abstract class that can be extended to implement any cache invalidation logic needed.
The AgentCheck has been modified to rely on a CacheKey implementation to provide the key for the persistent cache internally. Using check_id is now the default behavior through the FullConfigCacheKey implementation but can be overriden when necessary.
The logs_persistent_cache_key can be overriden by a developer that wants to modify what is the CacheKey used to handle the logs cursor.
We are still maintaining some level of separation between checks by adding the check name to the key. It is not the full check_id but, if the developer creates a CacheKey that is to simple, the impact of key clashing is reduced to the same kind of check.
Added 2 implementations of the CacheKey interface that work as an example and to provide funcionality:
- The FullConfigCacheKey is equivalent to the previous behavior. It relies on the check_id that includes a digest of the full configuration of the check. This default behavior makes this change backward compatible with the usage of send_log and get_log_cursor, both of them using persistent cache.
- The ConfigSetCacheKey implementation invalidates the cache only when a subset of configuration options have been updated. This provides a similar level of separation between different checks but allowing some configuration options to not invalidate the cache.
The CacheKeyManager allows multiple cache keys to coexist in the AgentCheck for differnet purposes. At the moment we only use the CacheKey internally to manage the log cursor. However, if we want to include something else we can register a new type of CacheKeyType enum and maintain the new one in the manager without having to modify how the AgentCheck behaves.

Note: the manager forces to have each cache key represented by an enum to avoid relying on simple string keys per cache key. While the intention is to allow to personalize the cache invalidation mechanism, we do not want potentially arbitrary cache keys being register which could trigger cache misses if the string key is mistakenly generated.

Motivation

Until now, integrations that wanted to emit logs and be consistent between agent restarts would have logs missing or duplicated if the user changed their configuration. This would invalidate the cache and there was limited control over the last cursor.

The intention of this change is to provide a backward compatible solution that allows developers to modify the cache invalidation behavior in an easy way. The example below defines a check with a log cursor that is never invalidated after agent restart:

from datadog_checks.base import AgentCheck
from datadog_checks.base.utils.cache_key import CacheKey

class AlwaysValidCacheKey(CacheKey):
    def base_key(self) -> str:
        return "always_valid"

class MyCheck(AgentCheck):
    def logs_persistent_cache_key(self) -> CacheKey:
        return AalwaysValidCacheKey(self)

    def check(self, instance):
        pass

Warning

Using custom CacheKey can introduce clash between the keys from different checks. For example, in the example above, every MyCheck check would compete for the same key which results in undefined behavior when reading the cache. This new feature will be added to the public documentation and we will add there the drawbacks of building a custom CacheKey implementation.

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

github-actions · 2025-09-10T14:09:31Z

⚠️ The qa/skip-qa label has been added with shippable changes

The following files, which will be shipped with the agent, were modified in this PR and
the qa/skip-qa label has been added.

You can ignore this if you are sure the changes in this PR do not require QA. Otherwise, consider removing the label.

List of modified files that will be shipped with the agent

datadog_checks_base/datadog_checks/base/utils/cache_key/__init__.py
datadog_checks_base/datadog_checks/base/utils/cache_key/base.py
datadog_checks_base/datadog_checks/base/utils/cache_key/config_set.py
datadog_checks_base/datadog_checks/base/utils/cache_key/full_config.py
datadog_checks_base/changelog.d/21316.added
datadog_checks_base/datadog_checks/base/checks/base.py

codecov · 2025-09-10T17:43:38Z

Codecov Report

❌ Patch coverage is 97.67442% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.03%. Comparing base (045fe96) to head (8b2775a).
⚠️ Report is 15 commits behind head on master.

Additional details and impacted files

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ofek

Great job trying to satisfy customer needs! I left a few comments. Thanks for pinging me on this one 🙏

Overall, I'm not convinced that the concept of cache key types is warranted. Rather, you could take a much simpler approach by defining a method that returns the expected persistent cache key prefix (defaulting to the check ID) which could be overridden. The return value would then be cached via a check initialization scheme or otherwise, and that static prefix would be used ubiquitously.

datadog_checks_base/datadog_checks/base/checks/base.py

datadog_checks_base/datadog_checks/base/utils/cache_key/manager.py

datadog_checks_base/datadog_checks/base/checks/base.py

datadog_checks_base/datadog_checks/base/utils/cache_key/__init__.py

ofek · 2025-09-10T20:19:28Z

datadog_checks_base/datadog_checks/base/utils/cache_key/config_set.py

+        if self.__key is not None:
+            return self.__key
+
+        merged_config = self.check.init_config | self.check.instance


The agent determines the check ID from only the instance AFAIK (double check) so it would be odd to use the full resolved config. More importantly, there are options that integrations use with the same name that do not have the exact same behavior and therefore this would not be correct.

As I mentioned in another comment, the name option from the instance config should always be included.

I will wait for the name discussion to be resolved in the comment before this one. About whether init config is used or not, I was under the impression it was based on this. But, to be honest, I am not very familiar with the agent code and I might be wrong. I can use only the instance but I can image some options in the init_config could be selected as a target subset here.

Could you clarify what this means?

More importantly, there are options that integrations use with the same name that do not have the exact same behavior and therefore this would not be correct.

How is it possible that they can have the same names? The options passed here do not affect all integrations, only the integration where an instance of this implementation is used. If I understand correctly you mean that IntegraionA and IntegrationB can have both an option named endpoint but even if the name is the same they do not mean the same, right? But this options do not affect all integrations, only those in which this cache key with a specific set of options is used. One integration cannot have multiple options with the same name, right?

I would expect that the developer of IntegrationA knows this, the same as IntegrationB and they will decide if endpoint is necessary when they override the logs_persistent_cache_key method. They need to create the instance passing which options, for that particular integration, should be considered when invalidating the cache.

What I meant was that it is possible for an instance-level option to have the same name as an equivalent option in init_config and for each to be treated differently by the integration's code. I don't have time to find an example but there are some.

Aaah alright, you mean different options in init and instance that have the same name. Maybe we can be more explicit and pass not only a set of options to respect but 2, one for init and another one for instance. That way we can crate the key instance like

ConfigSetCacheKey(check, init_config_options=["some_option"], instance_options=["other"])

This would remove the uncertainty. What do you think?

I have update the ConfigSet to explicitely require which options the developer wants to monitor.

datadog_checks_base/datadog_checks/base/utils/cache_key/config_set.py

datadog_checks_base/datadog_checks/base/utils/cache_key/manager.py

ofek

A few more things, I'll take a look again tomorrow!

ofek · 2025-09-11T17:27:41Z

datadog_checks_base/datadog_checks/base/checks/base.py

@@ -1094,7 +1107,7 @@ def read_persistent_cache(self, key):
            key (str):
                the key to retrieve
        """
-        return datadog_agent.read_persistent_cache(self._persistent_cache_id(key))
+        return datadog_agent.read_persistent_cache(key)


This is still a breaking change.

Oh man... true. I was focused on the behavior of interacting with these methods through the send_log method and get_log_cursor method but this side effect means that anyone using the write or read method to the persistent cache using their own key would have this broken.

What I am doing now is pulling the initialization of the CacheKey to these methods. The cache key initialization happens the first time we call these methods and then we apply the prefix we need to using the key provided by the caller here as the context.

This is a non-breaking change for checks using the default cache key as the prefix is the check_id, as before, and the context is the key provided by the caller.

ofek · 2025-09-11T17:27:57Z

datadog_checks_base/datadog_checks/base/checks/base.py

@@ -1110,7 +1123,7 @@ def write_persistent_cache(self, key, value):
            value (str):
                the value to store
        """
-        datadog_agent.write_persistent_cache(self._persistent_cache_id(key), value)
+        datadog_agent.write_persistent_cache(key, value)


This is still a breaking change.

ofek · 2025-09-11T17:34:50Z

datadog_checks_base/datadog_checks/base/checks/base.py

+from datadog_checks.base.utils.cache_key.base import CacheKey
+from datadog_checks.base.utils.cache_key.full_config import FullConfigCacheKey
+from datadog_checks.base.utils.cache_key.manager import CacheKeyManager, CacheKeyType


These should be imported lazily since most integrations do not use the cache. When you set up the check initialization just import in that function.

Left the CacheKey under TYPE_CHECKING and moved the FullConfigCacheKey to the persistent_cache_key method.

ofek · 2025-09-11T17:35:27Z

datadog_checks_base/datadog_checks/base/utils/cache_key/full_config.py

+        return self.base_key()
+
+    def base_key(self) -> str:
+        return str(self.check.check_id)


Suggested change

return str(self.check.check_id)

return self.check.check_id

At the time I was not sure if this was a string or not but yes, this is redundant. Thanks for spotting that, didn't realize it was still here.

ofek · 2025-09-11T18:04:11Z

datadog_checks_base/datadog_checks/base/utils/cache_key/base.py

+        if self.__prefix is not None:
+            return f"{self.__prefix}{self.base_key()}"
+
+        check_id_prefix = ":".join(self.check.check_id.split(":")[:-1])


There is no purpose to the key/base key dichotomy, it only adds unnecessary length to the filesystem path in the default case.

Given integration foo, instance name bar, instance hash 123 and context custom_key:

key + base key

Default: <run_path>/foo/barfoobar123_custom_key

Config set: <run_path>/foo/bar456_custom_key

key

Default: <run_path>/foo/bar123_custom_key

Config set: <run_path>/foo/bar456_custom_key

The reason why I have split those 2 is to enforce some check related isolation between different checks using the same custom key. My intention here is to avoid a developer setting a base_key that is always constant, with no check_id related prefix and unintentionally creating a poor customer experience for all checks they develop following the same practice.

For example, imagine a partner that decide that their base key is going to be

class PartnerKey(CheckKey): def base_key(self) -> str: return "partner_static_key"

And they start using this kind of key for all integrations they develop that require the same behavior.

I can easily see a scenario in which they would naively expect this key to only affect this check, but without the enforcement on our side of a check-scoped prefix, this would not be true. Without the separation between key and base_key there is only one single method providing a key which the developer have full control over.

With that separation they can still create this scenario but they would need to intentionally override the key method as well. Like what we are doing in the FullConfigCacheKey, where we override both but once this is in the hand of the developers I feel there needs to be some protection against it.

dkirov-dd

Great change overall! 🚀
I only have a couple of nit picks.

dkirov-dd · 2025-09-12T12:04:32Z

datadog_checks_base/tests/base/checks/test_agent_check.py

+        class ConstantCacheKey(CacheKey):
+            def base_key(self) -> str:
+                return "always_the_same"
+
+        class TestCheck(AgentCheck):
+            def persistent_cache_key(self) -> CacheKey:
+                return ConstantCacheKey(self)


You can move this out of the two tests.

dkirov-dd · 2025-09-12T12:06:10Z

datadog_checks_base/tests/base/checks/test_agent_check.py

+        new_check = TestCheck(name="another_test", init_config={}, instances=[{}])
+        new_check.check_id = 'test2:bar:456'


It's not clear to me whether it is the new_check.check_id or the TestCheck.name that invalidated the cache.
Could you only modify one of the two and keep the other one identical to the first TestCheck?

dkirov-dd

In regards to the key/base_key question, here is a suggestion to enforce the overriding of base_key instead of key.
My understanding is that the key method should never be overridden, therefore the final decorator can be used.

Another suggestion: rename base_key to key_suffix to be more explicit about the role of the method in the cache key construction.
IIUC, key() already implements the cache invalidation between different check IDs. Then the key_suffix allows for more granular control over the invalidation logic.

dkirov-dd · 2025-09-12T13:06:27Z

datadog_checks_base/datadog_checks/base/utils/cache_key/base.py

+
+    def key(self) -> str:


Suggested change

def key(self) -> str:

@final

def key(self) -> str:

dkirov-dd · 2025-09-12T13:12:53Z

datadog_checks_base/datadog_checks/base/utils/cache_key/full_config.py

+    def key(self) -> str:
+        return self.check.check_id
+
+    def base_key(self) -> str:
+        return self.check.check_id


Suggested change

def key(self) -> str:

return self.check.check_id

def base_key(self) -> str:

return self.check.check_id

self._CacheKey__cache_key = self.check.check_id

def base_key(self) -> str:

return "" # This method is not used for this CacheKey implementation

ofek

Here's my final recommendation. It will reduce the diff by 80% and incorporates @dkirov-dd's idea about the more semantically correct and intuitive key_suffix naming (although I flip it).

Let's create a new method on the base class which customers may override:

def get_persistent_cache_id(self) -> str:
    """
    This ID is used to form the final prefix of all persistent cache key lookups.
    """
    # The last component of the check ID is the hash of the instance configuration
    return self.check_id.split(':')[-1]

There will be another method that becomes a check initialization which will set a private property:

# __init__
    self.__persistent_cache_key_prefix: str | None = None

def __set_persistent_cache_key_prefix(self) -> None:
    # Remove the instance hash from the check ID
    namespace = ':'.join(self.check_id.split(':')[:-1])
    self.__persistent_cache_key_prefix = f'{namespace}:{self.get_persistent_cache_id()}_'

Finally, change the API calls:

    def read_persistent_cache(self, key: str) -> str:
        return datadog_agent.read_persistent_cache(f'{self.__persistent_cache_key_prefix}{key}')

    def write_persistent_cache(self, key: str, value: str) -> str:
        return datadog_agent.write_persistent_cache(f'{self.__persistent_cache_key_prefix}{key}', value)

The benefits of this approach are that:

There is no longer any overhead for accessing the cache with the proper key prefix. All computation happens once when the integration starts.
Customers retain complete control without having to create a dedicated class just to return a string. They now have 2 options:
1. We can create a simple utility for them to use within the overridden method.
2. They can override the method to return an empty string and rely exclusively on the name option of each instance.

Add support for customizable cache keys to store log cursor

b3487a9

AAraKKe added the qa/skip-qa Automatically skip this PR for the next QA label Sep 10, 2025

datadog-agent-integrations-bot bot added the base_package label Sep 10, 2025

AAraKKe added 5 commits September 10, 2025 16:10

Add changelog

b270356

Remove PersistentCacheKey lefotver from AgentCheck

d68445f

Fix docstring mistakes

80ab48b

Reduce cache key scope to checks with the same name

4c4ed30

Add test to cover restart behavior

f59d322

AAraKKe marked this pull request as ready for review September 10, 2025 16:54

AAraKKe requested a review from a team as a code owner September 10, 2025 16:54

datadog-agent-integrations-bot bot added the team/agent-integrations label Sep 10, 2025

AAraKKe changed the title ~~Add support for customizable cache keys to store log cursor~~ Add support for customizable cache keys for persistent cache Sep 10, 2025

ofek requested changes Sep 10, 2025

View reviewed changes

AAraKKe added 4 commits September 11, 2025 11:15

Address most of the comments on the PR

5dea5c4

Fix imports in tests

b055b39

Replace manual sorthing by mutable_hash utility

9cadb23

Update changelog message

79922bc

AAraKKe changed the title ~~Add support for customizable cache keys for persistent cache~~ [AI-6101] Add support for customizable cache keys for persistent cache Sep 11, 2025

ofek requested changes Sep 11, 2025

View reviewed changes

AAraKKe added 3 commits September 12, 2025 10:38

Remove CacheKeyManager

69c0fbf

Reword docstrings

fb3b214

Finalize update of ConfigSet

8b2775a

dkirov-dd reviewed Sep 12, 2025

View reviewed changes

ofek requested changes Sep 12, 2025

View reviewed changes

		new_check = TestCheck(name="another_test", init_config={}, instances=[{}])
		new_check.check_id = 'test2:bar:456'

[AI-6101] Add support for customizable cache keys for persistent cache #21316

Are you sure you want to change the base?

[AI-6101] Add support for customizable cache keys for persistent cache #21316

Conversation

AAraKKe commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Review checklist (to be filled by reviewers)

Uh oh!

github-actions bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ofek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AAraKKe Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ofek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AAraKKe Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AAraKKe Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ofek Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkirov-dd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkirov-dd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AAraKKe commented Sep 10, 2025 •

edited

Loading

github-actions bot commented Sep 10, 2025 •

edited

Loading

codecov bot commented Sep 10, 2025 •

edited

Loading

AAraKKe Sep 11, 2025 •

edited

Loading

AAraKKe Sep 12, 2025 •

edited

Loading

AAraKKe Sep 12, 2025 •

edited

Loading

ofek Sep 11, 2025 •

edited

Loading

dkirov-dd Sep 12, 2025 •

edited

Loading

ofek left a comment •

edited

Loading