Skip to content

Conversation

AAraKKe
Copy link
Contributor

@AAraKKe AAraKKe commented Sep 10, 2025

What does this PR do?

This PR adds support for customizable cache keys to be used with the agent persistent cache. This is implemented in the following way:

  • A CacheKey abstract class that can be extended to implement any cache invalidation logic needed.
  • The AgentCheck has been modified to rely on a CacheKey implementation to provide the key for the persistent cache internally. Using check_id is now the default behavior through the FullConfigCacheKey implementation but can be overriden when necessary.
  • The logs_persistent_cache_key can be overriden by a developer that wants to modify what is the CacheKey used to handle the logs cursor.
  • We are still maintaining some level of separation between checks by adding the check name to the key. It is not the full check_id but, if the developer creates a CacheKey that is to simple, the impact of key clashing is reduced to the same kind of check.
  • Added 2 implementations of the CacheKey interface that work as an example and to provide funcionality:
    • The FullConfigCacheKey is equivalent to the previous behavior. It relies on the check_id that includes a digest of the full configuration of the check. This default behavior makes this change backward compatible with the usage of send_log and get_log_cursor, both of them using persistent cache.
    • The ConfigSetCacheKey implementation invalidates the cache only when a subset of configuration options have been updated. This provides a similar level of separation between different checks but allowing some configuration options to not invalidate the cache.
  • The CacheKeyManager allows multiple cache keys to coexist in the AgentCheck for differnet purposes. At the moment we only use the CacheKey internally to manage the log cursor. However, if we want to include something else we can register a new type of CacheKeyType enum and maintain the new one in the manager without having to modify how the AgentCheck behaves.

Note: the manager forces to have each cache key represented by an enum to avoid relying on simple string keys per cache key. While the intention is to allow to personalize the cache invalidation mechanism, we do not want potentially arbitrary cache keys being register which could trigger cache misses if the string key is mistakenly generated.

Motivation

Until now, integrations that wanted to emit logs and be consistent between agent restarts would have logs missing or duplicated if the user changed their configuration. This would invalidate the cache and there was limited control over the last cursor.

The intention of this change is to provide a backward compatible solution that allows developers to modify the cache invalidation behavior in an easy way. The example below defines a check with a log cursor that is never invalidated after agent restart:

from datadog_checks.base import AgentCheck
from datadog_checks.base.utils.cache_key import CacheKey

class AlwaysValidCacheKey(CacheKey):
    def base_key(self) -> str:
        return "always_valid"

class MyCheck(AgentCheck):
    def logs_persistent_cache_key(self) -> CacheKey:
        return AalwaysValidCacheKey(self)

    def check(self, instance):
        pass

Warning

Using custom CacheKey can introduce clash between the keys from different checks. For example, in the example above, every MyCheck check would compete for the same key which results in undefined behavior when reading the cache. This new feature will be added to the public documentation and we will add there the drawbacks of building a custom CacheKey implementation.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

@AAraKKe AAraKKe added the qa/skip-qa Automatically skip this PR for the next QA label Sep 10, 2025
Copy link

github-actions bot commented Sep 10, 2025

⚠️ The qa/skip-qa label has been added with shippable changes

The following files, which will be shipped with the agent, were modified in this PR and
the qa/skip-qa label has been added.

You can ignore this if you are sure the changes in this PR do not require QA. Otherwise, consider removing the label.

List of modified files that will be shipped with the agent
datadog_checks_base/datadog_checks/base/utils/cache_key/__init__.py
datadog_checks_base/datadog_checks/base/utils/cache_key/base.py
datadog_checks_base/datadog_checks/base/utils/cache_key/config_set.py
datadog_checks_base/datadog_checks/base/utils/cache_key/full_config.py
datadog_checks_base/changelog.d/21316.added
datadog_checks_base/datadog_checks/base/checks/base.py

@AAraKKe AAraKKe marked this pull request as ready for review September 10, 2025 16:54
@AAraKKe AAraKKe requested a review from a team as a code owner September 10, 2025 16:54
@AAraKKe AAraKKe changed the title Add support for customizable cache keys to store log cursor Add support for customizable cache keys for persistent cache Sep 10, 2025
Copy link

codecov bot commented Sep 10, 2025

Codecov Report

❌ Patch coverage is 97.67442% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.03%. Comparing base (045fe96) to head (8b2775a).
⚠️ Report is 15 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@ofek ofek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job trying to satisfy customer needs! I left a few comments. Thanks for pinging me on this one 🙏

Overall, I'm not convinced that the concept of cache key types is warranted. Rather, you could take a much simpler approach by defining a method that returns the expected persistent cache key prefix (defaulting to the check ID) which could be overridden. The return value would then be cached via a check initialization scheme or otherwise, and that static prefix would be used ubiquitously.

if self.__key is not None:
return self.__key

merged_config = self.check.init_config | self.check.instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The agent determines the check ID from only the instance AFAIK (double check) so it would be odd to use the full resolved config. More importantly, there are options that integrations use with the same name that do not have the exact same behavior and therefore this would not be correct.
  2. As I mentioned in another comment, the name option from the instance config should always be included.

Copy link
Contributor Author

@AAraKKe AAraKKe Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wait for the name discussion to be resolved in the comment before this one. About whether init config is used or not, I was under the impression it was based on this. But, to be honest, I am not very familiar with the agent code and I might be wrong. I can use only the instance but I can image some options in the init_config could be selected as a target subset here.

Could you clarify what this means?

More importantly, there are options that integrations use with the same name that do not have the exact same behavior and therefore this would not be correct.

How is it possible that they can have the same names? The options passed here do not affect all integrations, only the integration where an instance of this implementation is used. If I understand correctly you mean that IntegraionA and IntegrationB can have both an option named endpoint but even if the name is the same they do not mean the same, right? But this options do not affect all integrations, only those in which this cache key with a specific set of options is used. One integration cannot have multiple options with the same name, right?

I would expect that the developer of IntegrationA knows this, the same as IntegrationB and they will decide if endpoint is necessary when they override the logs_persistent_cache_key method. They need to create the instance passing which options, for that particular integration, should be considered when invalidating the cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was that it is possible for an instance-level option to have the same name as an equivalent option in init_config and for each to be treated differently by the integration's code. I don't have time to find an example but there are some.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah alright, you mean different options in init and instance that have the same name. Maybe we can be more explicit and pass not only a set of options to respect but 2, one for init and another one for instance. That way we can crate the key instance like

ConfigSetCacheKey(check, init_config_options=["some_option"], instance_options=["other"])

This would remove the uncertainty. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have update the ConfigSet to explicitely require which options the developer wants to monitor.

@AAraKKe AAraKKe changed the title Add support for customizable cache keys for persistent cache [AI-6101] Add support for customizable cache keys for persistent cache Sep 11, 2025
Copy link
Contributor

@ofek ofek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more things, I'll take a look again tomorrow!

@@ -1094,7 +1107,7 @@ def read_persistent_cache(self, key):
key (str):
the key to retrieve
"""
return datadog_agent.read_persistent_cache(self._persistent_cache_id(key))
return datadog_agent.read_persistent_cache(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a breaking change.

Copy link
Contributor Author

@AAraKKe AAraKKe Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man... true. I was focused on the behavior of interacting with these methods through the send_log method and get_log_cursor method but this side effect means that anyone using the write or read method to the persistent cache using their own key would have this broken.

What I am doing now is pulling the initialization of the CacheKey to these methods. The cache key initialization happens the first time we call these methods and then we apply the prefix we need to using the key provided by the caller here as the context.

This is a non-breaking change for checks using the default cache key as the prefix is the check_id, as before, and the context is the key provided by the caller.

@@ -1110,7 +1123,7 @@ def write_persistent_cache(self, key, value):
value (str):
the value to store
"""
datadog_agent.write_persistent_cache(self._persistent_cache_id(key), value)
datadog_agent.write_persistent_cache(key, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a breaking change.

Comment on lines 27 to 29
from datadog_checks.base.utils.cache_key.base import CacheKey
from datadog_checks.base.utils.cache_key.full_config import FullConfigCacheKey
from datadog_checks.base.utils.cache_key.manager import CacheKeyManager, CacheKeyType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be imported lazily since most integrations do not use the cache. When you set up the check initialization just import in that function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left the CacheKey under TYPE_CHECKING and moved the FullConfigCacheKey to the persistent_cache_key method.

return self.base_key()

def base_key(self) -> str:
return str(self.check.check_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return str(self.check.check_id)
return self.check.check_id

Copy link
Contributor Author

@AAraKKe AAraKKe Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time I was not sure if this was a string or not but yes, this is redundant. Thanks for spotting that, didn't realize it was still here.

if self.__prefix is not None:
return f"{self.__prefix}{self.base_key()}"

check_id_prefix = ":".join(self.check.check_id.split(":")[:-1])
Copy link
Contributor

@ofek ofek Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no purpose to the key/base key dichotomy, it only adds unnecessary length to the filesystem path in the default case.

Given integration foo, instance name bar, instance hash 123 and context custom_key:

  • key + base key
    • Default: <run_path>/foo/barfoobar123_custom_key
    • Config set: <run_path>/foo/bar456_custom_key
  • key
    • Default: <run_path>/foo/bar123_custom_key
    • Config set: <run_path>/foo/bar456_custom_key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I have split those 2 is to enforce some check related isolation between different checks using the same custom key. My intention here is to avoid a developer setting a base_key that is always constant, with no check_id related prefix and unintentionally creating a poor customer experience for all checks they develop following the same practice.

For example, imagine a partner that decide that their base key is going to be

class PartnerKey(CheckKey):
    def base_key(self) -> str:
        return "partner_static_key"

And they start using this kind of key for all integrations they develop that require the same behavior.

I can easily see a scenario in which they would naively expect this key to only affect this check, but without the enforcement on our side of a check-scoped prefix, this would not be true. Without the separation between key and base_key there is only one single method providing a key which the developer have full control over.

With that separation they can still create this scenario but they would need to intentionally override the key method as well. Like what we are doing in the FullConfigCacheKey, where we override both but once this is in the hand of the developers I feel there needs to be some protection against it.

Copy link
Contributor

@dkirov-dd dkirov-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change overall! 🚀
I only have a couple of nit picks.

Comment on lines +563 to +569
class ConstantCacheKey(CacheKey):
def base_key(self) -> str:
return "always_the_same"

class TestCheck(AgentCheck):
def persistent_cache_key(self) -> CacheKey:
return ConstantCacheKey(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move this out of the two tests.

Comment on lines +596 to +597
new_check = TestCheck(name="another_test", init_config={}, instances=[{}])
new_check.check_id = 'test2:bar:456'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me whether it is the new_check.check_id or the TestCheck.name that invalidated the cache.
Could you only modify one of the two and keep the other one identical to the first TestCheck?

Copy link
Contributor

@dkirov-dd dkirov-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regards to the key/base_key question, here is a suggestion to enforce the overriding of base_key instead of key.
My understanding is that the key method should never be overridden, therefore the final decorator can be used.

Another suggestion: rename base_key to key_suffix to be more explicit about the role of the method in the cache key construction.
IIUC, key() already implements the cache invalidation between different check IDs. Then the key_suffix allows for more granular control over the invalidation logic.

Comment on lines +24 to +25

def key(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def key(self) -> str:
@final
def key(self) -> str:

Comment on lines +17 to +21
def key(self) -> str:
return self.check.check_id

def base_key(self) -> str:
return self.check.check_id
Copy link
Contributor

@dkirov-dd dkirov-dd Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def key(self) -> str:
return self.check.check_id
def base_key(self) -> str:
return self.check.check_id
self._CacheKey__cache_key = self.check.check_id
def base_key(self) -> str:
return "" # This method is not used for this CacheKey implementation

Copy link
Contributor

@ofek ofek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my final recommendation. It will reduce the diff by 80% and incorporates @dkirov-dd's idea about the more semantically correct and intuitive key_suffix naming (although I flip it).

Let's create a new method on the base class which customers may override:

def get_persistent_cache_id(self) -> str:
    """
    This ID is used to form the final prefix of all persistent cache key lookups.
    """
    # The last component of the check ID is the hash of the instance configuration
    return self.check_id.split(':')[-1]

There will be another method that becomes a check initialization which will set a private property:

# __init__
    self.__persistent_cache_key_prefix: str | None = None

def __set_persistent_cache_key_prefix(self) -> None:
    # Remove the instance hash from the check ID
    namespace = ':'.join(self.check_id.split(':')[:-1])
    self.__persistent_cache_key_prefix = f'{namespace}:{self.get_persistent_cache_id()}_'

Finally, change the API calls:

    def read_persistent_cache(self, key: str) -> str:
        return datadog_agent.read_persistent_cache(f'{self.__persistent_cache_key_prefix}{key}')

    def write_persistent_cache(self, key: str, value: str) -> str:
        return datadog_agent.write_persistent_cache(f'{self.__persistent_cache_key_prefix}{key}', value)

The benefits of this approach are that:

  1. There is no longer any overhead for accessing the cache with the proper key prefix. All computation happens once when the integration starts.
  2. Customers retain complete control without having to create a dedicated class just to return a string. They now have 2 options:
    1. We can create a simple utility for them to use within the overridden method.
    2. They can override the method to return an empty string and rely exclusively on the name option of each instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
base_package qa/skip-qa Automatically skip this PR for the next QA team/agent-integrations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants