-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[AI-6101] Add support for customizable cache keys for persistent cache #21316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The following files, which will be shipped with the agent, were modified in this PR and You can ignore this if you are sure the changes in this PR do not require QA. Otherwise, consider removing the label. List of modified files that will be shipped with the agent
|
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job trying to satisfy customer needs! I left a few comments. Thanks for pinging me on this one 🙏
Overall, I'm not convinced that the concept of cache key types is warranted. Rather, you could take a much simpler approach by defining a method that returns the expected persistent cache key prefix (defaulting to the check ID) which could be overridden. The return value would then be cached via a check initialization scheme or otherwise, and that static prefix would be used ubiquitously.
datadog_checks_base/datadog_checks/base/utils/cache_key/manager.py
Outdated
Show resolved
Hide resolved
datadog_checks_base/datadog_checks/base/utils/cache_key/__init__.py
Outdated
Show resolved
Hide resolved
if self.__key is not None: | ||
return self.__key | ||
|
||
merged_config = self.check.init_config | self.check.instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The agent determines the check ID from only the instance AFAIK (double check) so it would be odd to use the full resolved config. More importantly, there are options that integrations use with the same name that do not have the exact same behavior and therefore this would not be correct.
- As I mentioned in another comment, the
name
option from the instance config should always be included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will wait for the name discussion to be resolved in the comment before this one. About whether init config is used or not, I was under the impression it was based on this. But, to be honest, I am not very familiar with the agent code and I might be wrong. I can use only the instance but I can image some options in the init_config
could be selected as a target subset here.
Could you clarify what this means?
More importantly, there are options that integrations use with the same name that do not have the exact same behavior and therefore this would not be correct.
How is it possible that they can have the same names? The options passed here do not affect all integrations, only the integration where an instance of this implementation is used. If I understand correctly you mean that IntegraionA and IntegrationB can have both an option named endpoint
but even if the name is the same they do not mean the same, right? But this options do not affect all integrations, only those in which this cache key with a specific set of options is used. One integration cannot have multiple options with the same name, right?
I would expect that the developer of IntegrationA
knows this, the same as IntegrationB
and they will decide if endpoint
is necessary when they override the logs_persistent_cache_key
method. They need to create the instance passing which options, for that particular integration, should be considered when invalidating the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was that it is possible for an instance-level option to have the same name as an equivalent option in init_config
and for each to be treated differently by the integration's code. I don't have time to find an example but there are some.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaah alright, you mean different options in init and instance that have the same name. Maybe we can be more explicit and pass not only a set of options to respect but 2, one for init and another one for instance. That way we can crate the key instance like
ConfigSetCacheKey(check, init_config_options=["some_option"], instance_options=["other"])
This would remove the uncertainty. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have update the ConfigSet to explicitely require which options the developer wants to monitor.
datadog_checks_base/datadog_checks/base/utils/cache_key/config_set.py
Outdated
Show resolved
Hide resolved
datadog_checks_base/datadog_checks/base/utils/cache_key/manager.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more things, I'll take a look again tomorrow!
@@ -1094,7 +1107,7 @@ def read_persistent_cache(self, key): | |||
key (str): | |||
the key to retrieve | |||
""" | |||
return datadog_agent.read_persistent_cache(self._persistent_cache_id(key)) | |||
return datadog_agent.read_persistent_cache(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh man... true. I was focused on the behavior of interacting with these methods through the send_log
method and get_log_cursor
method but this side effect means that anyone using the write or read method to the persistent cache using their own key would have this broken.
What I am doing now is pulling the initialization of the CacheKey
to these methods. The cache key initialization happens the first time we call these methods and then we apply the prefix we need to using the key provided by the caller here as the context.
This is a non-breaking change for checks using the default cache key as the prefix is the check_id, as before, and the context is the key provided by the caller.
@@ -1110,7 +1123,7 @@ def write_persistent_cache(self, key, value): | |||
value (str): | |||
the value to store | |||
""" | |||
datadog_agent.write_persistent_cache(self._persistent_cache_id(key), value) | |||
datadog_agent.write_persistent_cache(key, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a breaking change.
from datadog_checks.base.utils.cache_key.base import CacheKey | ||
from datadog_checks.base.utils.cache_key.full_config import FullConfigCacheKey | ||
from datadog_checks.base.utils.cache_key.manager import CacheKeyManager, CacheKeyType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be imported lazily since most integrations do not use the cache. When you set up the check initialization just import in that function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left the CacheKey
under TYPE_CHECKING
and moved the FullConfigCacheKey
to the persistent_cache_key
method.
return self.base_key() | ||
|
||
def base_key(self) -> str: | ||
return str(self.check.check_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return str(self.check.check_id) | |
return self.check.check_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time I was not sure if this was a string or not but yes, this is redundant. Thanks for spotting that, didn't realize it was still here.
if self.__prefix is not None: | ||
return f"{self.__prefix}{self.base_key()}" | ||
|
||
check_id_prefix = ":".join(self.check.check_id.split(":")[:-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no purpose to the key/base key dichotomy, it only adds unnecessary length to the filesystem path in the default case.
Given integration foo
, instance name bar
, instance hash 123
and context custom_key
:
- key + base key
- Default:
<run_path>/foo/barfoobar123_custom_key
- Config set:
<run_path>/foo/bar456_custom_key
- Default:
- key
- Default:
<run_path>/foo/bar123_custom_key
- Config set:
<run_path>/foo/bar456_custom_key
- Default:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I have split those 2 is to enforce some check related isolation between different checks using the same custom key. My intention here is to avoid a developer setting a base_key
that is always constant, with no check_id
related prefix and unintentionally creating a poor customer experience for all checks they develop following the same practice.
For example, imagine a partner that decide that their base key is going to be
class PartnerKey(CheckKey):
def base_key(self) -> str:
return "partner_static_key"
And they start using this kind of key for all integrations they develop that require the same behavior.
I can easily see a scenario in which they would naively expect this key to only affect this check, but without the enforcement on our side of a check-scoped prefix, this would not be true. Without the separation between key
and base_key
there is only one single method providing a key which the developer have full control over.
With that separation they can still create this scenario but they would need to intentionally override the key
method as well. Like what we are doing in the FullConfigCacheKey
, where we override both but once this is in the hand of the developers I feel there needs to be some protection against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great change overall! 🚀
I only have a couple of nit picks.
class ConstantCacheKey(CacheKey): | ||
def base_key(self) -> str: | ||
return "always_the_same" | ||
|
||
class TestCheck(AgentCheck): | ||
def persistent_cache_key(self) -> CacheKey: | ||
return ConstantCacheKey(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move this out of the two tests.
new_check = TestCheck(name="another_test", init_config={}, instances=[{}]) | ||
new_check.check_id = 'test2:bar:456' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me whether it is the new_check.check_id
or the TestCheck.name
that invalidated the cache.
Could you only modify one of the two and keep the other one identical to the first TestCheck
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In regards to the key/base_key question, here is a suggestion to enforce the overriding of base_key
instead of key
.
My understanding is that the key
method should never be overridden, therefore the final
decorator can be used.
Another suggestion: rename base_key
to key_suffix
to be more explicit about the role of the method in the cache key construction.
IIUC, key()
already implements the cache invalidation between different check IDs. Then the key_suffix
allows for more granular control over the invalidation logic.
|
||
def key(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def key(self) -> str: | |
@final | |
def key(self) -> str: |
def key(self) -> str: | ||
return self.check.check_id | ||
|
||
def base_key(self) -> str: | ||
return self.check.check_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def key(self) -> str: | |
return self.check.check_id | |
def base_key(self) -> str: | |
return self.check.check_id | |
self._CacheKey__cache_key = self.check.check_id | |
def base_key(self) -> str: | |
return "" # This method is not used for this CacheKey implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my final recommendation. It will reduce the diff by 80% and incorporates @dkirov-dd's idea about the more semantically correct and intuitive key_suffix
naming (although I flip it).
Let's create a new method on the base class which customers may override:
def get_persistent_cache_id(self) -> str:
"""
This ID is used to form the final prefix of all persistent cache key lookups.
"""
# The last component of the check ID is the hash of the instance configuration
return self.check_id.split(':')[-1]
There will be another method that becomes a check initialization which will set a private property:
# __init__
self.__persistent_cache_key_prefix: str | None = None
def __set_persistent_cache_key_prefix(self) -> None:
# Remove the instance hash from the check ID
namespace = ':'.join(self.check_id.split(':')[:-1])
self.__persistent_cache_key_prefix = f'{namespace}:{self.get_persistent_cache_id()}_'
Finally, change the API calls:
def read_persistent_cache(self, key: str) -> str:
return datadog_agent.read_persistent_cache(f'{self.__persistent_cache_key_prefix}{key}')
def write_persistent_cache(self, key: str, value: str) -> str:
return datadog_agent.write_persistent_cache(f'{self.__persistent_cache_key_prefix}{key}', value)
The benefits of this approach are that:
- There is no longer any overhead for accessing the cache with the proper key prefix. All computation happens once when the integration starts.
- Customers retain complete control without having to create a dedicated class just to return a string. They now have 2 options:
- We can create a simple utility for them to use within the overridden method.
- They can override the method to return an empty string and rely exclusively on the
name
option of each instance.
What does this PR do?
This PR adds support for customizable cache keys to be used with the agent persistent cache. This is implemented in the following way:
CacheKey
abstract class that can be extended to implement any cache invalidation logic needed.AgentCheck
has been modified to rely on aCacheKey
implementation to provide the key for the persistent cache internally. Usingcheck_id
is now the default behavior through theFullConfigCacheKey
implementation but can be overriden when necessary.logs_persistent_cache_key
can be overriden by a developer that wants to modify what is theCacheKey
used to handle the logs cursor.CacheKey
that is to simple, the impact of key clashing is reduced to the same kind of check.CacheKey
interface that work as an example and to provide funcionality:FullConfigCacheKey
is equivalent to the previous behavior. It relies on thecheck_id
that includes a digest of the full configuration of the check. This default behavior makes this change backward compatible with the usage ofsend_log
andget_log_cursor
, both of them using persistent cache.ConfigSetCacheKey
implementation invalidates the cache only when a subset of configuration options have been updated. This provides a similar level of separation between different checks but allowing some configuration options to not invalidate the cache.CacheKeyManager
allows multiple cache keys to coexist in theAgentCheck
for differnet purposes. At the moment we only use theCacheKey
internally to manage the log cursor. However, if we want to include something else we can register a new type ofCacheKeyType
enum and maintain the new one in the manager without having to modify how theAgentCheck
behaves.Note: the manager forces to have each cache key represented by an enum to avoid relying on simple string keys per cache key. While the intention is to allow to personalize the cache invalidation mechanism, we do not want potentially arbitrary cache keys being register which could trigger cache misses if the string key is mistakenly generated.
Motivation
Until now, integrations that wanted to emit logs and be consistent between agent restarts would have logs missing or duplicated if the user changed their configuration. This would invalidate the cache and there was limited control over the last cursor.
The intention of this change is to provide a backward compatible solution that allows developers to modify the cache invalidation behavior in an easy way. The example below defines a check with a log cursor that is never invalidated after agent restart:
Warning
Using custom
CacheKey
can introduce clash between the keys from different checks. For example, in the example above, everyMyCheck
check would compete for the same key which results in undefined behavior when reading the cache. This new feature will be added to the public documentation and we will add there the drawbacks of building a customCacheKey
implementation.Review checklist (to be filled by reviewers)
qa/skip-qa
label if the PR doesn't need to be tested during QA.backport/<branch-name>
label to the PR and it will automatically open a backport PR once this one is merged