Skip to content

Conversation

@majanjua-amzn
Copy link
Contributor

@majanjua-amzn majanjua-amzn commented Aug 20, 2025

Description:
Adding support for X-Ray's new Adaptive Sampling feature. The feature will allow users to detect anomalies in their application across distributed services and boost the sampling rate of their root service based on their X-Ray sampling rule configuration and optionally a user-provided SDK level configuration. It will also allow users to optionally provide an error capture configuration, where - if configured - the sampler will send unsampled anomaly spans to be exported directly:

  • Configure a max boost rate and boost cooldown in their X-Ray sampling rules that defines how high the sampling rate can go for a given rule based on anomalies detected in the instrumentation and for how long
  • Configure anomaly conditions based on error code (RegEx), latency, and span operation - these will be used to check if a given span is an anomaly. By default, spans with a statusCode > 499 (i.e. 5XX) will be considered anomalies
  • Configure a anomaly/error capture rate that allows spans to be sent directly to the configured span exporter.

The changes:

  • Provide APIs called setSpanExporter and setAdaptiveSamplingConfig to set up the feature - if these are not provided any attempt to use the adaptSampling API will throw an IllegalStateException
  • Provide an API called adaptSampling that accepts a span and its associated spanData:
    • This API performs the necessary logic to determine whether the span is an anomaly based on the user-provided conditions (or default 5XX) and makes the decision whether to count it towards the boost-related statistics and/or whether to send it directly for export based on the error capture rate
  • Update the calls to GetSamplingRules and GetSamplingTargets according to the new API and the collected anomaly statistics
  • Propagate sampling information between instrumented services - specifically, the sampling rule in the root service is passed to all downstream services via the trace state AND baggage, such that statistics can be recorded meaningfully for boost to be triggered for the root service in a distributed system. Both trace state and baggage are used in case the user's configuration places a propagator that overrides one of these values in any service along the call chain.
  • Reimplemented the following change to allow the XrayRulesSampler while ensuring the statistics are correct to ensure the adaptive sampling logic is always triggered: Ensure all XRay Sampler functionality is under ParentBased logic #1488. Implemented a unit test generateStatistics() to verify.
  • Ensure baggage is propagated with any modifications in AwsXrayPropagator by using W3CBaggagePropagator.getInstance().inject(context.with(baggage), setter, carrier);

Testing:

  • Manual testing against the current revision of the contrib with the following set up:
    • Service A with the contrib changes, service B without and using tracecontext,baggage,b3,xray propagators, and service C with the contrib changes as well
    • Verified when A -> B -> C, service C is able to generate boost statistics for service A's sampling rule despite the trace state being dropped by service B (since the propagator configuration with b3 and xray at the end drops the trace state)
    • Verified other numerous use cases, e.g. no sampling rule with boost, no AwsXrayAdaptiveSamplingConfiguration set, some different propagator configurations, services in different programming languages in place of service B, etc.
  • Ran a performance test comparing ADOT 2.11.3 to a custom version built with these changes, showing the CPU and memory usage changes were non-existent/negligible.

Documentation:
https://aws.amazon.com/blogs/mt/dynamically-adjusting-x-ray-sampling-rules/

Outstanding items:
These changes are the first phase and iteration of AWS X-Ray's adaptive sampling feature. As we get feedback, more changes may be introduced to improve or streamline the experience.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 20, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: majanjua-amzn / name: Mahad Janjua (414b76f)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majanjua-amzn @wangzlei please review new public API carefully since this module is already marked stable (if you are unsure you can always add it initially under .internal. package as an experimental feature), thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've tested and reviewed it extensively more than once as part of a contribution to https://github.com/aws-observability/aws-otel-java-instrumentation, and have also verified there is no impact on existing behaviour for those not using the new APIs.

As such, we're okay with releasing these APIs directly. Thanks for the callout!

@majanjua-amzn majanjua-amzn requested a review from trask October 20, 2025 16:21
@trask trask added this pull request to the merge queue Oct 20, 2025
Merged via the queue into open-telemetry:main with commit 67efe98 Oct 20, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants