Implement anonymous usage metrics collection with opt-out mechanism #2449

lujunsan · 2025-11-04T12:31:58Z

Overview

Implements anonymous usage metrics (tool calls) collection for ToolHive. Users can opt out via persistent config or environment variable.

Implementation

Core Metrics Package (`pkg/usagemetrics`)

Collects tool call counts using atomic operations
Flushes metrics to API endpoint periodically (every 15 minutes)
Handles midnight UTC boundary for daily count resets
Manages background flush goroutine lifecycle
Sends anonymous instance ID with metrics (no personal data)

Middleware Integration

New usagemetrics middleware type
Intercepts tools/call requests from parsed MCP data
Non-blocking: failures don't affect request processing
Automatically disabled in CI environments

Opt-Out Mechanism

Users can opt out using either persistent config or environment variable:

# Via config (persistent)
thv config usage-metrics disable

# Via environment variable (session-only)
export TOOLHIVE_USAGE_METRICS_ENABLED=false

Metrics are enabled by default and automatically disabled in CI environments.

Integration Points

All workload execution paths read from Application Config:

CLI path (cmd/thv/app/run_flags.go) - reads appConfig.DisableUsageMetrics
API path (pkg/api/v1/workload_service.go) - reads appConfig.DisableUsageMetrics
Operator/middleware path (pkg/runner/middleware.go) - reads appConfig.DisableUsageMetrics

Documentation

Added middleware documentation in docs/middleware.md
Updated architecture docs in docs/arch/02-core-concepts.md
CLI documentation auto-generated via task docs

Privacy

Metrics are completely anonymous (no user identifiers)
Only collects tool call counts and anonymous data
Instance ID is randomly generated per process
Users can easily opt out via config or environment variable

Signed-off-by: lujunsan <[email protected]>

lujunsan · 2025-11-04T13:33:42Z

@claude review this PR

claude · 2025-11-04T13:33:58Z

Claude finished @lujunsan's task —— View job

Code Review for PR #2449

I've completed a thorough review of the anonymous usage metrics implementation. Overall, this is a well-designed feature with good attention to privacy, error handling, and integration points. Below are my findings organized by category.

Summary

This PR adds anonymous usage metrics collection that tracks tool calls to help ToolHive development. The implementation is solid with a few areas that could benefit from improvements around error handling, testing, and edge cases.

Key strengths:

Strong privacy-first design with multiple opt-out mechanisms
Non-blocking architecture - failures don't affect user requests
Clean middleware integration pattern
Good test coverage for core functionality
Well-documented in both inline comments and external docs

Areas for improvement:

Error handling in midnight boundary logic
Goroutine leak potential on rapid create/destroy cycles
Test coverage gaps (midnight transitions, concurrent access)
Anonymous ID initialization race conditions

Critical Issues

1. Potential Goroutine Leak on Rapid Create/Destroy

Location: pkg/usagemetrics/collector.go:63-65

The Start() method launches a goroutine without checking if one is already running:

func (c *Collector) Start() {
    c.started.Store(true)
    go c.flushLoop()
}

Issue: If Start() is called multiple times (e.g., during rapid workload restarts), each call creates a new goroutine. The started flag doesn't prevent this.

Recommendation: Add a check to prevent multiple goroutines:

func (c *Collector) Start() {
    if c.started.Swap(true) {
        return // Already started
    }
    go c.flushLoop()
}

2. Midnight Boundary Reset Loses Recent Counts

Location: pkg/usagemetrics/collector.go:83-114

The midnight boundary logic flushes the previous day's count but includes up to 15 minutes of calls from the current day (comment acknowledges this). While the comment says this is acceptable, there's a larger issue:

// Reset for the new day
c.currentDate = currentDate
c.counter.Store(0)
return nil

Issue: If the flush to the API fails (network error, API down), the count is permanently lost because we reset to 0 and return nil. The error is only logged as a warning.

Recommendation: Consider one of these approaches:

Don't reset the counter if the flush fails (track the last successfully flushed count)
Accumulate failed counts in a separate field for retry
Accept the data loss but document it more prominently

3. Anonymous ID Race Condition

Location: pkg/usagemetrics/client.go:49-59

anonymousID, err := updates.TryGetAnonymousID()
if err != nil {
    return fmt.Errorf("failed to get anonymous ID: %w", err)
}

// Skip sending if anonymous ID is not initialized yet
if anonymousID == "" {
    logger.Debugf("Skipping metrics send - anonymous ID not yet initialized")
    return nil
}

Issue: There's a race condition between when ToolHive starts and when the update checker creates the anonymous ID. If metrics try to flush before the update checker runs, metrics are silently dropped with just a debug log.

Recommendation:

Document this behavior more prominently (currently just a debug log)
Consider generating the anonymous ID eagerly during collector initialization
Or retry later if the ID isn't available yet

Major Issues

4. Insufficient Test Coverage for Critical Paths

Missing test cases:

Midnight boundary transitions - No test for the date rollover logic (pkg/usagemetrics/collector_test.go)
Concurrent access - No test for multiple goroutines calling IncrementToolCall() simultaneously
Flush failures - No test for API failure scenarios and retry behavior
Goroutine cleanup - No test verifying the flush loop properly stops on shutdown

Recommendation: Add tests for these scenarios:

func TestCollector_MidnightBoundaryTransition(t *testing.T) {
    // Test count flush and reset across midnight UTC
}

func TestCollector_ConcurrentIncrements(t *testing.T) {
    // Test thread-safety with multiple goroutines
}

func TestCollector_FlushFailureDoesNotBlock(t *testing.T) {
    // Test that flush failures don't break the collector
}

5. No Shutdown Timeout Handling

Location: pkg/usagemetrics/collector.go:141-167

func (c *Collector) Shutdown(ctx context.Context) {
    // ...
    select {
    case <-c.doneCh:
        logger.Debugf("Collector stopped cleanly")
    case <-ctx.Done():
        logger.Warnf("Collector shutdown timed out: %v", ctx.Err())
    }
}

Issue: If shutdown times out, the goroutine continues running. The middleware's shutdownTimeout is 10 seconds, but there's no forced cleanup if the goroutine is stuck.

Recommendation: Consider adding a way to force-stop the goroutine or document that zombie goroutines are acceptable for this non-critical feature.

Minor Issues

6. Inconsistent Error Handling in Flush

Location: pkg/usagemetrics/collector.go:131-135

if err := c.client.SendMetrics(c.instanceID, record); err != nil {
    logger.Warnf("Failed to send metrics: %v", err)
    // Don't return error - we'll retry on next flush
    return nil
}

This returns nil instead of the error, but the comment says "we'll retry on next flush." However, the count isn't preserved for retry - it's still in the counter and will be sent again on the next flush (potentially duplicating data).

Recommendation: Clarify the behavior:

If duplicate counts are acceptable, document it
If not, track the last successfully sent count

7. Fixed 15-Minute Flush Interval

Location: pkg/usagemetrics/collector.go:15

const flushInterval = 15 * time.Minute

Issue: This is hardcoded and not configurable. For testing or high-frequency scenarios, you might want to adjust this.

Recommendation: Consider making this configurable via environment variable (with 15min default) for testing purposes.

8. UserAgent BuildType Fallback Logic

Location: pkg/usagemetrics/client.go:103-107

buildType := versions.BuildType
if buildType == "" {
    buildType = "local_build"
}

Issue: The local_build fallback is fine, but it's not clear if this should be treated differently from official builds on the backend. If metrics from local builds should be filtered out, there's no way to identify them distinctly.

Recommendation: Consider using a more distinct value like dev_build or unknown_build to make filtering easier.

9. Middleware Factory Always Succeeds

Location: pkg/usagemetrics/middleware.go:62-70

collector, err := NewCollector()
if err != nil {
    logger.Warnf("Failed to initialize usage metrics: %v", err)
    // Continue - metrics are non-critical, register no-op middleware
    mw := &Middleware{}
    runner.AddMiddleware(config.Type, mw)
    return nil
}

Issue: If collector creation fails, a no-op middleware is registered. This is fine for fault tolerance, but the error is only logged as a warning. If there's a configuration issue, it might go unnoticed.

Recommendation: Consider logging at ERROR level or making it more visible that metrics are disabled due to an error.

Code Quality & Style

10. Magic Numbers in Timestamp Format

Location: pkg/usagemetrics/collector.go:95

previousDayTimestamp := c.currentDate + "T23:59:00Z"

Issue: The hardcoded "T23:59:00Z" creates a fake timestamp. While commented, this could be confusing.

Recommendation: Use a constant or helper function:

const endOfDayTime = "T23:59:00Z"
previousDayTimestamp := c.currentDate + endOfDayTime

11. Duplicate Code in Middleware Configuration

Location: pkg/runner/middleware.go:144-155 and pkg/runner/config_builder.go:571-577

The same logic for checking if metrics should be enabled exists in two places:

addUsageMetricsMiddleware() in pkg/runner/middleware.go
addCoreMiddlewares() in pkg/runner/config_builder.go

Recommendation: Consolidate this logic into a single place to avoid inconsistencies.

Documentation

12. Documentation is Excellent ✅

The documentation updates are comprehensive and well-structured:

docs/middleware.md - Great addition with clear explanation of opt-out mechanisms
docs/arch/02-core-concepts.md - Properly integrated into the middleware list
Inline code comments are thorough and explain the "why" behind decisions

One suggestion: Add a FAQ section addressing common questions:

"What data is collected?" (only counts, no user data)
"How can I verify metrics are disabled?" (check logs or config)
"Where is data sent?" (updates.stacklok.com)

Security & Privacy

13. Privacy Implementation is Strong ✅

The privacy-first design is commendable:

No user identifiable data collected
Anonymous instance ID per process (ephemeral)
Separate anonymous ID from updates.json (persistent across runs)
Multiple opt-out mechanisms
Auto-disabled in CI environments
Non-blocking - failures don't affect users

No security concerns identified.

Testing

14. Test Coverage is Good But Incomplete

What's tested well:

Basic collector operations ✅
Opt-out logic with various configurations ✅
User agent generation ✅
Middleware handler tool call detection ✅

What's missing:

Midnight boundary transitions ❌
Concurrent access patterns ❌
Network failure scenarios ❌
Goroutine lifecycle and cleanup ❌
Integration test for full middleware chain ❌

Recommendation: Add the missing test cases mentioned in issue #4.

Integration Points

15. Integration is Clean ✅

The middleware integration follows ToolHive's patterns well:

Clean factory function (CreateMiddleware)
Proper registration in GetSupportedMiddlewareFactories()
Configuration through PopulateMiddlewareConfigs() and WithMiddlewareFromFlags()
All three execution paths correctly read appConfig.DisableUsageMetrics:
- CLI: cmd/thv/app/run_flags.go:520
- API: pkg/api/v1/workload_service.go:243
- Operator: pkg/runner/middleware.go:91

No issues identified.

Performance

16. Performance Characteristics are Good ✅

Atomic operations for counter (no locks on hot path)
Non-blocking middleware (doesn't wait for flush)
Background goroutine for periodic flush (15min intervals)
Reasonable timeout (5s for API calls)

No performance concerns identified.

Recommendations Summary

Must fix before merge:

Fix goroutine leak in Start() method (Issue fix(typo): corrects readme #1)
Handle midnight boundary flush failures properly (Issue Do we want the container monitor? #2)
Document or fix anonymous ID race condition (Issue Bump golangci/golangci-lint-action from 2f856675483cb8b9378ee77ee0beb67955aca9d7 to 4696ba8babb6127d732c3c6dde519db15edab9ea #3)

Should fix before merge:
4. Add critical test cases (midnight, concurrency, failures) (Issue #4)
5. Add shutdown timeout handling (Issue #5)

Nice to have:
6. Clarify error handling and retry behavior (Issue #6)
7. Make flush interval configurable for testing (Issue #7)
8. Use distinct buildType for local builds (Issue #8)
9. Improve error visibility when collector fails (Issue #9)
10. Extract magic timestamp string to constant (Issue #10)
11. Consolidate duplicate middleware config logic (Issue #11)
12. Add FAQ to documentation (Issue #12)

Conclusion

This is a solid implementation of anonymous usage metrics with strong privacy protections. The main concerns are around edge cases (midnight transitions, goroutine management, error handling) and test coverage. With the critical issues addressed, this feature will be a valuable addition to ToolHive.

Overall assessment: Approve with requested changes for critical issues #1-3 and recommended improvements for issues #4-5.

| Branch: add-anonymous-usage-metrics

JAORMX · 2025-11-05T06:58:45Z

docs/middleware.md


 **Note**: This middleware is currently implemented but not registered in the supported middleware factories (`pkg/runner/middleware.go:15`). It can be used directly via the proxy command but is not available through the standard middleware configuration system.

-### 7. Authorization Middleware


what happened to this section?

Removed the duplication in the docs, as it's also at point 3 line 136. Should've have added a comment here tho (or a separate PR rather), thanks for checking!

JAORMX · 2025-11-05T07:00:58Z

pkg/updates/checker.go

+// Returns empty string if the file doesn't exist or doesn't contain an instance ID.
+// Use this for optional features like metrics that shouldn't trigger ID generation.
+// TODO this should probably be extracted into its own package to handle instance ID generation.
+func TryGetAnonymousID() (string, error) {


Can we cache this ID in memory for the duration of the process? this way, we'd avoid reading the file every 15 mins.

JAORMX · 2025-11-05T07:28:20Z

pkg/usagemetrics/types.go

+
+// getCurrentDateUTC returns the current date in YYYY-MM-DD format (UTC)
+func getCurrentDateUTC() string {
+	return time.Now().UTC().Format("2006-01-02")


why use a different format than RFC3339?

This date is only used to track the midnight flush, saves the day and checks on every flush if it is the next day; it was a bit easier to parse this way (no need to extract the date later, and just checks dates aren't equal), but I'm happy to change it if needed. This date is not sent or used outside of this process.

JAORMX · 2025-11-05T07:29:11Z

pkg/usagemetrics/collector.go

+	logger.Debugf("Flushing %d tool calls at %s", count, timestamp)
+
+	if err := c.client.SendMetrics(c.instanceID, record); err != nil {
+		logger.Warnf("Failed to send metrics: %v", err)


Can we tone down this warning? users often get confused by things like this. So, either we have a very descriptive warning about what happened and why this is not a problem, or we change the log level.

I agree, this probably should be debug.

dmjb · 2025-11-05T11:29:23Z

docs/arch/02-core-concepts.md


 **Middleware** is a composable layer in the request processing chain. Each middleware can inspect, modify, or reject requests.

-**Eight middleware types:**


Nitpick: We probably should get rid of this count. It is apparent from the numbered list that there are nine items.

dmjb · 2025-11-05T11:41:33Z

pkg/api/v1/workload_service.go

 		transportType = serverMetadata.GetTransport()
 	}

+	// Load application config for global settings


I wonder if we can inject the config into the parent struct instead of loading it in each function call.

Signed-off-by: lujunsan <[email protected]>

codecov · 2025-11-05T13:22:03Z

Codecov Report

❌ Patch coverage is 60.40609% with 78 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.50%. Comparing base (fd47262) to head (759a7a9).

Files with missing lines	Patch %	Lines
pkg/usagemetrics/collector.go	67.44%	25 Missing and 3 partials ⚠️
pkg/updates/checker.go	0.00%	15 Missing ⚠️
pkg/runner/middleware.go	0.00%	14 Missing ⚠️
pkg/usagemetrics/client.go	75.55%	6 Missing and 5 partials ⚠️
pkg/runner/config_builder.go	0.00%	4 Missing and 1 partial ⚠️
pkg/usagemetrics/middleware.go	80.76%	4 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2449      +/-   ##
==========================================
+ Coverage   55.46%   55.50%   +0.03%     
==========================================
  Files         284      288       +4     
  Lines       26705    26901     +196     
==========================================
+ Hits        14812    14931     +119     
- Misses      10534    10602      +68     
- Partials     1359     1368       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Implement anonymous usage metrics collection with opt-out mechanism

0e7bae1

Signed-off-by: lujunsan <[email protected]>

lujunsan requested review from ChrisJBurns and dmjb November 4, 2025 12:31

JAORMX reviewed Nov 5, 2025

View reviewed changes

dmjb reviewed Nov 5, 2025

View reviewed changes

lujunsan added 3 commits November 5, 2025 14:00

Improve usage metrics performance and fix goroutine leak

00b8bbb

Signed-off-by: lujunsan <[email protected]>

Merge branch 'main' into add-anonymous-usage-metrics

1daef03

Fix test case where anonymous_id was not being found

759a7a9

Signed-off-by: lujunsan <[email protected]>

lujunsan requested review from JAORMX and dmjb November 5, 2025 14:05


		Note: This middleware is currently implemented but not registered in the supported middleware factories (`pkg/runner/middleware.go:15`). It can be used directly via the proxy command but is not available through the standard middleware configuration system.

		### 7. Authorization Middleware


		Middleware is a composable layer in the request processing chain. Each middleware can inspect, modify, or reject requests.

		Eight middleware types:

Implement anonymous usage metrics collection with opt-out mechanism #2449

Are you sure you want to change the base?

Implement anonymous usage metrics collection with opt-out mechanism #2449

Uh oh!

Conversation

lujunsan commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Implementation

Core Metrics Package (pkg/usagemetrics)

Middleware Integration

Opt-Out Mechanism

Uh oh!

lujunsan commented Nov 4, 2025

Uh oh!

claude bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for PR #2449

Summary

Critical Issues

1. Potential Goroutine Leak on Rapid Create/Destroy

2. Midnight Boundary Reset Loses Recent Counts

3. Anonymous ID Race Condition

Major Issues

4. Insufficient Test Coverage for Critical Paths

5. No Shutdown Timeout Handling

Minor Issues

6. Inconsistent Error Handling in Flush

7. Fixed 15-Minute Flush Interval

8. UserAgent BuildType Fallback Logic

9. Middleware Factory Always Succeeds

Code Quality & Style

10. Magic Numbers in Timestamp Format

11. Duplicate Code in Middleware Configuration

Documentation

12. Documentation is Excellent ✅

Security & Privacy

13. Privacy Implementation is Strong ✅

Testing

14. Test Coverage is Good But Incomplete

Integration Points

15. Integration is Clean ✅

Performance

16. Performance Characteristics are Good ✅

Recommendations Summary

Conclusion

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 5, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lujunsan commented Nov 4, 2025 •

edited

Loading

Core Metrics Package (`pkg/usagemetrics`)

claude bot commented Nov 4, 2025 •

edited

Loading