Skip to content

Conversation

@lujunsan
Copy link
Contributor

@lujunsan lujunsan commented Nov 4, 2025

Overview

Implements anonymous usage metrics (tool calls) collection for ToolHive. Users can opt out via persistent config or environment variable.

Implementation

Core Metrics Package (pkg/usagemetrics)

  • Collects tool call counts using atomic operations
  • Flushes metrics to API endpoint periodically (every 15 minutes)
  • Handles midnight UTC boundary for daily count resets
  • Manages background flush goroutine lifecycle
  • Sends anonymous instance ID with metrics (no personal data)

Middleware Integration

  • New usagemetrics middleware type
  • Intercepts tools/call requests from parsed MCP data
  • Non-blocking: failures don't affect request processing
  • Automatically disabled in CI environments

Opt-Out Mechanism

Users can opt out using either persistent config or environment variable:

# Via config (persistent)
thv config usage-metrics disable

# Via environment variable (session-only)
export TOOLHIVE_USAGE_METRICS_ENABLED=false

Metrics are enabled by default and automatically disabled in CI environments.

Integration Points

All workload execution paths read from Application Config:

  • CLI path (cmd/thv/app/run_flags.go) - reads appConfig.DisableUsageMetrics
  • API path (pkg/api/v1/workload_service.go) - reads appConfig.DisableUsageMetrics
  • Operator/middleware path (pkg/runner/middleware.go) - reads appConfig.DisableUsageMetrics

Documentation

  • Added middleware documentation in docs/middleware.md
  • Updated architecture docs in docs/arch/02-core-concepts.md
  • CLI documentation auto-generated via task docs

Privacy

  • Metrics are completely anonymous (no user identifiers)
  • Only collects tool call counts and anonymous data
  • Instance ID is randomly generated per process
  • Users can easily opt out via config or environment variable

@lujunsan lujunsan requested review from ChrisJBurns and dmjb November 4, 2025 12:31
@lujunsan
Copy link
Contributor Author

lujunsan commented Nov 4, 2025

@claude review this PR

@claude
Copy link
Contributor

claude bot commented Nov 4, 2025

Claude finished @lujunsan's task —— View job


Code Review for PR #2449

I've completed a thorough review of the anonymous usage metrics implementation. Overall, this is a well-designed feature with good attention to privacy, error handling, and integration points. Below are my findings organized by category.


Summary

This PR adds anonymous usage metrics collection that tracks tool calls to help ToolHive development. The implementation is solid with a few areas that could benefit from improvements around error handling, testing, and edge cases.

Key strengths:

  • Strong privacy-first design with multiple opt-out mechanisms
  • Non-blocking architecture - failures don't affect user requests
  • Clean middleware integration pattern
  • Good test coverage for core functionality
  • Well-documented in both inline comments and external docs

Areas for improvement:

  • Error handling in midnight boundary logic
  • Goroutine leak potential on rapid create/destroy cycles
  • Test coverage gaps (midnight transitions, concurrent access)
  • Anonymous ID initialization race conditions

Critical Issues

1. Potential Goroutine Leak on Rapid Create/Destroy

Location: pkg/usagemetrics/collector.go:63-65

The Start() method launches a goroutine without checking if one is already running:

func (c *Collector) Start() {
    c.started.Store(true)
    go c.flushLoop()
}

Issue: If Start() is called multiple times (e.g., during rapid workload restarts), each call creates a new goroutine. The started flag doesn't prevent this.

Recommendation: Add a check to prevent multiple goroutines:

func (c *Collector) Start() {
    if c.started.Swap(true) {
        return // Already started
    }
    go c.flushLoop()
}

2. Midnight Boundary Reset Loses Recent Counts

Location: pkg/usagemetrics/collector.go:83-114

The midnight boundary logic flushes the previous day's count but includes up to 15 minutes of calls from the current day (comment acknowledges this). While the comment says this is acceptable, there's a larger issue:

// Reset for the new day
c.currentDate = currentDate
c.counter.Store(0)
return nil

Issue: If the flush to the API fails (network error, API down), the count is permanently lost because we reset to 0 and return nil. The error is only logged as a warning.

Recommendation: Consider one of these approaches:

  1. Don't reset the counter if the flush fails (track the last successfully flushed count)
  2. Accumulate failed counts in a separate field for retry
  3. Accept the data loss but document it more prominently

3. Anonymous ID Race Condition

Location: pkg/usagemetrics/client.go:49-59

anonymousID, err := updates.TryGetAnonymousID()
if err != nil {
    return fmt.Errorf("failed to get anonymous ID: %w", err)
}

// Skip sending if anonymous ID is not initialized yet
if anonymousID == "" {
    logger.Debugf("Skipping metrics send - anonymous ID not yet initialized")
    return nil
}

Issue: There's a race condition between when ToolHive starts and when the update checker creates the anonymous ID. If metrics try to flush before the update checker runs, metrics are silently dropped with just a debug log.

Recommendation:

  1. Document this behavior more prominently (currently just a debug log)
  2. Consider generating the anonymous ID eagerly during collector initialization
  3. Or retry later if the ID isn't available yet

Major Issues

4. Insufficient Test Coverage for Critical Paths

Missing test cases:

  1. Midnight boundary transitions - No test for the date rollover logic (pkg/usagemetrics/collector_test.go)
  2. Concurrent access - No test for multiple goroutines calling IncrementToolCall() simultaneously
  3. Flush failures - No test for API failure scenarios and retry behavior
  4. Goroutine cleanup - No test verifying the flush loop properly stops on shutdown

Recommendation: Add tests for these scenarios:

func TestCollector_MidnightBoundaryTransition(t *testing.T) {
    // Test count flush and reset across midnight UTC
}

func TestCollector_ConcurrentIncrements(t *testing.T) {
    // Test thread-safety with multiple goroutines
}

func TestCollector_FlushFailureDoesNotBlock(t *testing.T) {
    // Test that flush failures don't break the collector
}

5. No Shutdown Timeout Handling

Location: pkg/usagemetrics/collector.go:141-167

func (c *Collector) Shutdown(ctx context.Context) {
    // ...
    select {
    case <-c.doneCh:
        logger.Debugf("Collector stopped cleanly")
    case <-ctx.Done():
        logger.Warnf("Collector shutdown timed out: %v", ctx.Err())
    }
}

Issue: If shutdown times out, the goroutine continues running. The middleware's shutdownTimeout is 10 seconds, but there's no forced cleanup if the goroutine is stuck.

Recommendation: Consider adding a way to force-stop the goroutine or document that zombie goroutines are acceptable for this non-critical feature.


Minor Issues

6. Inconsistent Error Handling in Flush

Location: pkg/usagemetrics/collector.go:131-135

if err := c.client.SendMetrics(c.instanceID, record); err != nil {
    logger.Warnf("Failed to send metrics: %v", err)
    // Don't return error - we'll retry on next flush
    return nil
}

This returns nil instead of the error, but the comment says "we'll retry on next flush." However, the count isn't preserved for retry - it's still in the counter and will be sent again on the next flush (potentially duplicating data).

Recommendation: Clarify the behavior:

  • If duplicate counts are acceptable, document it
  • If not, track the last successfully sent count

7. Fixed 15-Minute Flush Interval

Location: pkg/usagemetrics/collector.go:15

const flushInterval = 15 * time.Minute

Issue: This is hardcoded and not configurable. For testing or high-frequency scenarios, you might want to adjust this.

Recommendation: Consider making this configurable via environment variable (with 15min default) for testing purposes.


8. UserAgent BuildType Fallback Logic

Location: pkg/usagemetrics/client.go:103-107

buildType := versions.BuildType
if buildType == "" {
    buildType = "local_build"
}

Issue: The local_build fallback is fine, but it's not clear if this should be treated differently from official builds on the backend. If metrics from local builds should be filtered out, there's no way to identify them distinctly.

Recommendation: Consider using a more distinct value like dev_build or unknown_build to make filtering easier.


9. Middleware Factory Always Succeeds

Location: pkg/usagemetrics/middleware.go:62-70

collector, err := NewCollector()
if err != nil {
    logger.Warnf("Failed to initialize usage metrics: %v", err)
    // Continue - metrics are non-critical, register no-op middleware
    mw := &Middleware{}
    runner.AddMiddleware(config.Type, mw)
    return nil
}

Issue: If collector creation fails, a no-op middleware is registered. This is fine for fault tolerance, but the error is only logged as a warning. If there's a configuration issue, it might go unnoticed.

Recommendation: Consider logging at ERROR level or making it more visible that metrics are disabled due to an error.


Code Quality & Style

10. Magic Numbers in Timestamp Format

Location: pkg/usagemetrics/collector.go:95

previousDayTimestamp := c.currentDate + "T23:59:00Z"

Issue: The hardcoded "T23:59:00Z" creates a fake timestamp. While commented, this could be confusing.

Recommendation: Use a constant or helper function:

const endOfDayTime = "T23:59:00Z"
previousDayTimestamp := c.currentDate + endOfDayTime

11. Duplicate Code in Middleware Configuration

Location: pkg/runner/middleware.go:144-155 and pkg/runner/config_builder.go:571-577

The same logic for checking if metrics should be enabled exists in two places:

  • addUsageMetricsMiddleware() in pkg/runner/middleware.go
  • addCoreMiddlewares() in pkg/runner/config_builder.go

Recommendation: Consolidate this logic into a single place to avoid inconsistencies.


Documentation

12. Documentation is Excellent ✅

The documentation updates are comprehensive and well-structured:

  • docs/middleware.md - Great addition with clear explanation of opt-out mechanisms
  • docs/arch/02-core-concepts.md - Properly integrated into the middleware list
  • Inline code comments are thorough and explain the "why" behind decisions

One suggestion: Add a FAQ section addressing common questions:

  • "What data is collected?" (only counts, no user data)
  • "How can I verify metrics are disabled?" (check logs or config)
  • "Where is data sent?" (updates.stacklok.com)

Security & Privacy

13. Privacy Implementation is Strong ✅

The privacy-first design is commendable:

  • No user identifiable data collected
  • Anonymous instance ID per process (ephemeral)
  • Separate anonymous ID from updates.json (persistent across runs)
  • Multiple opt-out mechanisms
  • Auto-disabled in CI environments
  • Non-blocking - failures don't affect users

No security concerns identified.


Testing

14. Test Coverage is Good But Incomplete

What's tested well:

  • Basic collector operations ✅
  • Opt-out logic with various configurations ✅
  • User agent generation ✅
  • Middleware handler tool call detection ✅

What's missing:

  • Midnight boundary transitions ❌
  • Concurrent access patterns ❌
  • Network failure scenarios ❌
  • Goroutine lifecycle and cleanup ❌
  • Integration test for full middleware chain ❌

Recommendation: Add the missing test cases mentioned in issue #4.


Integration Points

15. Integration is Clean ✅

The middleware integration follows ToolHive's patterns well:

  • Clean factory function (CreateMiddleware)
  • Proper registration in GetSupportedMiddlewareFactories()
  • Configuration through PopulateMiddlewareConfigs() and WithMiddlewareFromFlags()
  • All three execution paths correctly read appConfig.DisableUsageMetrics:
    • CLI: cmd/thv/app/run_flags.go:520
    • API: pkg/api/v1/workload_service.go:243
    • Operator: pkg/runner/middleware.go:91

No issues identified.


Performance

16. Performance Characteristics are Good ✅

  • Atomic operations for counter (no locks on hot path)
  • Non-blocking middleware (doesn't wait for flush)
  • Background goroutine for periodic flush (15min intervals)
  • Reasonable timeout (5s for API calls)

No performance concerns identified.


Recommendations Summary

Must fix before merge:

  1. Fix goroutine leak in Start() method (Issue fix(typo): corrects readme #1)
  2. Handle midnight boundary flush failures properly (Issue Do we want the container monitor? #2)
  3. Document or fix anonymous ID race condition (Issue Bump golangci/golangci-lint-action from 2f856675483cb8b9378ee77ee0beb67955aca9d7 to 4696ba8babb6127d732c3c6dde519db15edab9ea #3)

Should fix before merge:
4. Add critical test cases (midnight, concurrency, failures) (Issue #4)
5. Add shutdown timeout handling (Issue #5)

Nice to have:
6. Clarify error handling and retry behavior (Issue #6)
7. Make flush interval configurable for testing (Issue #7)
8. Use distinct buildType for local builds (Issue #8)
9. Improve error visibility when collector fails (Issue #9)
10. Extract magic timestamp string to constant (Issue #10)
11. Consolidate duplicate middleware config logic (Issue #11)
12. Add FAQ to documentation (Issue #12)


Conclusion

This is a solid implementation of anonymous usage metrics with strong privacy protections. The main concerns are around edge cases (midnight transitions, goroutine management, error handling) and test coverage. With the critical issues addressed, this feature will be a valuable addition to ToolHive.

Overall assessment: Approve with requested changes for critical issues #1-3 and recommended improvements for issues #4-5.


| Branch: add-anonymous-usage-metrics


**Note**: This middleware is currently implemented but not registered in the supported middleware factories (`pkg/runner/middleware.go:15`). It can be used directly via the proxy command but is not available through the standard middleware configuration system.

### 7. Authorization Middleware
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happened to this section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the duplication in the docs, as it's also at point 3 line 136. Should've have added a comment here tho (or a separate PR rather), thanks for checking!

// Returns empty string if the file doesn't exist or doesn't contain an instance ID.
// Use this for optional features like metrics that shouldn't trigger ID generation.
// TODO this should probably be extracted into its own package to handle instance ID generation.
func TryGetAnonymousID() (string, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cache this ID in memory for the duration of the process? this way, we'd avoid reading the file every 15 mins.


// getCurrentDateUTC returns the current date in YYYY-MM-DD format (UTC)
func getCurrentDateUTC() string {
return time.Now().UTC().Format("2006-01-02")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use a different format than RFC3339?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This date is only used to track the midnight flush, saves the day and checks on every flush if it is the next day; it was a bit easier to parse this way (no need to extract the date later, and just checks dates aren't equal), but I'm happy to change it if needed. This date is not sent or used outside of this process.

logger.Debugf("Flushing %d tool calls at %s", count, timestamp)

if err := c.client.SendMetrics(c.instanceID, record); err != nil {
logger.Warnf("Failed to send metrics: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we tone down this warning? users often get confused by things like this. So, either we have a very descriptive warning about what happened and why this is not a problem, or we change the log level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this probably should be debug.


**Middleware** is a composable layer in the request processing chain. Each middleware can inspect, modify, or reject requests.

**Eight middleware types:**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: We probably should get rid of this count. It is apparent from the numbered list that there are nine items.

transportType = serverMetadata.GetTransport()
}

// Load application config for global settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can inject the config into the parent struct instead of loading it in each function call.

@codecov
Copy link

codecov bot commented Nov 5, 2025

Codecov Report

❌ Patch coverage is 60.40609% with 78 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.50%. Comparing base (fd47262) to head (759a7a9).

Files with missing lines Patch % Lines
pkg/usagemetrics/collector.go 67.44% 25 Missing and 3 partials ⚠️
pkg/updates/checker.go 0.00% 15 Missing ⚠️
pkg/runner/middleware.go 0.00% 14 Missing ⚠️
pkg/usagemetrics/client.go 75.55% 6 Missing and 5 partials ⚠️
pkg/runner/config_builder.go 0.00% 4 Missing and 1 partial ⚠️
pkg/usagemetrics/middleware.go 80.76% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2449      +/-   ##
==========================================
+ Coverage   55.46%   55.50%   +0.03%     
==========================================
  Files         284      288       +4     
  Lines       26705    26901     +196     
==========================================
+ Hits        14812    14931     +119     
- Misses      10534    10602      +68     
- Partials     1359     1368       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lujunsan lujunsan requested review from JAORMX and dmjb November 5, 2025 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants