(feat) SRE debugger - use case example #787

tejas-goyal · 2025-10-17T20:23:51Z

Overview

Adds a new GitHub workflow example for automated debugging and error analysis.

What's New

SRE Debugger: Agent-powered workflow that automatically debugs failures and generates detailed error analysis
Integrates with CI/CD pipelines for automated incident response
Generates structured error reports with root cause analysis

Key Features

Automatic error detection and analysis
Detailed debugging reports
Suggested fixes and remediation steps
Works locally or in GitHub Actions

Usage

Local testing:
```bash
export LLM_API_KEY="your-key"
uv run python examples/03_github_workflows/04_sre_debugger/test_local.py
```

GitHub Actions:
Copy `workflow.yml` to `.github/workflows/` and add `LLM_API_KEY` secret.

Use Cases

Automated incident response
CI/CD failure analysis
Production debugging
Error monitoring and alerting" --head YOUR_USERNAME:feature/sre-debugger

blacksmith-sh · 2025-10-29T13:01:33Z

[Automatic Post]: It has been a while since there was any activity on this PR. @tejas-goyal, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

blacksmith-sh · 2025-11-04T13:02:11Z

[Automatic Post]: It has been a while since there was any activity on this PR. @tejas-goyal, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

enyst · 2025-11-04T20:28:50Z

Hey @tejas-goyal , thank you for the PR!

There are 2 workflows in the PR, was that intended?

tejas-goyal · 2025-11-04T22:34:19Z

Hey @tejas-goyal , thank you for the PR!

There are 2 workflows in the PR, was that intended?

Hi @enyst , you're right there are two workflows here, the SRE debugger I have here is now replaced by the datadog debugger example we have live. I wanted to merge my changelog generator example here, you can ignore the sre example or I can close this pr since I have the other one open as well: #785

tejas-goyal · 2025-11-05T16:12:36Z

🔧 CI Fix Available

I found and fixed the issue causing the GitHub Actions workflow to fail!

Problem

The PR review workflow was failing because it tried to checkout the PR branch (feature/sre-debugger) from the main repository (OpenHands/software-agent-sdk) instead of the fork (tejas-goyal/agent-sdk). The branch only exists in the fork, so the checkout step failed with:

A branch or tag with the name 'feature/sre-debugger' could not be found

Solution

The fix is simple - we need to specify the repository when checking out the PR branch. Here's the required change to .github/workflows/pr-review-by-openhands.yml:

- name: Checkout PR repository
  uses: actions/checkout@v5
  with:
      # Fetch the full history to get the diff
      fetch-depth: 0
      path: pr-repo
      # Check out the feature branch so agent can inspect the PR changes
+     repository: ${{ github.event.pull_request.head.repo.full_name }}
      ref: ${{ github.event.pull_request.head.ref }}

How to Apply

To fix this, the PR author (@tejas-goyal) needs to add one line to the workflow file in their fork:

Edit .github/workflows/pr-review-by-openhands.yml in the feature/sre-debugger branch
Add the line repository: ${{ github.event.pull_request.head.repo.full_name }} as shown above
Commit and push the change

This will allow the workflow to properly checkout the PR branch from the fork and the CI should pass! 🎉

enyst · 2025-11-08T20:08:35Z

@OpenHands Do a /codereview-roasted on this PR, the SRE debugger part.

Once you are done, answer a question:

Read the full discussion here. Then read .github/workflows and find the datadog workflow and compare what it does with this SRE debugger. Be throughout, and focus on this: datadog is a proprietary/closed tool. I'd like us to have examples with open tools as well. How does this PR compare in general and on that issue?

openhands-ai · 2025-11-08T20:08:44Z

I'm on it! enyst can track my progress at all-hands.dev

openhands-ai · 2025-11-08T20:14:17Z

Summary of work

What I reviewed

Pulled PR (feat) SRE debugger - use case example #787 metadata and diff via GitHub API and focused on the SRE Debugger example under examples/03_github_workflows/04_sre_debugger.
Inspected .github/workflows in this repo snapshot; there is no Datadog workflow there. The Datadog example lives under examples/03_github_workflows/04_datadog_debugging with its own workflow.yml intended to be copied into a repo.

Codereview-roasted (SRE Debugger)

Critical issues
- Wrong install targets/paths in workflow.yml: installs from All-Hands-AI/agent-sdk with subdirectories openhands/sdk and openhands/tools, which does not match this repository’s OpenHands/software-agent-sdk structure (openhands-sdk, openhands-tools). This will break installs.
- Not analyzing the failing commit: workflow_run handler doesn’t check out the exact SHA/PR that failed; it will analyze the wrong code. Needs to resolve the failing run’s ref from the event payload.
- No target dependency install: runs pytest without installing the repo’s own deps, likely causing false failures or noise.
- Naive failure detection: string contains checks like "error"/"FAILED" on raw stdout instead of using pytest exit code and structured outputs (JSON/JUnit XML).
- PR/issue feedback path broken: attempting “comment on PR” checks github.event_name == 'pull_request', which won’t be true for workflow_run or workflow_dispatch.
- Unbounded agent run: conversation.run() has no step/time budget; risky for CI timeouts.
- sys.path.insert hack for prompt import; repo guidelines discourage this.
Improvement opportunities
- Prefer structured parsing (pytest --json-report or --junitxml) and clip tracebacks intentionally rather than tail-only truncation.
- Tighten permissions to exactly what’s needed; request pull-requests: write if commenting on PRs.
- Restrict tools for CI runs and add step/token/time limits for determinism.
- Align README promises with actual workflow behavior (install deps, analyze failing ref).
Verdict: Needs rework before merge for reliability in CI.

Comparison: Datadog example vs SRE Debugger

Triggering and scope
- SRE Debugger: Aims to auto-run on failed “Tests” (workflow_run) and manual dispatch; analyzes local pytest failures within one repo.
- Datadog: Manual dispatch; queries Datadog Logs/Error Tracking and performs multi-repo analysis; creates/updates GitHub issues and posts comments.
Robustness and integration
- SRE Debugger: Currently brittle due to ref checkout, dependency install, permissions, and failure detection gaps.
- Datadog: More mature flow—clear inputs, artifacts, issues lifecycle, multi-repo cloning, explicit secret handling. It fetches its script/template at runtime from main (should pin to a commit to avoid drift).
Open vs proprietary tooling
- Datadog relies on a proprietary service and API keys.
- SRE Debugger is vendor-neutral in concept (pytest output, local code). If hardened, it can serve as the “open tools” counterpart.

Open-tools angle and recommendations

Keep the Datadog example (valuable for teams already on Datadog).
Add open-source counterparts with equivalent ergonomics:
- Grafana Loki: Query logs via Loki HTTP API and run the same agent analysis.
- OpenSearch/Elasticsearch: Query logs via REST; identical analysis flow.
- Sentry (self-hosted): Use Sentry API for error aggregation analogous to Datadog’s Error Tracking.
- OpenTelemetry + Tempo/Jaeger: Trace-driven debugging variant.
- CI-native variant: Download and parse JUnit XML/pytest JSON artifacts from the failing job (no external vendor).
Elevate the SRE Debugger example to be the canonical open stack CI failure analyzer by fixing the critical issues above.

Concrete next steps (for SRE Debugger)

Fix workflow install targets to this repo and correct subdirectory names; pin to a commit SHA.
From workflow_run, resolve the failing job’s SHA/PR and check out that ref; optionally download failing test artifacts.
Install the target repo’s dependencies (detect and use requirements.txt/pyproject/uv lock).
Parse structured test reports and use pytest’s exit code; remove substring-based detection.
Add time/step/token limits and restrict tools for CI runs.
Request appropriate permissions and reliably map back to the failing PR for comments.
Replace sys.path.insert with reading the prompt template from file.

Bottom line

The SRE Debugger is directionally the right open alternative to Datadog, but it needs hardening to be dependable in CI.
The Datadog example is more complete operationally today but is tied to a proprietary tool.
We should keep Datadog and invest in the SRE Debugger plus a Loki/OpenSearch/Sentry/OpenTelemetry set of examples to provide first-class open tooling coverage.

_{View full conversation}

neubig

Thanks for this!

could you remove the weekly changelog workflow from this PR?
please check the datadog debugging workflow and see if you can roughly match the style: https://github.com/OpenHands/software-agent-sdk/tree/main/examples/03_github_workflows/04_datadog_debugging

Tejas Goyal added 2 commits October 17, 2025 15:53

test

6ff2b30

Add SRE debugger workflow example

83925a3

neubig self-requested a review October 19, 2025 15:06

neubig added the review-this This label triggers a PR review by OpenHands label Oct 19, 2025

Merge branch 'main' into feature/sre-debugger

29e7467

neubig added review-this This label triggers a PR review by OpenHands and removed review-this This label triggers a PR review by OpenHands labels Oct 19, 2025

neubig mentioned this pull request Oct 19, 2025

OpenHands PR review workflow is failing on forks #809

Closed

simonrosenberg added review-this This label triggers a PR review by OpenHands and removed review-this This label triggers a PR review by OpenHands labels Oct 21, 2025

Merge branch 'main' into feature/sre-debugger

1542311

simonrosenberg added review-this This label triggers a PR review by OpenHands and removed review-this This label triggers a PR review by OpenHands labels Oct 21, 2025

This was referenced Oct 21, 2025

Test: Demonstrate fork fix works with SRE debugger feature #842

Closed

Test: Fork fix validation - SRE debugger feature with fork workflow fix #843

Closed

Merge branch 'main' into feature/sre-debugger

58630f2

neubig reviewed Nov 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(feat) SRE debugger - use case example #787

(feat) SRE debugger - use case example #787

Uh oh!

tejas-goyal commented Oct 17, 2025

Uh oh!

blacksmith-sh bot commented Oct 29, 2025

Uh oh!

blacksmith-sh bot commented Nov 4, 2025

Uh oh!

enyst commented Nov 4, 2025

Uh oh!

tejas-goyal commented Nov 4, 2025

Uh oh!

tejas-goyal commented Nov 5, 2025

Uh oh!

enyst commented Nov 8, 2025

Uh oh!

openhands-ai bot commented Nov 8, 2025

Uh oh!

openhands-ai bot commented Nov 8, 2025

Uh oh!

neubig left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

(feat) SRE debugger - use case example #787

Are you sure you want to change the base?

(feat) SRE debugger - use case example #787

Uh oh!

Conversation

tejas-goyal commented Oct 17, 2025

Overview

What's New

Key Features

Usage

Use Cases

Uh oh!

blacksmith-sh bot commented Oct 29, 2025

Uh oh!

blacksmith-sh bot commented Nov 4, 2025

Uh oh!

enyst commented Nov 4, 2025

Uh oh!

tejas-goyal commented Nov 4, 2025

Uh oh!

tejas-goyal commented Nov 5, 2025

🔧 CI Fix Available

Problem

Solution

How to Apply

Uh oh!

enyst commented Nov 8, 2025

Uh oh!

openhands-ai bot commented Nov 8, 2025

Uh oh!

openhands-ai bot commented Nov 8, 2025

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants