Skip to content

Conversation

@tejas-goyal
Copy link

Overview

Adds a new GitHub workflow example for automated debugging and error analysis.

What's New

  • SRE Debugger: Agent-powered workflow that automatically debugs failures and generates detailed error analysis
  • Integrates with CI/CD pipelines for automated incident response
  • Generates structured error reports with root cause analysis

Key Features

  • Automatic error detection and analysis
  • Detailed debugging reports
  • Suggested fixes and remediation steps
  • Works locally or in GitHub Actions

Usage

Local testing:
```bash
export LLM_API_KEY="your-key"
uv run python examples/03_github_workflows/04_sre_debugger/test_local.py
```

GitHub Actions:
Copy `workflow.yml` to `.github/workflows/` and add `LLM_API_KEY` secret.

Use Cases

  • Automated incident response
  • CI/CD failure analysis
  • Production debugging
  • Error monitoring and alerting" --head YOUR_USERNAME:feature/sre-debugger

@neubig neubig self-requested a review October 19, 2025 15:06
@neubig neubig added the review-this This label triggers a PR review by OpenHands label Oct 19, 2025
@neubig neubig added review-this This label triggers a PR review by OpenHands and removed review-this This label triggers a PR review by OpenHands labels Oct 19, 2025
@simonrosenberg simonrosenberg added review-this This label triggers a PR review by OpenHands and removed review-this This label triggers a PR review by OpenHands labels Oct 21, 2025
@simonrosenberg simonrosenberg added review-this This label triggers a PR review by OpenHands and removed review-this This label triggers a PR review by OpenHands labels Oct 21, 2025
@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Oct 29, 2025

[Automatic Post]: It has been a while since there was any activity on this PR. @tejas-goyal, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

1 similar comment
@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Nov 4, 2025

[Automatic Post]: It has been a while since there was any activity on this PR. @tejas-goyal, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@enyst
Copy link
Collaborator

enyst commented Nov 4, 2025

Hey @tejas-goyal , thank you for the PR!

There are 2 workflows in the PR, was that intended?

@tejas-goyal
Copy link
Author

Hey @tejas-goyal , thank you for the PR!

There are 2 workflows in the PR, was that intended?

Hi @enyst , you're right there are two workflows here, the SRE debugger I have here is now replaced by the datadog debugger example we have live. I wanted to merge my changelog generator example here, you can ignore the sre example or I can close this pr since I have the other one open as well: #785

Copy link
Author

🔧 CI Fix Available

I found and fixed the issue causing the GitHub Actions workflow to fail!

Problem

The PR review workflow was failing because it tried to checkout the PR branch (feature/sre-debugger) from the main repository (OpenHands/software-agent-sdk) instead of the fork (tejas-goyal/agent-sdk). The branch only exists in the fork, so the checkout step failed with:

A branch or tag with the name 'feature/sre-debugger' could not be found

Solution

The fix is simple - we need to specify the repository when checking out the PR branch. Here's the required change to .github/workflows/pr-review-by-openhands.yml:

- name: Checkout PR repository
  uses: actions/checkout@v5
  with:
      # Fetch the full history to get the diff
      fetch-depth: 0
      path: pr-repo
      # Check out the feature branch so agent can inspect the PR changes
+     repository: ${{ github.event.pull_request.head.repo.full_name }}
      ref: ${{ github.event.pull_request.head.ref }}

How to Apply

To fix this, the PR author (@tejas-goyal) needs to add one line to the workflow file in their fork:

  1. Edit .github/workflows/pr-review-by-openhands.yml in the feature/sre-debugger branch
  2. Add the line repository: ${{ github.event.pull_request.head.repo.full_name }} as shown above
  3. Commit and push the change

This will allow the workflow to properly checkout the PR branch from the fork and the CI should pass! 🎉

@enyst
Copy link
Collaborator

enyst commented Nov 8, 2025

@OpenHands Do a /codereview-roasted on this PR, the SRE debugger part.

Once you are done, answer a question:

Read the full discussion here. Then read .github/workflows and find the datadog workflow and compare what it does with this SRE debugger. Be throughout, and focus on this: datadog is a proprietary/closed tool. I'd like us to have examples with open tools as well. How does this PR compare in general and on that issue?

@openhands-ai
Copy link

openhands-ai bot commented Nov 8, 2025

I'm on it! enyst can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Nov 8, 2025

Summary of work

What I reviewed

  • Pulled PR (feat) SRE debugger - use case example  #787 metadata and diff via GitHub API and focused on the SRE Debugger example under examples/03_github_workflows/04_sre_debugger.
  • Inspected .github/workflows in this repo snapshot; there is no Datadog workflow there. The Datadog example lives under examples/03_github_workflows/04_datadog_debugging with its own workflow.yml intended to be copied into a repo.

Codereview-roasted (SRE Debugger)

  • Critical issues
    • Wrong install targets/paths in workflow.yml: installs from All-Hands-AI/agent-sdk with subdirectories openhands/sdk and openhands/tools, which does not match this repository’s OpenHands/software-agent-sdk structure (openhands-sdk, openhands-tools). This will break installs.
    • Not analyzing the failing commit: workflow_run handler doesn’t check out the exact SHA/PR that failed; it will analyze the wrong code. Needs to resolve the failing run’s ref from the event payload.
    • No target dependency install: runs pytest without installing the repo’s own deps, likely causing false failures or noise.
    • Naive failure detection: string contains checks like "error"/"FAILED" on raw stdout instead of using pytest exit code and structured outputs (JSON/JUnit XML).
    • PR/issue feedback path broken: attempting “comment on PR” checks github.event_name == 'pull_request', which won’t be true for workflow_run or workflow_dispatch.
    • Unbounded agent run: conversation.run() has no step/time budget; risky for CI timeouts.
    • sys.path.insert hack for prompt import; repo guidelines discourage this.
  • Improvement opportunities
    • Prefer structured parsing (pytest --json-report or --junitxml) and clip tracebacks intentionally rather than tail-only truncation.
    • Tighten permissions to exactly what’s needed; request pull-requests: write if commenting on PRs.
    • Restrict tools for CI runs and add step/token/time limits for determinism.
    • Align README promises with actual workflow behavior (install deps, analyze failing ref).
  • Verdict: Needs rework before merge for reliability in CI.

Comparison: Datadog example vs SRE Debugger

  • Triggering and scope
    • SRE Debugger: Aims to auto-run on failed “Tests” (workflow_run) and manual dispatch; analyzes local pytest failures within one repo.
    • Datadog: Manual dispatch; queries Datadog Logs/Error Tracking and performs multi-repo analysis; creates/updates GitHub issues and posts comments.
  • Robustness and integration
    • SRE Debugger: Currently brittle due to ref checkout, dependency install, permissions, and failure detection gaps.
    • Datadog: More mature flow—clear inputs, artifacts, issues lifecycle, multi-repo cloning, explicit secret handling. It fetches its script/template at runtime from main (should pin to a commit to avoid drift).
  • Open vs proprietary tooling
    • Datadog relies on a proprietary service and API keys.
    • SRE Debugger is vendor-neutral in concept (pytest output, local code). If hardened, it can serve as the “open tools” counterpart.

Open-tools angle and recommendations

  • Keep the Datadog example (valuable for teams already on Datadog).
  • Add open-source counterparts with equivalent ergonomics:
    • Grafana Loki: Query logs via Loki HTTP API and run the same agent analysis.
    • OpenSearch/Elasticsearch: Query logs via REST; identical analysis flow.
    • Sentry (self-hosted): Use Sentry API for error aggregation analogous to Datadog’s Error Tracking.
    • OpenTelemetry + Tempo/Jaeger: Trace-driven debugging variant.
    • CI-native variant: Download and parse JUnit XML/pytest JSON artifacts from the failing job (no external vendor).
  • Elevate the SRE Debugger example to be the canonical open stack CI failure analyzer by fixing the critical issues above.

Concrete next steps (for SRE Debugger)

  • Fix workflow install targets to this repo and correct subdirectory names; pin to a commit SHA.
  • From workflow_run, resolve the failing job’s SHA/PR and check out that ref; optionally download failing test artifacts.
  • Install the target repo’s dependencies (detect and use requirements.txt/pyproject/uv lock).
  • Parse structured test reports and use pytest’s exit code; remove substring-based detection.
  • Add time/step/token limits and restrict tools for CI runs.
  • Request appropriate permissions and reliably map back to the failing PR for comments.
  • Replace sys.path.insert with reading the prompt template from file.

Bottom line

  • The SRE Debugger is directionally the right open alternative to Datadog, but it needs hardening to be dependable in CI.
  • The Datadog example is more complete operationally today but is tied to a proprietary tool.
  • We should keep Datadog and invest in the SRE Debugger plus a Loki/OpenSearch/Sentry/OpenTelemetry set of examples to provide first-class open tooling coverage.

View full conversation

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

  1. could you remove the weekly changelog workflow from this PR?
  2. please check the datadog debugging workflow and see if you can roughly match the style: https://github.com/OpenHands/software-agent-sdk/tree/main/examples/03_github_workflows/04_datadog_debugging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this This label triggers a PR review by OpenHands

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants