Skip to content

Conversation

nherment
Copy link
Collaborator

@nherment nherment commented Oct 13, 2025

Ensures SSE messages are sent to the UI for tool results when tools are approved or denied. Otherwise the UI does not get updated.

Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

Walkthrough

Process tool-approval decisions now produces both updated messages and STREAM TOOL_RESULT events. call_stream accepts tool_decisions, yields pre-processed tool-result events, then continues normal streaming. Server only forwards tool_decisions into the streaming path. Tests adapted to new return shape.

Changes

Cohort / File(s) Summary
Tool-calling LLM: decisions & streaming
holmes/core/tool_calling_llm.py
- process_tool_decisions(...) -> tuple[list[dict], list[StreamMessage]]: returns (updated_messages, events) and emits TOOL_RESULT stream events for approved/denied tool calls.
- `call_stream(..., tool_decisions: List[ToolApprovalDecision]
Server: endpoint routing
server.py
- Removed non-stream pre-processing of tool decisions for chat requests.
- Streaming branch passes tool_decisions into ai.call_stream(..., tool_decisions=...); non-stream path unchanged.
Tests: return-shape updates
tests/test_approval_workflow.py
- Mocks updated: process_tool_decisions now returns (updated_messages, []) (messages plus empty StreamMessage list) instead of a single list.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant S as Server (/chat)
  participant L as ToolCallingLLM
  participant T as Tool(s)

  C->>S: POST /chat (messages, tool_decisions, stream=true)
  S->>L: call_stream(messages, tool_decisions)
  rect #EFEFEF
    note right of L: process_tool_decisions -> (messages, events)
    L-->>S: yield TOOL_RESULT events (denied/approved)
  end
  alt approved
    L->>T: invoke tool (user_approved=True)
    T-->>L: ToolCallResult(success/data)
    L-->>S: yield TOOL_RESULT (approved + result)
  else denied
    L-->>S: yield TOOL_RESULT (denied)
  end
  note over L: continue normal LLM streaming
  L-->>S: assistant token/events
  S-->>C: SSE stream (TOOL_RESULT + tokens)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • moshemorad

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly summarizes the primary change by stating that approved or denied tool calls will now trigger an SSE event, which aligns directly with the implemented updates to process_tool_decisions and call_stream for emitting tool result messages. It is concise, specific, and reflects the developer’s focus without extraneous details.
Description Check ✅ Passed The description accurately describes the goal of sending SSE messages to the UI for tool approval and denial outcomes, which matches the changes in the pull request that ensure those events are emitted.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch rob-1933_holmes_tool_approval_workflow_3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
holmes/core/tool_calling_llm.py (1)

337-379: Avoid 500s on stale/mismatched decisions; preserve tool result order on insert

  • Raising when no pending approvals found will 500 the SSE route. Prefer emitting an ERROR event and returning unchanged messages.
  • Inserting multiple tool results at the same index reverses their order. Track per-message insert offset to preserve tool_calls order.
-        if not pending_tool_calls:
-            error_message = f"Received {len(tool_decisions)} tool decisions but no pending approvals found"
-            logging.error(error_message)
-            raise Exception(error_message)
+        if not pending_tool_calls:
+            error_message = (
+                f"Received {len(tool_decisions)} tool decisions but no pending approvals found"
+            )
+            logging.warning(error_message)
+            events.append(
+                StreamMessage(
+                    event=StreamEvents.ERROR,
+                    data={"message": error_message},
+                )
+            )
+            return messages, events
@@
-        for tool_call_with_decision in pending_tool_calls:
+        # Preserve insertion order per assistant message
+        insert_offsets: dict[int, int] = {}
+        for tool_call_with_decision in pending_tool_calls:
@@
-            messages.insert(
-                tool_call_with_decision.message_index + 1, tool_call_message
-            )
+            idx = tool_call_with_decision.message_index
+            offset = insert_offsets.get(idx, 0)
+            messages.insert(idx + 1 + offset, tool_call_message)
+            insert_offsets[idx] = offset + 1
🧹 Nitpick comments (3)
tests/test_approval_workflow.py (2)

263-274: Fix Ruff ARG005 and RUF005 in test lambda

  • Rename unused arg to _tool_decisions.
  • Prefer iterable unpacking over concatenation.
-    ai.process_tool_decisions = MagicMock(
-        side_effect=lambda messages, tool_decisions: (
-            messages
-            + [
+    ai.process_tool_decisions = MagicMock(
+        side_effect=lambda messages, _tool_decisions: (
+            [
+                *messages,
                 {
                     "tool_call_id": "tool_call_123",
                     "role": "tool",
                     "name": "kubectl_delete",
                     "content": "pod 'dangerous-pod' deleted",
                 }
-            ],
+            ],
             [],  # Empty list for StreamMessages
         )
     )

409-420: Fix Ruff ARG005 and RUF005 in rejection test lambda

  • Rename unused arg to _tool_decisions.
  • Use iterable unpacking.
-    ai.process_tool_decisions = MagicMock(
-        side_effect=lambda messages, tool_decisions: (
-            messages
-            + [
+    ai.process_tool_decisions = MagicMock(
+        side_effect=lambda messages, _tool_decisions: (
+            [
+                *messages,
                 {
                     "tool_call_id": "tool_call_123",
                     "role": "tool",
                     "name": "kubectl_delete",
                     "content": "Tool execution was denied by the user.",
                 }
-            ],
+            ],
             [],  # Empty list for StreamMessages
         )
     )
holmes/core/tool_calling_llm.py (1)

293-306: Update docstring to reflect new return type

The function now returns (messages, events) but docstring says “Updated messages list”. Align it to avoid confusion and mypy hinting issues.

-        Returns:
-            Updated messages list with tool execution results
+        Returns:
+            A tuple of:
+            - Updated messages with tool execution results
+            - Stream events (TOOL_RESULT) emitted for each decision
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b65307d and 7c1dba5.

📒 Files selected for processing (3)
  • holmes/core/tool_calling_llm.py (6 hunks)
  • server.py (1 hunks)
  • tests/test_approval_workflow.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use Ruff for formatting and linting (configured in pyproject.toml) for all Python code
Type hints are required; code should pass mypy (configured in pyproject.toml)
ALWAYS place Python imports at the top of the file, not inside functions or methods

Files:

  • server.py
  • tests/test_approval_workflow.py
  • holmes/core/tool_calling_llm.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Only use pytest markers that are defined in pyproject.toml; never introduce undefined markers/tags

Files:

  • tests/test_approval_workflow.py
tests/**

📄 CodeRabbit inference engine (CLAUDE.md)

Test files should mirror the source structure under tests/

Files:

  • tests/test_approval_workflow.py
🧬 Code graph analysis (1)
holmes/core/tool_calling_llm.py (5)
holmes/utils/stream.py (2)
  • StreamMessage (25-27)
  • StreamEvents (15-22)
holmes/core/models.py (4)
  • ToolCallResult (23-60)
  • as_streaming_tool_result_response (50-60)
  • as_tool_call_message (30-36)
  • ToolApprovalDecision (177-181)
holmes/core/tracing.py (1)
  • DummySpan (101-122)
holmes/core/tools.py (2)
  • StructuredToolResult (79-103)
  • StructuredToolResultStatus (52-76)
holmes/core/safeguards.py (1)
  • prevent_overly_repeated_tool_call (78-120)
🪛 Ruff (0.13.3)
tests/test_approval_workflow.py

263-263: Unused lambda argument: tool_decisions

(ARG005)


264-272: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


409-409: Unused lambda argument: tool_decisions

(ARG005)


410-418: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: llm_evals
  • GitHub Check: Pre-commit checks
  • GitHub Check: build
🔇 Additional comments (3)
holmes/core/tool_calling_llm.py (2)

684-691: Good: skip redundancy safeguard when user approved

Guarding prevent_overly_repeated_tool_call behind user_approved avoids blocking explicit approvals.


899-912: Pre-processing and streaming decisions: looks correct

Processes decisions, yields TOOL_RESULT events, and carries updated msgs into the loop. Nice.

server.py (1)

367-368: No action needed: ChatRequest.tool_decisions is defined on a Pydantic BaseModel, so incoming lists of dicts are coerced into ToolApprovalDecision instances.

@nherment nherment requested a review from moshemorad October 13, 2025 09:58
moshemorad
moshemorad previously approved these changes Oct 13, 2025
tool_calls=previous_tool_calls,
)
tool_response = None
if not user_approved:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we changed it? if the user approved it but it was already ran once we still would like to run it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it because I noticed it. My thinking is that if a user actively approves a command then it should run, regardless of any restriction, including repeated commands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure i fully agreed with that. maybe worth discuss it with product and make sure?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
holmes/core/tool_calling_llm.py (3)

292-292: Use consistent type annotation style.

The return type mixes lowercase built-in types (tuple, list) with uppercase typing module types (List, Dict). For consistency with the rest of the codebase which imports and uses List from the typing module (line 5), use Tuple[List[Dict[str, Any]], List[StreamMessage]].

Apply this diff:

-    ) -> tuple[List[Dict[str, Any]], list[StreamMessage]]:
+    ) -> Tuple[List[Dict[str, Any]], List[StreamMessage]]:

And add Tuple to the imports on line 5:

-from typing import Dict, List, Optional, Type, Union, Callable, Any
+from typing import Dict, List, Optional, Type, Union, Callable, Any, Tuple

303-303: Use List[StreamMessage] for consistency.

The variable declaration uses lowercase list[StreamMessage] while the codebase consistently uses List from the typing module.

Apply this diff:

-        events: list[StreamMessage] = []
+        events: List[StreamMessage] = []

899-910: Mixed union syntax.

Line 899 uses the newer union syntax List[ToolApprovalDecision] | None while the rest of the codebase uses Optional[List[ToolApprovalDecision]]. While both work in Python 3.10+, consistency with the existing codebase style is preferred.

The logic for processing tool decisions and yielding events is correct and aligns with the PR objectives.

Apply this diff for consistency:

-        tool_decisions: List[ToolApprovalDecision] | None = None,
+        tool_decisions: Optional[List[ToolApprovalDecision]] = None,
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7c1dba5 and e267b9a.

📒 Files selected for processing (1)
  • holmes/core/tool_calling_llm.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use Ruff for formatting and linting (configured in pyproject.toml) for all Python code
Type hints are required; code should pass mypy (configured in pyproject.toml)
ALWAYS place Python imports at the top of the file, not inside functions or methods

Files:

  • holmes/core/tool_calling_llm.py
🧬 Code graph analysis (1)
holmes/core/tool_calling_llm.py (5)
holmes/utils/stream.py (2)
  • StreamMessage (25-27)
  • StreamEvents (15-22)
holmes/core/models.py (4)
  • ToolCallResult (23-60)
  • as_streaming_tool_result_response (50-60)
  • as_tool_call_message (30-36)
  • ToolApprovalDecision (177-181)
holmes/core/tracing.py (1)
  • DummySpan (101-122)
holmes/core/tools.py (2)
  • StructuredToolResult (79-103)
  • StructuredToolResultStatus (52-76)
holmes/core/safeguards.py (1)
  • prevent_overly_repeated_tool_call (78-120)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Pre-commit checks
  • GitHub Check: llm_evals
  • GitHub Check: build
🔇 Additional comments (2)
holmes/core/tool_calling_llm.py (2)

342-378: LGTM! Tool approval logic correctly emits SSE events.

The implementation properly handles both approved and denied tools, generating TOOL_RESULT events in both cases and deriving the tool call message from the result. This aligns perfectly with the PR objective to ensure SSE messages are sent for tool results.


684-690: LGTM! Approved tools correctly bypass the repeated-call safeguard.

When user_approved=True, the safeguard is skipped, which makes sense—if a user explicitly approves a tool, it should run regardless of previous executions.

Copy link
Contributor

Results of HolmesGPT evals

  • ask_holmes: 33/35 test cases were successful, 0 regressions, 1 setup failures
Test suite Test case Status
ask 01_how_many_pods
ask 02_what_is_wrong_with_pod
ask 04_related_k8s_events
ask 05_image_version
ask 09_crashpod
ask 10_image_pull_backoff
ask 110_k8s_events_image_pull
ask 11_init_containers
ask 13a_pending_node_selector_basic
ask 14_pending_resources
ask 15_failed_readiness_probe
ask 17_oom_kill
ask 19_detect_missing_app_details
ask 20_long_log_file_search
ask 24_misconfigured_pvc
ask 24a_misconfigured_pvc_basic
ask 28_permissions_error 🚧
ask 39_failed_toolset
ask 41_setup_argo
ask 42_dns_issues_steps_new_tools ⚠️
ask 43_current_datetime_from_prompt
ask 45_fetch_deployment_logs_simple
ask 51_logs_summarize_errors
ask 53_logs_find_term
ask 54_not_truncated_when_getting_pods
ask 59_label_based_counting
ask 60_count_less_than
ask 61_exact_match_counting
ask 63_fetch_error_logs_no_errors
ask 79_configmap_mount_issue
ask 83_secret_not_found
ask 86_configmap_like_but_secret
ask 93_calling_datadog[0]
ask 93_calling_datadog[1]
ask 93_calling_datadog[2]

Legend

  • ✅ the test was successful
  • :minus: the test was skipped
  • ⚠️ the test failed but is known to be flaky or known to fail
  • 🚧 the test had a setup failure (not a code regression)
  • 🔧 the test failed due to mock data issues (not a code regression)
  • 🚫 the test was throttled by API rate limits/overload
  • ❌ the test failed and should be fixed before merging the PR

@nherment nherment requested a review from moshemorad October 14, 2025 05:38
@nherment nherment merged commit 3b0eee7 into master Oct 15, 2025
8 checks passed
@nherment nherment deleted the rob-1933_holmes_tool_approval_workflow_3 branch October 15, 2025 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants