Skip to content

Conversation

Dishant1804
Copy link
Collaborator

Proposed change

Resolves #2144

  • create context for www- repositories
  • create chunks for the same
  • test covergae for it

Checklist

  • I've read and followed the contributing guidelines.
  • I've run make check-test locally; all checks and tests passed.

Copy link
Contributor

coderabbitai bot commented Sep 1, 2025

Summary by CodeRabbit

  • New Features

    • Ingests repository content (including README and common docs) to enrich AI context.
    • Adds background jobs to update repository context and content chunks.
    • Expands repository metadata captured (status, features, counts, timestamps).
  • Bug Fixes

    • Prevents duplicate content chunks and improves error handling during bulk operations.
    • Enhances logging for easier diagnostics.
  • Tests

    • Adds comprehensive tests for repository extraction, background jobs, and chunk handling.
  • Chores

    • Adds make targets for easier job execution.
    • Updates spelling dictionary.

Walkthrough

Adds repository Markdown content extraction, new Django management commands to update repository chunks and context using that extractor, corresponding Makefile targets, and extensive tests. Also updates Chunk model with uniqueness constraints, logging, and duplicate-handling in bulk save, plus a spelling dictionary entry.

Changes

Cohort / File(s) Summary
Makefile targets
backend/apps/ai/Makefile
Adds ai-update-repository-chunks and ai-update-repository-context targets invoking new Django commands.
Repository extractors
backend/apps/ai/common/extractors/repository.py
Adds extract_repository_content and extract_repository_markdown_content to build prose/metadata and fetch common Markdown files from GitHub raw URLs.
Management commands: repository
backend/apps/ai/management/commands/ai_update_repository_chunks.py, backend/apps/ai/management/commands/ai_update_repository_context.py
New commands for generating chunks and updating context for repositories; filter base queryset to is_owasp_site_repository=True, is_archived=False, is_empty=False; use markdown extractor; set key_field_name="key", source_name="owasp_repository_markdown".
Chunk model enhancements
backend/apps/ai/models/chunk.py
Adds logging, catches DatabaseError, enforces unique_together = ("context","text"), refactors duplicate checks, and improves bulk save with per-item fallback.
Tests: extractors
backend/tests/apps/ai/common/extractors/repository_test.py
Adds comprehensive tests for repository content and markdown extraction, URL construction, fallbacks, and logging.
Tests: management commands
backend/tests/apps/ai/management/commands/ai_update_repository_chunks_test.py, backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py
Verify inheritance, properties, filtering, source name, help text, and delegation to extractor.
Tests: Chunk model
backend/tests/apps/ai/models/chunk_test.py
Updates tests to match new duplicate-check and bulk save behavior using Chunk.objects.filter(...).exists().
Tests: utils
backend/tests/apps/ai/common/utils_test.py
Adjusts a test to mock Chunk.update_data raising when context is None; validates logging and empty result.
Misc test placeholder
backend/test_commands.py
Adds a placeholder module docstring.
Spelling dictionary
cspell/custom-dict.txt
Adds “repositorys”.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Assessment against linked issues

Objective Addressed Explanation
Fetch and index Markdown content from OWASP repositories prefixed with www- to improve NestBot contexts (#2144) Markdown fetching and indexing are implemented, but selection filters use is_owasp_site_repository without enforcing the www- name prefix. Scope alignment with the exact repo naming is unclear.

Assessment against linked issues: Out-of-scope changes

Code Change Explanation
Chunk model uniqueness, logging, and bulk save refactor (backend/apps/ai/models/chunk.py) Modifies storage behavior and constraints; not specifically required to fetch/index Markdown from www- repositories.

Possibly related PRs

Suggested labels

backend, backend-tests, nestbot, makefile

Suggested reviewers

  • kasya
  • arkid15r
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
docker-compose/local.yaml (1)

25-25: Bug: backend service sets DJANGO_REDIS_PASSWORD from DJANGO_REDIS_HOST.

This will set the password to the host value when DJANGO_REDIS_HOST is defined.

-      DJANGO_REDIS_PASSWORD: ${DJANGO_REDIS_HOST:-nest-cache-password}
+      DJANGO_REDIS_PASSWORD: ${DJANGO_REDIS_PASSWORD:-nest-cache-password}
backend/apps/ai/models/chunk.py (1)

85-93: Handle race with unique constraint in update_data().

Concurrent writers can still insert the same (context, text) between the exists() check and save(), raising IntegrityError. Catch and return None to match the method contract.

-        if save:
-            chunk.save()
+        if save:
+            from django.db import IntegrityError
+            try:
+                chunk.save()
+            except IntegrityError:
+                return None
backend/apps/slack/events/__init__.py (1)

1-20: Make event configuration idempotent and skip heavy imports when Slack app is disabled

Prevents duplicate registrations and avoids importing event modules if tokens are missing.

+_EVENTS_CONFIGURED = False
+
 def configure_slack_events():
     """Configure Slack events after Django apps are ready."""
-    from apps.slack.apps import SlackConfig
-    from apps.slack.events import (
-        app_home_opened,
-        message_posted,
-        team_join,
-        url_verification,
-    )
-    from apps.slack.events.event import EventBase
-    from apps.slack.events.member_joined_channel import (
-        catch_all,
-        contribute,
-        gsoc,
-        project_nest,
-    )
-
-    if SlackConfig.app:
-        EventBase.configure_events()
+    from apps.slack.apps import SlackConfig
+    global _EVENTS_CONFIGURED
+    if not SlackConfig.app or _EVENTS_CONFIGURED:
+        return
+    from apps.slack.events import (
+        app_home_opened,
+        message_posted,
+        team_join,
+        url_verification,
+    )
+    from apps.slack.events.event import EventBase
+    from apps.slack.events.member_joined_channel import (
+        catch_all,
+        contribute,
+        gsoc,
+        project_nest,
+    )
+    EventBase.configure_events()
+    _EVENTS_CONFIGURED = True
backend/tests/apps/ai/common/extractors/repository_test.py (1)

434-455: Log message lacks repository context expected by tests

Tests assert the debug log contains test-repo. Update extractor logging to include repo key/owner/path as shown above.

🧹 Nitpick comments (48)
backend/test_commands.py (1)

1-1: Avoid stray test_ module outside tests/*

Pytest will collect this empty module, adding noise. Either remove it or rename/move under a non-collected path (e.g., backend/commands_testutils.py) or into tests/ with actual tests.

backend/settings/base.py (1)

145-152: Harden RQ config: prefer URL form and make timeout configurable.

Using a single Redis URL avoids partial env mismatches and mirrors your cache config; exposing timeout via env helps ops tuning.

Apply within this block:

-        "default": {
-            "HOST": REDIS_HOST,
-            "PORT": 6379,
-            "PASSWORD": REDIS_PASSWORD,
-            "DEFAULT_TIMEOUT": 360,
-        }
+        "default": {
+            # Use URL form for parity with CACHES and simpler overrides.
+            "URL": f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:6379/0",
+            # Make tunable via env; 360s default preserved.
+            "DEFAULT_TIMEOUT": values.IntegerValue(
+                environ_name="RQ_DEFAULT_TIMEOUT",
+                default=360,
+            ),
+        }

Optionally consider adding:

  • RQ exception handlers that report to Sentry.
  • Separate queues (e.g., "ai", "slack") with different timeouts.
backend/tests/apps/ai/common/utils_test.py (1)

108-116: Add a minimal assertion to lock invocation count.

Verifies we don’t loop on failures.

         result = create_chunks_and_embeddings(
             chunk_texts=["some text"],
             context=None,
             openai_client=mock_openai_client,
         )
 
         mock_logger.exception.assert_called_once_with("Failed to create chunks and embeddings")
+        mock_update_data.assert_called_once()
         assert result == []
backend/apps/slack/constants.py (1)

25-80: Use frozenset for OWASP_KEYWORDS and prune overly generic entries
Convert OWASP_KEYWORDS to a frozenset to prevent accidental mutation; remove broad terms (e.g. “security”, “project”, “event”) to cut down on false positives—matching is already case-insensitive via text.lower(). Evaluate the false-positive rate on a representative Slack log before enabling auto-replies.

backend/apps/ai/models/chunk.py (1)

5-5: Also import IntegrityError.

If you adopt the changes above, add IntegrityError to imports.

-from django.db import DatabaseError, models
+from django.db import DatabaseError, IntegrityError, models
backend/tests/apps/ai/models/chunk_test.py (4)

31-42: Assert list-clearing contract after bulk_save.

If bulk_save is expected to clear the input list (as BulkSaveModel does), assert it to prevent regressions.

             with patch("apps.common.models.BulkSaveModel.bulk_save") as mock_bulk_save:
                 Chunk.bulk_save(mock_chunks)
                 mock_bulk_save.assert_called_once_with(Chunk, mock_chunks, fields=None)
+                assert mock_chunks == []

43-55: Mirror the clearing assertion for the fields path.

Keep expectations consistent across both code paths.

             with patch("apps.common.models.BulkSaveModel.bulk_save") as mock_bulk_save:
                 Chunk.bulk_save(mock_chunks, fields=fields)
                 mock_bulk_save.assert_called_once_with(Chunk, mock_chunks, fields=fields)
+                assert mock_chunks == []

65-86: Remove redundant @patch on Chunk.save.

You replace Chunk with a Mock later; the decorator patch becomes moot and can confuse readers.

-@patch("apps.ai.models.chunk.Chunk.save")
-def test_update_data_save_with_context(self, mock_save):
+def test_update_data_save_with_context(self):

31-55: Add a test ensuring updates aren’t filtered out.

Guard against regressions where chunks with ids (updates) get dropped by existence filtering.

Proposed new test:

def test_bulk_save_includes_updates():
    c1 = Mock(spec=Chunk); c1.id = 42; c1.context = Mock(); c1.text = "t1"
    c2 = Mock(spec=Chunk); c2.id = None; c2.context = Mock(); c2.text = "t2"
    with patch("apps.ai.models.chunk.Chunk.objects.filter") as mock_filter, \
         patch("apps.common.models.BulkSaveModel.bulk_save") as mock_bulk:
        mock_filter.return_value.exists.return_value = False
        chunks = [c1, c2]
        Chunk.bulk_save(chunks)
        # c1 (update) must still be passed through
        mock_bulk.assert_called_once()
        passed = mock_bulk.call_args.args[1]
        assert c1 in passed and c2 in passed

Want me to push this into the suite?

backend/apps/ai/common/extractors/repository.py (5)

130-146: Broaden markdown coverage (common variants).

Consider additional common filenames/paths to increase hit rate on OWASP www repos (case and format variants).

     markdown_files = [
         "README.md",
+        "README.rst",
+        "README.adoc",
+        "readme.md",
+        "README",
         "CONTRIBUTING.md",
         "CODE_OF_CONDUCT.md",
         "SECURITY.md",
         "CHANGELOG.md",
-        "LICENSE.md",
+        "LICENSE.md",
+        "LICENSE",
+        "docs/README.md",
         "docs/README.md",
         "docs/index.md",
         "docs/overview.md",
         "docs/getting-started.md",
         "docs/installation.md",
         "docs/usage.md",
         "docs/api.md",
         "docs/development.md",
         "docs/deployment.md",
+        "index.md",
+        "content/index.md",
+        "content/README.md",
+        "site/README.md",
     ]

156-160: Branch fallback for site repos.

If default_branch is missing, try gh-pages then master before main to better handle www repos.

-                raw_url = (
-                    f"https://raw.githubusercontent.com/{owner}/{repository.key}/"
-                    f"{repository.default_branch or 'main'}/{file_path}"
-                )
+                branch = (
+                    repository.default_branch
+                    or ("gh-pages" if getattr(repository, "has_pages", False) else None)
+                    or "master"
+                    or "main"
+                )
+                raw_url = f"https://raw.githubusercontent.com/{owner}/{repository.key}/{branch}/{file_path}"

167-169: Improve logging with context.

Log which file/owner/repo failed and keep traceback; current message loses diagnostics.

-        except (ValueError, TypeError, OSError, ConnectionError):
-            logger.debug("Failed to fetch")
+        except (ValueError, TypeError, OSError, ConnectionError):
+            logger.debug(
+                "Failed to fetch markdown file",
+                extra={"file_path": file_path, "owner": owner, "repo": getattr(repository, "key", None)},
+                exc_info=True,
+            )
             continue

148-166: Surface fetched files in metadata (and drop unused variable otherwise).

Either remove fetched_files or expose it so downstream context can reference what was pulled.

-    fetched_files = []
+    fetched_files = []
@@
-                if content and content.strip():
-                    fetched_files.append(file_path)
+                if content and content.strip():
+                    fetched_files.append(file_path)
                     prose_parts.append(f"## {file_path}\n\n{content}")
@@
     return (
-        DELIMITER.join(filter(None, prose_parts)),
-        DELIMITER.join(filter(None, metadata_parts)),
+        DELIMITER.join(filter(None, prose_parts)),
+        DELIMITER.join(
+            filter(
+                None,
+                metadata_parts
+                + (
+                    [f"Fetched Markdown Files: {', '.join(fetched_files)}"]
+                    if fetched_files
+                    else []
+                ),
+            )
+        ),
     )

Also applies to: 171-174


148-166: Consider rate limits and 404 noise from raw URLs.

Fetching many files per repo can hit GitHub rate limits and raw 404s may return HTML. Prefer GitHub API for contents (with If-None-Match) or add status checks in get_repository_file_content to return "" on non-200.

I can provide an API-backed extractor variant and harden get_repository_file_content to ignore non-200 responses. Want a follow-up PR?

backend/apps/slack/admin/conversation.py (1)

30-31: Expose the new flag in filters for quicker triage

Adding the boolean to list_filter eases admin workflows without altering data model.

Apply outside this hunk:

@@
     list_filter = (
         "sync_messages",
         "is_archived",
         "is_channel",
         "is_general",
         "is_im",
         "is_private",
+        "is_nest_bot_assistant_enabled",
     )
backend/apps/slack/commands/ai.py (2)

21-23: Avoid KeyError when Slack omits text; use a safe default.
Use .get("text") or "" to guard against missing/None and prevent an AttributeError on .strip().

-        return get_blocks(
-            query=command["text"].strip(),
-        )
+        return get_blocks(
+            query=(command.get("text") or "").strip(),
+        )

19-20: Document the local import intent.
Add a short comment so future readers know this is deliberate for lazy-loading/cycle avoidance.

-        from apps.slack.common.handlers.ai import get_blocks
+        # Local import to avoid circular import and reduce import-time overhead.
+        from apps.slack.common.handlers.ai import get_blocks
backend/apps/ai/Makefile (1)

37-44: Declare PHONY targets for the new commands.
Prevents make from treating similarly named files as up-to-date targets.

 ai-update-repository-context:
 	@echo "Updating repository context"
 	@CMD="python manage.py ai_update_repository_context" $(MAKE) exec-backend-command
 
+.PHONY: ai-update-repository-chunks ai-update-repository-context
backend/apps/ai/management/commands/ai_update_repository_context.py (1)

18-28: Eager-load FKs used by the extractor.
extract_repository_markdown_content accesses organization.login and owner.login; add select_related to avoid N+1 queries during batch runs.

         return (
             super()
             .get_base_queryset()
             .filter(
                 is_owasp_site_repository=True,
                 is_archived=False,
                 is_empty=False,
             )
-        )
+        ).select_related("organization", "owner")
backend/apps/slack/services/message_auto_reply.py (1)

33-35: Race window note (optional).
Small race exists between the replies check and posting; acceptable, but consider an idempotency key or re-check on 409s if this becomes noisy.

backend/apps/slack/common/handlers/ai.py (4)

16-19: Fix docstring: remove undocumented parameter.
The presentation argument isn’t in the signature.

-        presentation (EntityPresentation | None): Configuration for entity presentation.

26-28: Guard against oversized Slack block text.
Slack mrkdwn text is limited (~3000 chars per section). Chunk long responses into multiple blocks.

-    if ai_response:
-        return [markdown(ai_response)]
+    if ai_response:
+        max_len = 2900
+        if len(ai_response) <= max_len:
+            return [markdown(ai_response)]
+        return [markdown(ai_response[i : i + max_len]) for i in range(0, len(ai_response), max_len)]

31-46: Make models configurable via settings.
Avoid hard-coding model IDs; read from Django settings with sane defaults.

+    from django.conf import settings
     rag_tool = RagTool(
-        chat_model="gpt-4o",
-        embedding_model="text-embedding-3-small",
+        chat_model=getattr(settings, "AI_CHAT_MODEL", "gpt-4o"),
+        embedding_model=getattr(settings, "AI_EMBEDDING_MODEL", "text-embedding-3-small"),
     )

10-10: Unused logger.
Either use logger for diagnostics or remove it.

backend/apps/slack/MANIFEST.yaml (2)

111-124: Tighten bot scopes (least privilege).

Line 123 adds channels:history, which is needed for message.channels. Line 122 (channels:manage) appears unnecessary for this feature set—drop it if unused.

       - users:read
-      - groups:write
-      - channels:manage
+      - groups:write
+      # Removed channels:manage (not required for read/AI reply workflows)
       - channels:history

131-134: Validate event coverage vs. intended surfaces.

message.channels covers public channels only. If replies are desired in private channels/DMs, add message.groups and message.im and corresponding groups:history/im:history scopes; otherwise consider app_mention to reduce noise.

backend/apps/slack/migrations/0019_conversation_is_nest_bot_assistant_enabled.py (1)

11-17: Field OK; consider index and default rollout.

  • If this flag is frequently filtered, add db_index=True or a model Index.
  • Default False will disable the assistant for all existing conversations. Confirm that’s intended or add a data migration to enable for selected workspaces/channels.
-            field=models.BooleanField(default=False, verbose_name="Is Nest Bot Assistant Enabled"),
+            field=models.BooleanField(
+                default=False,
+                verbose_name="Is Nest Bot Assistant Enabled",
+                db_index=True,
+            ),
backend/apps/ai/management/commands/ai_update_repository_chunks.py (1)

18-28: Avoid N+1 on owner/org during extraction.

The extractor touches repository.organization/owner; select_related will prevent per-row FK hits.

     def get_base_queryset(self) -> QuerySet:
         """Return the base queryset with filtering for OWASP site repositories."""
         return (
             super()
             .get_base_queryset()
+            .select_related("organization", "owner")
             .filter(
                 is_owasp_site_repository=True,
                 is_archived=False,
                 is_empty=False,
             )
         )
backend/tests/apps/ai/management/commands/ai_update_repository_chunks_test.py (3)

45-47: Fix pluralization to “repositories”.

Current expectation hardcodes a misspelling (“repositorys”) and bakes it into UX.

-        assert command.entity_name_plural == "repositorys"
+        assert command.entity_name_plural == "repositories"

Outside this file, consider deriving from model meta to avoid manual pluralization:

# In BaseAICommand (or similar)
@property
def entity_name_plural(self) -> str:
    return getattr(self.model_class._meta, "verbose_name_plural", f"{self.entity_name}s")

67-82: Patch target is brittle.

Patching command.class.bases[0].get_base_queryset relies on MRO order. Patch the concrete base symbol instead.

-        with patch.object(command.__class__.__bases__[0], "get_base_queryset") as mock_super:
+        from apps.ai.common.base.chunk_command import BaseChunkCommand
+        with patch.object(BaseChunkCommand, "get_base_queryset") as mock_super:

104-119: Remove duplicated test logic.

This repeats test_get_base_queryset. Drop to reduce noise.

-    def test_queryset_filtering_logic(self, command):
-        """Test that the queryset filtering logic works correctly."""
-        with patch.object(command.__class__.__bases__[0], "get_base_queryset") as mock_super:
-            mock_queryset = Mock()
-            mock_super.return_value = mock_queryset
-            mock_queryset.filter.return_value = "filtered_queryset"
-
-            result = command.get_base_queryset()
-
-            mock_queryset.filter.assert_called_once_with(
-                is_owasp_site_repository=True,
-                is_archived=False,
-                is_empty=False,
-            )
-            assert result == "filtered_queryset"
+    # Removed: duplicate of test_get_base_queryset
backend/tests/apps/slack/common/handlers/ai_test.py (2)

75-83: Avoid hard-coding model IDs; read from settings.

Make chat/embedding model names configurable; update test to assert against settings-derived values.

-        mock_rag_tool.assert_called_once_with(
-            chat_model="gpt-4o",
-            embedding_model="text-embedding-3-small",
-        )
+        from django.conf import settings
+        mock_rag_tool.assert_called_once_with(
+            chat_model=getattr(settings, "SLACK_AI_CHAT_MODEL", "gpt-4o"),
+            embedding_model=getattr(settings, "SLACK_AI_EMBEDDING_MODEL", "text-embedding-3-small"),
+        )

Outside this file (handler):

from django.conf import settings
rag_tool = RagTool(
    chat_model=getattr(settings, "SLACK_AI_CHAT_MODEL", "gpt-4o"),
    embedding_model=getattr(settings, "SLACK_AI_EMBEDDING_MODEL", "text-embedding-3-small"),
)

132-143: Add a guard for blank-only queries.

Avoid calling the RAG pipeline when the stripped query is empty; return error blocks directly and add a test.

+    def test_get_blocks_with_blank_query(self):
+        with patch("apps.slack.common.handlers.ai.get_error_blocks") as mock_err:
+            mock_err.return_value = [{"type": "section"}]
+            with patch("apps.slack.common.handlers.ai.process_ai_query") as mock_proc:
+                result = get_blocks("   ")
+                mock_proc.assert_not_called()
+                assert result == [{"type": "section"}]
backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py (2)

67-83: Replace brittle base-class patching with import-path patching.

Using bases[0] is fragile. Patch the known symbol instead.

-        with patch.object(command.__class__.__bases__[0], "get_base_queryset") as mock_super:
+        with patch("apps.ai.common.base.context_command.BaseContextCommand.get_base_queryset") as mock_super:
             mock_queryset = Mock()
             mock_super.return_value = mock_queryset
             mock_queryset.filter.return_value = "filtered_queryset"

Repeat the same replacement in the other tests that patch the base queryset.

Also applies to: 104-119, 127-141, 142-156, 157-171


127-171: DRY up the three queryset-filter tests.

They assert the same filter arguments. Parametrize into one test to reduce noise and speed up the suite.

If helpful, I can provide a parametrized version.

backend/tests/apps/slack/services/message_auto_reply_test.py (3)

48-96: Also assert the retrieval call contract.

Validate we call Message.objects.get with pk for added confidence.

         generate_ai_reply_if_unanswered(mock_message.id)
 
+        mock_message_get.assert_called_once_with(pk=mock_message.id)
         mock_client.conversations_replies.assert_called_once_with(
             channel=mock_message.conversation.slack_channel_id,
             ts=mock_message.slack_message_id,
             limit=1,
         )

97-106: Add assertion for not-found path.

Confirms correct lookup argument.

         generate_ai_reply_if_unanswered(99999)
 
+        mock_message_get.assert_called_once_with(pk=99999)

172-211: Merge duplicate “no/empty AI response” tests via parametrization.

Reduces repetition while preserving coverage.

-    @patch("apps.slack.services.message_auto_reply.Message.objects.get")
-    @patch("apps.slack.services.message_auto_reply.WebClient")
-    @patch("apps.slack.services.message_auto_reply.process_ai_query")
-    def test_generate_ai_reply_no_ai_response(
-        self,
-        mock_process_ai_query,
-        mock_webclient,
-        mock_message_get,
-        mock_message,
-    ):
-        """Test when AI doesn't return a response."""
-        mock_message_get.return_value = mock_message
-        mock_client = Mock()
-        mock_webclient.return_value = mock_client
-        mock_client.conversations_replies.return_value = {"messages": [{"reply_count": 0}]}
-        mock_process_ai_query.return_value = None
-
-        generate_ai_reply_if_unanswered(mock_message.id)
-
-        mock_client.chat_postMessage.assert_not_called()
-
-    @patch("apps.slack.services.message_auto_reply.Message.objects.get")
-    @patch("apps.slack.services.message_auto_reply.WebClient")
-    @patch("apps.slack.services.message_auto_reply.process_ai_query")
-    def test_generate_ai_reply_empty_response(
-        self,
-        mock_process_ai_query,
-        mock_webclient,
-        mock_message_get,
-        mock_message,
-    ):
-        """Test when AI returns empty response."""
-        mock_message_get.return_value = mock_message
-        mock_client = Mock()
-        mock_webclient.return_value = mock_client
-        mock_client.conversations_replies.return_value = {"messages": [{"reply_count": 0}]}
-        mock_process_ai_query.return_value = ""
-
-        generate_ai_reply_if_unanswered(mock_message.id)
-
-        mock_client.chat_postMessage.assert_not_called()
+    @pytest.mark.parametrize("ai_response", [None, ""])
+    @patch("apps.slack.services.message_auto_reply.Message.objects.get")
+    @patch("apps.slack.services.message_auto_reply.WebClient")
+    @patch("apps.slack.services.message_auto_reply.process_ai_query")
+    def test_generate_ai_reply_no_or_empty_ai_response(
+        self, mock_process_ai_query, mock_webclient, mock_message_get, mock_message, ai_response
+    ):
+        """Test when AI returns no/empty response."""
+        mock_message_get.return_value = mock_message
+        mock_client = Mock()
+        mock_webclient.return_value = mock_client
+        mock_client.conversations_replies.return_value = {"messages": [{"reply_count": 0}]}
+        mock_process_ai_query.return_value = ai_response
+
+        generate_ai_reply_if_unanswered(mock_message.id)
+
+        mock_client.chat_postMessage.assert_not_called()
backend/tests/apps/slack/common/question_detector_test.py (3)

18-23: Avoid brittle assertions on pattern counts

Asserting exact counts for question_patterns/compiled_patterns couples tests to implementation details.

-        assert len(detector.question_patterns) == 4
-        assert len(detector.compiled_patterns) == 4
+        assert len(detector.question_patterns) >= 4
+        assert len(detector.compiled_patterns) >= 4

49-64: Confirm intent: treating “advice/recommend” statements as questions

Tests expect declaratives like “I recommend this approach” to be flagged as questions. If that’s not intended, constrain these patterns (e.g., require a question mark or leading question terms).


144-156: Reduce reliance on pattern order

Index-based checks (compiled_patterns[0/1]) are fragile. Prefer matching by pattern content or behavior.

-        question_mark_pattern = detector.compiled_patterns[0]
-        assert question_mark_pattern.search("Is this right?") is not None
+        assert any(p.pattern == r"\?" and p.search("Is this right?") for p in detector.compiled_patterns)

-        question_word_pattern = detector.compiled_patterns[1]
-        assert question_word_pattern.search("What is this") is not None
-        assert question_word_pattern.search("How does it work") is not None
+        expected = r"^(what|how|why|when|where|which|who|can|could|would|should|is|are|does|do|did)"
+        assert any(p.pattern == expected and p.search("What is this") for p in detector.compiled_patterns)
+        assert any(p.pattern == expected and p.search("How does it work") for p in detector.compiled_patterns)
backend/apps/slack/common/question_detector.py (3)

16-19: Normalize keywords to lowercase once

Ensure case-insensitive matching regardless of how OWASP_KEYWORDS is defined.

-        self.owasp_keywords = OWASP_KEYWORDS
+        self.owasp_keywords = {k.lower() for k in OWASP_KEYWORDS}

20-25: Tighten regex to reduce false positives

Add word-boundaries and allow optional leading whitespace for question starters; keeps behavior but trims accidental matches like “whatsoever”.

-            r"^(what|how|why|when|where|which|who|can|could|would|should|is|are|does|do|did)",
-            r"(help|explain|tell me|show me|guide|tutorial|example)",
-            r"(recommend|suggest|advice|opinion)",
+            r"^\s*\b(what|how|why|when|where|which|who|can|could|would|should|is|are|does|do|did)\b",
+            r"\b(help|explain|tell me|show me|guide|tutorial|example)\b",
+            r"\b(recommend|suggest|advice|opinion)\b",

48-57: Boundary-aware phrase matching for multi-word keywords

Substring matching can overmatch. Consider boundary-aware search that ignores punctuation/hyphens.

# Example enhancement:
def contains_owasp_keywords(self, text: str) -> bool:
    words = re.findall(r"\b\w+\b", text)
    text_words = set(words)
    if self.owasp_keywords.intersection(text_words):
        return True

    # Multi-word phrases: collapse whitespace/punctuation and match by tokens
    normalized = re.sub(r"[\s\-_]+", " ", text)
    for keyword in self.owasp_keywords:
        if " " in keyword and re.search(rf"\b{re.escape(keyword)}\b", normalized):
            return True
    return False
backend/tests/apps/slack/commands/ai_test.py (1)

13-17: Remove unused autouse fixture or reuse it

You recreate Ai() in tests; either reuse self.ai_command or drop the autouse fixture.

-    @pytest.fixture(autouse=True)
-    def setup_method(self):
-        """Set up test data before each test method."""
-        self.ai_command = Ai()
+    @pytest.fixture
+    def ai_command(self):
+        """Ai instance per test."""
+        return Ai()

Then use ai_command in tests.

backend/tests/apps/ai/common/extractors/repository_test.py (1)

397-405: Use call.args for robust URL assertions

Stringifying call is brittle. Inspect the first positional arg.

-        assert any(
-            "https://raw.githubusercontent.com/test-org/test-repo/develop/" in str(call)
-            for call in mock_get_content.call_args_list
-        )
+        assert any(
+            "https://raw.githubusercontent.com/test-org/test-repo/develop/" in c.args[0]
+            for c in mock_get_content.call_args_list
+        )
backend/tests/apps/slack/events/message_posted_test.py (2)

95-123: Add a case for mid-message mentions (optional)

If mentions anywhere should trigger AI reply, add a test where the mention is not at the start (e.g., “Hi <@u987654> what is OWASP?”) and confirm the stripped query is correct. If only leading mentions should trigger, assert non-trigger instead.


95-123: Consider guarding against None blocks from get_blocks

If get_blocks can return None, chat_postMessage may fail. Either assert in tests that we skip posting when None, or handle it in the event handler.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7cd6d44 and a1f1b70.

⛔ Files ignored due to path filters (1)
  • backend/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (36)
  • backend/apps/ai/Makefile (1 hunks)
  • backend/apps/ai/common/extractors/repository.py (1 hunks)
  • backend/apps/ai/management/commands/ai_update_repository_chunks.py (1 hunks)
  • backend/apps/ai/management/commands/ai_update_repository_context.py (1 hunks)
  • backend/apps/ai/models/chunk.py (3 hunks)
  • backend/apps/slack/MANIFEST.yaml (3 hunks)
  • backend/apps/slack/admin/conversation.py (1 hunks)
  • backend/apps/slack/apps.py (1 hunks)
  • backend/apps/slack/commands/__init__.py (1 hunks)
  • backend/apps/slack/commands/ai.py (1 hunks)
  • backend/apps/slack/common/handlers/ai.py (1 hunks)
  • backend/apps/slack/common/question_detector.py (1 hunks)
  • backend/apps/slack/constants.py (1 hunks)
  • backend/apps/slack/events/__init__.py (1 hunks)
  • backend/apps/slack/events/message_posted.py (1 hunks)
  • backend/apps/slack/migrations/0019_conversation_is_nest_bot_assistant_enabled.py (1 hunks)
  • backend/apps/slack/models/conversation.py (1 hunks)
  • backend/apps/slack/services/__init__.py (1 hunks)
  • backend/apps/slack/services/message_auto_reply.py (1 hunks)
  • backend/apps/slack/templates/commands/ai.jinja (1 hunks)
  • backend/pyproject.toml (1 hunks)
  • backend/settings/base.py (2 hunks)
  • backend/settings/urls.py (1 hunks)
  • backend/test_commands.py (1 hunks)
  • backend/tests/apps/ai/common/extractors/repository_test.py (1 hunks)
  • backend/tests/apps/ai/common/utils_test.py (1 hunks)
  • backend/tests/apps/ai/management/commands/ai_update_repository_chunks_test.py (1 hunks)
  • backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py (1 hunks)
  • backend/tests/apps/ai/models/chunk_test.py (4 hunks)
  • backend/tests/apps/slack/commands/ai_test.py (1 hunks)
  • backend/tests/apps/slack/common/handlers/ai_test.py (1 hunks)
  • backend/tests/apps/slack/common/question_detector_test.py (1 hunks)
  • backend/tests/apps/slack/events/message_posted_test.py (1 hunks)
  • backend/tests/apps/slack/services/message_auto_reply_test.py (1 hunks)
  • cspell/custom-dict.txt (2 hunks)
  • docker-compose/local.yaml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (20)
backend/apps/slack/apps.py (2)
backend/apps/mentorship/apps.py (1)
  • ready (12-14)
backend/apps/slack/events/__init__.py (1)
  • configure_slack_events (1-19)
backend/apps/slack/common/handlers/ai.py (2)
backend/apps/ai/agent/tools/rag/rag_tool.py (2)
  • RagTool (13-63)
  • query (34-63)
backend/apps/slack/blocks.py (1)
  • markdown (21-34)
backend/tests/apps/slack/common/handlers/ai_test.py (2)
backend/apps/slack/common/handlers/ai.py (2)
  • get_error_blocks (49-61)
  • process_ai_query (31-46)
backend/apps/ai/agent/tools/rag/rag_tool.py (1)
  • query (34-63)
backend/apps/ai/management/commands/ai_update_repository_context.py (4)
backend/apps/ai/common/base/context_command.py (1)
  • BaseContextCommand (9-54)
backend/apps/ai/common/extractors/repository.py (1)
  • extract_repository_markdown_content (114-174)
backend/apps/github/models/repository.py (1)
  • Repository (25-349)
backend/apps/ai/management/commands/ai_update_repository_chunks.py (5)
  • Command (10-36)
  • extract_content (14-16)
  • get_base_queryset (18-28)
  • get_default_queryset (30-32)
  • source_name (34-36)
backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py (3)
backend/apps/ai/management/commands/ai_update_repository_context.py (5)
  • Command (10-36)
  • source_name (34-36)
  • extract_content (14-16)
  • get_base_queryset (18-28)
  • get_default_queryset (30-32)
backend/apps/ai/common/base/context_command.py (1)
  • BaseContextCommand (9-54)
backend/apps/github/models/repository.py (1)
  • Repository (25-349)
backend/apps/slack/commands/ai.py (2)
backend/apps/slack/commands/command.py (1)
  • CommandBase (20-157)
backend/apps/slack/common/handlers/ai.py (1)
  • get_blocks (13-28)
backend/apps/slack/events/message_posted.py (7)
backend/apps/slack/common/handlers/ai.py (1)
  • get_blocks (13-28)
backend/apps/slack/common/question_detector.py (2)
  • QuestionDetector (13-57)
  • is_owasp_question (31-42)
backend/apps/slack/events/event.py (1)
  • EventBase (23-265)
backend/apps/slack/models/conversation.py (2)
  • Conversation (15-107)
  • update_data (85-107)
backend/apps/slack/models/member.py (1)
  • Member (10-78)
backend/apps/slack/models/message.py (2)
  • Message (15-160)
  • text (83-85)
backend/apps/slack/services/message_auto_reply.py (1)
  • generate_ai_reply_if_unanswered (16-48)
backend/apps/slack/common/question_detector.py (1)
backend/apps/slack/models/message.py (1)
  • text (83-85)
backend/tests/apps/slack/services/message_auto_reply_test.py (4)
backend/apps/slack/models/conversation.py (1)
  • Conversation (15-107)
backend/apps/slack/models/message.py (3)
  • Message (15-160)
  • text (83-85)
  • ts (88-90)
backend/apps/slack/models/workspace.py (2)
  • Workspace (11-44)
  • bot_token (37-44)
backend/apps/slack/services/message_auto_reply.py (1)
  • generate_ai_reply_if_unanswered (16-48)
backend/tests/apps/ai/management/commands/ai_update_repository_chunks_test.py (3)
backend/apps/ai/management/commands/ai_update_repository_chunks.py (5)
  • Command (10-36)
  • source_name (34-36)
  • extract_content (14-16)
  • get_base_queryset (18-28)
  • get_default_queryset (30-32)
backend/apps/ai/common/base/chunk_command.py (1)
  • BaseChunkCommand (12-91)
backend/apps/github/models/repository.py (1)
  • Repository (25-349)
backend/apps/slack/services/message_auto_reply.py (4)
backend/apps/slack/common/handlers/ai.py (1)
  • process_ai_query (31-46)
backend/apps/slack/models/message.py (3)
  • Message (15-160)
  • ts (88-90)
  • text (83-85)
backend/apps/slack/models/workspace.py (1)
  • bot_token (37-44)
backend/apps/ai/agent/tools/rag/rag_tool.py (1)
  • query (34-63)
backend/apps/ai/common/extractors/repository.py (1)
backend/apps/github/utils.py (1)
  • get_repository_file_content (60-79)
backend/apps/slack/events/__init__.py (2)
backend/apps/slack/apps.py (1)
  • SlackConfig (13-33)
backend/apps/slack/events/event.py (2)
  • EventBase (23-265)
  • configure_events (30-37)
backend/tests/apps/ai/models/chunk_test.py (3)
backend/tests/apps/ai/common/base/chunk_command_test.py (2)
  • mock_chunks (72-81)
  • mock_context (51-60)
backend/apps/ai/models/chunk.py (2)
  • Chunk (16-93)
  • bulk_save (34-53)
backend/tests/apps/ai/common/base/context_command_test.py (1)
  • mock_context (48-55)
backend/apps/ai/management/commands/ai_update_repository_chunks.py (4)
backend/apps/ai/common/base/chunk_command.py (1)
  • BaseChunkCommand (12-91)
backend/apps/ai/common/extractors/repository.py (1)
  • extract_repository_markdown_content (114-174)
backend/apps/github/models/repository.py (1)
  • Repository (25-349)
backend/apps/ai/management/commands/ai_update_repository_context.py (5)
  • Command (10-36)
  • extract_content (14-16)
  • get_base_queryset (18-28)
  • get_default_queryset (30-32)
  • source_name (34-36)
backend/tests/apps/slack/common/question_detector_test.py (2)
backend/apps/slack/common/question_detector.py (4)
  • QuestionDetector (13-57)
  • is_question (44-46)
  • contains_owasp_keywords (48-57)
  • is_owasp_question (31-42)
backend/tests/apps/slack/events/message_posted_test.py (1)
  • test_init (44-48)
backend/tests/apps/slack/events/message_posted_test.py (3)
backend/apps/slack/events/message_posted.py (1)
  • MessagePosted (17-77)
backend/apps/slack/models/conversation.py (1)
  • Conversation (15-107)
backend/apps/slack/models/member.py (1)
  • Member (10-78)
backend/apps/ai/models/chunk.py (3)
backend/apps/ai/models/context.py (1)
  • Context (15-61)
backend/apps/common/models.py (2)
  • BulkSaveModel (10-34)
  • TimestampedModel (37-46)
backend/apps/common/utils.py (1)
  • truncate (164-176)
backend/tests/apps/slack/commands/ai_test.py (2)
backend/apps/slack/commands/ai.py (1)
  • Ai (6-23)
backend/apps/slack/commands/command.py (1)
  • CommandBase (20-157)
backend/tests/apps/ai/common/extractors/repository_test.py (1)
backend/apps/ai/common/extractors/repository.py (2)
  • extract_repository_content (11-111)
  • extract_repository_markdown_content (114-174)
🔇 Additional comments (26)
backend/pyproject.toml (1)

22-22: Manually verify django-rq compatibility with Python 3.13 and Django 5.1
Poetry wasn’t available to run the compatibility checks; please confirm at runtime that django-rq 3.1 (and its rq/redis transitive dependencies) works under Python 3.13 and Django 5.1, and pin rq/redis in pyproject.toml if needed.

cspell/custom-dict.txt (1)

109-109: Dictionary additions OK

“rqworker” and “webgoat” entries make sense for this repo’s domain.

Also applies to: 125-125

backend/apps/slack/services/__init__.py (1)

1-1: Package initializer LGTM

Clear, minimal package docstring. No issues.

backend/settings/urls.py (1)

31-31: Django-RQ dashboard is staff-only by default
Unconditional path("django-rq/", include("django_rq.urls")) is safe since django-rq v3.1+ enforces staff-only access on its dashboard views.

backend/settings/base.py (1)

52-53: Django RQ app registration looks correct.

Adding "django_rq" to THIRD_PARTY_APPS is appropriate and consistent with how INSTALLED_APPS is composed.

backend/tests/apps/ai/common/utils_test.py (1)

97-107: Decorator order and side-effect wiring are correct.

Patching order matches argument order; the AttributeError simulates the None-context path as intended.

backend/apps/ai/models/chunk.py (2)

34-53: No downstream code depends on chunks being cleared. Chunk.bulk_save is only called in chunk_command.py (where the list isn’t used after the call) and in tests (which don’t assert on the list), so the updated clear-only-when-empty logic doesn’t break any callers.


19-23: Replace deprecated unique_together with a UniqueConstraint and plan a safe migration

 class Meta:
     db_table = "ai_chunks"
     verbose_name = "Chunk"
-    unique_together = ("context", "text")
+    constraints = [
+        models.UniqueConstraint(
+            fields=("context", "text"),
+            name="uq_ai_chunks_context_text",
+        )
+    ]

Add a data migration to dedupe existing rows before enforcing the constraint and use CREATE INDEX CONCURRENTLY on Postgres to avoid downtime.

backend/tests/apps/ai/models/chunk_test.py (1)

101-104: LGTM on direct context-based existence checks.

The assertions correctly verify the switch to context=mock_context, text=text.

Also applies to: 125-128, 148-150

backend/apps/ai/common/extractors/repository.py (1)

108-111: LGTM on structured repository metadata extraction.

The prose/metadata split and DELIMITER usage align with the RAG pipeline.

backend/apps/slack/commands/__init__.py (1)

5-5: LGTM: /ai command is now exported/loaded

Importing ai ensures command discovery via CommandBase.configure_commands().

backend/apps/slack/models/conversation.py (1)

30-32: LGTM: opt-in feature flag on Conversation

Good that from_slack() does not overwrite this local flag.

backend/apps/ai/management/commands/ai_update_repository_context.py (1)

30-32: Upstream get_base_queryset has no is_active filter BaseAICommand.get_base_queryset returns .objects.all() with no implicit filters, so overriding get_default_queryset correctly avoids filtering on the nonexistent is_active field.

backend/apps/slack/MANIFEST.yaml (1)

98-103: /ai slash command looks correct.

URL, usage, and escaping align with the existing pattern. Ensure the handler responds within 3s to Slack’s slash command ACK.

backend/apps/slack/migrations/0019_conversation_is_nest_bot_assistant_enabled.py (1)

1-1: Django version compatibility confirmed
The project’s pyproject.toml declares django = "^5.1" (i.e. ≥5.1,<6.0), which includes Django 5.2.5, so the migration header is consistent with the dependency range.

backend/apps/ai/management/commands/ai_update_repository_chunks.py (2)

14-17: Good: delegates extraction cleanly.


34-36: Consistent source_name.

Matches context command; good for downstream filters.

backend/tests/apps/slack/services/message_auto_reply_test.py (1)

17-43: Fixtures are clean and focused. LGTM.

backend/tests/apps/slack/common/question_detector_test.py (2)

170-186: Good real-world coverage

Nice parameterized cases that exercise diverse OWASP queries.


188-196: Initialization patching looks solid

Mocking OWASP_KEYWORDS verifies configurability without leaking global state.

backend/tests/apps/slack/commands/ai_test.py (2)

18-42: Patch target and whitespace trimming behavior look correct

Verifies the local import usage and exact query passed to get_blocks.


127-140: Exception propagation test is appropriate

Confirms render_blocks does not swallow errors, leaving caller to handle.

backend/tests/apps/ai/common/extractors/repository_test.py (1)

241-256: Baseline behavior verified

Description is included; metadata intentionally empty for markdown extraction. Looks good.

backend/tests/apps/slack/events/message_posted_test.py (3)

95-123: Bot-mention flow is well covered

Auth caching, mention stripping, and block rendering checks are solid.


210-292: OWASP question happy-path coverage looks good

Validates DB updates and delayed enqueue to the default queue.


348-375: Whitespace-trim behavior verified

Ensures query cleanup before rendering blocks. Nice.

Comment on lines 340 to 354
def test_extract_repository_markdown_content_file_fetch_exception(self, mock_get_content):
"""Test extraction when file fetching raises an exception."""
repository = MagicMock()
repository.description = "Test repository"
repository.organization = MagicMock(login="test-org")
repository.key = "test-repo"
repository.default_branch = "main"

mock_get_content.side_effect = Exception("Network error")

prose, metadata = extract_repository_markdown_content(repository)

assert "Repository Description: Test repository" in prose
assert metadata == ""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Mismatch: extractor doesn’t catch generic exceptions

extract_repository_markdown_content currently catches a narrow set; these tests raise Exception, which will bubble and fail. Broaden exception handling in the extractor and keep going.

Proposed change in apps/ai/common/extractors/repository.py:

# inside the for file_path in markdown_files loop
raw_url = f"https://raw.githubusercontent.com/{owner}/{repository.key}/{repository.default_branch or 'main'}/{file_path}"
try:
    content = get_repository_file_content(raw_url)
except Exception as exc:  # broad by design; network/parsing can vary
    logger.debug(
        "Failed to fetch %s for %s/%s from branch %s: %s",
        file_path,
        owner or "(unknown owner)",
        repository.key or "(no key)",
        repository.default_branch or "main",
        exc,
    )
    continue
🤖 Prompt for AI Agents
In backend/tests/apps/ai/common/extractors/repository_test.py around lines 340
to 354, the test triggers a generic Exception when fetching file content but the
extractor only catches narrow exceptions; update
apps/ai/common/extractors/repository.py inside the markdown_files loop to wrap
the get_repository_file_content call in a try/except that catches Exception, log
a debug message including file_path, owner (or "(unknown owner)"),
repository.key (or "(no key)"), branch (or "main") and the exception, then
continue to the next file so extraction proceeds without raising.

@Dishant1804 Dishant1804 force-pushed the www-repository branch 2 times, most recently from 6439df7 to b198ecf Compare September 6, 2025 17:55
Copy link

sonarqubecloud bot commented Sep 6, 2025

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/apps/ai/models/chunk.py (1)

85-93: update_data has a race on duplicates; rely on the DB constraint

The pre-check can race; catch IntegrityError instead and return None.

-        if Chunk.objects.filter(context=context, text=text).exists():
-            return None
-
-        chunk = Chunk(text=text, embedding=embedding, context=context)
-
-        if save:
-            chunk.save()
-
-        return chunk
+        chunk = Chunk(text=text, embedding=embedding, context=context)
+        if save:
+            try:
+                chunk.save()
+            except Exception as exc:  # narrowed below via local import
+                from django.db import IntegrityError
+                if isinstance(exc, IntegrityError):
+                    # Duplicate per unique (context, text); treat as no-op
+                    logger.debug("Chunk already exists for context_id=%s; skipping", getattr(context, "id", None))
+                    return None
+                raise
+        return chunk

If you prefer top-level imports, add:

-from django.db import DatabaseError, models
+from django.db import DatabaseError, IntegrityError, models
♻️ Duplicate comments (2)
backend/apps/ai/models/chunk.py (1)

35-53: bulk_save drops updates, does N+1 queries, and breaks the clear-list contract

  • Updates are filtered out (ids present) if an identical (context, text) exists.
  • Per-object .exists() causes O(N) queries.
  • Passing a new list to BulkSaveModel means the caller’s list isn’t cleared unless no uniques remain.
  • Duplicate races aren’t handled; rely on the DB unique constraint and catch IntegrityError.

Replace the body with a single-probe, race-safe implementation that preserves the contract.

     @staticmethod
     def bulk_save(chunks, fields=None):
-        """Bulk save chunks with duplicate handling."""
-        unique_chunks = []
-        unique_chunks = [
-            chunk
-            for chunk in chunks
-            if not Chunk.objects.filter(context=chunk.context, text=chunk.text).exists()
-        ]
-
-        if unique_chunks:
-            try:
-                BulkSaveModel.bulk_save(Chunk, unique_chunks, fields=fields)
-            except (ValueError, TypeError, DatabaseError):
-                for chunk in unique_chunks:
-                    try:
-                        chunk.save()
-                    except (ValueError, TypeError, DatabaseError) as save_error:
-                        logger.exception("Failed to save chunk", exc_info=save_error)
-        else:
-            chunks.clear()
+        """Bulk save chunks with duplicate handling and minimal queries."""
+        if not chunks:
+            return
+
+        # Partition: updates vs. creates
+        to_update = [c for c in chunks if getattr(c, "id", None)]
+        to_create = [c for c in chunks if not getattr(c, "id", None)]
+
+        # In-memory dedupe among creates by (context_id, text)
+        seen: set[tuple[int | None, str]] = set()
+        deduped_create = []
+        for c in to_create:
+            ctx_id = getattr(c, "context_id", None) or getattr(getattr(c, "context", None), "id", None)
+            key = (ctx_id, c.text)
+            if key in seen:
+                continue
+            seen.add(key)
+            deduped_create.append(c)
+
+        # One DB probe for existing keys among incoming creates
+        ctx_ids = {getattr(c, "context_id", None) or getattr(getattr(c, "context", None), "id", None) for c in deduped_create}
+        texts = {c.text for c in deduped_create}
+        existing_keys: set[tuple[int, str]] = set()
+        if ctx_ids and texts:
+            existing_keys = set(
+                Chunk.objects.filter(context_id__in=ctx_ids, text__in=texts)
+                .values_list("context_id", "text")
+            )
+
+        final_create = []
+        for c in deduped_create:
+            ctx_id = getattr(c, "context_id", None) or getattr(getattr(c, "context", None), "id", None)
+            if (ctx_id, c.text) not in existing_keys:
+                final_create.append(c)
+
+        unique_chunks = to_update + final_create
+        if not unique_chunks:
+            chunks.clear()
+            return
+
+        try:
+            # Maintain contract: operate on caller list so BulkSaveModel clears it
+            chunks[:] = unique_chunks
+            BulkSaveModel.bulk_save(Chunk, chunks, fields=fields)
+        except (ValueError, TypeError, DatabaseError) as exc:
+            # Fallback: per-object saves with race-safe duplicate handling
+            from django.db import IntegrityError  # local import to avoid global if unused
+            for obj in unique_chunks:
+                try:
+                    if getattr(obj, "id", None) and fields:
+                        obj.save(update_fields=fields)
+                    else:
+                        obj.save()
+                except IntegrityError:
+                    # Duplicate create due to race/uniqueness; ignore
+                    continue
+                except (ValueError, TypeError, DatabaseError):
+                    logger.exception("Failed to save chunk id=%s", getattr(obj, "id", None))
backend/tests/apps/ai/common/extractors/repository_test.py (1)

340-354: Add a test for generic exceptions from the fetcher

Past feedback asked to catch generic Exception in the extractor; add a case that raises Exception to ensure we don’t break on unexpected errors.

     @patch("apps.ai.common.extractors.repository.get_repository_file_content")
     def test_extract_repository_markdown_content_file_fetch_exception(self, mock_get_content):
         """Test extraction when file fetching raises an exception."""
         repository = MagicMock()
         repository.description = "Test repository"
         repository.organization = MagicMock(login="test-org")
         repository.key = "test-repo"
         repository.default_branch = "main"

-        mock_get_content.side_effect = ConnectionError("Network error")
+        mock_get_content.side_effect = Exception("Network error")

         prose, metadata = extract_repository_markdown_content(repository)

         assert "Repository Description: Test repository" in prose
         assert metadata == ""
🧹 Nitpick comments (5)
backend/apps/ai/models/chunk.py (2)

50-51: Use logger.exception correctly

logger.exception already includes traceback; passing exc_info object is unnecessary and atypical.

-                        logger.exception("Failed to save chunk", exc_info=save_error)
+                        logger.exception("Failed to save chunk id=%s", getattr(chunk, "id", None))

19-23: Prefer UniqueConstraint over unique_together

Django now favors explicit UniqueConstraint with a name; easier to reference in migrations and logs.

-        unique_together = ("context", "text")
+        constraints = [
+            models.UniqueConstraint(
+                fields=["context", "text"],
+                name="uniq_ai_chunk_context_text",
+            )
+        ]
backend/apps/ai/management/commands/ai_update_repository_chunks.py (1)

18-28: Optimize queryset for extractor access

Select related FKs to avoid extra queries when building raw URLs (owner/org), and narrow fields if desired.

         return (
             super()
             .get_base_queryset()
             .filter(
                 is_owasp_site_repository=True,
                 is_archived=False,
                 is_empty=False,
             )
-        )
+        ).select_related("organization", "owner")
backend/tests/apps/ai/common/extractors/repository_test.py (2)

407-420: Assert no network calls when owner/org missing

When both are absent, extractor should not attempt fetches. Patch and assert not called.

-    def test_extract_repository_markdown_content_no_owner_or_org(self):
+    @patch("apps.ai.common.extractors.repository.get_repository_file_content")
+    def test_extract_repository_markdown_content_no_owner_or_org(self, mock_get_content):
         """Test extraction when repository has neither organization nor owner."""
         repository = MagicMock()
         repository.description = "Test repository"
         repository.organization = None
         repository.owner = None
         repository.key = "test-repo"
         repository.default_branch = "main"

         prose, metadata = extract_repository_markdown_content(repository)

         assert "Repository Description: Test repository" in prose
         assert metadata == ""
+        mock_get_content.assert_not_called()

421-433: Also assert no network calls when key is missing

Key is required to form the raw URL; ensure no fetch attempted.

-    def test_extract_repository_markdown_content_no_key(self):
+    @patch("apps.ai.common.extractors.repository.get_repository_file_content")
+    def test_extract_repository_markdown_content_no_key(self, mock_get_content):
         """Test extraction when repository has no key."""
         repository = MagicMock()
         repository.description = "Test repository"
         repository.organization = MagicMock(login="test-org")
         repository.key = None
         repository.default_branch = "main"

         prose, metadata = extract_repository_markdown_content(repository)

         assert "Repository Description: Test repository" in prose
         assert metadata == ""
+        mock_get_content.assert_not_called()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a1f1b70 and 103fb24.

📒 Files selected for processing (12)
  • backend/apps/ai/Makefile (1 hunks)
  • backend/apps/ai/common/extractors/repository.py (1 hunks)
  • backend/apps/ai/management/commands/ai_update_repository_chunks.py (1 hunks)
  • backend/apps/ai/management/commands/ai_update_repository_context.py (1 hunks)
  • backend/apps/ai/models/chunk.py (3 hunks)
  • backend/test_commands.py (1 hunks)
  • backend/tests/apps/ai/common/extractors/repository_test.py (1 hunks)
  • backend/tests/apps/ai/common/utils_test.py (1 hunks)
  • backend/tests/apps/ai/management/commands/ai_update_repository_chunks_test.py (1 hunks)
  • backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py (1 hunks)
  • backend/tests/apps/ai/models/chunk_test.py (4 hunks)
  • cspell/custom-dict.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • backend/test_commands.py
  • backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py
  • backend/tests/apps/ai/models/chunk_test.py
  • backend/apps/ai/management/commands/ai_update_repository_context.py
  • backend/apps/ai/common/extractors/repository.py
  • cspell/custom-dict.txt
  • backend/apps/ai/Makefile
  • backend/tests/apps/ai/management/commands/ai_update_repository_chunks_test.py
  • backend/tests/apps/ai/common/utils_test.py
🧰 Additional context used
🧬 Code graph analysis (3)
backend/tests/apps/ai/common/extractors/repository_test.py (1)
backend/apps/ai/common/extractors/repository.py (2)
  • extract_repository_content (11-111)
  • extract_repository_markdown_content (114-174)
backend/apps/ai/management/commands/ai_update_repository_chunks.py (4)
backend/apps/ai/common/base/chunk_command.py (1)
  • BaseChunkCommand (12-91)
backend/apps/ai/common/extractors/repository.py (1)
  • extract_repository_markdown_content (114-174)
backend/apps/github/models/repository.py (1)
  • Repository (25-349)
backend/apps/ai/management/commands/ai_update_repository_context.py (5)
  • Command (10-36)
  • extract_content (14-16)
  • get_base_queryset (18-28)
  • get_default_queryset (30-32)
  • source_name (34-36)
backend/apps/ai/models/chunk.py (2)
backend/apps/ai/models/context.py (1)
  • Context (15-61)
backend/apps/common/models.py (2)
  • BulkSaveModel (10-34)
  • TimestampedModel (37-46)
🔇 Additional comments (5)
backend/apps/ai/management/commands/ai_update_repository_chunks.py (4)

10-12: LGTM: command wiring is correct

key_field_name and model_class look right.


14-16: LGTM: delegates to the markdown extractor

Extraction path aligns with the PR goals.


30-36: LGTM: default queryset override avoids is_active

Repository lacks is_active; good override.


18-28: Verified OWASP site repository filter uses www- prefixes
check_owasp_site_repository keys off key.startswith(("www-chapter-", "www-committee-", "www-event", "www-project-")), so the queryset filter aligns with the intended www- heuristic.

backend/tests/apps/ai/common/extractors/repository_test.py (1)

3-3: CI/python requirements support datetime.UTC
Project’s pyproject.toml specifies python = "^3.13" (>= 3.13), which includes the datetime.UTC alias introduced in Python 3.11—no action needed.

@Dishant1804 Dishant1804 marked this pull request as ready for review September 6, 2025 19:27
Copy link
Collaborator

@arkid15r arkid15r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dishant1804 you've submitted 1000+ lines of code for review.
I have a couple of questions:

  • could you explain what OWASP Nest problem this PR solves?
  • how did you verify that problem is solved with this PR?

@Dishant1804
Copy link
Collaborator Author

@Dishant1804 you've submitted 1000+ lines of code for review. I have a couple of questions:

  • could you explain what OWASP Nest problem this PR solves?
  • how did you verify that problem is solved with this PR?

this PR gets the .md files that we have in www- repositories and saves it chunks and context so that we can have more knowledge
I checked the admin panel for chunks and context, also ran it on some example queries

@arkid15r
Copy link
Collaborator

this PR gets the .md files that we have in www- repositories and saves it chunks and context so that we can have more knowledge I checked the admin panel for chunks and context, also ran it on some example queries

Alright, so it sound like you checked it with at least one repository and some relevant queries. Can you provide some examples?

@Dishant1804
Copy link
Collaborator Author

this PR gets the .md files that we have in www- repositories and saves it chunks and context so that we can have more knowledge I checked the admin panel for chunks and context, also ran it on some example queries

Alright, so it sound like you checked it with at least one repository and some relevant queries. Can you provide some examples?

this has a lot of chunks that are not up to the mark, maybe we should change the chunking delimiter or chunking strategy for this case, because it has lot of delimiters

@arkid15r
Copy link
Collaborator

this has a lot of chunks that are not up to the mark, maybe we should change the chunking delimiter or chunking strategy for this case, because it has lot of delimiters

I meant your general approach for testing, e.g. did OWASP Nest repository content help to improve NestBot answers/context for any specific questions? Did it get new info based on this?

My opinion -- it didn't. I haven't tested it yet but based on this code I can reasonably assume that no new [useful] content was added by these jobs and PR in general. Also I reasonably assume that this is an AI generated code that has nothing to do with real life scenario. I call it a cosplay-code.

The bottom line is that code has 0 value in its current form but you requested mentors to review it w/o testing. I don't appreciate this kind of disrespect to mentors' time. This is not acceptable.

@Dishant1804
Copy link
Collaborator Author

this has a lot of chunks that are not up to the mark, maybe we should change the chunking delimiter or chunking strategy for this case, because it has lot of delimiters

I meant your general approach for testing, e.g. did OWASP Nest repository content help to improve NestBot answers/context for any specific questions? Did it get new info based on this?

My opinion -- it didn't. I haven't tested it yet but based on this code I can reasonably assume that no new [useful] content was added by these jobs and PR in general. Also I reasonably assume that this is an AI generated code that has nothing to do with real life scenario. I call it a cosplay-code.

The bottom line is that code has 0 value in its current form but you requested mentors to review it w/o testing. I don't appreciate this kind of disrespect to mentors' time. This is not acceptable.

I'll update the chunking startegy for this and then request for review, we are getting the information from readme and contributing .md files, the chunking needs to be proper

@Dishant1804 Dishant1804 marked this pull request as draft September 11, 2025 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fetch .md content from OWASP www- repositories to improve NestBot contexts
2 participants