Sync www-repopsitories #2164

Dishant1804 · 2025-09-01T21:57:42Z

Proposed change

Resolves #2144

create context for www- repositories
create chunks for the same
test covergae for it

Checklist

I've read and followed the contributing guidelines.
I've run make check-test locally; all checks and tests passed.

coderabbitai · 2025-09-01T21:57:49Z

Summary by CodeRabbit

New Features
- Ingests repository content (including README and common docs) to enrich AI context.
- Adds background jobs to update repository context and content chunks.
- Expands repository metadata captured (status, features, counts, timestamps).
Bug Fixes
- Prevents duplicate content chunks and improves error handling during bulk operations.
- Enhances logging for easier diagnostics.
Tests
- Adds comprehensive tests for repository extraction, background jobs, and chunk handling.
Chores
- Adds make targets for easier job execution.
- Updates spelling dictionary.

Walkthrough

Adds repository Markdown content extraction, new Django management commands to update repository chunks and context using that extractor, corresponding Makefile targets, and extensive tests. Also updates Chunk model with uniqueness constraints, logging, and duplicate-handling in bulk save, plus a spelling dictionary entry.

Changes

Cohort / File(s)	Summary
Makefile targets `backend/apps/ai/Makefile`	Adds `ai-update-repository-chunks` and `ai-update-repository-context` targets invoking new Django commands.
Repository extractors `backend/apps/ai/common/extractors/repository.py`	Adds `extract_repository_content` and `extract_repository_markdown_content` to build prose/metadata and fetch common Markdown files from GitHub raw URLs.
Management commands: repository `backend/apps/ai/management/commands/ai_update_repository_chunks.py`, `backend/apps/ai/management/commands/ai_update_repository_context.py`	New commands for generating chunks and updating context for repositories; filter base queryset to `is_owasp_site_repository=True`, `is_archived=False`, `is_empty=False`; use markdown extractor; set `key_field_name="key"`, `source_name="owasp_repository_markdown"`.
Chunk model enhancements `backend/apps/ai/models/chunk.py`	Adds logging, catches `DatabaseError`, enforces `unique_together = ("context","text")`, refactors duplicate checks, and improves bulk save with per-item fallback.
Tests: extractors `backend/tests/apps/ai/common/extractors/repository_test.py`	Adds comprehensive tests for repository content and markdown extraction, URL construction, fallbacks, and logging.
Tests: management commands `backend/tests/apps/ai/management/commands/ai_update_repository_chunks_test.py`, `backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py`	Verify inheritance, properties, filtering, source name, help text, and delegation to extractor.
Tests: Chunk model `backend/tests/apps/ai/models/chunk_test.py`	Updates tests to match new duplicate-check and bulk save behavior using `Chunk.objects.filter(...).exists()`.
Tests: utils `backend/tests/apps/ai/common/utils_test.py`	Adjusts a test to mock `Chunk.update_data` raising when context is None; validates logging and empty result.
Misc test placeholder `backend/test_commands.py`	Adds a placeholder module docstring.
Spelling dictionary `cspell/custom-dict.txt`	Adds “repositorys”.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Assessment against linked issues

Objective	Addressed	Explanation
Fetch and index Markdown content from OWASP repositories prefixed with www- to improve NestBot contexts (#2144)	❓	Markdown fetching and indexing are implemented, but selection filters use `is_owasp_site_repository` without enforcing the `www-` name prefix. Scope alignment with the exact repo naming is unclear.

Assessment against linked issues: Out-of-scope changes

Code Change	Explanation
Chunk model uniqueness, logging, and bulk save refactor (`backend/apps/ai/models/chunk.py`)	Modifies storage behavior and constraints; not specifically required to fetch/index Markdown from `www-` repositories.

Possibly related PRs

Refactor RepositoryBasedEntityModel::from_github #1110: Touches the same repository file content fetching path and related error handling used by the new markdown extractor.
NestBot AI Assistant Contexts #1891: Relates to the command/extractor and Chunk/Context infrastructure this PR builds upon.
Chunk model and embeddings of message #1651: Earlier work on the AI Chunk model that this PR further modifies (bulk save, uniqueness).

Suggested labels

backend, backend-tests, nestbot, makefile

Suggested reviewers

kasya
arkid15r

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

docker-compose/local.yaml (1)

25-25: Bug: backend service sets DJANGO_REDIS_PASSWORD from DJANGO_REDIS_HOST.

This will set the password to the host value when DJANGO_REDIS_HOST is defined.
-      DJANGO_REDIS_PASSWORD: ${DJANGO_REDIS_HOST:-nest-cache-password}
+      DJANGO_REDIS_PASSWORD: ${DJANGO_REDIS_PASSWORD:-nest-cache-password}

backend/apps/ai/models/chunk.py (1)

85-93: Handle race with unique constraint in update_data().

Concurrent writers can still insert the same (context, text) between the exists() check and save(), raising IntegrityError. Catch and return None to match the method contract.
-        if save:
-            chunk.save()
+        if save:
+            from django.db import IntegrityError
+            try:
+                chunk.save()
+            except IntegrityError:
+                return None

backend/apps/slack/events/__init__.py (1)

1-20: Make event configuration idempotent and skip heavy imports when Slack app is disabled

Prevents duplicate registrations and avoids importing event modules if tokens are missing.

+_EVENTS_CONFIGURED = False
+
 def configure_slack_events():
     """Configure Slack events after Django apps are ready."""
-    from apps.slack.apps import SlackConfig
-    from apps.slack.events import (
-        app_home_opened,
-        message_posted,
-        team_join,
-        url_verification,
-    )
-    from apps.slack.events.event import EventBase
-    from apps.slack.events.member_joined_channel import (
-        catch_all,
-        contribute,
-        gsoc,
-        project_nest,
-    )
-
-    if SlackConfig.app:
-        EventBase.configure_events()
+    from apps.slack.apps import SlackConfig
+    global _EVENTS_CONFIGURED
+    if not SlackConfig.app or _EVENTS_CONFIGURED:
+        return
+    from apps.slack.events import (
+        app_home_opened,
+        message_posted,
+        team_join,
+        url_verification,
+    )
+    from apps.slack.events.event import EventBase
+    from apps.slack.events.member_joined_channel import (
+        catch_all,
+        contribute,
+        gsoc,
+        project_nest,
+    )
+    EventBase.configure_events()
+    _EVENTS_CONFIGURED = True

backend/tests/apps/ai/common/extractors/repository_test.py (1)

434-455: Log message lacks repository context expected by tests

Tests assert the debug log contains test-repo. Update extractor logging to include repo key/owner/path as shown above.

🧹 Nitpick comments (48)

backend/test_commands.py (1)

1-1: Avoid stray test_ module outside tests/*

Pytest will collect this empty module, adding noise. Either remove it or rename/move under a non-collected path (e.g., backend/commands_testutils.py) or into tests/ with actual tests.

backend/settings/base.py (1)

145-152: Harden RQ config: prefer URL form and make timeout configurable.

Using a single Redis URL avoids partial env mismatches and mirrors your cache config; exposing timeout via env helps ops tuning.

Apply within this block:

-        "default": {
-            "HOST": REDIS_HOST,
-            "PORT": 6379,
-            "PASSWORD": REDIS_PASSWORD,
-            "DEFAULT_TIMEOUT": 360,
-        }
+        "default": {
+            # Use URL form for parity with CACHES and simpler overrides.
+            "URL": f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:6379/0",
+            # Make tunable via env; 360s default preserved.
+            "DEFAULT_TIMEOUT": values.IntegerValue(
+                environ_name="RQ_DEFAULT_TIMEOUT",
+                default=360,
+            ),
+        }

Optionally consider adding:

RQ exception handlers that report to Sentry.
Separate queues (e.g., "ai", "slack") with different timeouts.

backend/tests/apps/ai/common/utils_test.py (1)

108-116: Add a minimal assertion to lock invocation count.

Verifies we don’t loop on failures.

         result = create_chunks_and_embeddings(
             chunk_texts=["some text"],
             context=None,
             openai_client=mock_openai_client,
         )
 
         mock_logger.exception.assert_called_once_with("Failed to create chunks and embeddings")
+        mock_update_data.assert_called_once()
         assert result == []

backend/apps/slack/constants.py (1)

25-80: Use frozenset for OWASP_KEYWORDS and prune overly generic entries
Convert OWASP_KEYWORDS to a frozenset to prevent accidental mutation; remove broad terms (e.g. “security”, “project”, “event”) to cut down on false positives—matching is already case-insensitive via text.lower(). Evaluate the false-positive rate on a representative Slack log before enabling auto-replies.

backend/apps/ai/models/chunk.py (1)

5-5: Also import IntegrityError.

If you adopt the changes above, add IntegrityError to imports.
-from django.db import DatabaseError, models
+from django.db import DatabaseError, IntegrityError, models

backend/tests/apps/ai/models/chunk_test.py (4)

31-42: Assert list-clearing contract after bulk_save.

If bulk_save is expected to clear the input list (as BulkSaveModel does), assert it to prevent regressions.

             with patch("apps.common.models.BulkSaveModel.bulk_save") as mock_bulk_save:
                 Chunk.bulk_save(mock_chunks)
                 mock_bulk_save.assert_called_once_with(Chunk, mock_chunks, fields=None)
+                assert mock_chunks == []

43-55: Mirror the clearing assertion for the fields path.

Keep expectations consistent across both code paths.

             with patch("apps.common.models.BulkSaveModel.bulk_save") as mock_bulk_save:
                 Chunk.bulk_save(mock_chunks, fields=fields)
                 mock_bulk_save.assert_called_once_with(Chunk, mock_chunks, fields=fields)
+                assert mock_chunks == []

65-86: Remove redundant @patch on Chunk.save.

You replace Chunk with a Mock later; the decorator patch becomes moot and can confuse readers.

-@patch("apps.ai.models.chunk.Chunk.save")
-def test_update_data_save_with_context(self, mock_save):
+def test_update_data_save_with_context(self):

31-55: Add a test ensuring updates aren’t filtered out.

Guard against regressions where chunks with ids (updates) get dropped by existence filtering.

Proposed new test:

def test_bulk_save_includes_updates():
    c1 = Mock(spec=Chunk); c1.id = 42; c1.context = Mock(); c1.text = "t1"
    c2 = Mock(spec=Chunk); c2.id = None; c2.context = Mock(); c2.text = "t2"
    with patch("apps.ai.models.chunk.Chunk.objects.filter") as mock_filter, \
         patch("apps.common.models.BulkSaveModel.bulk_save") as mock_bulk:
        mock_filter.return_value.exists.return_value = False
        chunks = [c1, c2]
        Chunk.bulk_save(chunks)
        # c1 (update) must still be passed through
        mock_bulk.assert_called_once()
        passed = mock_bulk.call_args.args[1]
        assert c1 in passed and c2 in passed

Want me to push this into the suite?

backend/apps/ai/common/extractors/repository.py (5)

130-146: Broaden markdown coverage (common variants).

Consider additional common filenames/paths to increase hit rate on OWASP www repos (case and format variants).

     markdown_files = [
         "README.md",
+        "README.rst",
+        "README.adoc",
+        "readme.md",
+        "README",
         "CONTRIBUTING.md",
         "CODE_OF_CONDUCT.md",
         "SECURITY.md",
         "CHANGELOG.md",
-        "LICENSE.md",
+        "LICENSE.md",
+        "LICENSE",
+        "docs/README.md",
         "docs/README.md",
         "docs/index.md",
         "docs/overview.md",
         "docs/getting-started.md",
         "docs/installation.md",
         "docs/usage.md",
         "docs/api.md",
         "docs/development.md",
         "docs/deployment.md",
+        "index.md",
+        "content/index.md",
+        "content/README.md",
+        "site/README.md",
     ]

156-160: Branch fallback for site repos.

If default_branch is missing, try gh-pages then master before main to better handle www repos.

-                raw_url = (
-                    f"https://raw.githubusercontent.com/{owner}/{repository.key}/"
-                    f"{repository.default_branch or 'main'}/{file_path}"
-                )
+                branch = (
+                    repository.default_branch
+                    or ("gh-pages" if getattr(repository, "has_pages", False) else None)
+                    or "master"
+                    or "main"
+                )
+                raw_url = f"https://raw.githubusercontent.com/{owner}/{repository.key}/{branch}/{file_path}"

167-169: Improve logging with context.

Log which file/owner/repo failed and keep traceback; current message loses diagnostics.

-        except (ValueError, TypeError, OSError, ConnectionError):
-            logger.debug("Failed to fetch")
+        except (ValueError, TypeError, OSError, ConnectionError):
+            logger.debug(
+                "Failed to fetch markdown file",
+                extra={"file_path": file_path, "owner": owner, "repo": getattr(repository, "key", None)},
+                exc_info=True,
+            )
             continue

148-166: Surface fetched files in metadata (and drop unused variable otherwise).

Either remove fetched_files or expose it so downstream context can reference what was pulled.

-    fetched_files = []
+    fetched_files = []
@@
-                if content and content.strip():
-                    fetched_files.append(file_path)
+                if content and content.strip():
+                    fetched_files.append(file_path)
                     prose_parts.append(f"## {file_path}\n\n{content}")
@@
     return (
-        DELIMITER.join(filter(None, prose_parts)),
-        DELIMITER.join(filter(None, metadata_parts)),
+        DELIMITER.join(filter(None, prose_parts)),
+        DELIMITER.join(
+            filter(
+                None,
+                metadata_parts
+                + (
+                    [f"Fetched Markdown Files: {', '.join(fetched_files)}"]
+                    if fetched_files
+                    else []
+                ),
+            )
+        ),
     )

Also applies to: 171-174

148-166: Consider rate limits and 404 noise from raw URLs.

Fetching many files per repo can hit GitHub rate limits and raw 404s may return HTML. Prefer GitHub API for contents (with If-None-Match) or add status checks in get_repository_file_content to return "" on non-200.

I can provide an API-backed extractor variant and harden get_repository_file_content to ignore non-200 responses. Want a follow-up PR?

backend/apps/slack/admin/conversation.py (1)

30-31: Expose the new flag in filters for quicker triage

Adding the boolean to list_filter eases admin workflows without altering data model.

Apply outside this hunk:
@@
     list_filter = (
         "sync_messages",
         "is_archived",
         "is_channel",
         "is_general",
         "is_im",
         "is_private",
+        "is_nest_bot_assistant_enabled",
     )

backend/apps/slack/commands/ai.py (2)

21-23: Avoid KeyError when Slack omits text; use a safe default.
Use .get("text") or "" to guard against missing/None and prevent an AttributeError on .strip().
-        return get_blocks(
-            query=command["text"].strip(),
-        )
+        return get_blocks(
+            query=(command.get("text") or "").strip(),
+        )
19-20: Document the local import intent.
Add a short comment so future readers know this is deliberate for lazy-loading/cycle avoidance.
-        from apps.slack.common.handlers.ai import get_blocks
+        # Local import to avoid circular import and reduce import-time overhead.
+        from apps.slack.common.handlers.ai import get_blocks

backend/apps/ai/Makefile (1)

37-44: Declare PHONY targets for the new commands.
Prevents make from treating similarly named files as up-to-date targets.
 ai-update-repository-context:
 	@echo "Updating repository context"
 	@CMD="python manage.py ai_update_repository_context" $(MAKE) exec-backend-command
 
+.PHONY: ai-update-repository-chunks ai-update-repository-context

backend/apps/ai/management/commands/ai_update_repository_context.py (1)

18-28: Eager-load FKs used by the extractor.
extract_repository_markdown_content accesses organization.login and owner.login; add select_related to avoid N+1 queries during batch runs.
         return (
             super()
             .get_base_queryset()
             .filter(
                 is_owasp_site_repository=True,
                 is_archived=False,
                 is_empty=False,
             )
-        )
+        ).select_related("organization", "owner")

backend/apps/slack/services/message_auto_reply.py (1)

33-35: Race window note (optional).
Small race exists between the replies check and posting; acceptable, but consider an idempotency key or re-check on 409s if this becomes noisy.

backend/apps/slack/common/handlers/ai.py (4)

16-19: Fix docstring: remove undocumented parameter.
The presentation argument isn’t in the signature.
-        presentation (EntityPresentation | None): Configuration for entity presentation.
26-28: Guard against oversized Slack block text.
Slack mrkdwn text is limited (~3000 chars per section). Chunk long responses into multiple blocks.
-    if ai_response:
-        return [markdown(ai_response)]
+    if ai_response:
+        max_len = 2900
+        if len(ai_response) <= max_len:
+            return [markdown(ai_response)]
+        return [markdown(ai_response[i : i + max_len]) for i in range(0, len(ai_response), max_len)]
31-46: Make models configurable via settings.
Avoid hard-coding model IDs; read from Django settings with sane defaults.
+    from django.conf import settings
     rag_tool = RagTool(
-        chat_model="gpt-4o",
-        embedding_model="text-embedding-3-small",
+        chat_model=getattr(settings, "AI_CHAT_MODEL", "gpt-4o"),
+        embedding_model=getattr(settings, "AI_EMBEDDING_MODEL", "text-embedding-3-small"),
     )
10-10: Unused logger.
Either use logger for diagnostics or remove it.

backend/apps/slack/MANIFEST.yaml (2)

111-124: Tighten bot scopes (least privilege).

Line 123 adds channels:history, which is needed for message.channels. Line 122 (channels:manage) appears unnecessary for this feature set—drop it if unused.
       - users:read
-      - groups:write
-      - channels:manage
+      - groups:write
+      # Removed channels:manage (not required for read/AI reply workflows)
       - channels:history
131-134: Validate event coverage vs. intended surfaces.

message.channels covers public channels only. If replies are desired in private channels/DMs, add message.groups and message.im and corresponding groups:history/im:history scopes; otherwise consider app_mention to reduce noise.

backend/apps/slack/migrations/0019_conversation_is_nest_bot_assistant_enabled.py (1)

11-17: Field OK; consider index and default rollout.

If this flag is frequently filtered, add db_index=True or a model Index.

Default False will disable the assistant for all existing conversations. Confirm that’s intended or add a data migration to enable for selected workspaces/channels.
-            field=models.BooleanField(default=False, verbose_name="Is Nest Bot Assistant Enabled"),
+            field=models.BooleanField(
+                default=False,
+                verbose_name="Is Nest Bot Assistant Enabled",
+                db_index=True,
+            ),

backend/apps/ai/management/commands/ai_update_repository_chunks.py (1)

18-28: Avoid N+1 on owner/org during extraction.

The extractor touches repository.organization/owner; select_related will prevent per-row FK hits.

     def get_base_queryset(self) -> QuerySet:
         """Return the base queryset with filtering for OWASP site repositories."""
         return (
             super()
             .get_base_queryset()
+            .select_related("organization", "owner")
             .filter(
                 is_owasp_site_repository=True,
                 is_archived=False,
                 is_empty=False,
             )
         )

backend/tests/apps/ai/management/commands/ai_update_repository_chunks_test.py (3)

45-47: Fix pluralization to “repositories”.

Current expectation hardcodes a misspelling (“repositorys”) and bakes it into UX.

-        assert command.entity_name_plural == "repositorys"
+        assert command.entity_name_plural == "repositories"

Outside this file, consider deriving from model meta to avoid manual pluralization:

# In BaseAICommand (or similar)
@property
def entity_name_plural(self) -> str:
    return getattr(self.model_class._meta, "verbose_name_plural", f"{self.entity_name}s")

67-82: Patch target is brittle.

Patching command.class.bases[0].get_base_queryset relies on MRO order. Patch the concrete base symbol instead.

-        with patch.object(command.__class__.__bases__[0], "get_base_queryset") as mock_super:
+        from apps.ai.common.base.chunk_command import BaseChunkCommand
+        with patch.object(BaseChunkCommand, "get_base_queryset") as mock_super:

104-119: Remove duplicated test logic.

This repeats test_get_base_queryset. Drop to reduce noise.

-    def test_queryset_filtering_logic(self, command):
-        """Test that the queryset filtering logic works correctly."""
-        with patch.object(command.__class__.__bases__[0], "get_base_queryset") as mock_super:
-            mock_queryset = Mock()
-            mock_super.return_value = mock_queryset
-            mock_queryset.filter.return_value = "filtered_queryset"
-
-            result = command.get_base_queryset()
-
-            mock_queryset.filter.assert_called_once_with(
-                is_owasp_site_repository=True,
-                is_archived=False,
-                is_empty=False,
-            )
-            assert result == "filtered_queryset"
+    # Removed: duplicate of test_get_base_queryset

backend/tests/apps/slack/common/handlers/ai_test.py (2)

75-83: Avoid hard-coding model IDs; read from settings.

Make chat/embedding model names configurable; update test to assert against settings-derived values.

-        mock_rag_tool.assert_called_once_with(
-            chat_model="gpt-4o",
-            embedding_model="text-embedding-3-small",
-        )
+        from django.conf import settings
+        mock_rag_tool.assert_called_once_with(
+            chat_model=getattr(settings, "SLACK_AI_CHAT_MODEL", "gpt-4o"),
+            embedding_model=getattr(settings, "SLACK_AI_EMBEDDING_MODEL", "text-embedding-3-small"),
+        )

Outside this file (handler):

from django.conf import settings
rag_tool = RagTool(
    chat_model=getattr(settings, "SLACK_AI_CHAT_MODEL", "gpt-4o"),
    embedding_model=getattr(settings, "SLACK_AI_EMBEDDING_MODEL", "text-embedding-3-small"),
)

132-143: Add a guard for blank-only queries.

Avoid calling the RAG pipeline when the stripped query is empty; return error blocks directly and add a test.

+    def test_get_blocks_with_blank_query(self):
+        with patch("apps.slack.common.handlers.ai.get_error_blocks") as mock_err:
+            mock_err.return_value = [{"type": "section"}]
+            with patch("apps.slack.common.handlers.ai.process_ai_query") as mock_proc:
+                result = get_blocks("   ")
+                mock_proc.assert_not_called()
+                assert result == [{"type": "section"}]

backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py (2)

67-83: Replace brittle base-class patching with import-path patching.

Using bases[0] is fragile. Patch the known symbol instead.
-        with patch.object(command.__class__.__bases__[0], "get_base_queryset") as mock_super:
+        with patch("apps.ai.common.base.context_command.BaseContextCommand.get_base_queryset") as mock_super:
             mock_queryset = Mock()
             mock_super.return_value = mock_queryset
             mock_queryset.filter.return_value = "filtered_queryset"
Repeat the same replacement in the other tests that patch the base queryset.

Also applies to: 104-119, 127-141, 142-156, 157-171

127-171: DRY up the three queryset-filter tests.

They assert the same filter arguments. Parametrize into one test to reduce noise and speed up the suite.

If helpful, I can provide a parametrized version.

backend/tests/apps/slack/services/message_auto_reply_test.py (3)

48-96: Also assert the retrieval call contract.

Validate we call Message.objects.get with pk for added confidence.

         generate_ai_reply_if_unanswered(mock_message.id)
 
+        mock_message_get.assert_called_once_with(pk=mock_message.id)
         mock_client.conversations_replies.assert_called_once_with(
             channel=mock_message.conversation.slack_channel_id,
             ts=mock_message.slack_message_id,
             limit=1,
         )

97-106: Add assertion for not-found path.

Confirms correct lookup argument.

         generate_ai_reply_if_unanswered(99999)
 
+        mock_message_get.assert_called_once_with(pk=99999)

172-211: Merge duplicate “no/empty AI response” tests via parametrization.

Reduces repetition while preserving coverage.

-    @patch("apps.slack.services.message_auto_reply.Message.objects.get")
-    @patch("apps.slack.services.message_auto_reply.WebClient")
-    @patch("apps.slack.services.message_auto_reply.process_ai_query")
-    def test_generate_ai_reply_no_ai_response(
-        self,
-        mock_process_ai_query,
-        mock_webclient,
-        mock_message_get,
-        mock_message,
-    ):
-        """Test when AI doesn't return a response."""
-        mock_message_get.return_value = mock_message
-        mock_client = Mock()
-        mock_webclient.return_value = mock_client
-        mock_client.conversations_replies.return_value = {"messages": [{"reply_count": 0}]}
-        mock_process_ai_query.return_value = None
-
-        generate_ai_reply_if_unanswered(mock_message.id)
-
-        mock_client.chat_postMessage.assert_not_called()
-
-    @patch("apps.slack.services.message_auto_reply.Message.objects.get")
-    @patch("apps.slack.services.message_auto_reply.WebClient")
-    @patch("apps.slack.services.message_auto_reply.process_ai_query")
-    def test_generate_ai_reply_empty_response(
-        self,
-        mock_process_ai_query,
-        mock_webclient,
-        mock_message_get,
-        mock_message,
-    ):
-        """Test when AI returns empty response."""
-        mock_message_get.return_value = mock_message
-        mock_client = Mock()
-        mock_webclient.return_value = mock_client
-        mock_client.conversations_replies.return_value = {"messages": [{"reply_count": 0}]}
-        mock_process_ai_query.return_value = ""
-
-        generate_ai_reply_if_unanswered(mock_message.id)
-
-        mock_client.chat_postMessage.assert_not_called()
+    @pytest.mark.parametrize("ai_response", [None, ""])
+    @patch("apps.slack.services.message_auto_reply.Message.objects.get")
+    @patch("apps.slack.services.message_auto_reply.WebClient")
+    @patch("apps.slack.services.message_auto_reply.process_ai_query")
+    def test_generate_ai_reply_no_or_empty_ai_response(
+        self, mock_process_ai_query, mock_webclient, mock_message_get, mock_message, ai_response
+    ):
+        """Test when AI returns no/empty response."""
+        mock_message_get.return_value = mock_message
+        mock_client = Mock()
+        mock_webclient.return_value = mock_client
+        mock_client.conversations_replies.return_value = {"messages": [{"reply_count": 0}]}
+        mock_process_ai_query.return_value = ai_response
+
+        generate_ai_reply_if_unanswered(mock_message.id)
+
+        mock_client.chat_postMessage.assert_not_called()

backend/tests/apps/slack/common/question_detector_test.py (3)

18-23: Avoid brittle assertions on pattern counts

Asserting exact counts for question_patterns/compiled_patterns couples tests to implementation details.
-        assert len(detector.question_patterns) == 4
-        assert len(detector.compiled_patterns) == 4
+        assert len(detector.question_patterns) >= 4
+        assert len(detector.compiled_patterns) >= 4
49-64: Confirm intent: treating “advice/recommend” statements as questions

Tests expect declaratives like “I recommend this approach” to be flagged as questions. If that’s not intended, constrain these patterns (e.g., require a question mark or leading question terms).

144-156: Reduce reliance on pattern order

Index-based checks (compiled_patterns[0/1]) are fragile. Prefer matching by pattern content or behavior.
-        question_mark_pattern = detector.compiled_patterns[0]
-        assert question_mark_pattern.search("Is this right?") is not None
+        assert any(p.pattern == r"\?" and p.search("Is this right?") for p in detector.compiled_patterns)

-        question_word_pattern = detector.compiled_patterns[1]
-        assert question_word_pattern.search("What is this") is not None
-        assert question_word_pattern.search("How does it work") is not None
+        expected = r"^(what|how|why|when|where|which|who|can|could|would|should|is|are|does|do|did)"
+        assert any(p.pattern == expected and p.search("What is this") for p in detector.compiled_patterns)
+        assert any(p.pattern == expected and p.search("How does it work") for p in detector.compiled_patterns)

backend/apps/slack/common/question_detector.py (3)

16-19: Normalize keywords to lowercase once

Ensure case-insensitive matching regardless of how OWASP_KEYWORDS is defined.

-        self.owasp_keywords = OWASP_KEYWORDS
+        self.owasp_keywords = {k.lower() for k in OWASP_KEYWORDS}

20-25: Tighten regex to reduce false positives

Add word-boundaries and allow optional leading whitespace for question starters; keeps behavior but trims accidental matches like “whatsoever”.

-            r"^(what|how|why|when|where|which|who|can|could|would|should|is|are|does|do|did)",
-            r"(help|explain|tell me|show me|guide|tutorial|example)",
-            r"(recommend|suggest|advice|opinion)",
+            r"^\s*\b(what|how|why|when|where|which|who|can|could|would|should|is|are|does|do|did)\b",
+            r"\b(help|explain|tell me|show me|guide|tutorial|example)\b",
+            r"\b(recommend|suggest|advice|opinion)\b",

48-57: Boundary-aware phrase matching for multi-word keywords

Substring matching can overmatch. Consider boundary-aware search that ignores punctuation/hyphens.

# Example enhancement:
def contains_owasp_keywords(self, text: str) -> bool:
    words = re.findall(r"\b\w+\b", text)
    text_words = set(words)
    if self.owasp_keywords.intersection(text_words):
        return True

    # Multi-word phrases: collapse whitespace/punctuation and match by tokens
    normalized = re.sub(r"[\s\-_]+", " ", text)
    for keyword in self.owasp_keywords:
        if " " in keyword and re.search(rf"\b{re.escape(keyword)}\b", normalized):
            return True
    return False

backend/tests/apps/slack/commands/ai_test.py (1)

13-17: Remove unused autouse fixture or reuse it

You recreate Ai() in tests; either reuse self.ai_command or drop the autouse fixture.

-    @pytest.fixture(autouse=True)
-    def setup_method(self):
-        """Set up test data before each test method."""
-        self.ai_command = Ai()
+    @pytest.fixture
+    def ai_command(self):
+        """Ai instance per test."""
+        return Ai()

Then use ai_command in tests.

backend/tests/apps/ai/common/extractors/repository_test.py (1)

397-405: Use call.args for robust URL assertions

Stringifying call is brittle. Inspect the first positional arg.

-        assert any(
-            "https://raw.githubusercontent.com/test-org/test-repo/develop/" in str(call)
-            for call in mock_get_content.call_args_list
-        )
+        assert any(
+            "https://raw.githubusercontent.com/test-org/test-repo/develop/" in c.args[0]
+            for c in mock_get_content.call_args_list
+        )

backend/tests/apps/slack/events/message_posted_test.py (2)

95-123: Add a case for mid-message mentions (optional)

If mentions anywhere should trigger AI reply, add a test where the mention is not at the start (e.g., “Hi <@u987654> what is OWASP?”) and confirm the stripped query is correct. If only leading mentions should trigger, assert non-trigger instead.

95-123: Consider guarding against None blocks from get_blocks

If get_blocks can return None, chat_postMessage may fail. Either assert in tests that we skip posting when None, or handle it in the event handler.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7cd6d44 and a1f1b70.

⛔ Files ignored due to path filters (1)

backend/poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (36)

backend/apps/ai/Makefile (1 hunks)
backend/apps/ai/common/extractors/repository.py (1 hunks)
backend/apps/ai/management/commands/ai_update_repository_chunks.py (1 hunks)
backend/apps/ai/management/commands/ai_update_repository_context.py (1 hunks)
backend/apps/ai/models/chunk.py (3 hunks)
backend/apps/slack/MANIFEST.yaml (3 hunks)
backend/apps/slack/admin/conversation.py (1 hunks)
backend/apps/slack/apps.py (1 hunks)
backend/apps/slack/commands/__init__.py (1 hunks)
backend/apps/slack/commands/ai.py (1 hunks)
backend/apps/slack/common/handlers/ai.py (1 hunks)
backend/apps/slack/common/question_detector.py (1 hunks)
backend/apps/slack/constants.py (1 hunks)
backend/apps/slack/events/__init__.py (1 hunks)
backend/apps/slack/events/message_posted.py (1 hunks)
backend/apps/slack/migrations/0019_conversation_is_nest_bot_assistant_enabled.py (1 hunks)
backend/apps/slack/models/conversation.py (1 hunks)
backend/apps/slack/services/__init__.py (1 hunks)
backend/apps/slack/services/message_auto_reply.py (1 hunks)
backend/apps/slack/templates/commands/ai.jinja (1 hunks)
backend/pyproject.toml (1 hunks)
backend/settings/base.py (2 hunks)
backend/settings/urls.py (1 hunks)
backend/test_commands.py (1 hunks)
backend/tests/apps/ai/common/extractors/repository_test.py (1 hunks)
backend/tests/apps/ai/common/utils_test.py (1 hunks)
backend/tests/apps/ai/management/commands/ai_update_repository_chunks_test.py (1 hunks)
backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py (1 hunks)
backend/tests/apps/ai/models/chunk_test.py (4 hunks)
backend/tests/apps/slack/commands/ai_test.py (1 hunks)
backend/tests/apps/slack/common/handlers/ai_test.py (1 hunks)
backend/tests/apps/slack/common/question_detector_test.py (1 hunks)
backend/tests/apps/slack/events/message_posted_test.py (1 hunks)
backend/tests/apps/slack/services/message_auto_reply_test.py (1 hunks)
cspell/custom-dict.txt (2 hunks)
docker-compose/local.yaml (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (20)

backend/apps/slack/apps.py (2)

backend/apps/mentorship/apps.py (1)

ready (12-14)

backend/apps/slack/events/__init__.py (1)

configure_slack_events (1-19)

backend/apps/slack/common/handlers/ai.py (2)

backend/apps/ai/agent/tools/rag/rag_tool.py (2)

RagTool (13-63)

query (34-63)

backend/apps/slack/blocks.py (1)

markdown (21-34)

backend/tests/apps/slack/common/handlers/ai_test.py (2)

backend/apps/slack/common/handlers/ai.py (2)

get_error_blocks (49-61)

process_ai_query (31-46)

backend/apps/ai/agent/tools/rag/rag_tool.py (1)

query (34-63)

backend/apps/ai/management/commands/ai_update_repository_context.py (4)

backend/apps/ai/common/base/context_command.py (1)

BaseContextCommand (9-54)

backend/apps/ai/common/extractors/repository.py (1)

extract_repository_markdown_content (114-174)

backend/apps/github/models/repository.py (1)

Repository (25-349)

backend/apps/ai/management/commands/ai_update_repository_chunks.py (5)

Command (10-36)

extract_content (14-16)

get_base_queryset (18-28)

get_default_queryset (30-32)

source_name (34-36)

backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py (3)

backend/apps/ai/management/commands/ai_update_repository_context.py (5)

Command (10-36)

source_name (34-36)

extract_content (14-16)

get_base_queryset (18-28)

get_default_queryset (30-32)

backend/apps/ai/common/base/context_command.py (1)

BaseContextCommand (9-54)

backend/apps/github/models/repository.py (1)

Repository (25-349)

backend/apps/slack/commands/ai.py (2)

backend/apps/slack/commands/command.py (1)

CommandBase (20-157)

backend/apps/slack/common/handlers/ai.py (1)

get_blocks (13-28)

backend/apps/slack/events/message_posted.py (7)

backend/apps/slack/common/handlers/ai.py (1)

get_blocks (13-28)

backend/apps/slack/common/question_detector.py (2)

QuestionDetector (13-57)

is_owasp_question (31-42)

backend/apps/slack/events/event.py (1)

EventBase (23-265)

backend/apps/slack/models/conversation.py (2)

Conversation (15-107)

update_data (85-107)

backend/apps/slack/models/member.py (1)

Member (10-78)

backend/apps/slack/models/message.py (2)

Message (15-160)

text (83-85)

backend/apps/slack/services/message_auto_reply.py (1)

generate_ai_reply_if_unanswered (16-48)

backend/apps/slack/common/question_detector.py (1)

backend/apps/slack/models/message.py (1)

text (83-85)

backend/tests/apps/slack/services/message_auto_reply_test.py (4)

backend/apps/slack/models/conversation.py (1)

Conversation (15-107)

backend/apps/slack/models/message.py (3)

Message (15-160)

text (83-85)

ts (88-90)

backend/apps/slack/models/workspace.py (2)

Workspace (11-44)

bot_token (37-44)

backend/apps/slack/services/message_auto_reply.py (1)

generate_ai_reply_if_unanswered (16-48)

backend/tests/apps/ai/management/commands/ai_update_repository_chunks_test.py (3)

backend/apps/ai/management/commands/ai_update_repository_chunks.py (5)

Command (10-36)

source_name (34-36)

extract_content (14-16)

get_base_queryset (18-28)

get_default_queryset (30-32)

backend/apps/ai/common/base/chunk_command.py (1)

BaseChunkCommand (12-91)

backend/apps/github/models/repository.py (1)

Repository (25-349)

backend/apps/slack/services/message_auto_reply.py (4)

backend/apps/slack/common/handlers/ai.py (1)

process_ai_query (31-46)

backend/apps/slack/models/message.py (3)

Message (15-160)

ts (88-90)

text (83-85)

backend/apps/slack/models/workspace.py (1)

bot_token (37-44)

backend/apps/ai/agent/tools/rag/rag_tool.py (1)

query (34-63)

backend/apps/ai/common/extractors/repository.py (1)

backend/apps/github/utils.py (1)

get_repository_file_content (60-79)

backend/apps/slack/events/__init__.py (2)

backend/apps/slack/apps.py (1)

SlackConfig (13-33)

backend/apps/slack/events/event.py (2)

EventBase (23-265)

configure_events (30-37)

backend/tests/apps/ai/models/chunk_test.py (3)

backend/tests/apps/ai/common/base/chunk_command_test.py (2)

mock_chunks (72-81)

mock_context (51-60)

backend/apps/ai/models/chunk.py (2)

Chunk (16-93)

bulk_save (34-53)

backend/tests/apps/ai/common/base/context_command_test.py (1)

mock_context (48-55)

backend/apps/ai/management/commands/ai_update_repository_chunks.py (4)

backend/apps/ai/common/base/chunk_command.py (1)

BaseChunkCommand (12-91)

backend/apps/ai/common/extractors/repository.py (1)

extract_repository_markdown_content (114-174)

backend/apps/github/models/repository.py (1)

Repository (25-349)

backend/apps/ai/management/commands/ai_update_repository_context.py (5)

Command (10-36)

extract_content (14-16)

get_base_queryset (18-28)

get_default_queryset (30-32)

source_name (34-36)

backend/tests/apps/slack/common/question_detector_test.py (2)

backend/apps/slack/common/question_detector.py (4)

QuestionDetector (13-57)

is_question (44-46)

contains_owasp_keywords (48-57)

is_owasp_question (31-42)

backend/tests/apps/slack/events/message_posted_test.py (1)

test_init (44-48)

backend/tests/apps/slack/events/message_posted_test.py (3)

backend/apps/slack/events/message_posted.py (1)

MessagePosted (17-77)

backend/apps/slack/models/conversation.py (1)

Conversation (15-107)

backend/apps/slack/models/member.py (1)

Member (10-78)

backend/apps/ai/models/chunk.py (3)

backend/apps/ai/models/context.py (1)

Context (15-61)

backend/apps/common/models.py (2)

BulkSaveModel (10-34)

TimestampedModel (37-46)

backend/apps/common/utils.py (1)

truncate (164-176)

backend/tests/apps/slack/commands/ai_test.py (2)

backend/apps/slack/commands/ai.py (1)

Ai (6-23)

backend/apps/slack/commands/command.py (1)

CommandBase (20-157)

backend/tests/apps/ai/common/extractors/repository_test.py (1)

backend/apps/ai/common/extractors/repository.py (2)

extract_repository_content (11-111)

extract_repository_markdown_content (114-174)

🔇 Additional comments (26)

backend/pyproject.toml (1)

22-22: Manually verify django-rq compatibility with Python 3.13 and Django 5.1
Poetry wasn’t available to run the compatibility checks; please confirm at runtime that django-rq 3.1 (and its rq/redis transitive dependencies) works under Python 3.13 and Django 5.1, and pin rq/redis in pyproject.toml if needed.

cspell/custom-dict.txt (1)

109-109: Dictionary additions OK

“rqworker” and “webgoat” entries make sense for this repo’s domain.

Also applies to: 125-125

backend/apps/slack/services/__init__.py (1)

1-1: Package initializer LGTM

Clear, minimal package docstring. No issues.

backend/settings/urls.py (1)

31-31: Django-RQ dashboard is staff-only by default
Unconditional path("django-rq/", include("django_rq.urls")) is safe since django-rq v3.1+ enforces staff-only access on its dashboard views.

backend/settings/base.py (1)

52-53: Django RQ app registration looks correct.

Adding "django_rq" to THIRD_PARTY_APPS is appropriate and consistent with how INSTALLED_APPS is composed.

backend/tests/apps/ai/common/utils_test.py (1)

97-107: Decorator order and side-effect wiring are correct.

Patching order matches argument order; the AttributeError simulates the None-context path as intended.
backend/apps/ai/models/chunk.py (2)
34-53: No downstream code depends on chunks being cleared. Chunk.bulk_save is only called in chunk_command.py (where the list isn’t used after the call) and in tests (which don’t assert on the list), so the updated clear-only-when-empty logic doesn’t break any callers.

19-23: Replace deprecated unique_together with a UniqueConstraint and plan a safe migration
 class Meta:
     db_table = "ai_chunks"
     verbose_name = "Chunk"
-    unique_together = ("context", "text")
+    constraints = [
+        models.UniqueConstraint(
+            fields=("context", "text"),
+            name="uq_ai_chunks_context_text",
+        )
+    ]
Add a data migration to dedupe existing rows before enforcing the constraint and use CREATE INDEX CONCURRENTLY on Postgres to avoid downtime.
backend/tests/apps/ai/models/chunk_test.py (1)

101-104: LGTM on direct context-based existence checks.

The assertions correctly verify the switch to context=mock_context, text=text.

Also applies to: 125-128, 148-150

backend/apps/ai/common/extractors/repository.py (1)

108-111: LGTM on structured repository metadata extraction.

The prose/metadata split and DELIMITER usage align with the RAG pipeline.

backend/apps/slack/commands/__init__.py (1)

5-5: LGTM: /ai command is now exported/loaded

Importing ai ensures command discovery via CommandBase.configure_commands().

backend/apps/slack/models/conversation.py (1)

30-32: LGTM: opt-in feature flag on Conversation

Good that from_slack() does not overwrite this local flag.

backend/apps/ai/management/commands/ai_update_repository_context.py (1)

30-32: Upstream get_base_queryset has no is_active filter BaseAICommand.get_base_queryset returns .objects.all() with no implicit filters, so overriding get_default_queryset correctly avoids filtering on the nonexistent is_active field.

backend/apps/slack/MANIFEST.yaml (1)

98-103: /ai slash command looks correct.

URL, usage, and escaping align with the existing pattern. Ensure the handler responds within 3s to Slack’s slash command ACK.

backend/apps/slack/migrations/0019_conversation_is_nest_bot_assistant_enabled.py (1)

1-1: Django version compatibility confirmed
The project’s pyproject.toml declares django = "^5.1" (i.e. ≥5.1,<6.0), which includes Django 5.2.5, so the migration header is consistent with the dependency range.

backend/apps/ai/management/commands/ai_update_repository_chunks.py (2)

14-17: Good: delegates extraction cleanly.

34-36: Consistent source_name.

Matches context command; good for downstream filters.

backend/tests/apps/slack/services/message_auto_reply_test.py (1)

17-43: Fixtures are clean and focused. LGTM.

backend/tests/apps/slack/common/question_detector_test.py (2)

170-186: Good real-world coverage

Nice parameterized cases that exercise diverse OWASP queries.

188-196: Initialization patching looks solid

Mocking OWASP_KEYWORDS verifies configurability without leaking global state.

backend/tests/apps/slack/commands/ai_test.py (2)

18-42: Patch target and whitespace trimming behavior look correct

Verifies the local import usage and exact query passed to get_blocks.

127-140: Exception propagation test is appropriate

Confirms render_blocks does not swallow errors, leaving caller to handle.

backend/tests/apps/ai/common/extractors/repository_test.py (1)

241-256: Baseline behavior verified

Description is included; metadata intentionally empty for markdown extraction. Looks good.

backend/tests/apps/slack/events/message_posted_test.py (3)

95-123: Bot-mention flow is well covered

Auth caching, mention stripping, and block rendering checks are solid.

210-292: OWASP question happy-path coverage looks good

Validates DB updates and delayed enqueue to the default queue.

348-375: Whitespace-trim behavior verified

Ensures query cleanup before rendering blocks. Nice.

backend/apps/ai/models/chunk.py

backend/apps/slack/apps.py

backend/apps/slack/events/message_posted.py

backend/apps/slack/services/message_auto_reply.py

backend/apps/slack/templates/commands/ai.jinja

coderabbitai · 2025-09-01T22:10:18Z

backend/tests/apps/ai/common/extractors/repository_test.py

+    def test_extract_repository_markdown_content_file_fetch_exception(self, mock_get_content):
+        """Test extraction when file fetching raises an exception."""
+        repository = MagicMock()
+        repository.description = "Test repository"
+        repository.organization = MagicMock(login="test-org")
+        repository.key = "test-repo"
+        repository.default_branch = "main"
+
+        mock_get_content.side_effect = Exception("Network error")
+
+        prose, metadata = extract_repository_markdown_content(repository)
+
+        assert "Repository Description: Test repository" in prose
+        assert metadata == ""
+


⚠️ Potential issue

Mismatch: extractor doesn’t catch generic exceptions

extract_repository_markdown_content currently catches a narrow set; these tests raise Exception, which will bubble and fail. Broaden exception handling in the extractor and keep going.

Proposed change in apps/ai/common/extractors/repository.py:

# inside the for file_path in markdown_files loop raw_url = f"https://raw.githubusercontent.com/{owner}/{repository.key}/{repository.default_branch or 'main'}/{file_path}" try: content = get_repository_file_content(raw_url) except Exception as exc: # broad by design; network/parsing can vary logger.debug( "Failed to fetch %s for %s/%s from branch %s: %s", file_path, owner or "(unknown owner)", repository.key or "(no key)", repository.default_branch or "main", exc, ) continue

🤖 Prompt for AI Agents

In backend/tests/apps/ai/common/extractors/repository_test.py around lines 340 to 354, the test triggers a generic Exception when fetching file content but the extractor only catches narrow exceptions; update apps/ai/common/extractors/repository.py inside the markdown_files loop to wrap the get_repository_file_content call in a try/except that catches Exception, log a debug message including file_path, owner (or "(unknown owner)"), repository.key (or "(no key)"), branch (or "main") and the exception, then continue to the next file so extraction proceeds without raising.

backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py

cspell/custom-dict.txt

docker-compose/local.yaml

sonarqubecloud · 2025-09-06T18:57:03Z

Quality Gate passed

Issues
7 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/apps/ai/models/chunk.py (1)

85-93: update_data has a race on duplicates; rely on the DB constraint

The pre-check can race; catch IntegrityError instead and return None.

-        if Chunk.objects.filter(context=context, text=text).exists():
-            return None
-
-        chunk = Chunk(text=text, embedding=embedding, context=context)
-
-        if save:
-            chunk.save()
-
-        return chunk
+        chunk = Chunk(text=text, embedding=embedding, context=context)
+        if save:
+            try:
+                chunk.save()
+            except Exception as exc:  # narrowed below via local import
+                from django.db import IntegrityError
+                if isinstance(exc, IntegrityError):
+                    # Duplicate per unique (context, text); treat as no-op
+                    logger.debug("Chunk already exists for context_id=%s; skipping", getattr(context, "id", None))
+                    return None
+                raise
+        return chunk

If you prefer top-level imports, add:

-from django.db import DatabaseError, models
+from django.db import DatabaseError, IntegrityError, models

♻️ Duplicate comments (2)

backend/apps/ai/models/chunk.py (1)

35-53: bulk_save drops updates, does N+1 queries, and breaks the clear-list contract

Updates are filtered out (ids present) if an identical (context, text) exists.
Per-object .exists() causes O(N) queries.
Passing a new list to BulkSaveModel means the caller’s list isn’t cleared unless no uniques remain.
Duplicate races aren’t handled; rely on the DB unique constraint and catch IntegrityError.

Replace the body with a single-probe, race-safe implementation that preserves the contract.

     @staticmethod
     def bulk_save(chunks, fields=None):
-        """Bulk save chunks with duplicate handling."""
-        unique_chunks = []
-        unique_chunks = [
-            chunk
-            for chunk in chunks
-            if not Chunk.objects.filter(context=chunk.context, text=chunk.text).exists()
-        ]
-
-        if unique_chunks:
-            try:
-                BulkSaveModel.bulk_save(Chunk, unique_chunks, fields=fields)
-            except (ValueError, TypeError, DatabaseError):
-                for chunk in unique_chunks:
-                    try:
-                        chunk.save()
-                    except (ValueError, TypeError, DatabaseError) as save_error:
-                        logger.exception("Failed to save chunk", exc_info=save_error)
-        else:
-            chunks.clear()
+        """Bulk save chunks with duplicate handling and minimal queries."""
+        if not chunks:
+            return
+
+        # Partition: updates vs. creates
+        to_update = [c for c in chunks if getattr(c, "id", None)]
+        to_create = [c for c in chunks if not getattr(c, "id", None)]
+
+        # In-memory dedupe among creates by (context_id, text)
+        seen: set[tuple[int | None, str]] = set()
+        deduped_create = []
+        for c in to_create:
+            ctx_id = getattr(c, "context_id", None) or getattr(getattr(c, "context", None), "id", None)
+            key = (ctx_id, c.text)
+            if key in seen:
+                continue
+            seen.add(key)
+            deduped_create.append(c)
+
+        # One DB probe for existing keys among incoming creates
+        ctx_ids = {getattr(c, "context_id", None) or getattr(getattr(c, "context", None), "id", None) for c in deduped_create}
+        texts = {c.text for c in deduped_create}
+        existing_keys: set[tuple[int, str]] = set()
+        if ctx_ids and texts:
+            existing_keys = set(
+                Chunk.objects.filter(context_id__in=ctx_ids, text__in=texts)
+                .values_list("context_id", "text")
+            )
+
+        final_create = []
+        for c in deduped_create:
+            ctx_id = getattr(c, "context_id", None) or getattr(getattr(c, "context", None), "id", None)
+            if (ctx_id, c.text) not in existing_keys:
+                final_create.append(c)
+
+        unique_chunks = to_update + final_create
+        if not unique_chunks:
+            chunks.clear()
+            return
+
+        try:
+            # Maintain contract: operate on caller list so BulkSaveModel clears it
+            chunks[:] = unique_chunks
+            BulkSaveModel.bulk_save(Chunk, chunks, fields=fields)
+        except (ValueError, TypeError, DatabaseError) as exc:
+            # Fallback: per-object saves with race-safe duplicate handling
+            from django.db import IntegrityError  # local import to avoid global if unused
+            for obj in unique_chunks:
+                try:
+                    if getattr(obj, "id", None) and fields:
+                        obj.save(update_fields=fields)
+                    else:
+                        obj.save()
+                except IntegrityError:
+                    # Duplicate create due to race/uniqueness; ignore
+                    continue
+                except (ValueError, TypeError, DatabaseError):
+                    logger.exception("Failed to save chunk id=%s", getattr(obj, "id", None))

backend/tests/apps/ai/common/extractors/repository_test.py (1)

340-354: Add a test for generic exceptions from the fetcher

Past feedback asked to catch generic Exception in the extractor; add a case that raises Exception to ensure we don’t break on unexpected errors.

     @patch("apps.ai.common.extractors.repository.get_repository_file_content")
     def test_extract_repository_markdown_content_file_fetch_exception(self, mock_get_content):
         """Test extraction when file fetching raises an exception."""
         repository = MagicMock()
         repository.description = "Test repository"
         repository.organization = MagicMock(login="test-org")
         repository.key = "test-repo"
         repository.default_branch = "main"

-        mock_get_content.side_effect = ConnectionError("Network error")
+        mock_get_content.side_effect = Exception("Network error")

         prose, metadata = extract_repository_markdown_content(repository)

         assert "Repository Description: Test repository" in prose
         assert metadata == ""

🧹 Nitpick comments (5)

backend/apps/ai/models/chunk.py (2)

50-51: Use logger.exception correctly

logger.exception already includes traceback; passing exc_info object is unnecessary and atypical.

-                        logger.exception("Failed to save chunk", exc_info=save_error)
+                        logger.exception("Failed to save chunk id=%s", getattr(chunk, "id", None))

19-23: Prefer UniqueConstraint over unique_together

Django now favors explicit UniqueConstraint with a name; easier to reference in migrations and logs.

-        unique_together = ("context", "text")
+        constraints = [
+            models.UniqueConstraint(
+                fields=["context", "text"],
+                name="uniq_ai_chunk_context_text",
+            )
+        ]

backend/apps/ai/management/commands/ai_update_repository_chunks.py (1)

18-28: Optimize queryset for extractor access

Select related FKs to avoid extra queries when building raw URLs (owner/org), and narrow fields if desired.
         return (
             super()
             .get_base_queryset()
             .filter(
                 is_owasp_site_repository=True,
                 is_archived=False,
                 is_empty=False,
             )
-        )
+        ).select_related("organization", "owner")

backend/tests/apps/ai/common/extractors/repository_test.py (2)

407-420: Assert no network calls when owner/org missing

When both are absent, extractor should not attempt fetches. Patch and assert not called.

-    def test_extract_repository_markdown_content_no_owner_or_org(self):
+    @patch("apps.ai.common.extractors.repository.get_repository_file_content")
+    def test_extract_repository_markdown_content_no_owner_or_org(self, mock_get_content):
         """Test extraction when repository has neither organization nor owner."""
         repository = MagicMock()
         repository.description = "Test repository"
         repository.organization = None
         repository.owner = None
         repository.key = "test-repo"
         repository.default_branch = "main"

         prose, metadata = extract_repository_markdown_content(repository)

         assert "Repository Description: Test repository" in prose
         assert metadata == ""
+        mock_get_content.assert_not_called()

421-433: Also assert no network calls when key is missing

Key is required to form the raw URL; ensure no fetch attempted.

-    def test_extract_repository_markdown_content_no_key(self):
+    @patch("apps.ai.common.extractors.repository.get_repository_file_content")
+    def test_extract_repository_markdown_content_no_key(self, mock_get_content):
         """Test extraction when repository has no key."""
         repository = MagicMock()
         repository.description = "Test repository"
         repository.organization = MagicMock(login="test-org")
         repository.key = None
         repository.default_branch = "main"

         prose, metadata = extract_repository_markdown_content(repository)

         assert "Repository Description: Test repository" in prose
         assert metadata == ""
+        mock_get_content.assert_not_called()

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a1f1b70 and 103fb24.

📒 Files selected for processing (12)

backend/apps/ai/Makefile (1 hunks)
backend/apps/ai/common/extractors/repository.py (1 hunks)
backend/apps/ai/management/commands/ai_update_repository_chunks.py (1 hunks)
backend/apps/ai/management/commands/ai_update_repository_context.py (1 hunks)
backend/apps/ai/models/chunk.py (3 hunks)
backend/test_commands.py (1 hunks)
backend/tests/apps/ai/common/extractors/repository_test.py (1 hunks)
backend/tests/apps/ai/common/utils_test.py (1 hunks)
backend/tests/apps/ai/management/commands/ai_update_repository_chunks_test.py (1 hunks)
backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py (1 hunks)
backend/tests/apps/ai/models/chunk_test.py (4 hunks)
cspell/custom-dict.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (9)

backend/test_commands.py
backend/tests/apps/ai/management/commands/ai_update_repository_context_test.py
backend/tests/apps/ai/models/chunk_test.py
backend/apps/ai/management/commands/ai_update_repository_context.py
backend/apps/ai/common/extractors/repository.py
cspell/custom-dict.txt
backend/apps/ai/Makefile
backend/tests/apps/ai/management/commands/ai_update_repository_chunks_test.py
backend/tests/apps/ai/common/utils_test.py

🧰 Additional context used

🧬 Code graph analysis (3)

backend/tests/apps/ai/common/extractors/repository_test.py (1)

backend/apps/ai/common/extractors/repository.py (2)

extract_repository_content (11-111)

extract_repository_markdown_content (114-174)

backend/apps/ai/management/commands/ai_update_repository_chunks.py (4)

backend/apps/ai/common/base/chunk_command.py (1)

BaseChunkCommand (12-91)

backend/apps/ai/common/extractors/repository.py (1)

extract_repository_markdown_content (114-174)

backend/apps/github/models/repository.py (1)

Repository (25-349)

backend/apps/ai/management/commands/ai_update_repository_context.py (5)

Command (10-36)

extract_content (14-16)

get_base_queryset (18-28)

get_default_queryset (30-32)

source_name (34-36)

backend/apps/ai/models/chunk.py (2)

backend/apps/ai/models/context.py (1)

Context (15-61)

backend/apps/common/models.py (2)

BulkSaveModel (10-34)

TimestampedModel (37-46)

🔇 Additional comments (5)

backend/apps/ai/management/commands/ai_update_repository_chunks.py (4)

10-12: LGTM: command wiring is correct

key_field_name and model_class look right.

14-16: LGTM: delegates to the markdown extractor

Extraction path aligns with the PR goals.

30-36: LGTM: default queryset override avoids is_active

Repository lacks is_active; good override.

18-28: Verified OWASP site repository filter uses www- prefixes
check_owasp_site_repository keys off key.startswith(("www-chapter-", "www-committee-", "www-event", "www-project-")), so the queryset filter aligns with the intended www- heuristic.

backend/tests/apps/ai/common/extractors/repository_test.py (1)

3-3: CI/python requirements support datetime.UTC
Project’s pyproject.toml specifies python = "^3.13" (>= 3.13), which includes the datetime.UTC alias introduced in Python 3.11—no action needed.

arkid15r

@Dishant1804 you've submitted 1000+ lines of code for review.
I have a couple of questions:

could you explain what OWASP Nest problem this PR solves?
how did you verify that problem is solved with this PR?

Dishant1804 · 2025-09-10T16:56:06Z

@Dishant1804 you've submitted 1000+ lines of code for review. I have a couple of questions:

could you explain what OWASP Nest problem this PR solves?

how did you verify that problem is solved with this PR?

this PR gets the .md files that we have in www- repositories and saves it chunks and context so that we can have more knowledge
I checked the admin panel for chunks and context, also ran it on some example queries

arkid15r · 2025-09-10T17:04:26Z

this PR gets the .md files that we have in www- repositories and saves it chunks and context so that we can have more knowledge I checked the admin panel for chunks and context, also ran it on some example queries

Alright, so it sound like you checked it with at least one repository and some relevant queries. Can you provide some examples?

Dishant1804 · 2025-09-10T18:29:03Z

this PR gets the .md files that we have in www- repositories and saves it chunks and context so that we can have more knowledge I checked the admin panel for chunks and context, also ran it on some example queries

Alright, so it sound like you checked it with at least one repository and some relevant queries. Can you provide some examples?

this has a lot of chunks that are not up to the mark, maybe we should change the chunking delimiter or chunking strategy for this case, because it has lot of delimiters

arkid15r · 2025-09-10T20:04:37Z

this has a lot of chunks that are not up to the mark, maybe we should change the chunking delimiter or chunking strategy for this case, because it has lot of delimiters

I meant your general approach for testing, e.g. did OWASP Nest repository content help to improve NestBot answers/context for any specific questions? Did it get new info based on this?

My opinion -- it didn't. I haven't tested it yet but based on this code I can reasonably assume that no new [useful] content was added by these jobs and PR in general. Also I reasonably assume that this is an AI generated code that has nothing to do with real life scenario. I call it a cosplay-code.

The bottom line is that code has 0 value in its current form but you requested mentors to review it w/o testing. I don't appreciate this kind of disrespect to mentors' time. This is not acceptable.

Dishant1804 · 2025-09-11T12:45:12Z

this has a lot of chunks that are not up to the mark, maybe we should change the chunking delimiter or chunking strategy for this case, because it has lot of delimiters

I meant your general approach for testing, e.g. did OWASP Nest repository content help to improve NestBot answers/context for any specific questions? Did it get new info based on this?

My opinion -- it didn't. I haven't tested it yet but based on this code I can reasonably assume that no new [useful] content was added by these jobs and PR in general. Also I reasonably assume that this is an AI generated code that has nothing to do with real life scenario. I call it a cosplay-code.

The bottom line is that code has 0 value in its current form but you requested mentors to review it w/o testing. I don't appreciate this kind of disrespect to mentors' time. This is not acceptable.

I'll update the chunking startegy for this and then request for review, we are getting the information from readme and contributing .md files, the chunking needs to be proper

coderabbitai bot reviewed Sep 1, 2025

View reviewed changes

spelling fixes and tests

b198ecf

Dishant1804 force-pushed the www-repository branch 2 times, most recently from 6439df7 to b198ecf Compare September 6, 2025 17:55

sonar and code rabbit suggestions implemented

103fb24

coderabbitai bot reviewed Sep 6, 2025

View reviewed changes

Dishant1804 marked this pull request as ready for review September 6, 2025 19:27

Dishant1804 requested review from arkid15r and kasya as code owners September 6, 2025 19:27

arkid15r reviewed Sep 10, 2025

View reviewed changes

Dishant1804 marked this pull request as draft September 11, 2025 12:45

Uh oh!

Sync www-repopsitories #2164

Are you sure you want to change the base?

Sync www-repopsitories #2164

Conversation

Dishant1804 commented Sep 1, 2025

Proposed change

Checklist

Uh oh!

coderabbitai bot commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Assessment against linked issues

Assessment against linked issues: Out-of-scope changes

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Sep 6, 2025

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

arkid15r left a comment

Choose a reason for hiding this comment

Uh oh!

Dishant1804 commented Sep 10, 2025

Uh oh!

arkid15r commented Sep 10, 2025

Uh oh!

Dishant1804 commented Sep 10, 2025

Uh oh!

arkid15r commented Sep 10, 2025

Uh oh!

Dishant1804 commented Sep 11, 2025

Uh oh!

Uh oh!

coderabbitai bot commented Sep 1, 2025 •

edited

Loading