Skip to content

Conversation

rudransh-shrivastava
Copy link
Collaborator

@rudransh-shrivastava rudransh-shrivastava commented Sep 5, 2025

Resolves #2088

Proposed change

migrate scraper logic of audience , leaders and urls to Github .md files

Checklist

  • I've read and followed the contributing guidelines.
  • I've run make check-test locally; all checks and tests passed.

Copy link
Contributor

coderabbitai bot commented Sep 5, 2025

Summary by CodeRabbit

  • New Features

    • Automatically extract audience, links (including GitHub), and leader details from project/chapter/committee repositories.
    • Synchronize leader names and emails into member records for improved accuracy.
    • Admin: Entity Members list now displays member name for easier identification.
  • Refactor

    • Consolidated data sourcing to the project/chapter/committee entities for URLs, audience, and leaders.
    • Updated member model to support name/email alongside optional linked account; adjusted uniqueness to prevent duplicates.

Walkthrough

Repository-file parsing replaces scraper-based extraction: add RepositoryBasedEntityModel helpers (info_md_url, get_audience, get_urls, get_leaders_emails, sync_leaders), update scraping commands to call entity.get_* methods, add leader persistence fields/migrations, and extend tests for parsing and syncing.

Changes

Cohort / File(s) Summary
Projects command
backend/apps/owasp/management/commands/owasp_scrape_projects.py
Switch audience/leader/url sources to project.get_audience(), project.get_leaders(), project.get_leaders_emails(), project.sync_leaders(...), and project.get_urls(domain="github.com").
Chapters & Committees commands
backend/apps/owasp/management/commands/owasp_scrape_chapters.py, backend/apps/owasp/management/commands/owasp_scrape_committees.py
Per-entity leaders_raw = ...get_leaders(), leaders_emails = ...get_leaders_emails() and conditional ...sync_leaders(...); source candidate URLs from *.get_urls() instead of scraper.
Repository-based model helpers
backend/apps/owasp/models/common.py
Add info_md_url property and methods: get_audience(), get_urls(domain=None), get_leaders_emails(), and sync_leaders(...) to parse repository .md files and sync EntityMember records.
EntityMember model & migration
backend/apps/owasp/models/entity_member.py
backend/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py
Make member FK nullable (blank=True, null=True); add member_email and member_name fields; migration to add/update these fields and adjust uniqueness/indexes.
Admin: EntityMember
backend/apps/owasp/admin/entity_member.py
Remove explicit fields layout; add member_name to list_display as the first column.
Migrations: unique/index adjustments
backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py
Remove legacy index, adjust unique_together to include (entity_type, entity_id, member_name, role), and alter member_name field.
Tests — command updates
backend/tests/apps/owasp/management/commands/owasp_scrape_projects_test.py
backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py
backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py
Updated mocks to return URLs/audience from Project/Chapter/Committee get_urls() / get_audience() instead of scraper; assertions/flows remain.
Tests — model parsing coverage
backend/tests/apps/owasp/models/common_test.py
Add parameterized tests for get_audience(), get_leaders_emails(), get_urls() (domain filtering/edge cases) and tests for sync_leaders behaviors (no-op, update, mixed create/update).
Misc (spellings)
cspell/custom-dict.txt
Add "GFKs" to custom dictionary.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • kasya

Pre-merge checks (4 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly and accurately summarizes the primary change — migrating scraper logic to parse GitHub Markdown files — and aligns with the PR objectives and linked issue #2088; it is concise, specific, and free of noisy file lists or emojis. The "Feature/" prefix is a minor branch-convention artifact but does not obscure intent. Overall the title communicates the main change clearly to reviewers scanning history.
Linked Issues Check ✅ Passed The PR implements repository Markdown parsing by adding info_md_url, get_audience, get_urls, get_leaders_emails, and sync_leaders on the repository-based model, updates management commands to use those methods (projects/chapters/committees), extracts leader emails and synchronizes EntityMember records, and adds tests and migrations to persist leader_name/member_email; these changes satisfy the linked issue objectives to move audience, leaders (including emails), and URLs extraction from the scraper to GitHub .md files. Updated tests exercise the new parsing paths and URL/audience behaviors. Database and admin changes support storing leader metadata as required by the feature.
Out of Scope Changes Check ✅ Passed All substantive code changes are focused on migrating scraping responsibilities to GitHub Markdown parsing and the necessary support work (model fields, migrations, admin updates, and tests); no unrelated features or broad refactors were detected beyond this scope. The cspell dictionary update is a harmless documentation tooling change. Reviewers should still validate the schema/index changes and migration impact on existing data as those touch persistence beyond pure parsing logic.
Description Check ✅ Passed The description directly references issue #2088 and states that audience, leaders, and URLs parsing will be migrated to GitHub .md files, matching the changeset intent; it also notes the contributor ran tests and followed CONTRIBUTING, so it is clearly related to the PR. The level of detail required for this lenient check is satisfied. There is no indication the description is off-topic relative to the changes.
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)

59-66: Guard against None from get_related_url before normalize_url

project.get_related_url(url) can return None; passing that into normalize_url() risks unexpected errors depending on its implementation. Add a guard in the comprehension.

-                {
-                    repository_url
-                    for url in set(project.get_urls(domain="github.com"))
-                    if (repository_url := normalize_url(project.get_related_url(url)))
-                    and repository_url not in {project.github_url, project.owasp_url}
-                }
+                {
+                    repository_url
+                    for url in set(project.get_urls(domain="github.com"))
+                    if (related_url := project.get_related_url(url))
+                    and (repository_url := normalize_url(related_url))
+                    and repository_url not in {project.github_url, project.owasp_url}
+                }
🧹 Nitpick comments (5)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (2)

80-84: Parse org/user owner with urlparse to avoid trailing slash and query issues

verified_url.split('/')[-1] breaks on trailing slashes and query strings. Use urlparse.

-                            gh_organization = gh.get_organization(verified_url.split("/")[-1])
+                            from urllib.parse import urlparse
+                            owner = urlparse(verified_url).path.strip("/").split("/")[0]
+                            if not owner:
+                                raise UnknownObjectException(status=404, data=None, headers=None)
+                            gh_organization = gh.get_organization(owner)

91-94: Log the skipped original URL, not None

When verified_url is falsy, logging it is unhelpful. Log scraped_url.

-                else:
-                    logger.info("Skipped related URL %s", verified_url)
+                else:
+                    logger.info("Skipped related URL %s", scraped_url)
backend/apps/owasp/models/common.py (1)

254-269: Harden URL extraction: stricter regex, dedupe, and resilient domain matching

  • Current regex can capture trailing punctuation.
  • Return stable unique ordering.
  • Accept subdomains like www.github.com via suffix match (case-insensitive).
-        urls = []
-        for line in content.split("\n"):
-            line_urls = re.findall(r"https?:\/\/[^\s\)]+", line.strip())
-            urls.extend(line_urls)
+        urls = []
+        for line in content.split("\n"):
+            line_urls = re.findall(r"https?://[^\s\)\]\}\>,\"']+", line.strip())
+            urls.extend(line_urls)
+        # de-duplicate preserving order
+        urls = list(dict.fromkeys(urls))
 
-        if domain:
-            return [url for url in urls if urlparse(url).netloc == domain]
+        if domain:
+            dom = domain.lstrip(".").lower()
+            return [u for u in urls if urlparse(u).netloc.lower().endswith(dom)]
 
         return urls
backend/tests/apps/owasp/management/commands/owasp_scrape_projects_test.py (2)

36-38: Stub get_leaders for determinism (prevents opaque Mock in leaders_raw)

The command now calls project.get_leaders(). Stubbing avoids persisting a bare Mock into leaders_raw if later assertions are added.

         mock_scraper.page_tree = True
+        mock_project.get_leaders.return_value = []
         mock_project.get_urls.return_value = []
         mock_project.get_audience.return_value = ["builder", "breaker", "defender"]

74-80: Add an org/user URL to cover the org expansion branch

Include https://github.com/org in inputs to exercise the GITHUB_USER_RE path that expands to repos. Expectations remain unchanged due to dedupe.

         mock_project.get_urls.return_value = [
             "https://github.com/org/repo1",
             "https://github.com/org/repo2",
             "https://invalid.com/repo3",
+            "https://github.com/org",
         ]
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6e29fc4 and a446674.

📒 Files selected for processing (3)
  • backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
  • backend/apps/owasp/models/common.py (4 hunks)
  • backend/tests/apps/owasp/management/commands/owasp_scrape_projects_test.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
backend/tests/apps/owasp/management/commands/owasp_scrape_projects_test.py (2)
backend/tests/apps/owasp/management/commands/owasp_create_project_metadata_file_test.py (1)
  • mock_project (14-40)
backend/apps/owasp/scraper.py (2)
  • get_urls (89-95)
  • get_audience (55-87)
backend/apps/owasp/models/common.py (3)
backend/apps/owasp/models/enums/project.py (1)
  • AudienceChoices (8-13)
backend/apps/owasp/scraper.py (2)
  • get_audience (55-87)
  • get_urls (89-95)
backend/apps/github/utils.py (1)
  • get_repository_file_content (60-79)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (2)
backend/apps/owasp/models/common.py (5)
  • get_leaders (190-210)
  • owasp_url (140-142)
  • deactivate (144-147)
  • get_audience (174-188)
  • get_urls (254-268)
backend/apps/owasp/scraper.py (3)
  • OwaspScraper (22-127)
  • get_audience (55-87)
  • get_urls (89-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run backend tests
  • GitHub Check: Run frontend unit tests
  • GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (2)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)

56-56: LGTM: migrated audience to GitHub info.md source

Switching to project.get_audience() aligns with the new parsing strategy.

backend/apps/owasp/models/common.py (1)

105-114: LGTM: info_md_url helper mirrors existing pattern

Consistent with index_md_url/leaders_md_url and safely returns None when repo missing.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/tests/apps/owasp/models/common_test.py (1)

16-43: Audience parsing: add case-insensitive + exact-word tests; avoid substring false-positives (e.g., “Codebreaker”).

Today, parsing is case-sensitive and matches substrings; “Codebreaker” would incorrectly flag “Breaker”. Add tests to pin desired semantics, and update the parser to be case-insensitive with word boundaries.

Apply this diff to extend the tests:

@@
     @pytest.mark.parametrize(
         ("content", "expected_audience"),
         [
             (
                 """### Top Ten Card Game Information
 * [Incubator Project](#)
 * [Type of Project](#)
 * [Version 0.0.0](#)
 * [Builder](#)
 * [Breaker](#)""",
                 ["breaker", "builder"],
             ),
+            ("* builder\n* BREAKER", ["breaker", "builder"]),
+            ("* Codebreaker", []),
             ("This test contains no audience information.", []),
             ("", []),
             (None, []),
         ],
     )

And update the implementation in apps/owasp/models/common.py to use case-insensitive, whole-word matching:

@@
-        for line in content.split("\n"):
-            for lower_kw, original_kw in AudienceChoices.choices:
-                if original_kw in line:
-                    found_keywords.add(lower_kw)
+        for line in content.splitlines():
+            lower_line = line.lower()
+            for lower_kw, original_kw in AudienceChoices.choices:
+                if re.search(rf"\b{re.escape(original_kw.lower())}\b", lower_line):
+                    found_keywords.add(lower_kw)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a446674 and 2b20ccf.

📒 Files selected for processing (2)
  • backend/apps/owasp/models/common.py (4 hunks)
  • backend/tests/apps/owasp/models/common_test.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/apps/owasp/models/common.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/tests/apps/owasp/models/common_test.py (1)
backend/apps/owasp/models/common.py (2)
  • get_audience (174-187)
  • get_urls (253-267)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run backend tests
  • GitHub Check: Run frontend unit tests

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
backend/apps/owasp/models/common.py (1)

174-188: Make audience detection case-insensitive; guard missing URL.
Case-sensitive matching will miss valid audience signals; also short-circuit if info_md_url is None to avoid noisy fetch errors.

 def get_audience(self):
     """Get audience from info.md file on GitHub."""
-    content = get_repository_file_content(self.info_md_url)
+    if not self.info_md_url:
+        return []
+    content = get_repository_file_content(self.info_md_url)
     if not content:
         return []

     found_keywords = set()

-    for line in content.split("\n"):
-        for lower_kw, original_kw in AudienceChoices.choices:
-            if original_kw in line:
-                found_keywords.add(lower_kw)
+    for line in content.split("\n"):
+        line_norm = line.casefold()
+        for lower_kw, original_kw in AudienceChoices.choices:
+            if original_kw.casefold() in line_norm:
+                found_keywords.add(lower_kw)

     return sorted(found_keywords)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2b20ccf and a697161.

📒 Files selected for processing (3)
  • backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
  • backend/apps/owasp/models/common.py (4 hunks)
  • backend/tests/apps/owasp/models/common_test.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • backend/tests/apps/owasp/models/common_test.py
  • backend/apps/owasp/management/commands/owasp_scrape_projects.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/apps/owasp/models/common.py (3)
backend/apps/owasp/models/enums/project.py (1)
  • AudienceChoices (8-13)
backend/apps/owasp/scraper.py (2)
  • get_audience (55-87)
  • get_urls (89-95)
backend/apps/github/utils.py (1)
  • get_repository_file_content (60-79)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Run backend tests
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run frontend unit tests
  • GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (2)
backend/apps/owasp/models/common.py (2)

21-21: Import looks correct and scoped.
AudienceChoices import path matches the enums module location.


105-114: Info.md URL property: LGTM.
Consistent with index_md_url/leaders_md_url. No issues.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
backend/apps/owasp/models/entity_member.py (2)

57-58: Index and normalize member_email for lookups and dedupe.

Email will be a frequent join/dedupe key from MD parsing. Add an index and ensure consistent casing.

-    member_email = models.EmailField(blank=True, default="")
+    member_email = models.EmailField(blank=True, default="", db_index=True)

Optionally, normalize on save:

def save(self, *args, **kwargs):
    if self.member_email:
        self.member_email = self.member_email.strip().lower()
    super().save(*args, **kwargs)

69-73: Harden str fallback to avoid empty names.

If both member is None and member_name is empty, the display becomes blank. Fall back to email or “Unknown”.

-        display_name = self.member.login if self.member else self.member_name
-        return f"{display_name} as {self.get_role_display()} for {self.entity}"
+        display_name = (
+            (getattr(self.member, "login", "") if self.member else "")
+            or self.member_name
+            or self.member_email
+            or "Unknown"
+        )
+        return f"{display_name} as {self.get_role_display()} for {self.entity}"
backend/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py (1)

13-35: Optional data backfill for nicer display.

Backfill member_name from existing relations to keep str meaningful for legacy rows lacking MD-provided names.

I can add a RunPython migration that sets member_name = member.login where member_name="" and member IS NOT NULL. Want me to draft it?

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37362e7 and f289f92.

📒 Files selected for processing (5)
  • backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
  • backend/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py (1 hunks)
  • backend/apps/owasp/models/common.py (5 hunks)
  • backend/apps/owasp/models/entity_member.py (2 hunks)
  • backend/tests/apps/owasp/models/common_test.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • backend/apps/owasp/models/common.py
  • backend/tests/apps/owasp/models/common_test.py
  • backend/apps/owasp/management/commands/owasp_scrape_projects.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/apps/owasp/models/entity_member.py (1)
backend/apps/mentorship/api/internal/nodes/mentor.py (1)
  • login (23-25)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run backend tests
  • GitHub Check: Run frontend unit tests
🔇 Additional comments (2)
backend/apps/owasp/models/entity_member.py (1)

69-72: .login is defined on User (inherited from GenericUserModel); no changes required.

backend/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py (1)

1-12: No action needed—Django is already pinned to cover 5.2.6
backend/pyproject.toml declares django = "^5.1", which resolves to ≥5.1.0 and <6.0.0, so migrations generated by 5.2.6 will be consistent.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (10)
backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py (2)

78-93: Add assertions to lock in new behavior (leaders + no scraper.get_urls).
Strengthen the test so regressions back to scraper.get_urls get caught and the leaders call is exercised.

Apply:

@@
         mock_active_committees.count.assert_called_once()
 
         assert mock_bulk_save.called
 
         assert mock_print.call_count == (committees - offset)
 
+        # New API usage assertions
+        mock_committee.get_leaders.assert_called_once()
+        mock_committee.get_urls.assert_called()
+        # Ensure old path isn't used anymore
+        mock_scraper.get_urls.assert_not_called()
+
         for call in mock_print.call_args_list:
             args, _ = call
             assert "https://owasp.org/www-committee-test" in args[0]

49-50: Remove redundant get_related_url side_effect assignment.
You set the same side_effect in the fixture already; this second assignment is unnecessary.

-        mock_committee.get_related_url.side_effect = lambda url, **_: url
+        # Redundant with fixture setup; can be removed.

Also applies to: 23-23

backend/apps/owasp/management/commands/owasp_scrape_chapters.py (3)

44-57: Avoid calling normalize_url(None); split steps for clarity and safety.
Guard against None from get_related_url before normalizing; keeps behavior explicit.

-            scraped_urls = sorted(
-                {
-                    repository_url
-                    for url in set(chapter.get_urls())
-                    if (
-                        repository_url := normalize_url(
-                            chapter.get_related_url(
-                                url, exclude_domains=("github.com", "owasp.org")
-                            )
-                        )
-                    )
-                    and repository_url not in {chapter.github_url, chapter.owasp_url}
-                }
-            )
+            scraped_urls = sorted(
+                {
+                    repo_url
+                    for url in set(chapter.get_urls())
+                    if (rel := chapter.get_related_url(url, exclude_domains=("github.com", "owasp.org")))
+                    and (repo_url := normalize_url(rel))
+                    and repo_url not in {chapter.github_url, chapter.owasp_url}
+                }
+            )

67-72: Avoid shadowing verified_url; use a distinct name for the normalized related URL.
Improves readability and log accuracy.

-                if verified_url := chapter.get_related_url(
-                    normalize_url(verified_url, check_path=True)
-                ):
-                    related_urls.add(verified_url)
+                if related_url := chapter.get_related_url(
+                    normalize_url(verified_url, check_path=True)
+                ):
+                    related_urls.add(related_url)
                 else:
-                    logger.info("Skipped related URL %s", verified_url)
+                    logger.info("Skipped related URL %s", related_url)

42-76: Deduplicate URL-pipeline logic shared with committees.
The “collect, verify, normalize, filter” flow is duplicated across commands. Consider extracting a shared helper (e.g., on the model or a utils module) to reduce drift.

backend/apps/owasp/management/commands/owasp_scrape_committees.py (3)

44-57: Guard normalize_url against None and simplify the filter.
Same safety/readability improvement as chapters.

-            scraped_urls = sorted(
-                {
-                    repository_url
-                    for url in set(committee.get_urls())
-                    if (
-                        repository_url := normalize_url(
-                            committee.get_related_url(
-                                url, exclude_domains=("github.com", "owasp.org")
-                            )
-                        )
-                    )
-                    and repository_url not in {committee.github_url, committee.owasp_url}
-                }
-            )
+            scraped_urls = sorted(
+                {
+                    repo_url
+                    for url in set(committee.get_urls())
+                    if (rel := committee.get_related_url(url, exclude_domains=("github.com", "owasp.org")))
+                    and (repo_url := normalize_url(rel))
+                    and repo_url not in {committee.github_url, committee.owasp_url}
+                }
+            )

67-72: Rename shadowed variable for clarity.
Avoid reusing verified_url for the post-processed related URL.

-                if verified_url := committee.get_related_url(
-                    normalize_url(verified_url, check_path=True)
-                ):
-                    related_urls.add(verified_url)
+                if related_url := committee.get_related_url(
+                    normalize_url(verified_url, check_path=True)
+                ):
+                    related_urls.add(related_url)
                 else:
-                    logger.info("Skipped related URL %s", verified_url)
+                    logger.info("Skipped related URL %s", related_url)

42-76: Extract shared URL-processing helper to avoid duplication with chapters.
Centralize the “discover → normalize → verify → filter” logic to keep behavior consistent across entities.

backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py (2)

78-93: Harden the test with explicit API-usage assertions.
Mirror the committees test improvements to catch regressions and exercise leaders.

@@
         mock_active_chapters.count.assert_called_once()
 
         assert mock_bulk_save.called
 
         assert mock_print.call_count == (chapters - offset)
 
+        # New API usage assertions
+        mock_chapter.get_leaders.assert_called_once()
+        mock_chapter.get_urls.assert_called()
+        # Ensure old path isn't used anymore
+        mock_scraper.get_urls.assert_not_called()
+
         for call in mock_print.call_args_list:
             args, _ = call
             assert "https://owasp.org/www-chapter-test" in args[0]

49-50: Drop duplicate get_related_url side_effect setup.
Already set in the fixture; the extra assignment in the test body can be removed.

-        mock_chapter.get_related_url.side_effect = lambda url, **_: url
+        # Redundant with fixture setup; can be removed.

Also applies to: 23-23

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f289f92 and 5c33591.

📒 Files selected for processing (5)
  • backend/apps/owasp/management/commands/owasp_scrape_chapters.py (1 hunks)
  • backend/apps/owasp/management/commands/owasp_scrape_committees.py (1 hunks)
  • backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
  • backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py (1 hunks)
  • backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/apps/owasp/management/commands/owasp_scrape_projects.py
🧰 Additional context used
🧬 Code graph analysis (4)
backend/apps/owasp/management/commands/owasp_scrape_chapters.py (2)
backend/apps/owasp/models/common.py (2)
  • get_leaders (189-209)
  • get_urls (273-284)
backend/apps/owasp/scraper.py (1)
  • get_urls (89-95)
backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py (1)
backend/tests/apps/owasp/management/commands/owasp_enrich_committees_test.py (1)
  • mock_committee (18-22)
backend/apps/owasp/management/commands/owasp_scrape_committees.py (2)
backend/apps/owasp/models/common.py (2)
  • get_leaders (189-209)
  • get_urls (273-284)
backend/apps/owasp/scraper.py (1)
  • get_urls (89-95)
backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py (1)
backend/tests/apps/owasp/management/commands/owasp_enrich_chapters_test.py (1)
  • mock_chapter (18-25)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Run frontend unit tests
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run backend tests
  • GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (6)
backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py (1)

41-46: Switched URL source to committee.get_urls — matches migration intent.
Good change; the test now exercises the new model-driven path instead of the scraper.

backend/apps/owasp/management/commands/owasp_scrape_chapters.py (2)

47-47: Use chapter.get_urls — aligned with .md-based parsing.
This aligns the command with the new RepositoryBasedEntityModel APIs.


42-42: Leaders now sourced from leaders.md — confirm whether emails should be persisted too.
If leader emails are required (per PR goal), consider also calling a corresponding API (e.g., get_leaders_emails) and persisting them.

backend/apps/owasp/management/commands/owasp_scrape_committees.py (2)

47-47: Use committee.get_urls — aligned with .md-based parsing.
This matches the PR direction to move away from DOM scraping for URL discovery.


42-42: Leaders now sourced from leaders.md — confirm whether emails should be persisted too.
If leader emails are part of the requirements, add retrieval/persistence (e.g., get_leaders_emails) here as well.

backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py (1)

41-46: Switched URL source to chapter.get_urls — aligns with new model API.
Good update; the test now validates the .md-based URL path.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)

55-59: Avoid double-fetching leaders.md and keep policy flag in sync.

Reuse leaders_emails to derive leaders_raw (one network call) and set is_leaders_policy_compliant alongside. Mirrors prior feedback.

-            project.audience = project.get_audience()
-            project.leaders_raw = project.get_leaders()
-            leaders_emails = project.get_leaders_emails()
-            if leaders_emails:
-                project.sync_leaders(leaders_emails)
+            project.audience = project.get_audience()
+            leaders_emails = project.get_leaders_emails()
+            # Derive leaders list from emails to avoid a second fetch; fallback if empty.
+            project.leaders_raw = list(leaders_emails) if leaders_emails else project.get_leaders()
+            project.is_leaders_policy_compliant = len(project.leaders_raw) > 1
+            if leaders_emails:
+                project.sync_leaders(leaders_emails)
🧹 Nitpick comments (5)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (2)

51-51: Remove stray debug print.

Leftover print("what") will pollute command output.

-                print("what")
+                logger.debug("OWASP page not found for %s; deactivating", project.owasp_url)

65-65: Drop redundant set() call in comprehension.

Outer set already dedupes; inner set(...) adds work without benefit.

-                    for url in set(project.get_urls(domain="github.com"))
+                    for url in project.get_urls(domain="github.com")
backend/apps/owasp/models/common.py (3)

107-115: Guard against missing default_branch.

Prevent generating an invalid URL when default_branch is None.

-        return (
+        return (
             "https://raw.githubusercontent.com/OWASP/"
-            f"{self.owasp_repository.key}/{self.owasp_repository.default_branch}/info.md"
-            if self.owasp_repository
+            f"{self.owasp_repository.key}/{self.owasp_repository.default_branch}/info.md"
+            if self.owasp_repository and self.owasp_repository.default_branch
             else None
         )

213-231: Harden leaders email extraction (support “+” bullets and MAILTO variants).

Current regex misses + bullets and upper/mixed-case mailto. Minor expansion improves robustness without changing semantics.

-            matches = re.findall(
-                r"[-*]\s*(?:\[([^\]]+)\]\(mailto:([^)]+)\)|([^[\n]+))", line.strip()
-            )
+            matches = re.findall(
+                r"[-*+]\s*(?:\[([^\]]+)\]\(\s*(?i:mailto:)\s*([^)]+)\)|([^[\n]+))",
+                line.strip(),
+            )

275-287: Strip trailing punctuation and dedupe URLs while preserving order.

Avoids false invalids like ...repo). and repeated links.

-        urls = re.findall(r"https?:\/\/[^\s\)]+", content.strip())
+        raw_urls = re.findall(r"https?://[^\s\)]+", content)
+        seen, urls = set(), []
+        for u in raw_urls:
+            u = u.rstrip(").,;:!?")
+            if u not in seen:
+                seen.add(u)
+                urls.append(u)

(Domain filtering via urlparse(url).netloc == domain is correct; keep as-is.)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c33591 and 9acb861.

📒 Files selected for processing (4)
  • backend/apps/owasp/management/commands/owasp_scrape_chapters.py (1 hunks)
  • backend/apps/owasp/management/commands/owasp_scrape_committees.py (1 hunks)
  • backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
  • backend/apps/owasp/models/common.py (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • backend/apps/owasp/management/commands/owasp_scrape_committees.py
  • backend/apps/owasp/management/commands/owasp_scrape_chapters.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.
📚 Learning: 2025-09-06T12:57:57.997Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:57.997Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).

Applied to files:

  • backend/apps/owasp/models/common.py
📚 Learning: 2025-09-06T12:48:44.892Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.

Applied to files:

  • backend/apps/owasp/models/common.py
🧬 Code graph analysis (2)
backend/apps/owasp/models/common.py (4)
backend/apps/common/models.py (2)
  • BulkSaveModel (10-34)
  • bulk_save (19-34)
backend/apps/owasp/models/enums/project.py (1)
  • AudienceChoices (8-13)
backend/apps/github/utils.py (1)
  • get_repository_file_content (60-79)
backend/apps/owasp/models/entity_member.py (2)
  • EntityMember (10-72)
  • Role (13-16)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (2)
backend/apps/owasp/models/common.py (6)
  • deactivate (146-149)
  • get_audience (176-189)
  • get_leaders (191-211)
  • get_leaders_emails (213-231)
  • sync_leaders (299-338)
  • get_urls (275-286)
backend/apps/owasp/models/project.py (1)
  • deactivate (278-281)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Run frontend unit tests
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run backend tests
  • GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (2)
backend/apps/owasp/models/common.py (2)

11-11: Imports look good.

All newly added imports are used.

Also applies to: 14-14, 23-23


176-190: Audience parsing preserves original case-sensitive behavior.

Matches the legacy scraper semantics as requested. Consider adding tests that assert case-sensitive detection for “Breaker/Builder/Defender”.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
backend/apps/owasp/models/common.py (1)

299-338: Sync leaders must update order/reactivation and deactivate stale entries; also persist all changed fields.
As previously noted, existing leaders’ order/is_active aren’t updated, removed leaders remain active, and bulk_save only updates member_email.

Minimal persistence fix (doesn’t cover the full logic change already suggested earlier):

-        if members_to_save:
-            BulkSaveModel.bulk_save(EntityMember, members_to_save, ["member_email"])
+        if members_to_save:
+            BulkSaveModel.bulk_save(
+                EntityMember,
+                members_to_save,
+                ["member_email", "order", "is_active"],
+            )

Please also adopt the earlier upsert/reorder/reactivate/deactivate logic so state is consistent.

🧹 Nitpick comments (1)
backend/apps/owasp/models/common.py (1)

275-287: Harden get_urls: trim trailing punctuation, dedupe, and normalize netloc case.
Prevents false negatives on domain filter (e.g., trailing “.”) and avoids duplicates while keeping strict netloc equality per security learning.

-        urls = re.findall(r"https?:\/\/[^\s\)]+", content.strip())
+        raw_urls = re.findall(r"https?://[^\s\)]+", content)
+        # Strip common trailing punctuation from Markdown/text usage and dedupe (stable).
+        urls = [u.rstrip(").,;:!?]") for u in raw_urls]
+        urls = list(dict.fromkeys(urls))
 
         if domain:
-            return [url for url in urls if urlparse(url).netloc == domain]
+            dn = domain.lower()
+            return [u for u in urls if urlparse(u).netloc.lower() == dn]
 
         return urls
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9acb861 and 674fb18.

📒 Files selected for processing (2)
  • backend/apps/owasp/models/common.py (7 hunks)
  • backend/tests/apps/owasp/models/common_test.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/tests/apps/owasp/models/common_test.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.
📚 Learning: 2025-09-06T12:57:57.997Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:57.997Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).

Applied to files:

  • backend/apps/owasp/models/common.py
📚 Learning: 2025-09-06T12:48:44.892Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.

Applied to files:

  • backend/apps/owasp/models/common.py
🧬 Code graph analysis (1)
backend/apps/owasp/models/common.py (4)
backend/apps/common/models.py (2)
  • BulkSaveModel (10-34)
  • bulk_save (19-34)
backend/apps/owasp/models/enums/project.py (1)
  • AudienceChoices (8-13)
backend/apps/github/utils.py (1)
  • get_repository_file_content (60-79)
backend/apps/owasp/models/entity_member.py (2)
  • EntityMember (10-72)
  • Role (13-16)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run backend tests
  • GitHub Check: Run frontend unit tests
🔇 Additional comments (3)
backend/apps/owasp/models/common.py (3)

11-11: LGTM: supporting imports added appropriately.
ContentType, BulkSaveModel, and AudienceChoices imports align with the new behaviors.

Also applies to: 14-14, 23-23


107-116: LGTM: info.md URL property.
Straightforward and consistent with index_md_url.


176-190: LGTM: audience extraction preserves original case-sensitive behavior.
Matches the original scraper semantics (per retrieved learnings) and returns normalized values.

urls = re.findall(r"https?:\/\/[^\s\)]+", content.strip())

if domain:
return [url for url in urls if urlparse(url).netloc == domain]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also match subdomains?

leaders = {}
for line in content.split("\n"):
matches = re.findall(
r"^[-*]\s*(?:\[([^\]]+)\]\(mailto:([^)]+)(\)|([^[<\n])))", line.strip()
Copy link
Collaborator Author

@rudransh-shrivastava rudransh-shrivastava Sep 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to fix this regex for hours, I can't figure it out.
^[-*]\s*(?:\[([^\]]+)\]\(mailto:([^)]+)(\)|([^[<\n]))) dropped leaders without emails
^[-*]\s*(?:\[([^\]]+)\]\(mailto:([^)]+)(\)|([^[<\n]))) was also promising but it too fails tests.

I added many tests for this. Please check them. I think if we can have a stricter rule for matching leaders names and emails I can figure this out.

example:
does * [John Doe] match name?
does * John Doe match name?

I also noticed that it was parsing <a> tags as names in earlier commits.

@arkid15r

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a one-time migration so it's not required to be ideal. Some required data can be added manually later on a higher level process.

I appreciate attention to details.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
backend/tests/apps/owasp/models/common_test.py (1)

191-226: Expand domain-filter cases to lock in security semantics (hostname equality-or-subdomain, case-insensitive, port-safe).
Current tests won’t catch regressions (e.g., lookalike domains, ports, case). Please extend as below; confirm whether subdomains should be accepted (test includes both). This also aligns with prior review feedback.

@@
     @pytest.mark.parametrize(
-        ("content", "domain", "expected_urls"),
+        ("content", "domain", "expected_urls"),
         [
@@
             ("This test contains no URLs.", None, []),
             ("", None, []),
             (None, None, []),
+            ("* [Fake](https://maliciousowasp.org)", "owasp.org", []),
+            ("* [UK](https://owasp.org.uk)", "owasp.org", []),
+            ("* [Home](https://OWASP.org)", "OWASP.ORG", ["https://OWASP.org"]),
+            ("* [With port](https://owasp.org:443/path)", "owasp.org", ["https://owasp.org:443/path"]),
+            ("* [Subdomain](https://test.owasp.org/docs)", "owasp.org", ["https://test.owasp.org/docs"]),
         ],
     )

If subdomains should NOT be accepted, change the last case’s expected_urls to [] and I’ll adjust the production code suggestion accordingly.

Production code change (outside this file) to make tests pass and harden filtering:

--- a/backend/apps/owasp/models/common.py
+++ b/backend/apps/owasp/models/common.py
@@ def get_urls(self, domain=None):
-        if domain:
-            return [url for url in urls if urlparse(url).netloc == domain]
+        if domain:
+            domain = domain.lstrip(".").lower()
+            def host_matches(u: str) -> bool:
+                host = urlparse(u).hostname
+                if not host:
+                    return False
+                host = host.lower()
+                # Accept apex or proper subdomains; case-insensitive; ports already stripped by .hostname
+                return host == domain or host.endswith("." + domain)
+            return [u for u in urls if host_matches(u)]
🧹 Nitpick comments (2)
backend/tests/apps/owasp/models/common_test.py (2)

22-49: Audience tests cover happy-path and empties.
Consider adding a duplicate keyword case to assert de-duplication remains stable, but optional.


321-349: Strengthen assertions for sync_leaders mixed path.
Verify update_fields and constructor args for the new leader.

@@ def test_sync_leaders_mixed_scenario(
-        call_args = mock_bulk_save.bulk_save.call_args
-        leaders_to_save = call_args[0][1]
+        call_args = mock_bulk_save.bulk_save.call_args
+        leaders_to_save = call_args[0][1]
         assert len(leaders_to_save) == 2  # Updated existing + new leader
         assert leaders_to_save[0] == existing_leader
+        # Ensure only member_email updates are persisted
+        assert call_args[0][2] == ["member_email"]
+        # New leader constructed with expected fields and order
+        mock_entity_member.assert_called_once()
+        _, kwargs = mock_entity_member.call_args
+        assert kwargs["member_name"] == "Jane Smith"
+        assert kwargs["member_email"] == "[email protected]"
+        assert kwargs["order"] == 1
+        assert kwargs["role"] == mock_entity_member.Role.LEADER
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 674fb18 and 68e037f.

📒 Files selected for processing (2)
  • backend/apps/owasp/models/common.py (7 hunks)
  • backend/tests/apps/owasp/models/common_test.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/apps/owasp/models/common.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.
📚 Learning: 2025-09-06T12:57:57.997Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:57.997Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).

Applied to files:

  • backend/tests/apps/owasp/models/common_test.py
🧬 Code graph analysis (1)
backend/tests/apps/owasp/models/common_test.py (2)
backend/apps/owasp/models/common.py (4)
  • get_audience (176-189)
  • get_leaders_emails (213-231)
  • get_urls (275-286)
  • sync_leaders (299-338)
backend/tests/apps/ai/common/base/chunk_command_test.py (1)
  • mock_content_type (64-68)
🔇 Additional comments (5)
backend/tests/apps/owasp/models/common_test.py (5)

1-1: Imports look good.
No issues with MagicMock/Mock/patch usage.


16-21: Solid test setup.
Fixture keeps tests isolated; id preset helps with FK lookups.


85-146: Leaders-email parsing: good edge coverage.
Nice handling of spacing, non-mailto bullets, and HTML anchor exclusion.


286-299: No-op path correctly avoids unnecessary writes.
Mocks are wired correctly; assertion precise.


300-320: Unchanged leader: good negative assertion.
Covers idempotency.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
backend/apps/owasp/models/common.py (1)

299-338: Prevent duplicate name-identified leaders when member is NULL (DB-level guard).

The current unique_together uses member FK; duplicates can accumulate for (entity, role, member_name) when member is NULL. Add a partial unique constraint to EntityMember.Meta.

Proposed constraint (in backend/apps/owasp/models/entity_member.py):

from django.db import models

class EntityMember(models.Model):
    class Meta:
        constraints = [
            models.UniqueConstraint(
                fields=["entity_type", "entity_id", "role", "member_name"],
                condition=models.Q(member__isnull=True),
                name="uniq_entity_role_member_name_when_member_null",
            ),
        ]

Follow with a migration and de-dupe/backfill before applying.

🧹 Nitpick comments (2)
backend/apps/owasp/models/common.py (2)

176-190: Audience parsing mirrors original scraper semantics (case-sensitive).

Implementation aligns with the stated requirement to keep case-sensitive behavior. Minor nit: use splitlines() for robustness across platforms.

-        for line in content.split("\n"):
+        for line in content.splitlines():

275-287: URL extraction: trim trailing punctuation, dedupe, and (optionally) support subdomains safely.

  • Current regex may include trailing punctuation (e.g., final “.”) and returns duplicates.
  • On the “should we also match subdomains?” question: keep exact netloc by default for safety; optionally add include_subdomains flag using a dot-boundary check (no substring matching).

Apply:

-    def get_urls(self, domain=None):
+    def get_urls(self, domain=None, include_subdomains: bool = False):
         """Get URLs from info.md file on GitHub."""
-        content = get_repository_file_content(self.info_md_url)
+        # Short-circuit to avoid noisy logs/requests when URL is missing.
+        if not self.info_md_url:
+            return []
+        content = get_repository_file_content(self.info_md_url)
         if not content:
             return []
 
-        urls = re.findall(r"https?:\/\/[^\s\)]+", content.strip())
+        raw_urls = re.findall(r"https?://[^\s\)]+", content)
+        # Strip common trailing punctuation and dedupe while preserving order.
+        urls = []
+        seen = set()
+        for u in (ru.rstrip(").,;:!?") for ru in raw_urls):
+            if u not in seen:
+                seen.add(u)
+                urls.append(u)
 
         if domain:
-            return [url for url in urls if urlparse(url).netloc == domain]
+            def matches(netloc: str) -> bool:
+                return netloc == domain or (include_subdomains and netloc.endswith("." + domain))
+            return [url for url in urls if matches(urlparse(url).netloc)]
 
         return urls

Please confirm if any call sites rely on the old signature; if so, we can keep signature stable and always do exact match, adding subdomain support later behind a new param.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68e037f and 024724a.

📒 Files selected for processing (2)
  • backend/apps/owasp/models/common.py (7 hunks)
  • backend/tests/apps/owasp/models/common_test.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/tests/apps/owasp/models/common_test.py
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.961Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.
📚 Learning: 2025-09-06T12:57:57.997Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:57.997Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).

Applied to files:

  • backend/apps/owasp/models/common.py
📚 Learning: 2025-09-06T12:48:44.892Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.

Applied to files:

  • backend/apps/owasp/models/common.py
📚 Learning: 2025-09-06T19:03:01.961Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.961Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.

Applied to files:

  • backend/apps/owasp/models/common.py
📚 Learning: 2025-07-28T14:51:14.736Z
Learnt from: adithya-naik
PR: OWASP/Nest#1894
File: frontend/src/components/TopContributorsList.tsx:74-74
Timestamp: 2025-07-28T14:51:14.736Z
Learning: In the OWASP/Nest project, the maintainer adithya-naik prefers not to create separate components for code that's only used in two specific cases, following the YAGNI principle to avoid over-engineering when the duplication is limited and manageable.

Applied to files:

  • backend/apps/owasp/models/common.py
📚 Learning: 2025-08-16T13:23:34.874Z
Learnt from: ahmedxgouda
PR: OWASP/Nest#2021
File: backend/apps/slack/models/google_auth.py:25-36
Timestamp: 2025-08-16T13:23:34.874Z
Learning: In the OWASP Nest project, encryption and decryption for Google OAuth tokens stored in BinaryField are being handled in a separate PR, not in the initial Google Auth setup implementation.

Applied to files:

  • backend/apps/owasp/models/common.py
🧬 Code graph analysis (1)
backend/apps/owasp/models/common.py (4)
backend/apps/common/models.py (2)
  • BulkSaveModel (10-34)
  • bulk_save (19-34)
backend/apps/owasp/models/enums/project.py (1)
  • AudienceChoices (8-13)
backend/apps/github/utils.py (1)
  • get_repository_file_content (60-79)
backend/apps/owasp/models/entity_member.py (2)
  • EntityMember (10-72)
  • Role (13-16)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run backend tests
  • GitHub Check: Run frontend unit tests
  • GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (2)
backend/apps/owasp/models/common.py (2)

11-24: Imports look correct and scoped.

No concerns; these are needed for the new functionality.


107-115: info_md_url: LGTM.

Property guards None when repo is absent; consistent with existing index_md_url.

@rudransh-shrivastava
Copy link
Collaborator Author

Hi, I'm not sure if I'm saving the entity_members in the best way. I only insert new entity_members and update the existing ones.

I am also not sure what to do in the unique_together constraint for EntityMember coderabbit suggests, since member is now optional, this constraint is invalid. Can we use member_name instead?
Need your guidance in these parts. Thank you!

@arkid15r

@rudransh-shrivastava rudransh-shrivastava marked this pull request as ready for review September 6, 2025 19:26
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)

54-58: Leaders parsing wired in; also keep policy flag in sync and consider handling removals

Nice switch to the repository-backed API. Two follow-ups:

  • If the project model still has is_leaders_policy_compliant, set it here (e.g., require >=2 leaders with emails) so it stays consistent with leaders_raw/leaders_emails. Based on the retrieved learnings, get_leaders_emails intentionally only captures mailto links and sync_leaders matches by member_name, so basing compliance on leaders_emails is appropriate.
  • sync_leaders currently upserts email on existing names but doesn’t deactivate leaders removed from the markdown or update order for renamed/reordered entries. Consider extending sync_leaders to deactivate missing names and to update order when it changes.

Example inline tweak for the flag (only if the field exists on Project):

             project.leaders_raw = project.get_leaders()
-            leaders_emails = project.get_leaders_emails()
+            leaders_emails = project.get_leaders_emails()
             if leaders_emails:
                 project.sync_leaders(leaders_emails)
+            # keep policy flag consistent with parsed leaders (if present on model)
+            if hasattr(project, "is_leaders_policy_compliant"):
+                project.is_leaders_policy_compliant = sum(1 for e in leaders_emails.values() if e) >= 2

If you’d like, I can draft a minimal sync_leaders enhancement that deactivates stale leaders and updates ordering.

🧹 Nitpick comments (1)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)

61-68: Switch to project.get_urls looks good; tighten GitHub handling and fix a small log nit

  • Domain filter: ensure get_urls treats www.github.com the same as github.com (normalize/lstrip www. in the model method). This avoids silently dropping valid links.
  • User/org pages: when GITHUB_USER_RE matches, try get_organization first and, on UnknownObjectException, fall back to gh.get_user(login) to include user repos. Today, user pages will be skipped silently after the exception.
  • Log message nit: when get_related_url returns falsy, logging verified_url prints None; log scraped_url instead.

Patch for the log:

-                else:
-                    logger.info("Skipped related URL %s", verified_url)
+                else:
+                    logger.info("Skipped related URL %s", scraped_url)

I can also provide a small helper to resolve GitHub principals (org or user) with caching to cut API calls across projects.

Also applies to: 94-96

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 024724a and e2cc92b.

📒 Files selected for processing (1)
  • backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/entity_member.py:50-56
Timestamp: 2025-09-06T19:28:14.273Z
Learning: In the OWASP/Nest project, when migrating scraper logic to GitHub .md file parsing, the sync_leaders method uses member_name as the primary identifier for finding and updating existing EntityMember records, not member_email. This approach is chosen because names are more stable identifiers in markdown files, while emails might be added/updated over time.
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.961Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.
📚 Learning: 2025-09-06T19:28:14.273Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/entity_member.py:50-56
Timestamp: 2025-09-06T19:28:14.273Z
Learning: In the OWASP/Nest project, when migrating scraper logic to GitHub .md file parsing, the sync_leaders method uses member_name as the primary identifier for finding and updating existing EntityMember records, not member_email. This approach is chosen because names are more stable identifiers in markdown files, while emails might be added/updated over time.

Applied to files:

  • backend/apps/owasp/management/commands/owasp_scrape_projects.py
🧬 Code graph analysis (1)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)
backend/apps/owasp/models/common.py (5)
  • get_audience (176-189)
  • get_leaders (191-211)
  • get_leaders_emails (213-231)
  • sync_leaders (299-338)
  • get_urls (275-286)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Run backend tests
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run frontend unit tests

arkid15r
arkid15r previously approved these changes Sep 11, 2025
Copy link
Collaborator

@arkid15r arkid15r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rudransh-shrivastava thanks for working on this

It's a good start towards GitHub as an entire data source for OWASP entity sync. I update code based on our model method conventions (please look into those changes).

I'm going to merge this and expect full scraper migration into GitHub entity sync code in a following PR.

Copy link

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
backend/tests/apps/owasp/models/common_test.py (1)

191-215: Harden URL domain tests: add subdomain/case/port and lookalike-host scenarios.

Covers regressions and security edge cases; mirrors prior review thread.

     @pytest.mark.parametrize(
         ("content", "domain", "expected_urls"),
         [
@@
             ("This test contains no URLs.", None, []),
             ("", None, []),
             (None, None, []),
+            ("* [Fake](https://maliciousowasp.org)", "owasp.org", []),
+            ("* [UK](https://owasp.org.uk)", "owasp.org", []),
+            ("* [Home](https://OWASP.org)", "OWASP.ORG", ["https://OWASP.org"]),
+            ("* [With port](https://owasp.org:443/path)", "owasp.org", ["https://owasp.org:443/path"]),
+            ("* [Sub](https://test.owasp.org/page)", "owasp.org", ["https://test.owasp.org/page"]),
         ],
     )
🧹 Nitpick comments (14)
backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py (2)

26-29: Consider case-insensitive uniqueness for member_name.

Names differ by case (“John Doe” vs “john doe”) will be treated as distinct. Prefer a UniqueConstraint on Lower(member_name) over unique_together for better dedup semantics.

Proposed follow-up migration:

from django.db import migrations, models
from django.db.models.functions import Lower

class Migration(migrations.Migration):
    dependencies = [("owasp", "0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more")]
    operations = [
        migrations.AlterUniqueTogether(name="entitymember", unique_together=set()),
        migrations.AddConstraint(
            model_name="entitymember",
            constraint=models.UniqueConstraint(
                expressions=["entity_type","entity_id",Lower("member_name"),"role"],
                name="uniq_entity_member_name_role_ci",
            ),
        ),
    ]

13-16: Validate dropping index owasp_entit_member__6e516f_idx won’t regress common queries.

If that index covered “member” lookups, admin searches or syncs might slow down. Recreate a targeted index if needed (e.g., on ["entity_type","entity_id","role"] or ["member"]).

backend/apps/owasp/admin/entity_member.py (4)

36-41: Add member_name to search_fields.

Enables searching by the stored name when no linked GitHub user exists.

     search_fields = (
         "member__login",
         "member__name",
+        "member_name",
         "description",
     )

35-35: Drop raw_id_fields when using autocomplete_fields for the same relation.

They’re alternatives; keeping both is redundant and can confuse admins.

-    raw_id_fields = ("member",)
+    # raw_id_fields not needed; using autocomplete_fields above.

41-41: Order by member_name as a fallback.

When member is NULL, add member_name to ordering to keep lists stable.

-    ordering = ("member__name", "order")
+    ordering = ("member__name", "member_name", "order")

15-29: Optional: reduce N+1 queries in list view.

Select related member and content type.

 class EntityMemberAdmin(admin.ModelAdmin):
@@
     ordering = ("member__name", "member_name", "order")
 
+    def get_queryset(self, request):
+        qs = super().get_queryset(request)
+        return qs.select_related("member", "entity_type")
backend/tests/apps/owasp/models/common_test.py (2)

303-326: Strengthen sync_leaders assertions; verify update vs. insert.

Patch EntityMember.update_data and assert exact payloads and save=False calls; remove unused mocks.

-    @patch("apps.owasp.models.common.ContentType")
-    @patch("apps.owasp.models.common.EntityMember")
-    @patch("apps.owasp.models.common.BulkSaveModel")
+    @patch("apps.owasp.models.common.ContentType")
+    @patch("apps.owasp.models.common.BulkSaveModel")
+    @patch("apps.owasp.models.common.EntityMember.update_data")
     def test_sync_leaders_mixed_scenario(
-        self, mock_bulk_save, mock_entity_member, mock_content_type
+        self, mock_update_data, mock_bulk_save, mock_content_type
     ):
@@
-        mock_entity_member.objects.filter.return_value = [existing_leader]
-
         leaders_emails = {
             "John Doe": "[email protected]",  # Update existing
             "Jane Smith": "[email protected]",  # New leader
         }
 
         self.model.sync_leaders(leaders_emails)
 
-        call_args = mock_bulk_save.bulk_save.call_args
-        leaders_to_save = call_args[0][1]
-
-        assert len(leaders_to_save) == 2  # Updated existing + new leader
+        # Two calls: update + insert
+        assert mock_update_data.call_count == 2
+        # Bulk save with the 2 returned instances
+        assert mock_bulk_save.bulk_save.call_count == 1

286-299: Remove unused filter mock in empty sync test.

It’s not used by the implementation.

-        mock_entity_member.objects.filter.return_value = []
-
         self.model.sync_leaders({})
backend/apps/owasp/models/entity_member.py (6)

20-25: Unique by member_name matches sync strategy; mind homonyms and case.

This will merge same-name leaders under the same entity/role (e.g., two “Alex Kim”). If that’s acceptable per product requirements, proceed; otherwise consider augmenting with email when present or a case-insensitive constraint.


26-28: Add a supporting index for common reads.

Fetching leaders typically filters by (entity_type, entity_id, role) and orders by order. Add a composite index to speed those queries.

         indexes = [
             models.Index(fields=["entity_type", "entity_id"]),
+            models.Index(fields=["entity_type", "entity_id", "role", "order"]),
         ]

73-75: Guard str against empty member_name.

If member_name is accidentally blank, this returns a trailing “as … for …”. Provide a fallback.

-        display_name = self.member.login if self.member else self.member_name
+        display_name = self.member.login if self.member else (self.member_name or "Unknown")

76-99: Use update_or_create for atomic upsert and fewer race conditions.

Prevents IntegrityError under concurrent writers and simplifies logic.

     @staticmethod
     def update_data(data, *, save: bool = True) -> "EntityMember":
-        """Update entity member data."""
-        try:
-            entity_member = EntityMember.objects.get(
-                entity_id=data["entity_id"],
-                entity_type=data["entity_type"],
-                member_name=data["member_name"],
-                role=data["role"],
-            )
-        except EntityMember.DoesNotExist:
-            entity_member = EntityMember(
-                entity_id=data["entity_id"],
-                entity_type=data["entity_type"],
-                member_name=data["member_name"],
-                role=data["role"],
-            )
-
-        entity_member.from_dict(data)
-        if save:
-            entity_member.save()
-
-        return entity_member
+        """Update entity member data."""
+        defaults = {
+            "member_email": (data.get("member_email") or "").strip(),
+            "order": data.get("order", 0),
+        }
+        entity_member, _ = EntityMember.objects.update_or_create(
+            entity_id=data["entity_id"],
+            entity_type=data["entity_type"],
+            member_name=data["member_name"].strip(),
+            role=data["role"],
+            defaults=defaults,
+        )
+        return entity_member

100-113: Normalize inputs to avoid dupes from whitespace variance.

Trim name/email before assignment.

     def from_dict(self, data) -> None:
         """Update instance based on dict data."""
         fields = {
             "entity_id": data["entity_id"],
             "entity_type": data["entity_type"],
-            "member_email": data.get("member_email", ""),
-            "member_name": data["member_name"],
+            "member_email": (data.get("member_email") or "").strip(),
+            "member_name": data["member_name"].strip(),
             "order": data.get("order", 0),
             "role": data["role"],
         }

26-28: If admin relies on member lookups, consider re-adding an index on member.

You removed the old index in the migration; add this if you filter/join by member frequently.

         indexes = [
             models.Index(fields=["entity_type", "entity_id"]),
+            models.Index(fields=["member"]),
         ]
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2cc92b and 7ebfe11.

📒 Files selected for processing (9)
  • backend/apps/owasp/admin/entity_member.py (1 hunks)
  • backend/apps/owasp/management/commands/owasp_scrape_chapters.py (1 hunks)
  • backend/apps/owasp/management/commands/owasp_scrape_committees.py (1 hunks)
  • backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
  • backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py (1 hunks)
  • backend/apps/owasp/models/common.py (7 hunks)
  • backend/apps/owasp/models/entity_member.py (3 hunks)
  • backend/tests/apps/owasp/models/common_test.py (5 hunks)
  • cspell/custom-dict.txt (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cspell/custom-dict.txt
🚧 Files skipped from review as they are similar to previous changes (4)
  • backend/apps/owasp/management/commands/owasp_scrape_committees.py
  • backend/apps/owasp/models/common.py
  • backend/apps/owasp/management/commands/owasp_scrape_projects.py
  • backend/apps/owasp/management/commands/owasp_scrape_chapters.py
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/entity_member.py:50-56
Timestamp: 2025-09-06T19:28:14.297Z
Learning: In the OWASP/Nest project, when migrating scraper logic to GitHub .md file parsing, the sync_leaders method uses member_name as the primary identifier for finding and updating existing EntityMember records, not member_email. This approach is chosen because names are more stable identifiers in markdown files, while emails might be added/updated over time.
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.985Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.
📚 Learning: 2025-09-06T19:28:14.297Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/entity_member.py:50-56
Timestamp: 2025-09-06T19:28:14.297Z
Learning: In the OWASP/Nest project, when migrating scraper logic to GitHub .md file parsing, the sync_leaders method uses member_name as the primary identifier for finding and updating existing EntityMember records, not member_email. This approach is chosen because names are more stable identifiers in markdown files, while emails might be added/updated over time.

Applied to files:

  • backend/apps/owasp/models/entity_member.py
📚 Learning: 2025-07-28T14:51:14.736Z
Learnt from: adithya-naik
PR: OWASP/Nest#1894
File: frontend/src/components/TopContributorsList.tsx:74-74
Timestamp: 2025-07-28T14:51:14.736Z
Learning: In the OWASP/Nest project, the maintainer adithya-naik prefers not to create separate components for code that's only used in two specific cases, following the YAGNI principle to avoid over-engineering when the duplication is limited and manageable.

Applied to files:

  • backend/apps/owasp/models/entity_member.py
📚 Learning: 2025-08-16T13:23:34.874Z
Learnt from: ahmedxgouda
PR: OWASP/Nest#2021
File: backend/apps/slack/models/google_auth.py:25-36
Timestamp: 2025-08-16T13:23:34.874Z
Learning: In the OWASP Nest project, encryption and decryption for Google OAuth tokens stored in BinaryField are being handled in a separate PR, not in the initial Google Auth setup implementation.

Applied to files:

  • backend/apps/owasp/models/entity_member.py
📚 Learning: 2025-09-06T12:48:44.916Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.916Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.

Applied to files:

  • backend/apps/owasp/models/entity_member.py
📚 Learning: 2025-09-06T12:57:58.021Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:58.021Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).

Applied to files:

  • backend/tests/apps/owasp/models/common_test.py
📚 Learning: 2025-09-06T19:03:01.985Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.985Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.

Applied to files:

  • backend/tests/apps/owasp/models/common_test.py
🧬 Code graph analysis (2)
backend/apps/owasp/models/entity_member.py (4)
backend/apps/owasp/admin/entity_member.py (1)
  • entity (52-65)
backend/apps/owasp/models/committee.py (2)
  • update_data (78-90)
  • save (54-59)
backend/apps/owasp/models/chapter.py (2)
  • update_data (195-217)
  • save (170-178)
backend/apps/owasp/models/project.py (2)
  • update_data (351-373)
  • save (326-331)
backend/tests/apps/owasp/models/common_test.py (1)
backend/apps/owasp/models/common.py (4)
  • get_audience (176-189)
  • get_leaders_emails (213-231)
  • get_urls (275-286)
  • sync_leaders (299-327)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run frontend e2e tests
🔇 Additional comments (2)
backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py (1)

26-29: Add a RunPython data-migration to dedupe/normalize before enforcing the unique constraint

This AlterUniqueTogether will fail at migrate-time if any duplicate rows exist for (entity_type, entity_id, member_name, role) — including case/whitespace variants. Add a data-cleanup RunPython immediately before the AlterUniqueTogether to normalize member_name and remove duplicates, or verify/fix the data beforehand.

Example (place immediately before the AlterUniqueTogether):

def _normalize_name(name: str) -> str:
    return " ".join((name or "").split())

def dedupe_entity_members(apps, schema_editor):
    EntityMember = apps.get_model("owasp", "EntityMember")
    seen = set()
    for em in EntityMember.objects.all().only("id","entity_type_id","entity_id","role","member_name"):
        key = (em.entity_type_id, em.entity_id, em.role, _normalize_name(em.member_name).lower())
        if key in seen:
            EntityMember.objects.filter(id=em.id).delete()
        else:
            seen.add(key)
            em.member_name = _normalize_name(em.member_name)
            em.save(update_fields=["member_name"])

Then add:

migrations.RunPython(dedupe_entity_members, migrations.RunPython.noop),

Location: backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py

backend/apps/owasp/admin/entity_member.py (1)

20-29: Good: showing member_name in list_display.

This aligns the admin with the new model fields.

@arkid15r arkid15r added this pull request to the merge queue Sep 11, 2025
Merged via the queue into OWASP:main with commit fc930e9 Sep 11, 2025
25 checks passed
@rudransh-shrivastava
Copy link
Collaborator Author

I will work on it, thank you!

@coderabbitai coderabbitai bot mentioned this pull request Sep 13, 2025
5 tasks
arkid15r added a commit that referenced this pull request Sep 14, 2025
* add properties and methods for github.md files

* use Github .md files files instead of scraper for urls, leaders, and audience

* Update code(add tests)

* Update code

* Update code

* add get_leaders_emails() and add name, email to entity_member model

* Update code

* add entity_member/leaders sync

* update regex

* add tests

* update code and fix sonar and cr issues

* Update code

* Update code

* Fix cspell

---------

Co-authored-by: Kate Golovanova <[email protected]>
Co-authored-by: Arkadii Yakovets <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate scraper to GitHub .md files parsing
3 participants