Feature/migrate scraper to GitHub md files #2223

rudransh-shrivastava · 2025-09-05T16:12:23Z

Resolves #2088

Proposed change

migrate scraper logic of audience , leaders and urls to Github .md files

Checklist

I've read and followed the contributing guidelines.
I've run make check-test locally; all checks and tests passed.

…audience

coderabbitai · 2025-09-05T16:12:30Z

Summary by CodeRabbit

New Features
- Automatically extract audience, links (including GitHub), and leader details from project/chapter/committee repositories.
- Synchronize leader names and emails into member records for improved accuracy.
- Admin: Entity Members list now displays member name for easier identification.
Refactor
- Consolidated data sourcing to the project/chapter/committee entities for URLs, audience, and leaders.
- Updated member model to support name/email alongside optional linked account; adjusted uniqueness to prevent duplicates.

Walkthrough

Repository-file parsing replaces scraper-based extraction: add RepositoryBasedEntityModel helpers (info_md_url, get_audience, get_urls, get_leaders_emails, sync_leaders), update scraping commands to call entity.get_* methods, add leader persistence fields/migrations, and extend tests for parsing and syncing.

Changes

Cohort / File(s)	Summary
Projects command `backend/apps/owasp/management/commands/owasp_scrape_projects.py`	Switch audience/leader/url sources to `project.get_audience()`, `project.get_leaders()`, `project.get_leaders_emails()`, `project.sync_leaders(...)`, and `project.get_urls(domain="github.com")`.
Chapters & Committees commands `backend/apps/owasp/management/commands/owasp_scrape_chapters.py`, `backend/apps/owasp/management/commands/owasp_scrape_committees.py`	Per-entity `leaders_raw = ...get_leaders()`, `leaders_emails = ...get_leaders_emails()` and conditional `...sync_leaders(...)`; source candidate URLs from `*.get_urls()` instead of scraper.
Repository-based model helpers `backend/apps/owasp/models/common.py`	Add `info_md_url` property and methods: `get_audience()`, `get_urls(domain=None)`, `get_leaders_emails()`, and `sync_leaders(...)` to parse repository .md files and sync EntityMember records.
EntityMember model & migration `backend/apps/owasp/models/entity_member.py` `backend/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py`	Make `member` FK nullable (`blank=True, null=True`); add `member_email` and `member_name` fields; migration to add/update these fields and adjust uniqueness/indexes.
Admin: EntityMember `backend/apps/owasp/admin/entity_member.py`	Remove explicit `fields` layout; add `member_name` to `list_display` as the first column.
Migrations: unique/index adjustments `backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py`	Remove legacy index, adjust `unique_together` to include `(entity_type, entity_id, member_name, role)`, and alter `member_name` field.
Tests — command updates `backend/tests/apps/owasp/management/commands/owasp_scrape_projects_test.py` `backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py` `backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py`	Updated mocks to return URLs/audience from Project/Chapter/Committee `get_urls()` / `get_audience()` instead of scraper; assertions/flows remain.
Tests — model parsing coverage `backend/tests/apps/owasp/models/common_test.py`	Add parameterized tests for `get_audience()`, `get_leaders_emails()`, `get_urls()` (domain filtering/edge cases) and tests for `sync_leaders` behaviors (no-op, update, mixed create/update).
Misc (spellings) `cspell/custom-dict.txt`	Add "GFKs" to custom dictionary.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feature command to generate standardized project.owasp.yaml metadata files #1985 — touches project audience metadata and project scraping/metadata commands (overlaps get_audience/get_urls usage).
implemented matching model and Added slack models #913 — modifies leaders-related fields/APIs on RepositoryBasedEntityModel; overlaps leader extraction/sync helpers.
removed leader and suggested leader, introduced entity_member model #2063 — related leader handling and EntityMember sync/persistence changes.

Suggested reviewers

kasya

Pre-merge checks (4 passed, 1 warning)

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly and accurately summarizes the primary change — migrating scraper logic to parse GitHub Markdown files — and aligns with the PR objectives and linked issue #2088; it is concise, specific, and free of noisy file lists or emojis. The "Feature/" prefix is a minor branch-convention artifact but does not obscure intent. Overall the title communicates the main change clearly to reviewers scanning history.
Linked Issues Check	✅ Passed	The PR implements repository Markdown parsing by adding info_md_url, get_audience, get_urls, get_leaders_emails, and sync_leaders on the repository-based model, updates management commands to use those methods (projects/chapters/committees), extracts leader emails and synchronizes EntityMember records, and adds tests and migrations to persist leader_name/member_email; these changes satisfy the linked issue objectives to move audience, leaders (including emails), and URLs extraction from the scraper to GitHub .md files. Updated tests exercise the new parsing paths and URL/audience behaviors. Database and admin changes support storing leader metadata as required by the feature.
Out of Scope Changes Check	✅ Passed	All substantive code changes are focused on migrating scraping responsibilities to GitHub Markdown parsing and the necessary support work (model fields, migrations, admin updates, and tests); no unrelated features or broad refactors were detected beyond this scope. The cspell dictionary update is a harmless documentation tooling change. Reviewers should still validate the schema/index changes and migration impact on existing data as those touch persistence beyond pure parsing logic.
Description Check	✅ Passed	The description directly references issue #2088 and states that audience, leaders, and URLs parsing will be migrated to GitHub .md files, matching the changeset intent; it also notes the contributor ran tests and followed CONTRIBUTING, so it is clearly related to the PR. The level of detail required for this lenient check is satisfied. There is no indication the description is off-topic relative to the changes.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)

59-66: Guard against None from get_related_url before normalize_url

project.get_related_url(url) can return None; passing that into normalize_url() risks unexpected errors depending on its implementation. Add a guard in the comprehension.

-                {
-                    repository_url
-                    for url in set(project.get_urls(domain="github.com"))
-                    if (repository_url := normalize_url(project.get_related_url(url)))
-                    and repository_url not in {project.github_url, project.owasp_url}
-                }
+                {
+                    repository_url
+                    for url in set(project.get_urls(domain="github.com"))
+                    if (related_url := project.get_related_url(url))
+                    and (repository_url := normalize_url(related_url))
+                    and repository_url not in {project.github_url, project.owasp_url}
+                }

🧹 Nitpick comments (5)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (2)

80-84: Parse org/user owner with urlparse to avoid trailing slash and query issues

verified_url.split('/')[-1] breaks on trailing slashes and query strings. Use urlparse.

-                            gh_organization = gh.get_organization(verified_url.split("/")[-1])
+                            from urllib.parse import urlparse
+                            owner = urlparse(verified_url).path.strip("/").split("/")[0]
+                            if not owner:
+                                raise UnknownObjectException(status=404, data=None, headers=None)
+                            gh_organization = gh.get_organization(owner)

91-94: Log the skipped original URL, not None

When verified_url is falsy, logging it is unhelpful. Log scraped_url.

-                else:
-                    logger.info("Skipped related URL %s", verified_url)
+                else:
+                    logger.info("Skipped related URL %s", scraped_url)

backend/apps/owasp/models/common.py (1)

254-269: Harden URL extraction: stricter regex, dedupe, and resilient domain matching

Current regex can capture trailing punctuation.
Return stable unique ordering.
Accept subdomains like www.github.com via suffix match (case-insensitive).

-        urls = []
-        for line in content.split("\n"):
-            line_urls = re.findall(r"https?:\/\/[^\s\)]+", line.strip())
-            urls.extend(line_urls)
+        urls = []
+        for line in content.split("\n"):
+            line_urls = re.findall(r"https?://[^\s\)\]\}\>,\"']+", line.strip())
+            urls.extend(line_urls)
+        # de-duplicate preserving order
+        urls = list(dict.fromkeys(urls))
 
-        if domain:
-            return [url for url in urls if urlparse(url).netloc == domain]
+        if domain:
+            dom = domain.lstrip(".").lower()
+            return [u for u in urls if urlparse(u).netloc.lower().endswith(dom)]
 
         return urls

backend/tests/apps/owasp/management/commands/owasp_scrape_projects_test.py (2)

36-38: Stub get_leaders for determinism (prevents opaque Mock in leaders_raw)

The command now calls project.get_leaders(). Stubbing avoids persisting a bare Mock into leaders_raw if later assertions are added.
         mock_scraper.page_tree = True
+        mock_project.get_leaders.return_value = []
         mock_project.get_urls.return_value = []
         mock_project.get_audience.return_value = ["builder", "breaker", "defender"]
74-80: Add an org/user URL to cover the org expansion branch

Include https://github.com/org in inputs to exercise the GITHUB_USER_RE path that expands to repos. Expectations remain unchanged due to dedupe.
         mock_project.get_urls.return_value = [
             "https://github.com/org/repo1",
             "https://github.com/org/repo2",
             "https://invalid.com/repo3",
+            "https://github.com/org",
         ]

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6e29fc4 and a446674.

📒 Files selected for processing (3)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
backend/apps/owasp/models/common.py (4 hunks)
backend/tests/apps/owasp/management/commands/owasp_scrape_projects_test.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

backend/tests/apps/owasp/management/commands/owasp_scrape_projects_test.py (2)

backend/tests/apps/owasp/management/commands/owasp_create_project_metadata_file_test.py (1)

mock_project (14-40)

backend/apps/owasp/scraper.py (2)

get_urls (89-95)

get_audience (55-87)

backend/apps/owasp/models/common.py (3)

backend/apps/owasp/models/enums/project.py (1)

AudienceChoices (8-13)

backend/apps/owasp/scraper.py (2)

get_audience (55-87)

get_urls (89-95)

backend/apps/github/utils.py (1)

get_repository_file_content (60-79)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (2)

backend/apps/owasp/models/common.py (5)

get_leaders (190-210)

owasp_url (140-142)

deactivate (144-147)

get_audience (174-188)

get_urls (254-268)

backend/apps/owasp/scraper.py (3)

OwaspScraper (22-127)

get_audience (55-87)

get_urls (89-95)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Run frontend e2e tests
GitHub Check: Run backend tests
GitHub Check: Run frontend unit tests
GitHub Check: CodeQL (javascript-typescript)

🔇 Additional comments (2)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)

56-56: LGTM: migrated audience to GitHub info.md source

Switching to project.get_audience() aligns with the new parsing strategy.

backend/apps/owasp/models/common.py (1)

105-114: LGTM: info_md_url helper mirrors existing pattern

Consistent with index_md_url/leaders_md_url and safely returns None when repo missing.

backend/apps/owasp/management/commands/owasp_scrape_projects.py

backend/apps/owasp/models/common.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

backend/tests/apps/owasp/models/common_test.py (1)

16-43: Audience parsing: add case-insensitive + exact-word tests; avoid substring false-positives (e.g., “Codebreaker”).

Today, parsing is case-sensitive and matches substrings; “Codebreaker” would incorrectly flag “Breaker”. Add tests to pin desired semantics, and update the parser to be case-insensitive with word boundaries.

Apply this diff to extend the tests:

@@
     @pytest.mark.parametrize(
         ("content", "expected_audience"),
         [
             (
                 """### Top Ten Card Game Information
 * [Incubator Project](#)
 * [Type of Project](#)
 * [Version 0.0.0](#)
 * [Builder](#)
 * [Breaker](#)""",
                 ["breaker", "builder"],
             ),
+            ("* builder\n* BREAKER", ["breaker", "builder"]),
+            ("* Codebreaker", []),
             ("This test contains no audience information.", []),
             ("", []),
             (None, []),
         ],
     )

And update the implementation in apps/owasp/models/common.py to use case-insensitive, whole-word matching:

@@
-        for line in content.split("\n"):
-            for lower_kw, original_kw in AudienceChoices.choices:
-                if original_kw in line:
-                    found_keywords.add(lower_kw)
+        for line in content.splitlines():
+            lower_line = line.lower()
+            for lower_kw, original_kw in AudienceChoices.choices:
+                if re.search(rf"\b{re.escape(original_kw.lower())}\b", lower_line):
+                    found_keywords.add(lower_kw)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a446674 and 2b20ccf.

📒 Files selected for processing (2)

backend/apps/owasp/models/common.py (4 hunks)
backend/tests/apps/owasp/models/common_test.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/apps/owasp/models/common.py

🧰 Additional context used

🧬 Code graph analysis (1)

backend/tests/apps/owasp/models/common_test.py (1)

backend/apps/owasp/models/common.py (2)

get_audience (174-187)

get_urls (253-267)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Run frontend e2e tests
GitHub Check: Run backend tests
GitHub Check: Run frontend unit tests

backend/tests/apps/owasp/models/common_test.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

backend/apps/owasp/models/common.py (1)

174-188: Make audience detection case-insensitive; guard missing URL.
Case-sensitive matching will miss valid audience signals; also short-circuit if info_md_url is None to avoid noisy fetch errors.

 def get_audience(self):
     """Get audience from info.md file on GitHub."""
-    content = get_repository_file_content(self.info_md_url)
+    if not self.info_md_url:
+        return []
+    content = get_repository_file_content(self.info_md_url)
     if not content:
         return []

     found_keywords = set()

-    for line in content.split("\n"):
-        for lower_kw, original_kw in AudienceChoices.choices:
-            if original_kw in line:
-                found_keywords.add(lower_kw)
+    for line in content.split("\n"):
+        line_norm = line.casefold()
+        for lower_kw, original_kw in AudienceChoices.choices:
+            if original_kw.casefold() in line_norm:
+                found_keywords.add(lower_kw)

     return sorted(found_keywords)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2b20ccf and a697161.

📒 Files selected for processing (3)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
backend/apps/owasp/models/common.py (4 hunks)
backend/tests/apps/owasp/models/common_test.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

backend/tests/apps/owasp/models/common_test.py
backend/apps/owasp/management/commands/owasp_scrape_projects.py

🧰 Additional context used

🧬 Code graph analysis (1)

backend/apps/owasp/models/common.py (3)

backend/apps/owasp/models/enums/project.py (1)

AudienceChoices (8-13)

backend/apps/owasp/scraper.py (2)

get_audience (55-87)

get_urls (89-95)

backend/apps/github/utils.py (1)

get_repository_file_content (60-79)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Run backend tests
GitHub Check: Run frontend e2e tests
GitHub Check: Run frontend unit tests
GitHub Check: CodeQL (javascript-typescript)

🔇 Additional comments (2)

backend/apps/owasp/models/common.py (2)

21-21: Import looks correct and scoped.
AudienceChoices import path matches the enums module location.

105-114: Info.md URL property: LGTM.
Consistent with index_md_url/leaders_md_url. No issues.

backend/apps/owasp/models/common.py

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

backend/apps/owasp/models/entity_member.py (2)
57-58: Index and normalize member_email for lookups and dedupe.

Email will be a frequent join/dedupe key from MD parsing. Add an index and ensure consistent casing.
-    member_email = models.EmailField(blank=True, default="")
+    member_email = models.EmailField(blank=True, default="", db_index=True)
Optionally, normalize on save:
def save(self, *args, **kwargs):
    if self.member_email:
        self.member_email = self.member_email.strip().lower()
    super().save(*args, **kwargs)
69-73: Harden str fallback to avoid empty names.

If both member is None and member_name is empty, the display becomes blank. Fall back to email or “Unknown”.
-        display_name = self.member.login if self.member else self.member_name
-        return f"{display_name} as {self.get_role_display()} for {self.entity}"
+        display_name = (
+            (getattr(self.member, "login", "") if self.member else "")
+            or self.member_name
+            or self.member_email
+            or "Unknown"
+        )
+        return f"{display_name} as {self.get_role_display()} for {self.entity}"
backend/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py (1)

13-35: Optional data backfill for nicer display.

Backfill member_name from existing relations to keep str meaningful for legacy rows lacking MD-provided names.

I can add a RunPython migration that sets member_name = member.login where member_name="" and member IS NOT NULL. Want me to draft it?

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37362e7 and f289f92.

📒 Files selected for processing (5)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
backend/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py (1 hunks)
backend/apps/owasp/models/common.py (5 hunks)
backend/apps/owasp/models/entity_member.py (2 hunks)
backend/tests/apps/owasp/models/common_test.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

backend/apps/owasp/models/common.py
backend/tests/apps/owasp/models/common_test.py
backend/apps/owasp/management/commands/owasp_scrape_projects.py

🧰 Additional context used

🧬 Code graph analysis (1)

backend/apps/owasp/models/entity_member.py (1)

backend/apps/mentorship/api/internal/nodes/mentor.py (1)

login (23-25)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Run frontend e2e tests
GitHub Check: Run backend tests
GitHub Check: Run frontend unit tests

🔇 Additional comments (2)

backend/apps/owasp/models/entity_member.py (1)

69-72: .login is defined on User (inherited from GenericUserModel); no changes required.

backend/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py (1)

1-12: No action needed—Django is already pinned to cover 5.2.6
backend/pyproject.toml declares django = "^5.1", which resolves to ≥5.1.0 and <6.0.0, so migrations generated by 5.2.6 will be consistent.

...nd/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py

backend/apps/owasp/models/entity_member.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (10)

backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py (2)

78-93: Add assertions to lock in new behavior (leaders + no scraper.get_urls).
Strengthen the test so regressions back to scraper.get_urls get caught and the leaders call is exercised.

Apply:

@@
         mock_active_committees.count.assert_called_once()
 
         assert mock_bulk_save.called
 
         assert mock_print.call_count == (committees - offset)
 
+        # New API usage assertions
+        mock_committee.get_leaders.assert_called_once()
+        mock_committee.get_urls.assert_called()
+        # Ensure old path isn't used anymore
+        mock_scraper.get_urls.assert_not_called()
+
         for call in mock_print.call_args_list:
             args, _ = call
             assert "https://owasp.org/www-committee-test" in args[0]

49-50: Remove redundant get_related_url side_effect assignment.
You set the same side_effect in the fixture already; this second assignment is unnecessary.

-        mock_committee.get_related_url.side_effect = lambda url, **_: url
+        # Redundant with fixture setup; can be removed.

Also applies to: 23-23

backend/apps/owasp/management/commands/owasp_scrape_chapters.py (3)

44-57: Avoid calling normalize_url(None); split steps for clarity and safety.
Guard against None from get_related_url before normalizing; keeps behavior explicit.

-            scraped_urls = sorted(
-                {
-                    repository_url
-                    for url in set(chapter.get_urls())
-                    if (
-                        repository_url := normalize_url(
-                            chapter.get_related_url(
-                                url, exclude_domains=("github.com", "owasp.org")
-                            )
-                        )
-                    )
-                    and repository_url not in {chapter.github_url, chapter.owasp_url}
-                }
-            )
+            scraped_urls = sorted(
+                {
+                    repo_url
+                    for url in set(chapter.get_urls())
+                    if (rel := chapter.get_related_url(url, exclude_domains=("github.com", "owasp.org")))
+                    and (repo_url := normalize_url(rel))
+                    and repo_url not in {chapter.github_url, chapter.owasp_url}
+                }
+            )

67-72: Avoid shadowing verified_url; use a distinct name for the normalized related URL.
Improves readability and log accuracy.

-                if verified_url := chapter.get_related_url(
-                    normalize_url(verified_url, check_path=True)
-                ):
-                    related_urls.add(verified_url)
+                if related_url := chapter.get_related_url(
+                    normalize_url(verified_url, check_path=True)
+                ):
+                    related_urls.add(related_url)
                 else:
-                    logger.info("Skipped related URL %s", verified_url)
+                    logger.info("Skipped related URL %s", related_url)

42-76: Deduplicate URL-pipeline logic shared with committees.
The “collect, verify, normalize, filter” flow is duplicated across commands. Consider extracting a shared helper (e.g., on the model or a utils module) to reduce drift.

backend/apps/owasp/management/commands/owasp_scrape_committees.py (3)

44-57: Guard normalize_url against None and simplify the filter.
Same safety/readability improvement as chapters.

-            scraped_urls = sorted(
-                {
-                    repository_url
-                    for url in set(committee.get_urls())
-                    if (
-                        repository_url := normalize_url(
-                            committee.get_related_url(
-                                url, exclude_domains=("github.com", "owasp.org")
-                            )
-                        )
-                    )
-                    and repository_url not in {committee.github_url, committee.owasp_url}
-                }
-            )
+            scraped_urls = sorted(
+                {
+                    repo_url
+                    for url in set(committee.get_urls())
+                    if (rel := committee.get_related_url(url, exclude_domains=("github.com", "owasp.org")))
+                    and (repo_url := normalize_url(rel))
+                    and repo_url not in {committee.github_url, committee.owasp_url}
+                }
+            )

67-72: Rename shadowed variable for clarity.
Avoid reusing verified_url for the post-processed related URL.

-                if verified_url := committee.get_related_url(
-                    normalize_url(verified_url, check_path=True)
-                ):
-                    related_urls.add(verified_url)
+                if related_url := committee.get_related_url(
+                    normalize_url(verified_url, check_path=True)
+                ):
+                    related_urls.add(related_url)
                 else:
-                    logger.info("Skipped related URL %s", verified_url)
+                    logger.info("Skipped related URL %s", related_url)

42-76: Extract shared URL-processing helper to avoid duplication with chapters.
Centralize the “discover → normalize → verify → filter” logic to keep behavior consistent across entities.

backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py (2)

78-93: Harden the test with explicit API-usage assertions.
Mirror the committees test improvements to catch regressions and exercise leaders.

@@
         mock_active_chapters.count.assert_called_once()
 
         assert mock_bulk_save.called
 
         assert mock_print.call_count == (chapters - offset)
 
+        # New API usage assertions
+        mock_chapter.get_leaders.assert_called_once()
+        mock_chapter.get_urls.assert_called()
+        # Ensure old path isn't used anymore
+        mock_scraper.get_urls.assert_not_called()
+
         for call in mock_print.call_args_list:
             args, _ = call
             assert "https://owasp.org/www-chapter-test" in args[0]

49-50: Drop duplicate get_related_url side_effect setup.
Already set in the fixture; the extra assignment in the test body can be removed.

-        mock_chapter.get_related_url.side_effect = lambda url, **_: url
+        # Redundant with fixture setup; can be removed.

Also applies to: 23-23

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f289f92 and 5c33591.

📒 Files selected for processing (5)

backend/apps/owasp/management/commands/owasp_scrape_chapters.py (1 hunks)
backend/apps/owasp/management/commands/owasp_scrape_committees.py (1 hunks)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py (1 hunks)
backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/apps/owasp/management/commands/owasp_scrape_projects.py

🧰 Additional context used

🧬 Code graph analysis (4)

backend/apps/owasp/management/commands/owasp_scrape_chapters.py (2)

backend/apps/owasp/models/common.py (2)

get_leaders (189-209)

get_urls (273-284)

backend/apps/owasp/scraper.py (1)

get_urls (89-95)

backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py (1)

backend/tests/apps/owasp/management/commands/owasp_enrich_committees_test.py (1)

mock_committee (18-22)

backend/apps/owasp/management/commands/owasp_scrape_committees.py (2)

backend/apps/owasp/models/common.py (2)

get_leaders (189-209)

get_urls (273-284)

backend/apps/owasp/scraper.py (1)

get_urls (89-95)

backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py (1)

backend/tests/apps/owasp/management/commands/owasp_enrich_chapters_test.py (1)

mock_chapter (18-25)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Run frontend unit tests
GitHub Check: Run frontend e2e tests
GitHub Check: Run backend tests
GitHub Check: CodeQL (javascript-typescript)

🔇 Additional comments (6)

backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py (1)

41-46: Switched URL source to committee.get_urls — matches migration intent.
Good change; the test now exercises the new model-driven path instead of the scraper.

backend/apps/owasp/management/commands/owasp_scrape_chapters.py (2)

47-47: Use chapter.get_urls — aligned with .md-based parsing.
This aligns the command with the new RepositoryBasedEntityModel APIs.

42-42: Leaders now sourced from leaders.md — confirm whether emails should be persisted too.
If leader emails are required (per PR goal), consider also calling a corresponding API (e.g., get_leaders_emails) and persisting them.

backend/apps/owasp/management/commands/owasp_scrape_committees.py (2)

47-47: Use committee.get_urls — aligned with .md-based parsing.
This matches the PR direction to move away from DOM scraping for URL discovery.

42-42: Leaders now sourced from leaders.md — confirm whether emails should be persisted too.
If leader emails are part of the requirements, add retrieval/persistence (e.g., get_leaders_emails) here as well.

backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py (1)

41-46: Switched URL source to chapter.get_urls — aligns with new model API.
Good update; the test now validates the .md-based URL path.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)

55-59: Avoid double-fetching leaders.md and keep policy flag in sync.

Reuse leaders_emails to derive leaders_raw (one network call) and set is_leaders_policy_compliant alongside. Mirrors prior feedback.

-            project.audience = project.get_audience()
-            project.leaders_raw = project.get_leaders()
-            leaders_emails = project.get_leaders_emails()
-            if leaders_emails:
-                project.sync_leaders(leaders_emails)
+            project.audience = project.get_audience()
+            leaders_emails = project.get_leaders_emails()
+            # Derive leaders list from emails to avoid a second fetch; fallback if empty.
+            project.leaders_raw = list(leaders_emails) if leaders_emails else project.get_leaders()
+            project.is_leaders_policy_compliant = len(project.leaders_raw) > 1
+            if leaders_emails:
+                project.sync_leaders(leaders_emails)

🧹 Nitpick comments (5)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (2)

51-51: Remove stray debug print.

Leftover print("what") will pollute command output.
-                print("what")
+                logger.debug("OWASP page not found for %s; deactivating", project.owasp_url)
65-65: Drop redundant set() call in comprehension.

Outer set already dedupes; inner set(...) adds work without benefit.
-                    for url in set(project.get_urls(domain="github.com"))
+                    for url in project.get_urls(domain="github.com")

backend/apps/owasp/models/common.py (3)

107-115: Guard against missing default_branch.

Prevent generating an invalid URL when default_branch is None.

-        return (
+        return (
             "https://raw.githubusercontent.com/OWASP/"
-            f"{self.owasp_repository.key}/{self.owasp_repository.default_branch}/info.md"
-            if self.owasp_repository
+            f"{self.owasp_repository.key}/{self.owasp_repository.default_branch}/info.md"
+            if self.owasp_repository and self.owasp_repository.default_branch
             else None
         )

213-231: Harden leaders email extraction (support “+” bullets and MAILTO variants).

Current regex misses + bullets and upper/mixed-case mailto. Minor expansion improves robustness without changing semantics.

-            matches = re.findall(
-                r"[-*]\s*(?:\[([^\]]+)\]\(mailto:([^)]+)\)|([^[\n]+))", line.strip()
-            )
+            matches = re.findall(
+                r"[-*+]\s*(?:\[([^\]]+)\]\(\s*(?i:mailto:)\s*([^)]+)\)|([^[\n]+))",
+                line.strip(),
+            )

275-287: Strip trailing punctuation and dedupe URLs while preserving order.

Avoids false invalids like ...repo). and repeated links.

-        urls = re.findall(r"https?:\/\/[^\s\)]+", content.strip())
+        raw_urls = re.findall(r"https?://[^\s\)]+", content)
+        seen, urls = set(), []
+        for u in raw_urls:
+            u = u.rstrip(").,;:!?")
+            if u not in seen:
+                seen.add(u)
+                urls.append(u)

(Domain filtering via urlparse(url).netloc == domain is correct; keep as-is.)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c33591 and 9acb861.

📒 Files selected for processing (4)

backend/apps/owasp/management/commands/owasp_scrape_chapters.py (1 hunks)
backend/apps/owasp/management/commands/owasp_scrape_committees.py (1 hunks)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
backend/apps/owasp/models/common.py (7 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

backend/apps/owasp/management/commands/owasp_scrape_committees.py
backend/apps/owasp/management/commands/owasp_scrape_chapters.py

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.

📚 Learning: 2025-09-06T12:57:57.997Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:57.997Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).

Applied to files:

backend/apps/owasp/models/common.py

📚 Learning: 2025-09-06T12:48:44.892Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.

Applied to files:

backend/apps/owasp/models/common.py

🧬 Code graph analysis (2)

backend/apps/owasp/models/common.py (4)

backend/apps/common/models.py (2)

BulkSaveModel (10-34)

bulk_save (19-34)

backend/apps/owasp/models/enums/project.py (1)

AudienceChoices (8-13)

backend/apps/github/utils.py (1)

get_repository_file_content (60-79)

backend/apps/owasp/models/entity_member.py (2)

EntityMember (10-72)

Role (13-16)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (2)

backend/apps/owasp/models/common.py (6)

deactivate (146-149)

get_audience (176-189)

get_leaders (191-211)

get_leaders_emails (213-231)

sync_leaders (299-338)

get_urls (275-286)

backend/apps/owasp/models/project.py (1)

deactivate (278-281)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Run frontend unit tests
GitHub Check: Run frontend e2e tests
GitHub Check: Run backend tests
GitHub Check: CodeQL (javascript-typescript)

🔇 Additional comments (2)

backend/apps/owasp/models/common.py (2)

11-11: Imports look good.

All newly added imports are used.

Also applies to: 14-14, 23-23

176-190: Audience parsing preserves original case-sensitive behavior.

Matches the legacy scraper semantics as requested. Consider adding tests that assert case-sensitive detection for “Breaker/Builder/Defender”.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

backend/apps/owasp/models/common.py (1)
299-338: Sync leaders must update order/reactivation and deactivate stale entries; also persist all changed fields.
As previously noted, existing leaders’ order/is_active aren’t updated, removed leaders remain active, and bulk_save only updates member_email.

Minimal persistence fix (doesn’t cover the full logic change already suggested earlier):
-        if members_to_save:
-            BulkSaveModel.bulk_save(EntityMember, members_to_save, ["member_email"])
+        if members_to_save:
+            BulkSaveModel.bulk_save(
+                EntityMember,
+                members_to_save,
+                ["member_email", "order", "is_active"],
+            )
Please also adopt the earlier upsert/reorder/reactivate/deactivate logic so state is consistent.

🧹 Nitpick comments (1)

backend/apps/owasp/models/common.py (1)

275-287: Harden get_urls: trim trailing punctuation, dedupe, and normalize netloc case.
Prevents false negatives on domain filter (e.g., trailing “.”) and avoids duplicates while keeping strict netloc equality per security learning.

-        urls = re.findall(r"https?:\/\/[^\s\)]+", content.strip())
+        raw_urls = re.findall(r"https?://[^\s\)]+", content)
+        # Strip common trailing punctuation from Markdown/text usage and dedupe (stable).
+        urls = [u.rstrip(").,;:!?]") for u in raw_urls]
+        urls = list(dict.fromkeys(urls))
 
         if domain:
-            return [url for url in urls if urlparse(url).netloc == domain]
+            dn = domain.lower()
+            return [u for u in urls if urlparse(u).netloc.lower() == dn]
 
         return urls

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9acb861 and 674fb18.

📒 Files selected for processing (2)

backend/apps/owasp/models/common.py (7 hunks)
backend/tests/apps/owasp/models/common_test.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/tests/apps/owasp/models/common_test.py

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.

📚 Learning: 2025-09-06T12:57:57.997Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:57.997Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).

Applied to files:

backend/apps/owasp/models/common.py

📚 Learning: 2025-09-06T12:48:44.892Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.

Applied to files:

backend/apps/owasp/models/common.py

🧬 Code graph analysis (1)

backend/apps/owasp/models/common.py (4)

backend/apps/common/models.py (2)

BulkSaveModel (10-34)

bulk_save (19-34)

backend/apps/owasp/models/enums/project.py (1)

AudienceChoices (8-13)

backend/apps/github/utils.py (1)

get_repository_file_content (60-79)

backend/apps/owasp/models/entity_member.py (2)

EntityMember (10-72)

Role (13-16)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Run frontend e2e tests
GitHub Check: Run backend tests
GitHub Check: Run frontend unit tests

🔇 Additional comments (3)

backend/apps/owasp/models/common.py (3)

11-11: LGTM: supporting imports added appropriately.
ContentType, BulkSaveModel, and AudienceChoices imports align with the new behaviors.

Also applies to: 14-14, 23-23

107-116: LGTM: info.md URL property.
Straightforward and consistent with index_md_url.

176-190: LGTM: audience extraction preserves original case-sensitive behavior.
Matches the original scraper semantics (per retrieved learnings) and returns normalized values.

backend/apps/owasp/models/common.py

rudransh-shrivastava · 2025-09-06T18:26:40Z

backend/apps/owasp/models/common.py

+        urls = re.findall(r"https?:\/\/[^\s\)]+", content.strip())
+
+        if domain:
+            return [url for url in urls if urlparse(url).netloc == domain]


should we also match subdomains?

rudransh-shrivastava · 2025-09-06T18:57:00Z

backend/apps/owasp/models/common.py

+        leaders = {}
+        for line in content.split("\n"):
+            matches = re.findall(
+                r"^[-*]\s*(?:\[([^\]]+)\]\(mailto:([^)]+)(\)|([^[<\n])))", line.strip()


I have tried to fix this regex for hours, I can't figure it out.
^[-*]\s*(?:\[([^\]]+)\]\(mailto:([^)]+)(\)|([^[<\n]))) dropped leaders without emails
^[-*]\s*(?:\[([^\]]+)\]\(mailto:([^)]+)(\)|([^[<\n]))) was also promising but it too fails tests.

I added many tests for this. Please check them. I think if we can have a stricter rule for matching leaders names and emails I can figure this out.

example:
does * [John Doe] match name?
does * John Doe match name?

I also noticed that it was parsing <a> tags as names in earlier commits.

@arkid15r

This is going to be a one-time migration so it's not required to be ideal. Some required data can be added manually later on a higher level process.

I appreciate attention to details.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

backend/tests/apps/owasp/models/common_test.py (1)

191-226: Expand domain-filter cases to lock in security semantics (hostname equality-or-subdomain, case-insensitive, port-safe).
Current tests won’t catch regressions (e.g., lookalike domains, ports, case). Please extend as below; confirm whether subdomains should be accepted (test includes both). This also aligns with prior review feedback.

@@
     @pytest.mark.parametrize(
-        ("content", "domain", "expected_urls"),
+        ("content", "domain", "expected_urls"),
         [
@@
             ("This test contains no URLs.", None, []),
             ("", None, []),
             (None, None, []),
+            ("* [Fake](https://maliciousowasp.org)", "owasp.org", []),
+            ("* [UK](https://owasp.org.uk)", "owasp.org", []),
+            ("* [Home](https://OWASP.org)", "OWASP.ORG", ["https://OWASP.org"]),
+            ("* [With port](https://owasp.org:443/path)", "owasp.org", ["https://owasp.org:443/path"]),
+            ("* [Subdomain](https://test.owasp.org/docs)", "owasp.org", ["https://test.owasp.org/docs"]),
         ],
     )

If subdomains should NOT be accepted, change the last case’s expected_urls to [] and I’ll adjust the production code suggestion accordingly.

Production code change (outside this file) to make tests pass and harden filtering:

--- a/backend/apps/owasp/models/common.py
+++ b/backend/apps/owasp/models/common.py
@@ def get_urls(self, domain=None):
-        if domain:
-            return [url for url in urls if urlparse(url).netloc == domain]
+        if domain:
+            domain = domain.lstrip(".").lower()
+            def host_matches(u: str) -> bool:
+                host = urlparse(u).hostname
+                if not host:
+                    return False
+                host = host.lower()
+                # Accept apex or proper subdomains; case-insensitive; ports already stripped by .hostname
+                return host == domain or host.endswith("." + domain)
+            return [u for u in urls if host_matches(u)]

🧹 Nitpick comments (2)

backend/tests/apps/owasp/models/common_test.py (2)

22-49: Audience tests cover happy-path and empties.
Consider adding a duplicate keyword case to assert de-duplication remains stable, but optional.

321-349: Strengthen assertions for sync_leaders mixed path.
Verify update_fields and constructor args for the new leader.

@@ def test_sync_leaders_mixed_scenario(
-        call_args = mock_bulk_save.bulk_save.call_args
-        leaders_to_save = call_args[0][1]
+        call_args = mock_bulk_save.bulk_save.call_args
+        leaders_to_save = call_args[0][1]
         assert len(leaders_to_save) == 2  # Updated existing + new leader
         assert leaders_to_save[0] == existing_leader
+        # Ensure only member_email updates are persisted
+        assert call_args[0][2] == ["member_email"]
+        # New leader constructed with expected fields and order
+        mock_entity_member.assert_called_once()
+        _, kwargs = mock_entity_member.call_args
+        assert kwargs["member_name"] == "Jane Smith"
+        assert kwargs["member_email"] == "[email protected]"
+        assert kwargs["order"] == 1
+        assert kwargs["role"] == mock_entity_member.Role.LEADER

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 674fb18 and 68e037f.

📒 Files selected for processing (2)

backend/apps/owasp/models/common.py (7 hunks)
backend/tests/apps/owasp/models/common_test.py (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/apps/owasp/models/common.py

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.

📚 Learning: 2025-09-06T12:57:57.997Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:57.997Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).

Applied to files:

backend/tests/apps/owasp/models/common_test.py

🧬 Code graph analysis (1)

backend/tests/apps/owasp/models/common_test.py (2)

backend/apps/owasp/models/common.py (4)

get_audience (176-189)

get_leaders_emails (213-231)

get_urls (275-286)

sync_leaders (299-338)

backend/tests/apps/ai/common/base/chunk_command_test.py (1)

mock_content_type (64-68)

🔇 Additional comments (5)

backend/tests/apps/owasp/models/common_test.py (5)

1-1: Imports look good.
No issues with MagicMock/Mock/patch usage.

16-21: Solid test setup.
Fixture keeps tests isolated; id preset helps with FK lookups.

85-146: Leaders-email parsing: good edge coverage.
Nice handling of spacing, non-mailto bullets, and HTML anchor exclusion.

286-299: No-op path correctly avoids unnecessary writes.
Mocks are wired correctly; assertion precise.

300-320: Unchanged leader: good negative assertion.
Covers idempotency.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

backend/apps/owasp/models/common.py (1)
299-338: Prevent duplicate name-identified leaders when member is NULL (DB-level guard).

The current unique_together uses member FK; duplicates can accumulate for (entity, role, member_name) when member is NULL. Add a partial unique constraint to EntityMember.Meta.

Proposed constraint (in backend/apps/owasp/models/entity_member.py):
from django.db import models

class EntityMember(models.Model):
    class Meta:
        constraints = [
            models.UniqueConstraint(
                fields=["entity_type", "entity_id", "role", "member_name"],
                condition=models.Q(member__isnull=True),
                name="uniq_entity_role_member_name_when_member_null",
            ),
        ]
Follow with a migration and de-dupe/backfill before applying.

🧹 Nitpick comments (2)

backend/apps/owasp/models/common.py (2)

176-190: Audience parsing mirrors original scraper semantics (case-sensitive).

Implementation aligns with the stated requirement to keep case-sensitive behavior. Minor nit: use splitlines() for robustness across platforms.

-        for line in content.split("\n"):
+        for line in content.splitlines():

275-287: URL extraction: trim trailing punctuation, dedupe, and (optionally) support subdomains safely.

Current regex may include trailing punctuation (e.g., final “.”) and returns duplicates.
On the “should we also match subdomains?” question: keep exact netloc by default for safety; optionally add include_subdomains flag using a dot-boundary check (no substring matching).

Apply:

-    def get_urls(self, domain=None):
+    def get_urls(self, domain=None, include_subdomains: bool = False):
         """Get URLs from info.md file on GitHub."""
-        content = get_repository_file_content(self.info_md_url)
+        # Short-circuit to avoid noisy logs/requests when URL is missing.
+        if not self.info_md_url:
+            return []
+        content = get_repository_file_content(self.info_md_url)
         if not content:
             return []
 
-        urls = re.findall(r"https?:\/\/[^\s\)]+", content.strip())
+        raw_urls = re.findall(r"https?://[^\s\)]+", content)
+        # Strip common trailing punctuation and dedupe while preserving order.
+        urls = []
+        seen = set()
+        for u in (ru.rstrip(").,;:!?") for ru in raw_urls):
+            if u not in seen:
+                seen.add(u)
+                urls.append(u)
 
         if domain:
-            return [url for url in urls if urlparse(url).netloc == domain]
+            def matches(netloc: str) -> bool:
+                return netloc == domain or (include_subdomains and netloc.endswith("." + domain))
+            return [url for url in urls if matches(urlparse(url).netloc)]
 
         return urls

Please confirm if any call sites rely on the old signature; if so, we can keep signature stable and always do exact match, adding subdomain support later behind a new param.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68e037f and 024724a.

📒 Files selected for processing (2)

backend/apps/owasp/models/common.py (7 hunks)
backend/tests/apps/owasp/models/common_test.py (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/tests/apps/owasp/models/common_test.py

🧰 Additional context used

🧠 Learnings (6)

📓 Common learnings

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.961Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.

📚 Learning: 2025-09-06T12:57:57.997Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:57.997Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).

Applied to files:

backend/apps/owasp/models/common.py

📚 Learning: 2025-09-06T12:48:44.892Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.

Applied to files:

backend/apps/owasp/models/common.py

📚 Learning: 2025-09-06T19:03:01.961Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.961Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.

Applied to files:

backend/apps/owasp/models/common.py

📚 Learning: 2025-07-28T14:51:14.736Z

Learnt from: adithya-naik
PR: OWASP/Nest#1894
File: frontend/src/components/TopContributorsList.tsx:74-74
Timestamp: 2025-07-28T14:51:14.736Z
Learning: In the OWASP/Nest project, the maintainer adithya-naik prefers not to create separate components for code that's only used in two specific cases, following the YAGNI principle to avoid over-engineering when the duplication is limited and manageable.

Applied to files:

backend/apps/owasp/models/common.py

📚 Learning: 2025-08-16T13:23:34.874Z

Learnt from: ahmedxgouda
PR: OWASP/Nest#2021
File: backend/apps/slack/models/google_auth.py:25-36
Timestamp: 2025-08-16T13:23:34.874Z
Learning: In the OWASP Nest project, encryption and decryption for Google OAuth tokens stored in BinaryField are being handled in a separate PR, not in the initial Google Auth setup implementation.

Applied to files:

backend/apps/owasp/models/common.py

🧬 Code graph analysis (1)

backend/apps/owasp/models/common.py (4)

backend/apps/common/models.py (2)

BulkSaveModel (10-34)

bulk_save (19-34)

backend/apps/owasp/models/enums/project.py (1)

AudienceChoices (8-13)

backend/apps/github/utils.py (1)

get_repository_file_content (60-79)

backend/apps/owasp/models/entity_member.py (2)

EntityMember (10-72)

Role (13-16)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Run frontend e2e tests
GitHub Check: Run backend tests
GitHub Check: Run frontend unit tests
GitHub Check: CodeQL (javascript-typescript)

🔇 Additional comments (2)

backend/apps/owasp/models/common.py (2)

11-24: Imports look correct and scoped.

No concerns; these are needed for the new functionality.

107-115: info_md_url: LGTM.

Property guards None when repo is absent; consistent with existing index_md_url.

backend/apps/owasp/models/common.py

rudransh-shrivastava · 2025-09-06T19:25:41Z

Hi, I'm not sure if I'm saving the entity_members in the best way. I only insert new entity_members and update the existing ones.

I am also not sure what to do in the unique_together constraint for EntityMember coderabbit suggests, since member is now optional, this constraint is invalid. Can we use member_name instead?
Need your guidance in these parts. Thank you!

@arkid15r

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)
54-58: Leaders parsing wired in; also keep policy flag in sync and consider handling removals

Nice switch to the repository-backed API. Two follow-ups:

If the project model still has is_leaders_policy_compliant, set it here (e.g., require >=2 leaders with emails) so it stays consistent with leaders_raw/leaders_emails. Based on the retrieved learnings, get_leaders_emails intentionally only captures mailto links and sync_leaders matches by member_name, so basing compliance on leaders_emails is appropriate.

sync_leaders currently upserts email on existing names but doesn’t deactivate leaders removed from the markdown or update order for renamed/reordered entries. Consider extending sync_leaders to deactivate missing names and to update order when it changes.

Example inline tweak for the flag (only if the field exists on Project):
             project.leaders_raw = project.get_leaders()
-            leaders_emails = project.get_leaders_emails()
+            leaders_emails = project.get_leaders_emails()
             if leaders_emails:
                 project.sync_leaders(leaders_emails)
+            # keep policy flag consistent with parsed leaders (if present on model)
+            if hasattr(project, "is_leaders_policy_compliant"):
+                project.is_leaders_policy_compliant = sum(1 for e in leaders_emails.values() if e) >= 2
If you’d like, I can draft a minimal sync_leaders enhancement that deactivates stale leaders and updates ordering.

🧹 Nitpick comments (1)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)
61-68: Switch to project.get_urls looks good; tighten GitHub handling and fix a small log nit

Domain filter: ensure get_urls treats www.github.com the same as github.com (normalize/lstrip www. in the model method). This avoids silently dropping valid links.

User/org pages: when GITHUB_USER_RE matches, try get_organization first and, on UnknownObjectException, fall back to gh.get_user(login) to include user repos. Today, user pages will be skipped silently after the exception.

Log message nit: when get_related_url returns falsy, logging verified_url prints None; log scraped_url instead.

Patch for the log:
-                else:
-                    logger.info("Skipped related URL %s", verified_url)
+                else:
+                    logger.info("Skipped related URL %s", scraped_url)
I can also provide a small helper to resolve GitHub principals (org or user) with caching to cut API calls across projects.

Also applies to: 94-96

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 024724a and e2cc92b.

📒 Files selected for processing (1)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/entity_member.py:50-56
Timestamp: 2025-09-06T19:28:14.273Z
Learning: In the OWASP/Nest project, when migrating scraper logic to GitHub .md file parsing, the sync_leaders method uses member_name as the primary identifier for finding and updating existing EntityMember records, not member_email. This approach is chosen because names are more stable identifiers in markdown files, while emails might be added/updated over time.

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.961Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.

📚 Learning: 2025-09-06T19:28:14.273Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/entity_member.py:50-56
Timestamp: 2025-09-06T19:28:14.273Z
Learning: In the OWASP/Nest project, when migrating scraper logic to GitHub .md file parsing, the sync_leaders method uses member_name as the primary identifier for finding and updating existing EntityMember records, not member_email. This approach is chosen because names are more stable identifiers in markdown files, while emails might be added/updated over time.

Applied to files:

backend/apps/owasp/management/commands/owasp_scrape_projects.py

🧬 Code graph analysis (1)

backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)

backend/apps/owasp/models/common.py (5)

get_audience (176-189)

get_leaders (191-211)

get_leaders_emails (213-231)

sync_leaders (299-338)

get_urls (275-286)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Run backend tests
GitHub Check: Run frontend e2e tests
GitHub Check: Run frontend unit tests

arkid15r

@rudransh-shrivastava thanks for working on this

It's a good start towards GitHub as an entire data source for OWASP entity sync. I update code based on our model method conventions (please look into those changes).

I'm going to merge this and expect full scraper migration into GitHub entity sync code in a following PR.

sonarqubecloud · 2025-09-11T03:04:29Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

backend/tests/apps/owasp/models/common_test.py (1)

191-215: Harden URL domain tests: add subdomain/case/port and lookalike-host scenarios.

Covers regressions and security edge cases; mirrors prior review thread.

     @pytest.mark.parametrize(
         ("content", "domain", "expected_urls"),
         [
@@
             ("This test contains no URLs.", None, []),
             ("", None, []),
             (None, None, []),
+            ("* [Fake](https://maliciousowasp.org)", "owasp.org", []),
+            ("* [UK](https://owasp.org.uk)", "owasp.org", []),
+            ("* [Home](https://OWASP.org)", "OWASP.ORG", ["https://OWASP.org"]),
+            ("* [With port](https://owasp.org:443/path)", "owasp.org", ["https://owasp.org:443/path"]),
+            ("* [Sub](https://test.owasp.org/page)", "owasp.org", ["https://test.owasp.org/page"]),
         ],
     )

🧹 Nitpick comments (14)

backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py (2)

26-29: Consider case-insensitive uniqueness for member_name.

Names differ by case (“John Doe” vs “john doe”) will be treated as distinct. Prefer a UniqueConstraint on Lower(member_name) over unique_together for better dedup semantics.

Proposed follow-up migration:
from django.db import migrations, models
from django.db.models.functions import Lower

class Migration(migrations.Migration):
    dependencies = [("owasp", "0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more")]
    operations = [
        migrations.AlterUniqueTogether(name="entitymember", unique_together=set()),
        migrations.AddConstraint(
            model_name="entitymember",
            constraint=models.UniqueConstraint(
                expressions=["entity_type","entity_id",Lower("member_name"),"role"],
                name="uniq_entity_member_name_role_ci",
            ),
        ),
    ]
13-16: Validate dropping index owasp_entit_member__6e516f_idx won’t regress common queries.

If that index covered “member” lookups, admin searches or syncs might slow down. Recreate a targeted index if needed (e.g., on ["entity_type","entity_id","role"] or ["member"]).

backend/apps/owasp/admin/entity_member.py (4)

36-41: Add member_name to search_fields.

Enables searching by the stored name when no linked GitHub user exists.
     search_fields = (
         "member__login",
         "member__name",
+        "member_name",
         "description",
     )
35-35: Drop raw_id_fields when using autocomplete_fields for the same relation.

They’re alternatives; keeping both is redundant and can confuse admins.
-    raw_id_fields = ("member",)
+    # raw_id_fields not needed; using autocomplete_fields above.
41-41: Order by member_name as a fallback.

When member is NULL, add member_name to ordering to keep lists stable.
-    ordering = ("member__name", "order")
+    ordering = ("member__name", "member_name", "order")
15-29: Optional: reduce N+1 queries in list view.

Select related member and content type.
 class EntityMemberAdmin(admin.ModelAdmin):
@@
     ordering = ("member__name", "member_name", "order")
 
+    def get_queryset(self, request):
+        qs = super().get_queryset(request)
+        return qs.select_related("member", "entity_type")

backend/tests/apps/owasp/models/common_test.py (2)

303-326: Strengthen sync_leaders assertions; verify update vs. insert.

Patch EntityMember.update_data and assert exact payloads and save=False calls; remove unused mocks.

-    @patch("apps.owasp.models.common.ContentType")
-    @patch("apps.owasp.models.common.EntityMember")
-    @patch("apps.owasp.models.common.BulkSaveModel")
+    @patch("apps.owasp.models.common.ContentType")
+    @patch("apps.owasp.models.common.BulkSaveModel")
+    @patch("apps.owasp.models.common.EntityMember.update_data")
     def test_sync_leaders_mixed_scenario(
-        self, mock_bulk_save, mock_entity_member, mock_content_type
+        self, mock_update_data, mock_bulk_save, mock_content_type
     ):
@@
-        mock_entity_member.objects.filter.return_value = [existing_leader]
-
         leaders_emails = {
             "John Doe": "[email protected]",  # Update existing
             "Jane Smith": "[email protected]",  # New leader
         }
 
         self.model.sync_leaders(leaders_emails)
 
-        call_args = mock_bulk_save.bulk_save.call_args
-        leaders_to_save = call_args[0][1]
-
-        assert len(leaders_to_save) == 2  # Updated existing + new leader
+        # Two calls: update + insert
+        assert mock_update_data.call_count == 2
+        # Bulk save with the 2 returned instances
+        assert mock_bulk_save.bulk_save.call_count == 1

286-299: Remove unused filter mock in empty sync test.

It’s not used by the implementation.

-        mock_entity_member.objects.filter.return_value = []
-
         self.model.sync_leaders({})

backend/apps/owasp/models/entity_member.py (6)

20-25: Unique by member_name matches sync strategy; mind homonyms and case.

This will merge same-name leaders under the same entity/role (e.g., two “Alex Kim”). If that’s acceptable per product requirements, proceed; otherwise consider augmenting with email when present or a case-insensitive constraint.

26-28: Add a supporting index for common reads.

Fetching leaders typically filters by (entity_type, entity_id, role) and orders by order. Add a composite index to speed those queries.

         indexes = [
             models.Index(fields=["entity_type", "entity_id"]),
+            models.Index(fields=["entity_type", "entity_id", "role", "order"]),
         ]

73-75: Guard str against empty member_name.

If member_name is accidentally blank, this returns a trailing “as … for …”. Provide a fallback.

-        display_name = self.member.login if self.member else self.member_name
+        display_name = self.member.login if self.member else (self.member_name or "Unknown")

76-99: Use update_or_create for atomic upsert and fewer race conditions.

Prevents IntegrityError under concurrent writers and simplifies logic.

     @staticmethod
     def update_data(data, *, save: bool = True) -> "EntityMember":
-        """Update entity member data."""
-        try:
-            entity_member = EntityMember.objects.get(
-                entity_id=data["entity_id"],
-                entity_type=data["entity_type"],
-                member_name=data["member_name"],
-                role=data["role"],
-            )
-        except EntityMember.DoesNotExist:
-            entity_member = EntityMember(
-                entity_id=data["entity_id"],
-                entity_type=data["entity_type"],
-                member_name=data["member_name"],
-                role=data["role"],
-            )
-
-        entity_member.from_dict(data)
-        if save:
-            entity_member.save()
-
-        return entity_member
+        """Update entity member data."""
+        defaults = {
+            "member_email": (data.get("member_email") or "").strip(),
+            "order": data.get("order", 0),
+        }
+        entity_member, _ = EntityMember.objects.update_or_create(
+            entity_id=data["entity_id"],
+            entity_type=data["entity_type"],
+            member_name=data["member_name"].strip(),
+            role=data["role"],
+            defaults=defaults,
+        )
+        return entity_member

100-113: Normalize inputs to avoid dupes from whitespace variance.

Trim name/email before assignment.

     def from_dict(self, data) -> None:
         """Update instance based on dict data."""
         fields = {
             "entity_id": data["entity_id"],
             "entity_type": data["entity_type"],
-            "member_email": data.get("member_email", ""),
-            "member_name": data["member_name"],
+            "member_email": (data.get("member_email") or "").strip(),
+            "member_name": data["member_name"].strip(),
             "order": data.get("order", 0),
             "role": data["role"],
         }

26-28: If admin relies on member lookups, consider re-adding an index on member.

You removed the old index in the migration; add this if you filter/join by member frequently.

         indexes = [
             models.Index(fields=["entity_type", "entity_id"]),
+            models.Index(fields=["member"]),
         ]

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2cc92b and 7ebfe11.

📒 Files selected for processing (9)

backend/apps/owasp/admin/entity_member.py (1 hunks)
backend/apps/owasp/management/commands/owasp_scrape_chapters.py (1 hunks)
backend/apps/owasp/management/commands/owasp_scrape_committees.py (1 hunks)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py (1 hunks)
backend/apps/owasp/models/common.py (7 hunks)
backend/apps/owasp/models/entity_member.py (3 hunks)
backend/tests/apps/owasp/models/common_test.py (5 hunks)
cspell/custom-dict.txt (1 hunks)

✅ Files skipped from review due to trivial changes (1)

cspell/custom-dict.txt

🚧 Files skipped from review as they are similar to previous changes (4)

backend/apps/owasp/management/commands/owasp_scrape_committees.py
backend/apps/owasp/models/common.py
backend/apps/owasp/management/commands/owasp_scrape_projects.py
backend/apps/owasp/management/commands/owasp_scrape_chapters.py

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/entity_member.py:50-56
Timestamp: 2025-09-06T19:28:14.297Z
Learning: In the OWASP/Nest project, when migrating scraper logic to GitHub .md file parsing, the sync_leaders method uses member_name as the primary identifier for finding and updating existing EntityMember records, not member_email. This approach is chosen because names are more stable identifiers in markdown files, while emails might be added/updated over time.

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.985Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.

📚 Learning: 2025-09-06T19:28:14.297Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/entity_member.py:50-56
Timestamp: 2025-09-06T19:28:14.297Z
Learning: In the OWASP/Nest project, when migrating scraper logic to GitHub .md file parsing, the sync_leaders method uses member_name as the primary identifier for finding and updating existing EntityMember records, not member_email. This approach is chosen because names are more stable identifiers in markdown files, while emails might be added/updated over time.

Applied to files:

backend/apps/owasp/models/entity_member.py

📚 Learning: 2025-07-28T14:51:14.736Z

Learnt from: adithya-naik
PR: OWASP/Nest#1894
File: frontend/src/components/TopContributorsList.tsx:74-74
Timestamp: 2025-07-28T14:51:14.736Z
Learning: In the OWASP/Nest project, the maintainer adithya-naik prefers not to create separate components for code that's only used in two specific cases, following the YAGNI principle to avoid over-engineering when the duplication is limited and manageable.

Applied to files:

backend/apps/owasp/models/entity_member.py

📚 Learning: 2025-08-16T13:23:34.874Z

Learnt from: ahmedxgouda
PR: OWASP/Nest#2021
File: backend/apps/slack/models/google_auth.py:25-36
Timestamp: 2025-08-16T13:23:34.874Z
Learning: In the OWASP Nest project, encryption and decryption for Google OAuth tokens stored in BinaryField are being handled in a separate PR, not in the initial Google Auth setup implementation.

Applied to files:

backend/apps/owasp/models/entity_member.py

📚 Learning: 2025-09-06T12:48:44.916Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.916Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.

Applied to files:

backend/apps/owasp/models/entity_member.py

📚 Learning: 2025-09-06T12:57:58.021Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:58.021Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).

Applied to files:

backend/tests/apps/owasp/models/common_test.py

📚 Learning: 2025-09-06T19:03:01.985Z

Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.985Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.

Applied to files:

backend/tests/apps/owasp/models/common_test.py

🧬 Code graph analysis (2)

backend/apps/owasp/models/entity_member.py (4)

backend/apps/owasp/admin/entity_member.py (1)

entity (52-65)

backend/apps/owasp/models/committee.py (2)

update_data (78-90)

save (54-59)

backend/apps/owasp/models/chapter.py (2)

update_data (195-217)

save (170-178)

backend/apps/owasp/models/project.py (2)

update_data (351-373)

save (326-331)

backend/tests/apps/owasp/models/common_test.py (1)

backend/apps/owasp/models/common.py (4)

get_audience (176-189)

get_leaders_emails (213-231)

get_urls (275-286)

sync_leaders (299-327)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Run frontend e2e tests

🔇 Additional comments (2)

backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py (1)
26-29: Add a RunPython data-migration to dedupe/normalize before enforcing the unique constraint

This AlterUniqueTogether will fail at migrate-time if any duplicate rows exist for (entity_type, entity_id, member_name, role) — including case/whitespace variants. Add a data-cleanup RunPython immediately before the AlterUniqueTogether to normalize member_name and remove duplicates, or verify/fix the data beforehand.

Example (place immediately before the AlterUniqueTogether):
def _normalize_name(name: str) -> str:
    return " ".join((name or "").split())

def dedupe_entity_members(apps, schema_editor):
    EntityMember = apps.get_model("owasp", "EntityMember")
    seen = set()
    for em in EntityMember.objects.all().only("id","entity_type_id","entity_id","role","member_name"):
        key = (em.entity_type_id, em.entity_id, em.role, _normalize_name(em.member_name).lower())
        if key in seen:
            EntityMember.objects.filter(id=em.id).delete()
        else:
            seen.add(key)
            em.member_name = _normalize_name(em.member_name)
            em.save(update_fields=["member_name"])
Then add:
migrations.RunPython(dedupe_entity_members, migrations.RunPython.noop),
Location: backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py
backend/apps/owasp/admin/entity_member.py (1)

20-29: Good: showing member_name in list_display.

This aligns the admin with the new model fields.

rudransh-shrivastava · 2025-09-11T05:09:04Z

I will work on it, thank you!

* add properties and methods for github.md files * use Github .md files files instead of scraper for urls, leaders, and audience * Update code(add tests) * Update code * Update code * add get_leaders_emails() and add name, email to entity_member model * Update code * add entity_member/leaders sync * update regex * add tests * update code and fix sonar and cr issues * Update code * Update code * Fix cspell --------- Co-authored-by: Kate Golovanova <[email protected]> Co-authored-by: Arkadii Yakovets <[email protected]>

rudransh-shrivastava added 2 commits September 5, 2025 21:34

add properties and methods for github.md files

5744792

use Github .md files files instead of scraper for urls, leaders, and …

a446674

…audience

github-actions bot added backend backend-tests labels Sep 5, 2025

coderabbitai bot reviewed Sep 5, 2025

View reviewed changes

backend/apps/owasp/management/commands/owasp_scrape_projects.py Outdated Show resolved Hide resolved

backend/apps/owasp/models/common.py Outdated Show resolved Hide resolved

Update code(add tests)

2b20ccf

coderabbitai bot reviewed Sep 5, 2025

View reviewed changes

backend/tests/apps/owasp/models/common_test.py Outdated Show resolved Hide resolved

Update code

a697161

coderabbitai bot reviewed Sep 5, 2025

View reviewed changes

backend/apps/owasp/models/common.py Outdated Show resolved Hide resolved

rudransh-shrivastava added 2 commits September 5, 2025 22:57

Update code

37362e7

add get_leaders_emails() and add name, email to entity_member model

f289f92

coderabbitai bot reviewed Sep 6, 2025

View reviewed changes

...nd/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py Show resolved Hide resolved

backend/apps/owasp/models/entity_member.py Outdated Show resolved Hide resolved

Update code

5c33591

coderabbitai bot reviewed Sep 6, 2025

View reviewed changes

rudransh-shrivastava and others added 2 commits September 6, 2025 18:28

Merge branch 'main' into feature/migrate-scraper-to-github-md-files

b5bb6ec

add entity_member/leaders sync

9acb861

coderabbitai bot reviewed Sep 6, 2025

View reviewed changes

update regex

674fb18

coderabbitai bot reviewed Sep 6, 2025

View reviewed changes

backend/apps/owasp/models/common.py Outdated Show resolved Hide resolved

backend/apps/owasp/models/common.py Outdated Show resolved Hide resolved

add tests

68e037f

rudransh-shrivastava commented Sep 6, 2025

View reviewed changes

coderabbitai bot reviewed Sep 6, 2025

View reviewed changes

update code and fix sonar and cr issues

024724a

coderabbitai bot reviewed Sep 6, 2025

View reviewed changes

backend/apps/owasp/models/common.py Show resolved Hide resolved

rudransh-shrivastava marked this pull request as ready for review September 6, 2025 19:26

rudransh-shrivastava requested a review from arkid15r as a code owner September 6, 2025 19:26

Merge branch 'main' into feature/migrate-scraper-to-github-md-files

1212256

rudransh-shrivastava and others added 2 commits September 9, 2025 12:24

Merge branch 'main' into feature/migrate-scraper-to-github-md-files

818f43e

Update code

e2cc92b

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

Update code

34b0090

arkid15r previously approved these changes Sep 11, 2025

View reviewed changes

arkid15r added 2 commits September 10, 2025 20:02

Merge branch 'main' into pr/rudransh-shrivastava/2223

af823c8

Fix cspell

7ebfe11

arkid15r dismissed their stale review via 7ebfe11 September 11, 2025 03:03

arkid15r requested a review from kasya as a code owner September 11, 2025 03:03

arkid15r enabled auto-merge September 11, 2025 03:03

coderabbitai bot reviewed Sep 11, 2025

View reviewed changes

arkid15r approved these changes Sep 11, 2025

View reviewed changes

arkid15r added this pull request to the merge queue Sep 11, 2025

Merged via the queue into OWASP:main with commit fc930e9 Sep 11, 2025
25 checks passed

rudransh-shrivastava mentioned this pull request Sep 12, 2025

Full Scraper Migration to GitHub .md Files #2272

Open

2 tasks

coderabbitai bot mentioned this pull request Sep 13, 2025

Sync www-repopsitories #2164

Draft

5 tasks

Uh oh!

Feature/migrate scraper to GitHub md files #2223

Feature/migrate scraper to GitHub md files #2223

Uh oh!

Conversation

rudransh-shrivastava commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed change

Checklist

Uh oh!

coderabbitai bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks (4 passed, 1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rudransh-shrivastava Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

rudransh-shrivastava Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arkid15r Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rudransh-shrivastava commented Sep 6, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

arkid15r left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Sep 11, 2025

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

rudransh-shrivastava commented Sep 5, 2025 •

edited

Loading

coderabbitai bot commented Sep 5, 2025 •

edited

Loading

rudransh-shrivastava Sep 6, 2025 •

edited

Loading