-
-
Notifications
You must be signed in to change notification settings - Fork 209
Feature/migrate scraper to GitHub md files #2223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/migrate scraper to GitHub md files #2223
Conversation
Summary by CodeRabbit
WalkthroughRepository-file parsing replaces scraper-based extraction: add RepositoryBasedEntityModel helpers (info_md_url, get_audience, get_urls, get_leaders_emails, sync_leaders), update scraping commands to call entity.get_* methods, add leader persistence fields/migrations, and extend tests for parsing and syncing. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks (4 passed, 1 warning)❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)
59-66
: Guard against None from get_related_url before normalize_url
project.get_related_url(url)
can return None; passing that intonormalize_url()
risks unexpected errors depending on its implementation. Add a guard in the comprehension.- { - repository_url - for url in set(project.get_urls(domain="github.com")) - if (repository_url := normalize_url(project.get_related_url(url))) - and repository_url not in {project.github_url, project.owasp_url} - } + { + repository_url + for url in set(project.get_urls(domain="github.com")) + if (related_url := project.get_related_url(url)) + and (repository_url := normalize_url(related_url)) + and repository_url not in {project.github_url, project.owasp_url} + }
🧹 Nitpick comments (5)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (2)
80-84
: Parse org/user owner with urlparse to avoid trailing slash and query issues
verified_url.split('/')[-1]
breaks on trailing slashes and query strings. Useurlparse
.- gh_organization = gh.get_organization(verified_url.split("/")[-1]) + from urllib.parse import urlparse + owner = urlparse(verified_url).path.strip("/").split("/")[0] + if not owner: + raise UnknownObjectException(status=404, data=None, headers=None) + gh_organization = gh.get_organization(owner)
91-94
: Log the skipped original URL, notNone
When
verified_url
is falsy, logging it is unhelpful. Logscraped_url
.- else: - logger.info("Skipped related URL %s", verified_url) + else: + logger.info("Skipped related URL %s", scraped_url)backend/apps/owasp/models/common.py (1)
254-269
: Harden URL extraction: stricter regex, dedupe, and resilient domain matching
- Current regex can capture trailing punctuation.
- Return stable unique ordering.
- Accept subdomains like www.github.com via suffix match (case-insensitive).
- urls = [] - for line in content.split("\n"): - line_urls = re.findall(r"https?:\/\/[^\s\)]+", line.strip()) - urls.extend(line_urls) + urls = [] + for line in content.split("\n"): + line_urls = re.findall(r"https?://[^\s\)\]\}\>,\"']+", line.strip()) + urls.extend(line_urls) + # de-duplicate preserving order + urls = list(dict.fromkeys(urls)) - if domain: - return [url for url in urls if urlparse(url).netloc == domain] + if domain: + dom = domain.lstrip(".").lower() + return [u for u in urls if urlparse(u).netloc.lower().endswith(dom)] return urlsbackend/tests/apps/owasp/management/commands/owasp_scrape_projects_test.py (2)
36-38
: Stubget_leaders
for determinism (prevents opaque Mock inleaders_raw
)The command now calls
project.get_leaders()
. Stubbing avoids persisting a bare Mock intoleaders_raw
if later assertions are added.mock_scraper.page_tree = True + mock_project.get_leaders.return_value = [] mock_project.get_urls.return_value = [] mock_project.get_audience.return_value = ["builder", "breaker", "defender"]
74-80
: Add an org/user URL to cover the org expansion branchInclude
https://github.com/org
in inputs to exercise theGITHUB_USER_RE
path that expands to repos. Expectations remain unchanged due to dedupe.mock_project.get_urls.return_value = [ "https://github.com/org/repo1", "https://github.com/org/repo2", "https://invalid.com/repo3", + "https://github.com/org", ]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
backend/apps/owasp/management/commands/owasp_scrape_projects.py
(1 hunks)backend/apps/owasp/models/common.py
(4 hunks)backend/tests/apps/owasp/management/commands/owasp_scrape_projects_test.py
(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
backend/tests/apps/owasp/management/commands/owasp_scrape_projects_test.py (2)
backend/tests/apps/owasp/management/commands/owasp_create_project_metadata_file_test.py (1)
mock_project
(14-40)backend/apps/owasp/scraper.py (2)
get_urls
(89-95)get_audience
(55-87)
backend/apps/owasp/models/common.py (3)
backend/apps/owasp/models/enums/project.py (1)
AudienceChoices
(8-13)backend/apps/owasp/scraper.py (2)
get_audience
(55-87)get_urls
(89-95)backend/apps/github/utils.py (1)
get_repository_file_content
(60-79)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (2)
backend/apps/owasp/models/common.py (5)
get_leaders
(190-210)owasp_url
(140-142)deactivate
(144-147)get_audience
(174-188)get_urls
(254-268)backend/apps/owasp/scraper.py (3)
OwaspScraper
(22-127)get_audience
(55-87)get_urls
(89-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run backend tests
- GitHub Check: Run frontend unit tests
- GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (2)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)
56-56
: LGTM: migrated audience to GitHub info.md sourceSwitching to
project.get_audience()
aligns with the new parsing strategy.backend/apps/owasp/models/common.py (1)
105-114
: LGTM:info_md_url
helper mirrors existing patternConsistent with
index_md_url
/leaders_md_url
and safely returns None when repo missing.
backend/apps/owasp/management/commands/owasp_scrape_projects.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
backend/tests/apps/owasp/models/common_test.py (1)
16-43
: Audience parsing: add case-insensitive + exact-word tests; avoid substring false-positives (e.g., “Codebreaker”).Today, parsing is case-sensitive and matches substrings; “Codebreaker” would incorrectly flag “Breaker”. Add tests to pin desired semantics, and update the parser to be case-insensitive with word boundaries.
Apply this diff to extend the tests:
@@ @pytest.mark.parametrize( ("content", "expected_audience"), [ ( """### Top Ten Card Game Information * [Incubator Project](#) * [Type of Project](#) * [Version 0.0.0](#) * [Builder](#) * [Breaker](#)""", ["breaker", "builder"], ), + ("* builder\n* BREAKER", ["breaker", "builder"]), + ("* Codebreaker", []), ("This test contains no audience information.", []), ("", []), (None, []), ], )And update the implementation in apps/owasp/models/common.py to use case-insensitive, whole-word matching:
@@ - for line in content.split("\n"): - for lower_kw, original_kw in AudienceChoices.choices: - if original_kw in line: - found_keywords.add(lower_kw) + for line in content.splitlines(): + lower_line = line.lower() + for lower_kw, original_kw in AudienceChoices.choices: + if re.search(rf"\b{re.escape(original_kw.lower())}\b", lower_line): + found_keywords.add(lower_kw)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
backend/apps/owasp/models/common.py
(4 hunks)backend/tests/apps/owasp/models/common_test.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/apps/owasp/models/common.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/tests/apps/owasp/models/common_test.py (1)
backend/apps/owasp/models/common.py (2)
get_audience
(174-187)get_urls
(253-267)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run backend tests
- GitHub Check: Run frontend unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
backend/apps/owasp/models/common.py (1)
174-188
: Make audience detection case-insensitive; guard missing URL.
Case-sensitive matching will miss valid audience signals; also short-circuit ifinfo_md_url
is None to avoid noisy fetch errors.def get_audience(self): """Get audience from info.md file on GitHub.""" - content = get_repository_file_content(self.info_md_url) + if not self.info_md_url: + return [] + content = get_repository_file_content(self.info_md_url) if not content: return [] found_keywords = set() - for line in content.split("\n"): - for lower_kw, original_kw in AudienceChoices.choices: - if original_kw in line: - found_keywords.add(lower_kw) + for line in content.split("\n"): + line_norm = line.casefold() + for lower_kw, original_kw in AudienceChoices.choices: + if original_kw.casefold() in line_norm: + found_keywords.add(lower_kw) return sorted(found_keywords)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
backend/apps/owasp/management/commands/owasp_scrape_projects.py
(1 hunks)backend/apps/owasp/models/common.py
(4 hunks)backend/tests/apps/owasp/models/common_test.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- backend/tests/apps/owasp/models/common_test.py
- backend/apps/owasp/management/commands/owasp_scrape_projects.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/apps/owasp/models/common.py (3)
backend/apps/owasp/models/enums/project.py (1)
AudienceChoices
(8-13)backend/apps/owasp/scraper.py (2)
get_audience
(55-87)get_urls
(89-95)backend/apps/github/utils.py (1)
get_repository_file_content
(60-79)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Run backend tests
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run frontend unit tests
- GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (2)
backend/apps/owasp/models/common.py (2)
21-21
: Import looks correct and scoped.
AudienceChoices
import path matches the enums module location.
105-114
: Info.md URL property: LGTM.
Consistent withindex_md_url
/leaders_md_url
. No issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
backend/apps/owasp/models/entity_member.py (2)
57-58
: Index and normalizemember_email
for lookups and dedupe.Email will be a frequent join/dedupe key from MD parsing. Add an index and ensure consistent casing.
- member_email = models.EmailField(blank=True, default="") + member_email = models.EmailField(blank=True, default="", db_index=True)Optionally, normalize on save:
def save(self, *args, **kwargs): if self.member_email: self.member_email = self.member_email.strip().lower() super().save(*args, **kwargs)
69-73
: Harden str fallback to avoid empty names.If both
member
is None andmember_name
is empty, the display becomes blank. Fall back to email or “Unknown”.- display_name = self.member.login if self.member else self.member_name - return f"{display_name} as {self.get_role_display()} for {self.entity}" + display_name = ( + (getattr(self.member, "login", "") if self.member else "") + or self.member_name + or self.member_email + or "Unknown" + ) + return f"{display_name} as {self.get_role_display()} for {self.entity}"backend/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py (1)
13-35
: Optional data backfill for nicer display.Backfill
member_name
from existing relations to keep str meaningful for legacy rows lacking MD-provided names.I can add a RunPython migration that sets
member_name = member.login
wheremember_name=""
andmember IS NOT NULL
. Want me to draft it?
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
backend/apps/owasp/management/commands/owasp_scrape_projects.py
(1 hunks)backend/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py
(1 hunks)backend/apps/owasp/models/common.py
(5 hunks)backend/apps/owasp/models/entity_member.py
(2 hunks)backend/tests/apps/owasp/models/common_test.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- backend/apps/owasp/models/common.py
- backend/tests/apps/owasp/models/common_test.py
- backend/apps/owasp/management/commands/owasp_scrape_projects.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/apps/owasp/models/entity_member.py (1)
backend/apps/mentorship/api/internal/nodes/mentor.py (1)
login
(23-25)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run backend tests
- GitHub Check: Run frontend unit tests
🔇 Additional comments (2)
backend/apps/owasp/models/entity_member.py (1)
69-72
:.login
is defined onUser
(inherited fromGenericUserModel
); no changes required.backend/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py (1)
1-12
: No action needed—Django is already pinned to cover 5.2.6
backend/pyproject.toml
declaresdjango = "^5.1"
, which resolves to ≥5.1.0 and <6.0.0, so migrations generated by 5.2.6 will be consistent.
...nd/apps/owasp/migrations/0051_entitymember_member_email_entitymember_member_name_and_more.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (10)
backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py (2)
78-93
: Add assertions to lock in new behavior (leaders + no scraper.get_urls).
Strengthen the test so regressions back to scraper.get_urls get caught and the leaders call is exercised.Apply:
@@ mock_active_committees.count.assert_called_once() assert mock_bulk_save.called assert mock_print.call_count == (committees - offset) + # New API usage assertions + mock_committee.get_leaders.assert_called_once() + mock_committee.get_urls.assert_called() + # Ensure old path isn't used anymore + mock_scraper.get_urls.assert_not_called() + for call in mock_print.call_args_list: args, _ = call assert "https://owasp.org/www-committee-test" in args[0]
49-50
: Remove redundant get_related_url side_effect assignment.
You set the same side_effect in the fixture already; this second assignment is unnecessary.- mock_committee.get_related_url.side_effect = lambda url, **_: url + # Redundant with fixture setup; can be removed.Also applies to: 23-23
backend/apps/owasp/management/commands/owasp_scrape_chapters.py (3)
44-57
: Avoid calling normalize_url(None); split steps for clarity and safety.
Guard against None from get_related_url before normalizing; keeps behavior explicit.- scraped_urls = sorted( - { - repository_url - for url in set(chapter.get_urls()) - if ( - repository_url := normalize_url( - chapter.get_related_url( - url, exclude_domains=("github.com", "owasp.org") - ) - ) - ) - and repository_url not in {chapter.github_url, chapter.owasp_url} - } - ) + scraped_urls = sorted( + { + repo_url + for url in set(chapter.get_urls()) + if (rel := chapter.get_related_url(url, exclude_domains=("github.com", "owasp.org"))) + and (repo_url := normalize_url(rel)) + and repo_url not in {chapter.github_url, chapter.owasp_url} + } + )
67-72
: Avoid shadowing verified_url; use a distinct name for the normalized related URL.
Improves readability and log accuracy.- if verified_url := chapter.get_related_url( - normalize_url(verified_url, check_path=True) - ): - related_urls.add(verified_url) + if related_url := chapter.get_related_url( + normalize_url(verified_url, check_path=True) + ): + related_urls.add(related_url) else: - logger.info("Skipped related URL %s", verified_url) + logger.info("Skipped related URL %s", related_url)
42-76
: Deduplicate URL-pipeline logic shared with committees.
The “collect, verify, normalize, filter” flow is duplicated across commands. Consider extracting a shared helper (e.g., on the model or a utils module) to reduce drift.backend/apps/owasp/management/commands/owasp_scrape_committees.py (3)
44-57
: Guard normalize_url against None and simplify the filter.
Same safety/readability improvement as chapters.- scraped_urls = sorted( - { - repository_url - for url in set(committee.get_urls()) - if ( - repository_url := normalize_url( - committee.get_related_url( - url, exclude_domains=("github.com", "owasp.org") - ) - ) - ) - and repository_url not in {committee.github_url, committee.owasp_url} - } - ) + scraped_urls = sorted( + { + repo_url + for url in set(committee.get_urls()) + if (rel := committee.get_related_url(url, exclude_domains=("github.com", "owasp.org"))) + and (repo_url := normalize_url(rel)) + and repo_url not in {committee.github_url, committee.owasp_url} + } + )
67-72
: Rename shadowed variable for clarity.
Avoid reusing verified_url for the post-processed related URL.- if verified_url := committee.get_related_url( - normalize_url(verified_url, check_path=True) - ): - related_urls.add(verified_url) + if related_url := committee.get_related_url( + normalize_url(verified_url, check_path=True) + ): + related_urls.add(related_url) else: - logger.info("Skipped related URL %s", verified_url) + logger.info("Skipped related URL %s", related_url)
42-76
: Extract shared URL-processing helper to avoid duplication with chapters.
Centralize the “discover → normalize → verify → filter” logic to keep behavior consistent across entities.backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py (2)
78-93
: Harden the test with explicit API-usage assertions.
Mirror the committees test improvements to catch regressions and exercise leaders.@@ mock_active_chapters.count.assert_called_once() assert mock_bulk_save.called assert mock_print.call_count == (chapters - offset) + # New API usage assertions + mock_chapter.get_leaders.assert_called_once() + mock_chapter.get_urls.assert_called() + # Ensure old path isn't used anymore + mock_scraper.get_urls.assert_not_called() + for call in mock_print.call_args_list: args, _ = call assert "https://owasp.org/www-chapter-test" in args[0]
49-50
: Drop duplicate get_related_url side_effect setup.
Already set in the fixture; the extra assignment in the test body can be removed.- mock_chapter.get_related_url.side_effect = lambda url, **_: url + # Redundant with fixture setup; can be removed.Also applies to: 23-23
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
backend/apps/owasp/management/commands/owasp_scrape_chapters.py
(1 hunks)backend/apps/owasp/management/commands/owasp_scrape_committees.py
(1 hunks)backend/apps/owasp/management/commands/owasp_scrape_projects.py
(1 hunks)backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py
(1 hunks)backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/apps/owasp/management/commands/owasp_scrape_projects.py
🧰 Additional context used
🧬 Code graph analysis (4)
backend/apps/owasp/management/commands/owasp_scrape_chapters.py (2)
backend/apps/owasp/models/common.py (2)
get_leaders
(189-209)get_urls
(273-284)backend/apps/owasp/scraper.py (1)
get_urls
(89-95)
backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py (1)
backend/tests/apps/owasp/management/commands/owasp_enrich_committees_test.py (1)
mock_committee
(18-22)
backend/apps/owasp/management/commands/owasp_scrape_committees.py (2)
backend/apps/owasp/models/common.py (2)
get_leaders
(189-209)get_urls
(273-284)backend/apps/owasp/scraper.py (1)
get_urls
(89-95)
backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py (1)
backend/tests/apps/owasp/management/commands/owasp_enrich_chapters_test.py (1)
mock_chapter
(18-25)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Run frontend unit tests
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run backend tests
- GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (6)
backend/tests/apps/owasp/management/commands/owasp_scrape_committees_test.py (1)
41-46
: Switched URL source to committee.get_urls — matches migration intent.
Good change; the test now exercises the new model-driven path instead of the scraper.backend/apps/owasp/management/commands/owasp_scrape_chapters.py (2)
47-47
: Use chapter.get_urls — aligned with .md-based parsing.
This aligns the command with the new RepositoryBasedEntityModel APIs.
42-42
: Leaders now sourced from leaders.md — confirm whether emails should be persisted too.
If leader emails are required (per PR goal), consider also calling a corresponding API (e.g., get_leaders_emails) and persisting them.backend/apps/owasp/management/commands/owasp_scrape_committees.py (2)
47-47
: Use committee.get_urls — aligned with .md-based parsing.
This matches the PR direction to move away from DOM scraping for URL discovery.
42-42
: Leaders now sourced from leaders.md — confirm whether emails should be persisted too.
If leader emails are part of the requirements, add retrieval/persistence (e.g., get_leaders_emails) here as well.backend/tests/apps/owasp/management/commands/owasp_scrape_chapters_test.py (1)
41-46
: Switched URL source to chapter.get_urls — aligns with new model API.
Good update; the test now validates the .md-based URL path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)
55-59
: Avoid double-fetching leaders.md and keep policy flag in sync.Reuse
leaders_emails
to deriveleaders_raw
(one network call) and setis_leaders_policy_compliant
alongside. Mirrors prior feedback.- project.audience = project.get_audience() - project.leaders_raw = project.get_leaders() - leaders_emails = project.get_leaders_emails() - if leaders_emails: - project.sync_leaders(leaders_emails) + project.audience = project.get_audience() + leaders_emails = project.get_leaders_emails() + # Derive leaders list from emails to avoid a second fetch; fallback if empty. + project.leaders_raw = list(leaders_emails) if leaders_emails else project.get_leaders() + project.is_leaders_policy_compliant = len(project.leaders_raw) > 1 + if leaders_emails: + project.sync_leaders(leaders_emails)
🧹 Nitpick comments (5)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (2)
51-51
: Remove stray debug print.Leftover
print("what")
will pollute command output.- print("what") + logger.debug("OWASP page not found for %s; deactivating", project.owasp_url)
65-65
: Drop redundant set() call in comprehension.Outer set already dedupes; inner
set(...)
adds work without benefit.- for url in set(project.get_urls(domain="github.com")) + for url in project.get_urls(domain="github.com")backend/apps/owasp/models/common.py (3)
107-115
: Guard against missing default_branch.Prevent generating an invalid URL when
default_branch
is None.- return ( + return ( "https://raw.githubusercontent.com/OWASP/" - f"{self.owasp_repository.key}/{self.owasp_repository.default_branch}/info.md" - if self.owasp_repository + f"{self.owasp_repository.key}/{self.owasp_repository.default_branch}/info.md" + if self.owasp_repository and self.owasp_repository.default_branch else None )
213-231
: Harden leaders email extraction (support “+” bullets and MAILTO variants).Current regex misses
+
bullets and upper/mixed-casemailto
. Minor expansion improves robustness without changing semantics.- matches = re.findall( - r"[-*]\s*(?:\[([^\]]+)\]\(mailto:([^)]+)\)|([^[\n]+))", line.strip() - ) + matches = re.findall( + r"[-*+]\s*(?:\[([^\]]+)\]\(\s*(?i:mailto:)\s*([^)]+)\)|([^[\n]+))", + line.strip(), + )
275-287
: Strip trailing punctuation and dedupe URLs while preserving order.Avoids false invalids like
...repo).
and repeated links.- urls = re.findall(r"https?:\/\/[^\s\)]+", content.strip()) + raw_urls = re.findall(r"https?://[^\s\)]+", content) + seen, urls = set(), [] + for u in raw_urls: + u = u.rstrip(").,;:!?") + if u not in seen: + seen.add(u) + urls.append(u)(Domain filtering via
urlparse(url).netloc == domain
is correct; keep as-is.)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
backend/apps/owasp/management/commands/owasp_scrape_chapters.py
(1 hunks)backend/apps/owasp/management/commands/owasp_scrape_committees.py
(1 hunks)backend/apps/owasp/management/commands/owasp_scrape_projects.py
(1 hunks)backend/apps/owasp/models/common.py
(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- backend/apps/owasp/management/commands/owasp_scrape_committees.py
- backend/apps/owasp/management/commands/owasp_scrape_chapters.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.
📚 Learning: 2025-09-06T12:57:57.997Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:57.997Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).
Applied to files:
backend/apps/owasp/models/common.py
📚 Learning: 2025-09-06T12:48:44.892Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.
Applied to files:
backend/apps/owasp/models/common.py
🧬 Code graph analysis (2)
backend/apps/owasp/models/common.py (4)
backend/apps/common/models.py (2)
BulkSaveModel
(10-34)bulk_save
(19-34)backend/apps/owasp/models/enums/project.py (1)
AudienceChoices
(8-13)backend/apps/github/utils.py (1)
get_repository_file_content
(60-79)backend/apps/owasp/models/entity_member.py (2)
EntityMember
(10-72)Role
(13-16)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (2)
backend/apps/owasp/models/common.py (6)
deactivate
(146-149)get_audience
(176-189)get_leaders
(191-211)get_leaders_emails
(213-231)sync_leaders
(299-338)get_urls
(275-286)backend/apps/owasp/models/project.py (1)
deactivate
(278-281)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Run frontend unit tests
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run backend tests
- GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (2)
backend/apps/owasp/models/common.py (2)
11-11
: Imports look good.All newly added imports are used.
Also applies to: 14-14, 23-23
176-190
: Audience parsing preserves original case-sensitive behavior.Matches the legacy scraper semantics as requested. Consider adding tests that assert case-sensitive detection for “Breaker/Builder/Defender”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
backend/apps/owasp/models/common.py (1)
299-338
: Sync leaders must update order/reactivation and deactivate stale entries; also persist all changed fields.
As previously noted, existing leaders’ order/is_active aren’t updated, removed leaders remain active, and bulk_save only updates member_email.Minimal persistence fix (doesn’t cover the full logic change already suggested earlier):
- if members_to_save: - BulkSaveModel.bulk_save(EntityMember, members_to_save, ["member_email"]) + if members_to_save: + BulkSaveModel.bulk_save( + EntityMember, + members_to_save, + ["member_email", "order", "is_active"], + )Please also adopt the earlier upsert/reorder/reactivate/deactivate logic so state is consistent.
🧹 Nitpick comments (1)
backend/apps/owasp/models/common.py (1)
275-287
: Harden get_urls: trim trailing punctuation, dedupe, and normalize netloc case.
Prevents false negatives on domain filter (e.g., trailing “.”) and avoids duplicates while keeping strict netloc equality per security learning.- urls = re.findall(r"https?:\/\/[^\s\)]+", content.strip()) + raw_urls = re.findall(r"https?://[^\s\)]+", content) + # Strip common trailing punctuation from Markdown/text usage and dedupe (stable). + urls = [u.rstrip(").,;:!?]") for u in raw_urls] + urls = list(dict.fromkeys(urls)) if domain: - return [url for url in urls if urlparse(url).netloc == domain] + dn = domain.lower() + return [u for u in urls if urlparse(u).netloc.lower() == dn] return urls
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/apps/owasp/models/common.py
(7 hunks)backend/tests/apps/owasp/models/common_test.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/tests/apps/owasp/models/common_test.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.
📚 Learning: 2025-09-06T12:57:57.997Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:57.997Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).
Applied to files:
backend/apps/owasp/models/common.py
📚 Learning: 2025-09-06T12:48:44.892Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.
Applied to files:
backend/apps/owasp/models/common.py
🧬 Code graph analysis (1)
backend/apps/owasp/models/common.py (4)
backend/apps/common/models.py (2)
BulkSaveModel
(10-34)bulk_save
(19-34)backend/apps/owasp/models/enums/project.py (1)
AudienceChoices
(8-13)backend/apps/github/utils.py (1)
get_repository_file_content
(60-79)backend/apps/owasp/models/entity_member.py (2)
EntityMember
(10-72)Role
(13-16)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run backend tests
- GitHub Check: Run frontend unit tests
🔇 Additional comments (3)
backend/apps/owasp/models/common.py (3)
11-11
: LGTM: supporting imports added appropriately.
ContentType, BulkSaveModel, and AudienceChoices imports align with the new behaviors.Also applies to: 14-14, 23-23
107-116
: LGTM: info.md URL property.
Straightforward and consistent with index_md_url.
176-190
: LGTM: audience extraction preserves original case-sensitive behavior.
Matches the original scraper semantics (per retrieved learnings) and returns normalized values.
urls = re.findall(r"https?:\/\/[^\s\)]+", content.strip()) | ||
|
||
if domain: | ||
return [url for url in urls if urlparse(url).netloc == domain] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also match subdomains?
backend/apps/owasp/models/common.py
Outdated
leaders = {} | ||
for line in content.split("\n"): | ||
matches = re.findall( | ||
r"^[-*]\s*(?:\[([^\]]+)\]\(mailto:([^)]+)(\)|([^[<\n])))", line.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to fix this regex for hours, I can't figure it out.
^[-*]\s*(?:\[([^\]]+)\]\(mailto:([^)]+)(\)|([^[<\n])))
dropped leaders without emails
^[-*]\s*(?:\[([^\]]+)\]\(mailto:([^)]+)(\)|([^[<\n])))
was also promising but it too fails tests.
I added many tests for this. Please check them. I think if we can have a stricter rule for matching leaders names and emails I can figure this out.
example:
does * [John Doe]
match name?
does * John Doe
match name?
I also noticed that it was parsing <a>
tags as names in earlier commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be a one-time migration so it's not required to be ideal. Some required data can be added manually later on a higher level process.
I appreciate attention to details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
backend/tests/apps/owasp/models/common_test.py (1)
191-226
: Expand domain-filter cases to lock in security semantics (hostname equality-or-subdomain, case-insensitive, port-safe).
Current tests won’t catch regressions (e.g., lookalike domains, ports, case). Please extend as below; confirm whether subdomains should be accepted (test includes both). This also aligns with prior review feedback.@@ @pytest.mark.parametrize( - ("content", "domain", "expected_urls"), + ("content", "domain", "expected_urls"), [ @@ ("This test contains no URLs.", None, []), ("", None, []), (None, None, []), + ("* [Fake](https://maliciousowasp.org)", "owasp.org", []), + ("* [UK](https://owasp.org.uk)", "owasp.org", []), + ("* [Home](https://OWASP.org)", "OWASP.ORG", ["https://OWASP.org"]), + ("* [With port](https://owasp.org:443/path)", "owasp.org", ["https://owasp.org:443/path"]), + ("* [Subdomain](https://test.owasp.org/docs)", "owasp.org", ["https://test.owasp.org/docs"]), ], )If subdomains should NOT be accepted, change the last case’s expected_urls to [] and I’ll adjust the production code suggestion accordingly.
Production code change (outside this file) to make tests pass and harden filtering:
--- a/backend/apps/owasp/models/common.py +++ b/backend/apps/owasp/models/common.py @@ def get_urls(self, domain=None): - if domain: - return [url for url in urls if urlparse(url).netloc == domain] + if domain: + domain = domain.lstrip(".").lower() + def host_matches(u: str) -> bool: + host = urlparse(u).hostname + if not host: + return False + host = host.lower() + # Accept apex or proper subdomains; case-insensitive; ports already stripped by .hostname + return host == domain or host.endswith("." + domain) + return [u for u in urls if host_matches(u)]
🧹 Nitpick comments (2)
backend/tests/apps/owasp/models/common_test.py (2)
22-49
: Audience tests cover happy-path and empties.
Consider adding a duplicate keyword case to assert de-duplication remains stable, but optional.
321-349
: Strengthen assertions for sync_leaders mixed path.
Verify update_fields and constructor args for the new leader.@@ def test_sync_leaders_mixed_scenario( - call_args = mock_bulk_save.bulk_save.call_args - leaders_to_save = call_args[0][1] + call_args = mock_bulk_save.bulk_save.call_args + leaders_to_save = call_args[0][1] assert len(leaders_to_save) == 2 # Updated existing + new leader assert leaders_to_save[0] == existing_leader + # Ensure only member_email updates are persisted + assert call_args[0][2] == ["member_email"] + # New leader constructed with expected fields and order + mock_entity_member.assert_called_once() + _, kwargs = mock_entity_member.call_args + assert kwargs["member_name"] == "Jane Smith" + assert kwargs["member_email"] == "[email protected]" + assert kwargs["order"] == 1 + assert kwargs["role"] == mock_entity_member.Role.LEADER
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/apps/owasp/models/common.py
(7 hunks)backend/tests/apps/owasp/models/common_test.py
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/apps/owasp/models/common.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.
📚 Learning: 2025-09-06T12:57:57.997Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:57.997Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).
Applied to files:
backend/tests/apps/owasp/models/common_test.py
🧬 Code graph analysis (1)
backend/tests/apps/owasp/models/common_test.py (2)
backend/apps/owasp/models/common.py (4)
get_audience
(176-189)get_leaders_emails
(213-231)get_urls
(275-286)sync_leaders
(299-338)backend/tests/apps/ai/common/base/chunk_command_test.py (1)
mock_content_type
(64-68)
🔇 Additional comments (5)
backend/tests/apps/owasp/models/common_test.py (5)
1-1
: Imports look good.
No issues with MagicMock/Mock/patch usage.
16-21
: Solid test setup.
Fixture keeps tests isolated; id preset helps with FK lookups.
85-146
: Leaders-email parsing: good edge coverage.
Nice handling of spacing, non-mailto bullets, and HTML anchor exclusion.
286-299
: No-op path correctly avoids unnecessary writes.
Mocks are wired correctly; assertion precise.
300-320
: Unchanged leader: good negative assertion.
Covers idempotency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
backend/apps/owasp/models/common.py (1)
299-338
: Prevent duplicate name-identified leaders when member is NULL (DB-level guard).The current unique_together uses member FK; duplicates can accumulate for (entity, role, member_name) when member is NULL. Add a partial unique constraint to EntityMember.Meta.
Proposed constraint (in backend/apps/owasp/models/entity_member.py):
from django.db import models class EntityMember(models.Model): class Meta: constraints = [ models.UniqueConstraint( fields=["entity_type", "entity_id", "role", "member_name"], condition=models.Q(member__isnull=True), name="uniq_entity_role_member_name_when_member_null", ), ]Follow with a migration and de-dupe/backfill before applying.
🧹 Nitpick comments (2)
backend/apps/owasp/models/common.py (2)
176-190
: Audience parsing mirrors original scraper semantics (case-sensitive).Implementation aligns with the stated requirement to keep case-sensitive behavior. Minor nit: use splitlines() for robustness across platforms.
- for line in content.split("\n"): + for line in content.splitlines():
275-287
: URL extraction: trim trailing punctuation, dedupe, and (optionally) support subdomains safely.
- Current regex may include trailing punctuation (e.g., final “.”) and returns duplicates.
- On the “should we also match subdomains?” question: keep exact netloc by default for safety; optionally add include_subdomains flag using a dot-boundary check (no substring matching).
Apply:
- def get_urls(self, domain=None): + def get_urls(self, domain=None, include_subdomains: bool = False): """Get URLs from info.md file on GitHub.""" - content = get_repository_file_content(self.info_md_url) + # Short-circuit to avoid noisy logs/requests when URL is missing. + if not self.info_md_url: + return [] + content = get_repository_file_content(self.info_md_url) if not content: return [] - urls = re.findall(r"https?:\/\/[^\s\)]+", content.strip()) + raw_urls = re.findall(r"https?://[^\s\)]+", content) + # Strip common trailing punctuation and dedupe while preserving order. + urls = [] + seen = set() + for u in (ru.rstrip(").,;:!?") for ru in raw_urls): + if u not in seen: + seen.add(u) + urls.append(u) if domain: - return [url for url in urls if urlparse(url).netloc == domain] + def matches(netloc: str) -> bool: + return netloc == domain or (include_subdomains and netloc.endswith("." + domain)) + return [url for url in urls if matches(urlparse(url).netloc)] return urlsPlease confirm if any call sites rely on the old signature; if so, we can keep signature stable and always do exact match, adding subdomain support later behind a new param.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/apps/owasp/models/common.py
(7 hunks)backend/tests/apps/owasp/models/common_test.py
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/tests/apps/owasp/models/common_test.py
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.961Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.
📚 Learning: 2025-09-06T12:57:57.997Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:57.997Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).
Applied to files:
backend/apps/owasp/models/common.py
📚 Learning: 2025-09-06T12:48:44.892Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.892Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.
Applied to files:
backend/apps/owasp/models/common.py
📚 Learning: 2025-09-06T19:03:01.961Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.961Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.
Applied to files:
backend/apps/owasp/models/common.py
📚 Learning: 2025-07-28T14:51:14.736Z
Learnt from: adithya-naik
PR: OWASP/Nest#1894
File: frontend/src/components/TopContributorsList.tsx:74-74
Timestamp: 2025-07-28T14:51:14.736Z
Learning: In the OWASP/Nest project, the maintainer adithya-naik prefers not to create separate components for code that's only used in two specific cases, following the YAGNI principle to avoid over-engineering when the duplication is limited and manageable.
Applied to files:
backend/apps/owasp/models/common.py
📚 Learning: 2025-08-16T13:23:34.874Z
Learnt from: ahmedxgouda
PR: OWASP/Nest#2021
File: backend/apps/slack/models/google_auth.py:25-36
Timestamp: 2025-08-16T13:23:34.874Z
Learning: In the OWASP Nest project, encryption and decryption for Google OAuth tokens stored in BinaryField are being handled in a separate PR, not in the initial Google Auth setup implementation.
Applied to files:
backend/apps/owasp/models/common.py
🧬 Code graph analysis (1)
backend/apps/owasp/models/common.py (4)
backend/apps/common/models.py (2)
BulkSaveModel
(10-34)bulk_save
(19-34)backend/apps/owasp/models/enums/project.py (1)
AudienceChoices
(8-13)backend/apps/github/utils.py (1)
get_repository_file_content
(60-79)backend/apps/owasp/models/entity_member.py (2)
EntityMember
(10-72)Role
(13-16)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run backend tests
- GitHub Check: Run frontend unit tests
- GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (2)
backend/apps/owasp/models/common.py (2)
11-24
: Imports look correct and scoped.No concerns; these are needed for the new functionality.
107-115
: info_md_url: LGTM.Property guards None when repo is absent; consistent with existing index_md_url.
Hi, I'm not sure if I'm saving the I am also not sure what to do in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)
54-58
: Leaders parsing wired in; also keep policy flag in sync and consider handling removalsNice switch to the repository-backed API. Two follow-ups:
- If the project model still has is_leaders_policy_compliant, set it here (e.g., require >=2 leaders with emails) so it stays consistent with leaders_raw/leaders_emails. Based on the retrieved learnings, get_leaders_emails intentionally only captures mailto links and sync_leaders matches by member_name, so basing compliance on leaders_emails is appropriate.
- sync_leaders currently upserts email on existing names but doesn’t deactivate leaders removed from the markdown or update order for renamed/reordered entries. Consider extending sync_leaders to deactivate missing names and to update order when it changes.
Example inline tweak for the flag (only if the field exists on Project):
project.leaders_raw = project.get_leaders() - leaders_emails = project.get_leaders_emails() + leaders_emails = project.get_leaders_emails() if leaders_emails: project.sync_leaders(leaders_emails) + # keep policy flag consistent with parsed leaders (if present on model) + if hasattr(project, "is_leaders_policy_compliant"): + project.is_leaders_policy_compliant = sum(1 for e in leaders_emails.values() if e) >= 2If you’d like, I can draft a minimal sync_leaders enhancement that deactivates stale leaders and updates ordering.
🧹 Nitpick comments (1)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)
61-68
: Switch to project.get_urls looks good; tighten GitHub handling and fix a small log nit
- Domain filter: ensure get_urls treats www.github.com the same as github.com (normalize/lstrip www. in the model method). This avoids silently dropping valid links.
- User/org pages: when GITHUB_USER_RE matches, try get_organization first and, on UnknownObjectException, fall back to gh.get_user(login) to include user repos. Today, user pages will be skipped silently after the exception.
- Log message nit: when get_related_url returns falsy, logging verified_url prints None; log scraped_url instead.
Patch for the log:
- else: - logger.info("Skipped related URL %s", verified_url) + else: + logger.info("Skipped related URL %s", scraped_url)I can also provide a small helper to resolve GitHub principals (org or user) with caching to cut API calls across projects.
Also applies to: 94-96
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/apps/owasp/management/commands/owasp_scrape_projects.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/entity_member.py:50-56
Timestamp: 2025-09-06T19:28:14.273Z
Learning: In the OWASP/Nest project, when migrating scraper logic to GitHub .md file parsing, the sync_leaders method uses member_name as the primary identifier for finding and updating existing EntityMember records, not member_email. This approach is chosen because names are more stable identifiers in markdown files, while emails might be added/updated over time.
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.961Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.
📚 Learning: 2025-09-06T19:28:14.273Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/entity_member.py:50-56
Timestamp: 2025-09-06T19:28:14.273Z
Learning: In the OWASP/Nest project, when migrating scraper logic to GitHub .md file parsing, the sync_leaders method uses member_name as the primary identifier for finding and updating existing EntityMember records, not member_email. This approach is chosen because names are more stable identifiers in markdown files, while emails might be added/updated over time.
Applied to files:
backend/apps/owasp/management/commands/owasp_scrape_projects.py
🧬 Code graph analysis (1)
backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)
backend/apps/owasp/models/common.py (5)
get_audience
(176-189)get_leaders
(191-211)get_leaders_emails
(213-231)sync_leaders
(299-338)get_urls
(275-286)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Run backend tests
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run frontend unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rudransh-shrivastava thanks for working on this
It's a good start towards GitHub as an entire data source for OWASP entity sync. I update code based on our model method conventions (please look into those changes).
I'm going to merge this and expect full scraper migration into GitHub entity sync code in a following PR.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
backend/tests/apps/owasp/models/common_test.py (1)
191-215
: Harden URL domain tests: add subdomain/case/port and lookalike-host scenarios.Covers regressions and security edge cases; mirrors prior review thread.
@pytest.mark.parametrize( ("content", "domain", "expected_urls"), [ @@ ("This test contains no URLs.", None, []), ("", None, []), (None, None, []), + ("* [Fake](https://maliciousowasp.org)", "owasp.org", []), + ("* [UK](https://owasp.org.uk)", "owasp.org", []), + ("* [Home](https://OWASP.org)", "OWASP.ORG", ["https://OWASP.org"]), + ("* [With port](https://owasp.org:443/path)", "owasp.org", ["https://owasp.org:443/path"]), + ("* [Sub](https://test.owasp.org/page)", "owasp.org", ["https://test.owasp.org/page"]), ], )
🧹 Nitpick comments (14)
backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py (2)
26-29
: Consider case-insensitive uniqueness for member_name.Names differ by case (“John Doe” vs “john doe”) will be treated as distinct. Prefer a UniqueConstraint on Lower(member_name) over unique_together for better dedup semantics.
Proposed follow-up migration:
from django.db import migrations, models from django.db.models.functions import Lower class Migration(migrations.Migration): dependencies = [("owasp", "0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more")] operations = [ migrations.AlterUniqueTogether(name="entitymember", unique_together=set()), migrations.AddConstraint( model_name="entitymember", constraint=models.UniqueConstraint( expressions=["entity_type","entity_id",Lower("member_name"),"role"], name="uniq_entity_member_name_role_ci", ), ), ]
13-16
: Validate dropping index owasp_entit_member__6e516f_idx won’t regress common queries.If that index covered “member” lookups, admin searches or syncs might slow down. Recreate a targeted index if needed (e.g., on ["entity_type","entity_id","role"] or ["member"]).
backend/apps/owasp/admin/entity_member.py (4)
36-41
: Add member_name to search_fields.Enables searching by the stored name when no linked GitHub user exists.
search_fields = ( "member__login", "member__name", + "member_name", "description", )
35-35
: Drop raw_id_fields when using autocomplete_fields for the same relation.They’re alternatives; keeping both is redundant and can confuse admins.
- raw_id_fields = ("member",) + # raw_id_fields not needed; using autocomplete_fields above.
41-41
: Order by member_name as a fallback.When member is NULL, add member_name to ordering to keep lists stable.
- ordering = ("member__name", "order") + ordering = ("member__name", "member_name", "order")
15-29
: Optional: reduce N+1 queries in list view.Select related member and content type.
class EntityMemberAdmin(admin.ModelAdmin): @@ ordering = ("member__name", "member_name", "order") + def get_queryset(self, request): + qs = super().get_queryset(request) + return qs.select_related("member", "entity_type")backend/tests/apps/owasp/models/common_test.py (2)
303-326
: Strengthen sync_leaders assertions; verify update vs. insert.Patch EntityMember.update_data and assert exact payloads and save=False calls; remove unused mocks.
- @patch("apps.owasp.models.common.ContentType") - @patch("apps.owasp.models.common.EntityMember") - @patch("apps.owasp.models.common.BulkSaveModel") + @patch("apps.owasp.models.common.ContentType") + @patch("apps.owasp.models.common.BulkSaveModel") + @patch("apps.owasp.models.common.EntityMember.update_data") def test_sync_leaders_mixed_scenario( - self, mock_bulk_save, mock_entity_member, mock_content_type + self, mock_update_data, mock_bulk_save, mock_content_type ): @@ - mock_entity_member.objects.filter.return_value = [existing_leader] - leaders_emails = { "John Doe": "[email protected]", # Update existing "Jane Smith": "[email protected]", # New leader } self.model.sync_leaders(leaders_emails) - call_args = mock_bulk_save.bulk_save.call_args - leaders_to_save = call_args[0][1] - - assert len(leaders_to_save) == 2 # Updated existing + new leader + # Two calls: update + insert + assert mock_update_data.call_count == 2 + # Bulk save with the 2 returned instances + assert mock_bulk_save.bulk_save.call_count == 1
286-299
: Remove unused filter mock in empty sync test.It’s not used by the implementation.
- mock_entity_member.objects.filter.return_value = [] - self.model.sync_leaders({})backend/apps/owasp/models/entity_member.py (6)
20-25
: Unique by member_name matches sync strategy; mind homonyms and case.This will merge same-name leaders under the same entity/role (e.g., two “Alex Kim”). If that’s acceptable per product requirements, proceed; otherwise consider augmenting with email when present or a case-insensitive constraint.
26-28
: Add a supporting index for common reads.Fetching leaders typically filters by (entity_type, entity_id, role) and orders by order. Add a composite index to speed those queries.
indexes = [ models.Index(fields=["entity_type", "entity_id"]), + models.Index(fields=["entity_type", "entity_id", "role", "order"]), ]
73-75
: Guard str against empty member_name.If member_name is accidentally blank, this returns a trailing “as … for …”. Provide a fallback.
- display_name = self.member.login if self.member else self.member_name + display_name = self.member.login if self.member else (self.member_name or "Unknown")
76-99
: Use update_or_create for atomic upsert and fewer race conditions.Prevents IntegrityError under concurrent writers and simplifies logic.
@staticmethod def update_data(data, *, save: bool = True) -> "EntityMember": - """Update entity member data.""" - try: - entity_member = EntityMember.objects.get( - entity_id=data["entity_id"], - entity_type=data["entity_type"], - member_name=data["member_name"], - role=data["role"], - ) - except EntityMember.DoesNotExist: - entity_member = EntityMember( - entity_id=data["entity_id"], - entity_type=data["entity_type"], - member_name=data["member_name"], - role=data["role"], - ) - - entity_member.from_dict(data) - if save: - entity_member.save() - - return entity_member + """Update entity member data.""" + defaults = { + "member_email": (data.get("member_email") or "").strip(), + "order": data.get("order", 0), + } + entity_member, _ = EntityMember.objects.update_or_create( + entity_id=data["entity_id"], + entity_type=data["entity_type"], + member_name=data["member_name"].strip(), + role=data["role"], + defaults=defaults, + ) + return entity_member
100-113
: Normalize inputs to avoid dupes from whitespace variance.Trim name/email before assignment.
def from_dict(self, data) -> None: """Update instance based on dict data.""" fields = { "entity_id": data["entity_id"], "entity_type": data["entity_type"], - "member_email": data.get("member_email", ""), - "member_name": data["member_name"], + "member_email": (data.get("member_email") or "").strip(), + "member_name": data["member_name"].strip(), "order": data.get("order", 0), "role": data["role"], }
26-28
: If admin relies on member lookups, consider re-adding an index on member.You removed the old index in the migration; add this if you filter/join by member frequently.
indexes = [ models.Index(fields=["entity_type", "entity_id"]), + models.Index(fields=["member"]), ]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
backend/apps/owasp/admin/entity_member.py
(1 hunks)backend/apps/owasp/management/commands/owasp_scrape_chapters.py
(1 hunks)backend/apps/owasp/management/commands/owasp_scrape_committees.py
(1 hunks)backend/apps/owasp/management/commands/owasp_scrape_projects.py
(1 hunks)backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py
(1 hunks)backend/apps/owasp/models/common.py
(7 hunks)backend/apps/owasp/models/entity_member.py
(3 hunks)backend/tests/apps/owasp/models/common_test.py
(5 hunks)cspell/custom-dict.txt
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- cspell/custom-dict.txt
🚧 Files skipped from review as they are similar to previous changes (4)
- backend/apps/owasp/management/commands/owasp_scrape_committees.py
- backend/apps/owasp/models/common.py
- backend/apps/owasp/management/commands/owasp_scrape_projects.py
- backend/apps/owasp/management/commands/owasp_scrape_chapters.py
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/entity_member.py:50-56
Timestamp: 2025-09-06T19:28:14.297Z
Learning: In the OWASP/Nest project, when migrating scraper logic to GitHub .md file parsing, the sync_leaders method uses member_name as the primary identifier for finding and updating existing EntityMember records, not member_email. This approach is chosen because names are more stable identifiers in markdown files, while emails might be added/updated over time.
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.985Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.
📚 Learning: 2025-09-06T19:28:14.297Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/entity_member.py:50-56
Timestamp: 2025-09-06T19:28:14.297Z
Learning: In the OWASP/Nest project, when migrating scraper logic to GitHub .md file parsing, the sync_leaders method uses member_name as the primary identifier for finding and updating existing EntityMember records, not member_email. This approach is chosen because names are more stable identifiers in markdown files, while emails might be added/updated over time.
Applied to files:
backend/apps/owasp/models/entity_member.py
📚 Learning: 2025-07-28T14:51:14.736Z
Learnt from: adithya-naik
PR: OWASP/Nest#1894
File: frontend/src/components/TopContributorsList.tsx:74-74
Timestamp: 2025-07-28T14:51:14.736Z
Learning: In the OWASP/Nest project, the maintainer adithya-naik prefers not to create separate components for code that's only used in two specific cases, following the YAGNI principle to avoid over-engineering when the duplication is limited and manageable.
Applied to files:
backend/apps/owasp/models/entity_member.py
📚 Learning: 2025-08-16T13:23:34.874Z
Learnt from: ahmedxgouda
PR: OWASP/Nest#2021
File: backend/apps/slack/models/google_auth.py:25-36
Timestamp: 2025-08-16T13:23:34.874Z
Learning: In the OWASP Nest project, encryption and decryption for Google OAuth tokens stored in BinaryField are being handled in a separate PR, not in the initial Google Auth setup implementation.
Applied to files:
backend/apps/owasp/models/entity_member.py
📚 Learning: 2025-09-06T12:48:44.916Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:48:44.916Z
Learning: When migrating scraper logic to GitHub .md file parsing in the OWASP project, the new implementation should maintain the same case-sensitivity behavior as the original scraper logic, even if case-insensitive matching might seem more robust.
Applied to files:
backend/apps/owasp/models/entity_member.py
📚 Learning: 2025-09-06T12:57:58.021Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:0-0
Timestamp: 2025-09-06T12:57:58.021Z
Learning: When filtering URLs by domain in the OWASP project, use `urlparse(url).netloc == domain` instead of `domain in url` to avoid security vulnerabilities where malicious subdomains could bypass filtering (e.g., `testowasp.org` would match when filtering for `owasp.org`).
Applied to files:
backend/tests/apps/owasp/models/common_test.py
📚 Learning: 2025-09-06T19:03:01.985Z
Learnt from: rudransh-shrivastava
PR: OWASP/Nest#2223
File: backend/apps/owasp/models/common.py:213-232
Timestamp: 2025-09-06T19:03:01.985Z
Learning: In the OWASP Nest project, the get_leaders_emails() method in RepositoryBasedEntityModel is designed to only capture leaders with mailto: links from leaders.md files, intentionally ignoring plain text names without email addresses. The current regex implementation works correctly for the intended behavior as validated by comprehensive test cases.
Applied to files:
backend/tests/apps/owasp/models/common_test.py
🧬 Code graph analysis (2)
backend/apps/owasp/models/entity_member.py (4)
backend/apps/owasp/admin/entity_member.py (1)
entity
(52-65)backend/apps/owasp/models/committee.py (2)
update_data
(78-90)save
(54-59)backend/apps/owasp/models/chapter.py (2)
update_data
(195-217)save
(170-178)backend/apps/owasp/models/project.py (2)
update_data
(351-373)save
(326-331)
backend/tests/apps/owasp/models/common_test.py (1)
backend/apps/owasp/models/common.py (4)
get_audience
(176-189)get_leaders_emails
(213-231)get_urls
(275-286)sync_leaders
(299-327)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Run frontend e2e tests
🔇 Additional comments (2)
backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py (1)
26-29
: Add a RunPython data-migration to dedupe/normalize before enforcing the unique constraintThis AlterUniqueTogether will fail at migrate-time if any duplicate rows exist for (entity_type, entity_id, member_name, role) — including case/whitespace variants. Add a data-cleanup RunPython immediately before the AlterUniqueTogether to normalize member_name and remove duplicates, or verify/fix the data beforehand.
Example (place immediately before the AlterUniqueTogether):
def _normalize_name(name: str) -> str: return " ".join((name or "").split()) def dedupe_entity_members(apps, schema_editor): EntityMember = apps.get_model("owasp", "EntityMember") seen = set() for em in EntityMember.objects.all().only("id","entity_type_id","entity_id","role","member_name"): key = (em.entity_type_id, em.entity_id, em.role, _normalize_name(em.member_name).lower()) if key in seen: EntityMember.objects.filter(id=em.id).delete() else: seen.add(key) em.member_name = _normalize_name(em.member_name) em.save(update_fields=["member_name"])Then add:
migrations.RunPython(dedupe_entity_members, migrations.RunPython.noop),Location: backend/apps/owasp/migrations/0052_remove_entitymember_owasp_entit_member__6e516f_idx_and_more.py
backend/apps/owasp/admin/entity_member.py (1)
20-29
: Good: showing member_name in list_display.This aligns the admin with the new model fields.
I will work on it, thank you! |
* add properties and methods for github.md files * use Github .md files files instead of scraper for urls, leaders, and audience * Update code(add tests) * Update code * Update code * add get_leaders_emails() and add name, email to entity_member model * Update code * add entity_member/leaders sync * update regex * add tests * update code and fix sonar and cr issues * Update code * Update code * Fix cspell --------- Co-authored-by: Kate Golovanova <[email protected]> Co-authored-by: Arkadii Yakovets <[email protected]>
Resolves #2088
Proposed change
migrate scraper logic of
audience
,leaders
andurls
to Github .md filesChecklist
make check-test
locally; all checks and tests passed.