Skip to content

Conversation

zahidblackduck
Copy link
Collaborator

@zahidblackduck zahidblackduck commented Aug 28, 2025

JIRA Ticket
IDETECT-4845

Description
This is a merge request that implements parsing for dependencies declared via PEP508 compliant URIs in pyproject.toml, ensuring versions from wheel files, archives, and VCS refs can be extracted.

Example of supported PEP 508 URI dependency:

[project]
dependencies = [
    "torch @ https://download.pytorch.org/whl/cpu/torch-2.6.0%2Bcpu-cp310-cp310-linux_x86_64.whl",
    "torchvision @ https://download.pytorch.org/whl/cpu/torchvision-0.21.0%2Bcpu-cp310-cp310-linux_x86_64.whl",
    "flask @ git+https://github.com/pallets/[email protected]",
    "pip @ https://github.com/pypa/pip/archive/1.3.1.zip#sha1=da9234ee9982d4bbb3c72346a6de940a148ea686",
]

Impact Areas

  • Setuptools Pip Detector (Build) now extracts name + version for direct references installed via pip install using pyproject.toml, with graceful fallback to the next detector if version is missing, malformed, or URI invalid.
  • Setuptools Detector (Buildless) parses name + version directly from pyproject.toml. Reports name-only when version is unavailable.
  • UV Lock Detector extracts name + version correctly If requirements.txt is present and contains direct references. Falls back to name-only if version missing. This detector doesn't parse dependency information form pyproject.toml
  • Poetry Lock Detector is unaffected. It continues to parse names from pyproject.toml and versions from poetry.lock.
  • UV CLI Detector is also unaffected. It continues to parse and report name + version via existing uv tree command which is used in the current implementation.

Notes

  • pip show and uv tree are confirmed read-only and safe (no reverse shell injection risk).

@zahidblackduck zahidblackduck self-assigned this Aug 28, 2025
@zahidblackduck zahidblackduck marked this pull request as draft August 28, 2025 13:54
@zahidblackduck zahidblackduck changed the title SPIKE: Explore Python Package Version Extraction from URIs in pyproject.toml (PEP508) SPIKE: Python Package Version Extraction from URIs in pyproject.toml (PEP508) Sep 4, 2025
@zahidblackduck zahidblackduck changed the title SPIKE: Python Package Version Extraction from URIs in pyproject.toml (PEP508) Python Package Version Extraction from URIs in pyproject.toml (PEP508) Sep 11, 2025
@zahidblackduck zahidblackduck marked this pull request as ready for review September 11, 2025 18:01
@zahidblackduck zahidblackduck requested review from Copilot, dterrybd, andrian-sevastyanov and devmehtabd and removed request for dterrybd September 11, 2025 18:02
@zahidblackduck zahidblackduck assigned shantyk and unassigned shantyk Sep 11, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements parsing for PEP 508 compliant URI dependencies in pyproject.toml files to extract both name and version information from direct references including wheel files, archives, and VCS repositories.

  • Adds regex-based version extraction from URLs in PythonDependencyTransformer
  • Enhances pyproject.toml parsing to handle direct URI dependencies alongside traditional version constraints
  • Includes comprehensive test coverage for various dependency formats

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
PythonDependencyTransformer.java Adds PEP 508 URI parsing logic with regex patterns for extracting versions from URLs
PythonDependencyTransformerTest.java New test file covering normal dependencies and PEP 508 URI formats
PyprojectTomlParserTest.java New test file validating complex pyproject.toml parsing with mixed dependency types

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 19 to 21
private static final List<String> TOKEN_IGNORE_AFTER_CHARS = Arrays.asList(",", "[", "==", ">=", "~=", "<=", ">", "<");
private static final Pattern URI_VERSION_PATTERN = Pattern.compile(".*/([A-Za-z0-9_.-]+)-([0-9]+(?:\\.[0-9A-Za-z_-]+)*).*\\.(whl|zip|tar\\.gz|tar\\.bz2|tar)$");
private static final Pattern VCS_VERSION_PATTERN = Pattern.compile(".*@([0-9]+(?:\\.[0-9]+)*(?:[A-Za-z0-9._-]*)?).*");
Copy link
Preview

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These regex patterns are complex and lack documentation. Consider adding inline comments explaining what each pattern matches and providing examples of URLs they're designed to parse.

Suggested change
private static final List<String> TOKEN_IGNORE_AFTER_CHARS = Arrays.asList(",", "[", "==", ">=", "~=", "<=", ">", "<");
private static final Pattern URI_VERSION_PATTERN = Pattern.compile(".*/([A-Za-z0-9_.-]+)-([0-9]+(?:\\.[0-9A-Za-z_-]+)*).*\\.(whl|zip|tar\\.gz|tar\\.bz2|tar)$");
private static final Pattern VCS_VERSION_PATTERN = Pattern.compile(".*@([0-9]+(?:\\.[0-9]+)*(?:[A-Za-z0-9._-]*)?).*");
private static final List<String> TOKEN_IGNORE_AFTER_CHARS = Arrays.asList(",", "[", "==", ">=", "~=", "<=", ">", "<");
// Matches package filenames in URIs, extracting the package name and version.
// Example: https://files.pythonhosted.org/packages/.../requests-2.25.1-py2.py3-none-any.whl
// Captures: "requests" as name, "2.25.1" as version
private static final Pattern URI_VERSION_PATTERN = Pattern.compile(".*/([A-Za-z0-9_.-]+)-([0-9]+(?:\\.[0-9A-Za-z_-]+)*).*\\.(whl|zip|tar\\.gz|tar\\.bz2|tar)$");
// Matches VCS (Version Control System) URIs with an @version suffix.
// Example: git+https://github.com/psf/[email protected]
// Captures: "v2.25.1" as version
private static final Pattern VCS_VERSION_PATTERN = Pattern.compile(".*@([0-9]+(?:\\.[0-9]+)*(?:[A-Za-z0-9._-]*)?).*");
// Matches archive or release URLs, extracting the version from the path.
// Example: https://github.com/psf/requests/archive/2.25.1.zip
// Captures: "2.25.1" as version

Copilot uses AI. Check for mistakes.

Comment on lines +96 to +99
Matcher matcher = URI_VERSION_PATTERN.matcher(uri);
if (matcher.find()) {
return matcher.group(2);
}
Copy link
Preview

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 2 refers to the second capture group. Consider using a named constant like VERSION_GROUP_INDEX = 2 to make the code more self-documenting.

Copilot uses AI. Check for mistakes.

Comment on lines +102 to +105
Matcher vcsMatcher = VCS_VERSION_PATTERN.matcher(uri);
if (vcsMatcher.find()) {
return vcsMatcher.group(1);
}
Copy link
Preview

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 1 refers to the first capture group. Consider using a named constant like VCS_VERSION_GROUP_INDEX = 1 to make the code more self-documenting.

Copilot uses AI. Check for mistakes.

Comment on lines +108 to +111
Matcher archiveMatcher = ARCHIVE_VERSION_PATTERN.matcher(uri);
if (archiveMatcher.find()) {
return archiveMatcher.group(1);
}
Copy link
Preview

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 1 refers to the first capture group. Consider using a named constant like ARCHIVE_VERSION_GROUP_INDEX = 1 to make the code more self-documenting.

Copilot uses AI. Check for mistakes.

class PythonDependencyTransformerTest {

@Test
void testTransformLine() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see unit tests for this complex parsing.

There are many cases being tested by this test. It would be a bit nicer if it was implemented using @ParameterizedTest. Otherwise, if there's ever a problem and the test starts to fail, only one failing assertion will be reported. With a parameterized test it becomes more clear what other types of cases might be broken.

Copy link
Collaborator Author

@zahidblackduck zahidblackduck Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great suggestion. I've updated the test with @ParameterizedTest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants