Skip to content

Conversation

Anexen
Copy link

@Anexen Anexen commented Jun 26, 2020

Pip can check downloaded package archives against local hashes to protect against remote tampering.

The problem: pip install --require-hashes ... doesn't work with private repo.
Solution: this PR includes sha256 hash into the package URL according to PEP-503.

@naiquevin
Copy link
Contributor

@Anexen : Thanks for the PR. I've finally found some time to work on this project. But I'll look at this PR a bit later because

  1. Not quite familiar with the --requires-hashes functionality. Will read up a bit before reviewing
  2. I am currently merging the older PRs and one of them adds support for azure. I can see that the Storage.read_contents method you've introduced raises a NotImplementedError which will break the azure implementation. Also wondering if the hashes functionality can be provided optionally instead of by default.

@Anexen
Copy link
Author

Anexen commented Jun 29, 2020

Hi @naiquevin

Thanks for the feedback. My plan is to implement Storage.read_contents for Azure and make hashes functionality configurable via .pypi-private.cfg.

@naiquevin
Copy link
Contributor

naiquevin commented Jun 29, 2020

@Anexen Thanks. I have a concern about downloading the content for the checksum while updating the package index as it's not scalable ie. as the number of artifacts for a package will increase the time taken by update_pkg_index will keep increasing.

An alternate way is to store hash in the object metadata while uploading the file (PUT operation). As per the S3 docs 1, once an object is uploaded, it's metadata cannot be modified. In fact, one of the system generated metadata on every s3 object is Content-MD5 (The base64-encoded 128-bit MD5 digest of the object) which we might be able to use as well.

Regarding other storage backends:

  1. Digitalocean spaces - DO spaces has an s3 compatible API so it would have something similar (not verified)
  2. Azure Blob storage also seems to have a similar concept of metadata

I think it'd be worthwhile evaluating this option. There'd be no need to add the read_content method to Storage classes as well. What do you think?

@Anexen
Copy link
Author

Anexen commented Jun 30, 2020

@naiquevin, the use of metadata is a good idea. I can replace read_content method with something like get_content_hash.

Regarding LocalStorage, there are Extended File Attributes which can be used as metadata storage for supported filesystems with fallback to .sha256 files (this is better than reading and calculating the hash from scratch every time the index is updated)

@naiquevin
Copy link
Contributor

Good point about LocalStorage, I missed it completely. I am thinking if hash generation can be made part of the Storage interface itself. Then each Storage can decide how it wants to manage hashes so it will flexible for future storage backends too. What do you think?

@Anexen
Copy link
Author

Anexen commented Jul 1, 2020

This is exactly what I proposed in the last comment. get_content_hash or similar method should be defined on the Storage interface so each storage can decide how it wants to manage hashes.

@naiquevin
Copy link
Contributor

Oh right. I misread it, my bad! get_content_hash sounds good

@jarojasm95
Copy link

Hi @Anexen , any update here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants