Skip to content

share tempfiles when downloading from remotes #6584

@philfry

Description

@philfry

Is your feature request related to a problem? Please describe.
As already mentioned in discourse.pulpproject.org, I'm using pulp/pulp_rpm to provide a couple of repositories to hundreds of hosts (headless servers). To save disk space (and I really don't want to mirror all the desktop packages for nothing) the repositories are set to on_demand.

Now, when these hosts perform their daily package updates, they request an rpm, let's say "linux-firmware", which is like 500MiB, pulp retrieves this file from the remote repository, streams it to the client and to a temporary file and eventually saves the tempfile as an artifact to the repository.

Unfortunately, there are like 10 to 20 hosts asking for that file, causing 10 to 20 downloads from the remote repository (which is bad for the remote repo) and 10 to 20 tempfiles, wasting something like 5GiB to 10GiB of space (which is bad for my pulp server).

Describe the solution you'd like
As pulp knows about what file is requested from which remote, I'd suggest something like this:
When a client asks for a specific file which is not already saved as an artifact, thus needs to be retrieved, then

  • check redis whether or not the requested file is already queued
  • if not: create tempfile, save the tuple (repo, requested file, tempfile) to redis, retrieve the remote file into the tempfile, stream the tempfile content to the client, save the tempfile as artifact, remove the tuple from redis
  • if it is already queued (i.e. repo + requested file are found in redis): just hook onto the tempfile and send it to the client, then forget about it and let the first process that's retrieving the file handle the redis and move-tempfile-to-repo stuff

This would significally reduce the need of disk space in the working directory and network traffic.

Describe alternatives you've considered

  1. Pre-seeding the repos using dummy machines that start their package updates before all others. This is fine for package updates, but it happens, that I install dozens hosts at once and they fill up the pulp server with dozens of copies of all the packages required for an installation.
  2. Prefetch all packages that are required for an installation right after syncing the repos. But would mean having to fetch fresh package lists from all hosts, including packages that are seldom updated and maybe never used on new installations. Also: unnecessary traffic for the remote repos

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions