Simple Pachyderm Pipeline template to download a huggingface dataset or model to a repo.
name- Pipeline namesecretName- Kubernetes secret with hugggingface tokentype- The type of downloadmodelordatasethf_name- Name of the hugggingface dataset or model to downloadrevision- (Optional) a specific revision of the dataset you want to downloaddisable_progress- (Optional) set totrueto disable progress loggingallow_patterns- (Optional) allow patterns for the download comma separated"data/*,*.json"ignore_patterns- (Optional) ignore patterns for the download comma separated"data/*,*.json"
- Create a secret in your pachyderm kubernetes namespace with your huggingface token that's read-only:
kubectl create secret generic hugging-face-token --from-literal HF_HOME=<token> - Create the pipeline using the template in this repo using 
pachctl:pachctl create pipeline \ --jsonnet https://raw.githubusercontent.com/tybritten/hf-dataset-downloader/main/dataset-downloader.jsonnet \ --arg name="hf-downloader-CodeAlpaca_20k" --arg hf_name="HuggingFaceH4/CodeAlpaca_20K" --arg secretName=hugging-face-token - This creates a cron pipeline with a spec of never. So to run it you'll run:
pachctl cron run hf-downloader-CodeAlpaca_20k