-
Notifications
You must be signed in to change notification settings - Fork 1
Implement tar optimization and dynamic exclusions (Issues #3, #4, #5, #6, #8) #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…zeros - Set --blocking-factor=1 to prevent extra zero-padding blocks - Add --sparse flag to consolidate runs of zeros in input files - Add --label parameter with run basename for identification - Pipe tar output through 'head --bytes -1024' to trim EOF blocks before gzip - Update comments and documentation to reflect that --ignore-zeros is no longer needed - Replace --gzip with manual gzip after EOF trimming for proper concatenation These changes implement the solution described in issue #8 to create efficiently concatenatable tarballs that can be extracted with standard tar commands without requiring --ignore-zeros. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Chris Tomkins-Tinch <[email protected]>
do not trim EOF blocks from final incremental tarball that mops up the remaining changed files
…ia in-band tar labels, plus a verbose json file alongside the README.txt in the destination bucket
…her than just truncating it
migrate Google Cloud CLI calls from `gsutil` -> `gcloud storage`. Mostly a drop-in replacement, except for `gsutil compose` which is now `gcloud storage objects compose`.
…tination (store in `FINAL_TARBALL_PATH `)
…e more involved and hopefully cross-platform (GNU/Linux vs macOS/BSD)
…c data from run folder tarballs
…m including multiple snapshots of large in-progress files use dynamic exclusions to avoid tarball bloat by discouraging tar from including multiple snapshots of large in-progress files; this addresses #4
Claude finished @tomkinsc's task —— View job Pull Request ReviewI've completed a comprehensive review of this PR that addresses multiple issues (#3-8) in the sequence upload system. Review Tasks:
SummaryThis is an excellent PR that significantly improves the robustness and functionality of the sequence upload system. The changes are well-implemented with proper error handling, cross-platform compatibility, and comprehensive documentation. Code Quality & Best Practices ⭐Strengths:
Best Practices Followed:
Performance Considerations ⚡Positive Changes:
Areas to Monitor:
Security Assessment 🔒Security Strengths:
Minor Security Considerations:
Potential Bugs & Issues 🐛Minor Issues:
Potential Race Conditions:
Configuration & Environment ⚙️Excellent additions:
Test Coverage Considerations 🧪Missing Test Coverage:
Recommendations:
Documentation Quality 📚Excellent documentation:
Migration & Breaking Changes 🔄Well handled:
Overall AssessmentThis is a high-quality PR that significantly improves the system's robustness, performance, and maintainability. The implementation demonstrates strong bash scripting skills and attention to production concerns like error handling, cross-platform compatibility, and performance optimization. Recommendation: APPROVE ✅ The benefits far outweigh the minor issues identified, and the improvements address real production pain points around tarball bloat and upload reliability. |
NB: This PR is a trial run of using Claude Code for GitHub-interactive changes. While the code changes here have been reviewed (quickly) by human eyes, and output from new functions has been checked (even more quickly) following local execution by Claude, the revised scripts should be executed, in full, on real sequencing data before this PR is merged. The revised scripts should be tested by uploading to a bucket a full sequencing run worth of real data that have been "written" (copied) over time to a test directory by |
Summary
This PR implements significant improvements to the sequence upload system to address tar optimization (Issue #8), enhanced tar labeling (Issue #6), tarball bloat prevention (Issue #4), gsutil to gcloud storage migration (Issue #3), and Terra TSV creation (Issue #5). The changes include comprehensive dependency checking, dynamic exclusion systems, and parameterized configuration options.
Key Changes
🎯 Issue #4: Tarball Bloat Prevention
DELAY_BETWEEN_INCREMENTS_SEC
from 30 to 600 seconds to reduce snapshot frequency and minimize tarball bloat from partial*.cbcl
filesgenerate_dynamic_exclusions()
function that automatically excludes:Data/Intensities/BaseCalls/L*/C###.#
) to prevent capturing partial*.cbcl
filessync
+ 10-second wait before final tarball to ensure all writes are committed🏷️ Issue #6: Enhanced Tar Labeling
🔧 Issue #8: Tar Optimization
--blocking-factor=1
,--sparse
, and proper labeling for efficient concatenation☁️ Issue #3: gsutil to gcloud storage Migration
📊 Issue #5: Terra TSV Creation
TERRA_RUN_TABLE_NAME
🔧 Infrastructure Improvements
TAR_EXCLUSIONS
environment variableConfiguration Changes
New Environment Variables
TAR_EXCLUSIONS
- Space-separated list of directories to exclude from tar archivesTERRA_RUN_TABLE_NAME
- Table name for Terra TSV file generationUpdated Defaults
DELAY_BETWEEN_INCREMENTS_SEC
- Changed from 30 to 600 secondsTechnical Details
Dynamic Exclusion Logic
Tar Command Enhancement
--exclude
) with dynamic exclusions (--exclude-from
)Metadata Generation
Testing Considerations
Files Modified
incremental_illumina_upload_to_gs.sh
- Main upload script with all enhancementsmonitor_runs.sh
- Updated with dependency checking and consistent cron detectionCLAUDE.md
- Updated documentation reflecting all changes.claude/
- Claude Code integration configurationBreaking Changes
None. All changes are backward compatible with existing deployments.
🤖 Generated with Claude Code
Co-Authored-By: Claude [email protected]