Skip to content

Conversation

tomkinsc
Copy link
Contributor

Summary

This PR implements significant improvements to the sequence upload system to address tar optimization (Issue #8), enhanced tar labeling (Issue #6), tarball bloat prevention (Issue #4), gsutil to gcloud storage migration (Issue #3), and Terra TSV creation (Issue #5). The changes include comprehensive dependency checking, dynamic exclusion systems, and parameterized configuration options.

Key Changes

🎯 Issue #4: Tarball Bloat Prevention

  • Increased default delay: Changed DELAY_BETWEEN_INCREMENTS_SEC from 30 to 600 seconds to reduce snapshot frequency and minimize tarball bloat from partial *.cbcl files
  • Dynamic exclusions system: Implemented generate_dynamic_exclusions() function that automatically excludes:
    • Most recent cycle directory (Data/Intensities/BaseCalls/L*/C###.#) to prevent capturing partial *.cbcl files
    • Files modified within the past 3 minutes to avoid capturing actively written files
    • Exclusions only apply during active sequencing; final tarball captures everything when run is complete
  • Final sync optimization: Added sync + 10-second wait before final tarball to ensure all writes are committed
  • Sort and deduplicate: Dynamic exclusions are sorted and deduplicated to handle files caught by multiple criteria

🏷️ Issue #6: Enhanced Tar Labeling

  • Comprehensive metadata: Added detailed tar labels with run info, timestamps, machine details, external IP, and cron detection
  • Multiple encoding formats: JSON format with fallbacks to pipe-separated and gzip+base64 for 99-byte tar label limit
  • Intelligent run basename shortening: Extracts last component after underscores, then truncates if needed
  • External IP detection: Multiple fallback methods using dig, curl, ip, and route commands
  • Verbose metadata JSON: Generates comprehensive upload metadata files with tool versions and environment details

🔧 Issue #8: Tar Optimization

  • EOF block handling: Conditional processing where incremental tarballs trim EOF blocks but final tarball preserves them
  • Optimized tar settings: Uses --blocking-factor=1, --sparse, and proper labeling for efficient concatenation
  • Cross-platform compatibility: Handles GNU tar (gtar) on macOS vs tar on Linux

☁️ Issue #3: gsutil to gcloud storage Migration

  • Complete migration: Replaced all gsutil commands with modern gcloud storage equivalents
  • Version detection: Proper gcloud storage component version detection with fallbacks
  • Breaking change analysis: Identified and addressed gcloud storage compose functionality
  • Future-proofing: Updated to use supported Google Cloud SDK components

📊 Issue #5: Terra TSV Creation

  • TSV file generation: Creates Terra-compatible TSV files for data table import
  • Configurable table names: Support for custom Terra table names via TERRA_RUN_TABLE_NAME
  • POSIX compliance: Proper line endings and tab-separated format for Terra compatibility

🔧 Infrastructure Improvements

  • Comprehensive dependency checking: Hard vs optional dependency categorization with graceful degradation
  • Consistent cron detection: Cross-platform process tree inspection using pstree with ps fallback
  • Parameterized exclusions: Static tar exclusions now configurable via TAR_EXCLUSIONS environment variable
  • Enhanced error handling: Robust retry logic and error recovery throughout

Configuration Changes

New Environment Variables

  • TAR_EXCLUSIONS - Space-separated list of directories to exclude from tar archives
  • TERRA_RUN_TABLE_NAME - Table name for Terra TSV file generation

Updated Defaults

  • DELAY_BETWEEN_INCREMENTS_SEC - Changed from 30 to 600 seconds

Technical Details

Dynamic Exclusion Logic

# Only applies when run is not finished
if [[ "$run_is_finished" != 'true' ]]; then
    # Exclude most recent cycle directory
    # Exclude files modified within 3 minutes
    # Sort and deduplicate exclusions
fi

Tar Command Enhancement

  • Combines static exclusions (--exclude) with dynamic exclusions (--exclude-from)
  • Builds exclusion arguments dynamically from parameterized arrays
  • Maintains compatibility with existing exclusion patterns

Metadata Generation

  • External IP detection with multiple fallback methods
  • Cross-platform version detection for gcloud, tar, and bash
  • Comprehensive JSON metadata with upload statistics

Testing Considerations

  • Verify dynamic exclusions work correctly on different run directory structures
  • Test cross-platform compatibility (Linux sequencers vs macOS development)
  • Confirm Terra TSV format compatibility
  • Validate gcloud storage command migration

Files Modified

  • incremental_illumina_upload_to_gs.sh - Main upload script with all enhancements
  • monitor_runs.sh - Updated with dependency checking and consistent cron detection
  • CLAUDE.md - Updated documentation reflecting all changes
  • .claude/ - Claude Code integration configuration

Breaking Changes

None. All changes are backward compatible with existing deployments.

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

claude bot and others added 14 commits August 15, 2025 22:57
…zeros

- Set --blocking-factor=1 to prevent extra zero-padding blocks
- Add --sparse flag to consolidate runs of zeros in input files
- Add --label parameter with run basename for identification
- Pipe tar output through 'head --bytes -1024' to trim EOF blocks before gzip
- Update comments and documentation to reflect that --ignore-zeros is no longer needed
- Replace --gzip with manual gzip after EOF trimming for proper concatenation

These changes implement the solution described in issue #8 to create
efficiently concatenatable tarballs that can be extracted with standard
tar commands without requiring --ignore-zeros.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Chris Tomkins-Tinch <[email protected]>
do not trim EOF blocks from final incremental tarball that mops up the remaining changed files
…ia in-band tar labels, plus a verbose json file alongside the README.txt in the destination bucket
migrate Google Cloud CLI calls from `gsutil` -> `gcloud storage`. Mostly a drop-in replacement, except for `gsutil compose` which is now `gcloud storage objects compose`.
…e more involved and hopefully cross-platform (GNU/Linux vs macOS/BSD)
…m including multiple snapshots of large in-progress files

use dynamic exclusions to avoid tarball bloat by discouraging tar from including multiple snapshots of large in-progress files; this addresses #4
Copy link

claude bot commented Aug 16, 2025

Claude finished @tomkinsc's task —— View job


Pull Request Review

I've completed a comprehensive review of this PR that addresses multiple issues (#3-8) in the sequence upload system.

Review Tasks:

  • Analyze main upload script changes (incremental_illumina_upload_to_gs.sh)
  • Review monitoring script updates (monitor_runs.sh)
  • Check documentation updates (CLAUDE.md)
  • Evaluate code quality and best practices
  • Identify potential bugs or security issues
  • Assess performance implications
  • Review test coverage considerations

Summary

This is an excellent PR that significantly improves the robustness and functionality of the sequence upload system. The changes are well-implemented with proper error handling, cross-platform compatibility, and comprehensive documentation.

Code Quality & Best Practices ⭐

Strengths:

  • Excellent dependency checking: Comprehensive hard vs optional dependency validation with graceful degradation
  • Cross-platform compatibility: Proper handling of macOS vs Linux differences (GNU tar, ps command variations)
  • Robust error handling: Multiple fallback methods for external IP detection, cron detection, and version checking
  • Clean function design: Well-structured functions with clear purposes (generate_tar_label, generate_dynamic_exclusions, etc.)
  • Configuration flexibility: Parameterized exclusions and configurable environment variables
  • Security awareness: Multiple validation checks for IP addresses and safe command construction

Best Practices Followed:

  • Proper use of set -x for debugging
  • Consistent variable naming and quoting
  • Modular function design
  • Comprehensive logging and status messages

Performance Considerations ⚡

Positive Changes:

  • Optimized delay: Increasing DELAY_BETWEEN_INCREMENTS_SEC from 30 to 600 seconds significantly reduces snapshot frequency, preventing tarball bloat
  • Efficient tar settings: --blocking-factor=1 and --sparse options optimize storage and concatenation
  • Dynamic exclusions: Smart exclusion of incomplete cycle directories and recently modified files prevents capturing partial *.cbcl files
  • EOF block handling: Conditional trimming for incremental tarballs while preserving EOF blocks for final tarball

Areas to Monitor:

  • The 3-minute recent file exclusion (find "$run_path" -mmin -3) may impact performance on very large runs with many files
  • The retry logic with exponential backoff (sleep $(expr $RSYNC_RETRY_DELAY_SEC \* $retry_count)) could lead to very long delays on persistent failures

Security Assessment 🔒

Security Strengths:

  • IP validation: Proper regex validation for external IP addresses (^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$)
  • Safe command construction: Proper array handling for exclusion arguments
  • Input validation: Path sanitization with realpath
  • No credential exposure: Uses gcloud authentication without exposing secrets

Minor Security Considerations:

  • External IP detection makes multiple network calls that could potentially leak information, but the fallback to 0.0.0.0 is appropriate
  • No input sanitization for user-provided bucket prefix, but this is consistent with Google Cloud SDK usage patterns

Potential Bugs & Issues 🐛

Minor Issues:

  1. Line 456-462: Complex regex and path manipulation in generate_dynamic_exclusions() could fail silently:

    local most_recent_cycle=$(find "${run_path}/Data/Intensities/BaseCalls/"L* \
        -type d \
        -regextype posix-extended \
        -regex '^.+/C[0-9]+\.[0-9]+$' 2>/dev/null | \
        sort -r -k1,1 -V | \
        head -n1 | \
        sed -E 's|(BaseCalls/)L([0-9]+)|\\1L*|g' 2>/dev/null || true)

    Recommendation: Add validation to ensure the regex produces expected results.

  2. Line 358: Version detection could fail on some systems:

    local script_version=$(grep "^# version: " "$0" | head -1 | sed 's/^# version: //' || echo "unknown")

    Issue: The script doesn't contain a # version: comment, so this will always return "unknown".

  3. Line 690: Complex command composition could fail in edge cases:

    until [[ "$($GCLOUD_STORAGE_CMD du ${DESTINATION_BUCKET_PREFIX}/$RUN_BASENAME/parts/'*.tar.gz' 2> /dev/null | wc -l | awk '{print $1}' || echo '0')" == "0" ]]; do

    Recommendation: Consider breaking this into multiple commands for better error handling.

Potential Race Conditions:

  • The dynamic exclusions logic could theoretically miss files that are created between the find commands, but the 3-minute buffer and final sync should handle this.

Configuration & Environment ⚙️

Excellent additions:

  • TAR_EXCLUSIONS environment variable for customizable exclusions
  • TERRA_RUN_TABLE_NAME for Terra integration flexibility
  • Proper default handling for Illumina vs non-Illumina environments

Test Coverage Considerations 🧪

Missing Test Coverage:

  • No unit tests for the complex shell functions
  • No integration tests for the dynamic exclusion logic
  • No tests for cross-platform compatibility (Linux vs macOS)
  • No validation of Terra TSV format compliance

Recommendations:

  1. Add unit tests using a framework like bats (Bash Automated Testing System)
  2. Create integration tests that simulate partial sequencing runs
  3. Add validation tests for Terra TSV format
  4. Consider adding shellcheck integration for static analysis

Documentation Quality 📚

Excellent documentation:

  • Comprehensive CLAUDE.md updates reflect all new features
  • Inline comments explain complex logic
  • Usage examples and environment variable documentation
  • Clear ASCII art diagrams

Migration & Breaking Changes 🔄

Well handled:

  • Complete migration from gsutil to gcloud storage with proper version detection
  • All changes are backward compatible
  • No breaking changes to existing deployments

Overall Assessment

This is a high-quality PR that significantly improves the system's robustness, performance, and maintainability. The implementation demonstrates strong bash scripting skills and attention to production concerns like error handling, cross-platform compatibility, and performance optimization.

Recommendation: APPROVE

The benefits far outweigh the minor issues identified, and the improvements address real production pain points around tarball bloat and upload reliability.


@tomkinsc
Copy link
Contributor Author

NB: This PR is a trial run of using Claude Code for GitHub-interactive changes. While the code changes here have been reviewed (quickly) by human eyes, and output from new functions has been checked (even more quickly) following local execution by Claude, the revised scripts should be executed, in full, on real sequencing data before this PR is merged. The revised scripts should be tested by uploading to a bucket a full sequencing run worth of real data that have been "written" (copied) over time to a test directory by simulate_sequencer_write.sh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant