Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .claude/agents/expert-developer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
name: expert-developer
description: use this agent when uncommitted changes exist for files in this repo, and for when open issues are outstanding for this repo on GitHub
model: sonnet
---

Expert software enginner who can act on my behalf to address open GitHub issues, incorporate changes according to guidance I have issued, and review new code I have written.
14 changes: 14 additions & 0 deletions .claude/settings.local.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"permissions": {
"allow": [
"Bash(git checkout:*)",
"Bash(git fetch:*)",
"WebFetch(domain:www.gnu.org)",
"WebSearch",
"WebFetch(domain:man7.org)",
"Bash(gh issue view:*)"
],
"deny": [],
"ask": []
}
}
82 changes: 82 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

This repository contains bash scripts for incremental upload of Illumina sequencing runs to Google Cloud Storage. The system creates incremental gzipped tarballs and syncs them to a single tarball in a GS bucket, designed to work on Linux-based Illumina sequencers (like NextSeq 2000) or companion computers.

## Architecture

**Core Scripts:**
- `incremental_illumina_upload_to_gs.sh` - Main upload script that creates incremental tar archives and uploads them to GS
- `monitor_runs.sh` - Monitoring script that watches for new run directories and launches upload processes
- `simulate_sequencer_write.sh` - Testing utility that simulates a sequencer writing data incrementally

**Key Components:**
- **Incremental Archiving**: Uses GNU tar with `--listed-incremental` to create incremental backups
- **Chunked Uploads**: Splits large runs into manageable chunks (default 100MB) with retry logic
- **GS Composition**: Uses `gcloud storage objects compose` to merge incremental tarballs into single archives
- **Cross-platform Support**: Handles differences between Linux (Illumina sequencers) and macOS

## Dependencies

Required tools that must be available:
- `gcloud storage` (Google Cloud SDK)
- `tar` (GNU tar, installed as `gtar` on macOS)
- `pstree` (for monitoring script, installed via `brew install pstree` on macOS)

## Environment Variables

Key configuration variables (with defaults):
- `CHUNK_SIZE_MB=100` - Size of incremental tar chunks
- `DELAY_BETWEEN_INCREMENTS_SEC=600` - Wait time between upload attempts (optimized to reduce tarball bloat from partial *.cbcl files)
- `RUN_COMPLETION_TIMEOUT_DAYS=16` - Max time to wait for run completion
- `STAGING_AREA_PATH` - Location for temporary files (defaults to `/usr/local/illumina/seq-run-uploads` on Illumina machines, `/tmp/seq-run-uploads` elsewhere)
- `RSYNC_RETRY_MAX_ATTEMPTS=12` - Maximum retry attempts for uploads
- `INCLUSION_TIME_INTERVAL_DAYS=7` - Age limit for runs to be considered for upload
- `TERRA_RUN_TABLE_NAME=flowcell` - Table name for Terra TSV file generation (creates `entity:{table_name}_id` column)
- `TAR_EXCLUSIONS` - Space-separated list of directories to exclude from tar archives (defaults to: "Thumbnail_Images Images FocusModelGeneration Autocenter InstrumentAnalyticsLogs Logs")

## Usage Patterns

**Main upload script:**
```bash
./incremental_illumina_upload_to_gs.sh /path/to/run gs://bucket-prefix
```

**Monitoring script:**
```bash
./monitor_runs.sh /path/to/monitored-directory gs://bucket-prefix
```

**Simulation script (for testing):**
```bash
./simulate_sequencer_write.sh /path/to/actual_run /path/to/simulated_run
```

## Important Implementation Details

- **Excluded Directories**: The upload excludes large non-essential directories (configurable via `TAR_EXCLUSIONS` environment variable, defaults to: `Thumbnail_Images`, `Images`, `FocusModelGeneration`, `Autocenter`, `InstrumentAnalyticsLogs`, `Logs`)
- **Dynamic Exclusions**: During active sequencing, automatically excludes the most recent cycle directory and recently modified files (within 3 minutes) to prevent tarball bloat from partial `*.cbcl` files. Exclusions are disabled for the final tarball when `RTAComplete.txt`/`RTAComplete.xml` is detected.
- **Individual Files**: `SampleSheet.csv` and `RunInfo.xml` are uploaded separately before tarball creation
- **Run Completion Detection**: Looks for `RTAComplete.txt` or `RTAComplete.xml` files
- **Tarball Extraction**: Resulting tarballs must be extracted with GNU tar using `--ignore-zeros`
- **NFS Support**: Uses `--no-check-device` flag for NFS mounted storage
- **Platform Detection**: Automatically detects Illumina machines vs other environments
- **Cleanup**: Removes local incremental tarballs after successful upload

## Cron Integration

The monitoring script is designed to work with cron scheduling. Example crontab entry:
```
@hourly ~/monitor_runs.sh /usr/local/illumina/runs gs://bucket/flowcells >> ~/upload_monitor.log
```

## File Paths

Staging areas:
- Illumina machines: `/usr/local/illumina/seq-run-uploads`
- Other systems: `/tmp/seq-run-uploads`

Run detection based on presence of `RunInfo.xml` files in monitored directories.
Loading