Skip to content

Conversation

kahnertk
Copy link

@kahnertk kahnertk commented Oct 6, 2025

This PR introduces a comprehensive refactor of the inference pipeline, focused on substantially improving runtime performance, scalability, and code structure, while preserving the original CLI workflow via process.py.

Key Technical Changes

  • Expanded CLI & Configuration Functionality

    • The existing process.py entry point was extended with additional flags and configuration options.
    • New parameters include batch size control, GPU/CPU selection, output format (e.g., H5AD), async saving, attention map toggling, and more.
    • Improved help messages and sensible defaults to better support large-scale runs.
  • Modularization of Processing Logic

    • The previously monolithic processing script was split into multiple modules (e.g., dataset handling, model inference, I/O, utilities), improving readability and maintainability.
    • Core inference logic was reorganized into clearer, well-scoped functions with explicit data flow.
  • Data Loading & Preprocessing Enhancements

    • Introduced a SubCellDataset class using pandas CSV parsing for scalable image list handling.
    • Added prefetching, optional min–max normalization, and multiprocessing hooks.
    • Cleaned up legacy CSV formats while maintaining backward compatibility.
  • Batched Inference Pipeline

    • Replaced per-image loops with efficient batched inference.
    • Added configurable batch size and immediate GPU cleanup between batches to prevent memory fragmentation.
    • Introduced embeddings-only mode to skip classification when not required.
  • I/O and Output Refactor

    • Added support for H5AD output, enabling compact storage of large runs.
    • Implemented optional asynchronous saving of .npy files using ThreadPoolExecutor.
    • Standardized output structure and file naming.
  • Core Improvements & Bug Fixes

    • Removed redundant preprocessing code paths and clarified image channel handling.
    • Fixed issues in image_utils.py (e.g., convert_bitdepth parameter handling).
    • General code cleanups and variable name improvements.
  • Documentation Updates

    • Rewrote README with updated installation, CLI usage, config parameters, and examples.
    • Added parameter tables and clearer guidance for large-scale use.

Impact

  • Significant runtime reduction for large datasets through batching and async I/O.
  • Improved memory efficiency on GPU and CPU.
  • A cleaner, modular codebase without altering the established CLI workflow.
  • Updated documentation for smoother user onboarding.

kahnertk and others added 6 commits August 26, 2025 11:12
Key changes:
- Create new SubCellDataset class with pandas-based CSV reading
- Refactor inference.py to handle batched data processing instead of single images
- Add progress tracking with tqdm progress bars
- Add tqdm dependency to requirements
- Fix bug in image_utils.py (convert_bitdepth parameter error)
- Adapt variable naming in min_max_standardize function
- Add command-line arguments for batch processing control
- Add quiet mode option to reduce verbose logging
…o 15.5h for 2.56M images

Major optimizations implemented:
- Async file saving with ThreadPoolExecutor to reduce I/O bottlenecks
- H5AD output format option for efficient batch storage
- Embeddings-only mode to skip classification when not needed
- Optimized image loading with tifffile for TIFF images and PIL for others
- Min-max normalization option for better image preprocessing
- Immediate GPU memory cleanup after each batch
- Configurable attention map saving (disabled by default in async mode)

Performance improvements:
- Batch processing with async I/O operations
- Memory-efficient data handling
- Optimized image loading pipeline
- Reduced disk I/O overhead

Note: This is work in progress and needs cleanup and testing, but is working and has achieved significant performance gains.
  Major architectural improvements, bug fixes, and performance enhancements:

  ## Architecture & Refactoring
  - Split monolithic process.py into modular components (config.py, cli.py,
    model_loader.py, output_handlers.py)
  - Added comprehensive type hints throughout
  - Enhanced error handling with specific exceptions and actionable messages

  ## Critical Bug Fixes
  - Fixed attention map saving (was passing None after GPU cleanup)
  - Fixed duplicate normalization (images normalized twice - dataset + inference)
  - Fixed output format naming consistency ("h5ad" → "combined")
  - Removed unused utils.py and scikit-learn dependency

  ## Performance Improvements
  - Eliminated duplicate min-max normalization in inference pipeline
  - Optimized async_saving pattern (removed nested ThreadPoolExecutors)
  - Extracted duplicate save logic into shared helper function
  - Updated default batch_size to 128 for better stability

  ## API Improvements
  - Renamed output formats: "individual" and "combined" (more intuitive)
  - Changed -t to -m for model_type flag
  - Removed redundant negative boolean flags (--no-create_csv, etc.)
  - Simplified parameter organization (Basic vs Additional)

  ## Documentation
  - Modernized README.md with professional formatting, badges, and tables
  - Updated parameter documentation with clear examples
  - Added publication link and repository references

  ## Dependencies
  - Removed scikit-learn (unused)
  - Added tifffile (required by image_utils.py)
  - Updated config.yaml with new defaults
Updated project description in README.md.
Critical fixes:
- Fix probability normalization bug - apply softmax to ensure probabilities sum to 1.0 (inference.py)
- Previously probabilities were raw logits and summed to ~2-3 instead of 1.0

PyTorch compatibility:
- Suppress torch.load FutureWarning by explicitly setting weights_only=False (vit_model.py)
- Fix SDPA attention warning by using eager attention implementation (vit_model.py)

Error handling improvements:
- Add comprehensive CUDA OOM error handling with actionable suggestions (process.py)
- Suggests reducing batch_size, num_workers, or switching to CPU

Logging improvements:
- Fix incomplete output file logging - explicitly list embedding, probability, and attention map files (process.py)
- Previously only showed generic "individual files" message

User experience:
- Add CLI examples and common mistakes to help output (cli.py)
- Shows correct usage patterns when users run --help
…lts overriding

config

This commit addresses two major design issues:

1. Hardcoded config and input file paths
- Added --config parameter (default: config.yaml)
- Added --path-list parameter (default: path_list.csv)
- Standardized PATH_LIST_CSV constant (removed unnecessary ./ prefix)

2. Critical bug: CLI defaults were overwriting config file values
- Changed all CLI argument defaults to argparse.SUPPRESS
- Modified parse_args() to filter out unprovided arguments
- Only explicitly provided CLI args now override config file values

Configuration priority is now correctly:
Default values → Config file → Explicit CLI arguments

Updated documentation (README.md, CLAUDE.md) with new parameters and examples.
@kahnertk
Copy link
Author

kahnertk commented Oct 7, 2025

Added --config and --path-list parameters, fix CLI defaults overriding config

This addresses two major design issues:

  1. Hardcoded config and input file paths
  • Added --config parameter (default: config.yaml)
  • Added --path-list parameter (default: path_list.csv)
  • Standardized PATH_LIST_CSV constant (removed unnecessary ./ prefix)
  1. Critical bug: CLI defaults were overwriting config file values
  • Changed all CLI argument defaults to argparse.SUPPRESS
  • Modified parse_args() to filter out unprovided arguments
  • Only explicitly provided CLI args now override config file values

Configuration priority is now correctly:
Default values → Config file → Explicit CLI arguments

Updated documentation (README.md, CLAUDE.md) with new parameters and examples.

Changed parameter name from --path-list to --path_list to match the
naming convention of other parameters (using underscores instead of hyphens).

Updated documentation in README.md and CLAUDE.md to reflect the change.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant