[WIP] LLAMA3 Integration with BioNeMo Recipes with ASCII Tokenization/EVO2-Style Dataloader #1205

savitha-eng · 2025-09-29T21:31:16Z

Description

This PR implements a complete end-to-end genomic foundation model training pipeline using LLAMA3 architecture with EVO2-compatible data processing and distributed training capabilities.

Key Components:

Genomic Tokenization System:
ASCII character-level tokenization with 256-token vocabulary covering complete ASCII character set
Direct nucleotide encoding supporting standard nucleotides (A, T, C, G, N) and IUPAC ambiguity codes
EVO2-compatible byte-level tokenization methodology
Integrated BOS, EOS, and PAD tokens for proper sequence boundary management
EVO2-Style Genomic Dataloader (following the Sharded Eden Dataloader):
Efficient sequence windowing: 8192-token sequences with 7992-token stride (200bp overlap) - following the EVO2 approach
Randomization: fixed window positions with shuffled access order for training stability (
SQLite backend integration for high-performance sequence retrieval
Full FSDP and DDP distributed training compatibility
Memory-efficient streaming with configurable parameters
Currently processes small subsample dataset (1,024 sequences into 254,043 training windows) with future scaling planned for production datasets
Model Pipeline Architecture:
Dynamic RoPE implementation supporting arbitrary sequence lengths with ESM-2 style computation
Random weight initialization using from-config approach to avoid text-domain bias
Memory-optimized LLAMA3: 8-layer, 403M parameter configuration for development/testing
FSDP integration with TransformerEngine optimization
Complete Weights & Biases integration for experiment tracking
Quick Validation Results: - Much more testing needed
L0 Sanity: 4-layer model, 25 steps, 3-minute validation
L1 Pilot: 8-layer model, 200 steps, 50-minute training
Stable training convergence demonstrated (loss: 6.1 → 1.38 over 200 steps)
Consistent 23.5GB GPU memory usage with 4.1 iterations/second sustained performance
Note: Much more extensive validation is needed with full-scale models and production datasets

Usage

Quick validation run:
cd bionemo-recipes/recipes/llama_native_te_nvfsdp torchrun --nproc_per_node=1 train.py --config-name=L0_sanity
More Realistic training run (still very small scale):
cd bionemo-recipes/recipes/llama_native_te_nvfsdp torchrun --nproc_per_node=1 train.py --config-name=L1_pilot
Configuration customization:
Type of changes
[x] New feature (non-breaking change which adds functionality)
[ ] Bug fix (non-breaking change which fixes an issue)
[ ] Refactor
[ ] Documentation update
[ ] Other (please describe):
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
[ ] ciflow:skip - Skip all CI tests for this PR
[ ] ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
[ ] ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
[ ] ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
[x] ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.
Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.
For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.
Pre-submit Checklist
[x] I have tested these changes locally
[ ] I have updated the documentation accordingly
[ ] I have added/updated tests as needed
[] All existing tests pass successfully

Next Steps:

Dataset Scaling: Scale from current subsample (1K sequences) to larger datasets (10M+ sequences)
Model Scaling: Increase to 16+ layers and full parameter counts for training
Multi-Node Training: Test and validate existing FSDP implementation across multiple GPU nodes
THD Sequence Packing
Extended Context Training: Support for longer context lengths (32K+ tokens) with iterative training strategies and memory optimization
Extended Validation: Comprehensive evaluation with downstream genomic tasks and comparative benchmarking against EVO2

…ration into bionemo recipes Signed-off-by: savitha-eng <[email protected]>

copy-pr-bot · 2025-09-29T21:31:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-09-29T21:31:25Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch savitha-llama3integration-with-dataloader-and-ascii-tokenizer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

Public repositories are always opted into early access features.
You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

pstjohn

leaving a first set of comments, will keep reviewing later

pstjohn · 2025-09-29T22:31:27Z

bionemo-recipes/recipes/llama_native_te_nvfsdp/__init__.py

this folder isn't a package, so this file won't do anything 🤷

pstjohn · 2025-09-29T22:38:50Z