Skip to content

Commit 3ed14c7

Browse files
Curt TiggesCurt Tigges
authored andcommitted
added more tests
1 parent 6fa8a62 commit 3ed14c7

File tree

5 files changed

+474
-0
lines changed

5 files changed

+474
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,8 @@ clt_smoke_output_remote_wandb/
209209
wandb/
210210
scripts/debug
211211
scripts/optimization
212+
sparsify/
213+
clt-training/
212214

213215
# models
214216
*.pt

tests/unit/data/README.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Data Integrity Tests
2+
3+
This directory contains comprehensive tests to ensure data integrity across the activation generation and retrieval pipeline.
4+
5+
## Background
6+
7+
We've experienced data mixup issues in the past:
8+
1. **Lexicographic vs Numerical Ordering**: Layers were sorted as strings (layer_10, layer_2, layer_20) instead of numerically
9+
2. **Layer Data Corruption**: Similar issues with layer ordering causing data from one layer to be associated with another
10+
11+
## Test Coverage
12+
13+
### `test_data_integrity.py`
14+
15+
Comprehensive test suite covering:
16+
17+
1. **Layer Ordering** (`test_layer_ordering_numerical_not_lexicographic`)
18+
- Verifies layers are ordered numerically (1, 2, 10, 20, 100) not lexicographically
19+
- Creates layers that would be misordered if sorted as strings
20+
- Validates both HDF5 structure and actual data values
21+
22+
2. **Normalization Application** (`test_normalization_application_correctness`)
23+
- Tests that normalization statistics are correctly applied during retrieval
24+
- Creates data with known mean/std, then verifies normalized output
25+
- Ensures each layer's statistics are applied to the correct layer
26+
27+
3. **Cross-Chunk Token Ordering** (`test_cross_chunk_token_ordering`)
28+
- Verifies token ordering is preserved across chunk boundaries
29+
- Uses deterministic patterns to track tokens across multiple chunks
30+
- Ensures no tokens are duplicated or lost
31+
32+
4. **Manifest Format Compatibility** (`test_manifest_format_compatibility`)
33+
- Tests both legacy 2-field and new 3-field manifest formats
34+
- Ensures backward compatibility with existing datasets
35+
36+
### `test_local_activation_store.py`
37+
38+
Includes additional test:
39+
- **Layer Data Integrity** (`test_layer_data_integrity`)
40+
- Verifies each layer contains distinct, non-mixed data
41+
- Checks value ranges are layer-specific
42+
- Ensures targets = inputs + 1 relationship is preserved
43+
44+
## Running the Tests
45+
46+
Run all data integrity tests:
47+
```bash
48+
pytest tests/unit/data/test_data_integrity.py -v
49+
```
50+
51+
Run specific test:
52+
```bash
53+
pytest tests/unit/data/test_data_integrity.py::TestDataIntegrity::test_layer_ordering_numerical_not_lexicographic -v
54+
```
55+
56+
Run with coverage:
57+
```bash
58+
pytest tests/unit/data/test_data_integrity.py --cov=clt.activation_generation --cov=clt.training.data -v
59+
```
60+
61+
## What These Tests Prevent
62+
63+
1. **Silent Data Corruption**: Detects if layers get mixed up during generation or retrieval
64+
2. **Normalization Errors**: Ensures statistics from one layer aren't applied to another
65+
3. **Token Loss**: Verifies all tokens are accessible and in correct order
66+
4. **Format Regressions**: Maintains compatibility with existing activation datasets
67+
68+
## Adding New Tests
69+
70+
When adding features that touch activation generation or retrieval:
71+
1. Add tests that use deterministic, verifiable patterns
72+
2. Test edge cases (empty chunks, single token, many layers)
73+
3. Verify both structure (metadata, manifests) and actual data values
74+
4. Consider cross-component interactions

0 commit comments

Comments
 (0)