Skip to content

Commit 2dfda3e

Browse files
committed
refactor v1
1 parent cfd77e3 commit 2dfda3e

File tree

7 files changed

+779
-746
lines changed

7 files changed

+779
-746
lines changed

.github/workflows/integration_test_8gpu.yaml renamed to .github/workflows/integration_test_8gpu_core.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,5 +46,4 @@ jobs:
4646
USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
4747
4848
mkdir artifacts-to-be-uploaded
49-
python ./tests/integration_tests.py --config_dir ./torchtitan/models/llama3/train_configs artifacts-to-be-uploaded/llama3 --ngpu 8
50-
python ./tests/integration_tests.py --config_dir ./torchtitan/models/deepseek_v3/train_configs artifacts-to-be-uploaded/deepseek --ngpu 4
49+
python ./tests/integration_tests/integration_tests.py artifacts-to-be-uploaded --test_suite core --ngpu 8

.github/workflows/integration_test_8gpu_h100.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,4 +47,4 @@ jobs:
4747
USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
4848
4949
mkdir artifacts-to-be-uploaded
50-
python ./tests/integration_tests_h100.py artifacts-to-be-uploaded --ngpu 8
50+
python ./tests/integration_tests/integration_tests_h100.py artifacts-to-be-uploaded --ngpu 8
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
name: 8 GPU Integration Test
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
paths-ignore:
7+
- 'torchtitan/experiments/**'
8+
pull_request:
9+
paths-ignore:
10+
- 'torchtitan/experiments/**'
11+
schedule:
12+
# Runs every 6 hours
13+
- cron: '0 */6 * * *'
14+
15+
concurrency:
16+
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
17+
cancel-in-progress: true
18+
19+
defaults:
20+
run:
21+
shell: bash -l -eo pipefail {0}
22+
23+
jobs:
24+
build-test:
25+
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
26+
with:
27+
runner: linux.g5.48xlarge.nvidia.gpu
28+
gpu-arch-type: cuda
29+
gpu-arch-version: "12.6"
30+
# This image is faster to clone than the default, but it lacks CC needed by triton
31+
# (1m25s vs 2m37s).
32+
docker-image: torchtitan-ubuntu-20.04-clang12
33+
repository: pytorch/torchtitan
34+
upload-artifact: outputs
35+
script: |
36+
set -eux
37+
38+
# The generic Linux job chooses to use base env, not the one setup by the image
39+
CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
40+
conda activate "${CONDA_ENV}"
41+
42+
pip config --user set global.progress_bar off
43+
44+
python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
45+
46+
USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
47+
48+
mkdir artifacts-to-be-uploaded
49+
python ./tests/integration_tests/integration_tests.py artifacts-to-be-uploaded --test_suite parallelsim --ngpu 8

tests/README.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,10 @@ This directory contains tests for the TorchTitan project, including unit tests a
55
## Test Structure
66

77
- `unit_tests/`: Contains unit tests for individual components
8-
- `integration_tests.py`: Contains integration tests that test multiple components together
9-
- `integration_tests_h100.py`: Contains integration tests specifically designed for H100 GPUs, which utilize symmetric memory and float8.
8+
- `integration_tests/`: Contains integration tests that test multiple components together
9+
- `integration_tests.py`: Main integration tests for various model configurations
10+
- `integration_tests_h100.py`: Tests specifically designed for H100 GPUs, utilizing symmetric memory and float8
11+
- `base_config.toml`: Base configuration file for integration tests
1012
- `assets/`: Contains test assets and fixtures used by the tests
1113

1214
## Running Tests
@@ -25,25 +27,27 @@ pip install -r requirements.txt
2527
To run the integration tests:
2628

2729
```bash
28-
python ./tests/integration_tests.py <output_dir> [--config_dir CONFIG_DIR] [--test TEST] [--ngpu NGPU]
30+
python -m tests.integration_tests.integration_tests <output_dir> [--config_path CONFIG_PATH] [--test_name TEST_NAME] [--test_suite TEST_SUITE] [--model MODEL] [--ngpu NGPU]
2931
```
3032

3133
Arguments:
3234
- `output_dir`: (Required) Directory where test outputs will be stored
33-
- `--config_dir`: (Optional) Directory containing configuration files (default: "./torchtitan/models/llama3/train_configs")
34-
- `--test`: (Optional) Specific test to run, use test names from the `build_test_list()` function (default: "all")
35+
- `--config_path`: (Optional) Path to the base config file (default: "./tests/integration_tests/base_config.toml")
36+
- `--test_name`: (Optional) Specific test to run by name (default: "all")
37+
- `--test_suite`: (Optional) Test suite to run: 'core', 'parallelism', or 'all' (default: "all")
38+
- `--model`: (Optional) Specify the model to run tests on (default: "all")
3539
- `--ngpu`: (Optional) Number of GPUs to use for testing (default: 8)
3640

3741
Examples:
3842
```bash
3943
# Run all integration tests with 8 GPUs
40-
python ./tests/integration_tests.py ./test_output
44+
python -m tests.integration_tests.integration_tests ./test_output
4145

4246
# Run a specific test with 4 GPUs
43-
python ./tests/integration_tests.py ./test_output --test default --ngpu 4
47+
python -m tests.integration_tests.integration_tests ./test_output --test_name tp_only --ngpu 4
4448

45-
# Run all tests with a custom config directory
46-
python ./tests/integration_tests.py ./test_output --config_dir ./my_configs
49+
# Run only core functionality tests
50+
python -m tests.integration_tests.integration_tests ./test_output --test_suite core
4751
```
4852

4953
### Running Unit Tests

tests/integration_tests/base_config.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ save_tb_folder = "tb"
2121
enable_wandb = false
2222

2323
[model]
24-
name = "deepseek_v3"
24+
name = "llama3" # option: llama3, deepseek_v3
2525
flavor = "debugmodel"
2626
# test tokenizer, for debug purpose only
2727
tokenizer_path = "./tests/assets/tokenizer"

0 commit comments

Comments
 (0)