Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
192 commits
Select commit Hold shift + click to select a range
94b127b
Add Dynamic Quant Method for DSv3-R1
hlin99 Feb 17, 2025
701f77b
Add Support for 2-node vLLM Serving
hlin99 Feb 17, 2025
8a60b23
Recover Graph Warm Up
Wei-Lin-Intel Feb 17, 2025
4cf51d0
Remove Quant Scale Padding Weights
Wei-Lin-Intel Feb 18, 2025
91d31ea
add mark step in deepseek v3 to break graph into small pieces
Wei-Lin-Intel Feb 18, 2025
10d999d
update the ip and gloo
yiliu30 Feb 18, 2025
99f9e4a
fix if name
Feb 18, 2025
608ceb2
fix unquantized method
yiliu30 Feb 18, 2025
931cbfb
patch for inc
Yi4Liu Feb 18, 2025
23ca46c
Merge branch 'yang/deepseek_r1_g2_dynamic_quant' into yi-2nodes
Yi4Liu Feb 18, 2025
e882bd4
remove hpu_fused_moe
yiliu30 Feb 18, 2025
4b70608
disable log
yiliu30 Feb 18, 2025
b89268c
add debug
yiliu30 Feb 18, 2025
3946218
revert debug info
yiliu30 Feb 18, 2025
ca62b4c
uncomment debug
yiliu30 Feb 18, 2025
3b8edd8
add debug info
yiliu30 Feb 18, 2025
6358d2b
disbale init dynamic moe
yiliu30 Feb 18, 2025
c2ed14c
remove clone
yiliu30 Feb 18, 2025
fd9fa96
refine log
yiliu30 Feb 18, 2025
ebc8660
add inc quant config
Yi4Liu Feb 19, 2025
f33414e
add more quant method
Yi4Liu Feb 19, 2025
a4693bb
add more inc to hpu
Yi4Liu Feb 19, 2025
6332d93
add qconfig
Yi4Liu Feb 19, 2025
a4cdd48
add more log
Yi4Liu Feb 19, 2025
4542c1a
fix inc check
Yi4Liu Feb 19, 2025
e77173c
add rank
Yi4Liu Feb 19, 2025
12e1db3
add barr
Yi4Liu Feb 19, 2025
ea0e86f
sleep for dump
Yi4Liu Feb 19, 2025
2110a81
print model
Yi4Liu Feb 19, 2025
f32f724
add rank debug
Yi4Liu Feb 19, 2025
c3bc8ea
print inc model with rank
Yi4Liu Feb 19, 2025
294d0b2
debug more
Yi4Liu Feb 19, 2025
445c833
fixed rank debug
Yi4Liu Feb 19, 2025
c1c226f
fix quant check
Yi4Liu Feb 19, 2025
de72bc8
fix inc check
Yi4Liu Feb 19, 2025
e7f0968
fix the num_expert_group
Yi4Liu Feb 19, 2025
95ececd
fix hidden shape
Yi4Liu Feb 19, 2025
d66d0cb
fix num_expert_group
Yi4Liu Feb 19, 2025
98c1970
fix hidden shape
Yi4Liu Feb 19, 2025
f2c4964
update the num_expert_group
Yi4Liu Feb 19, 2025
ca1d161
revert weight set
Yi4Liu Feb 19, 2025
2df9dc8
disable fused moe init
Yi4Liu Feb 19, 2025
9172b49
add ep rank back
Yi4Liu Feb 19, 2025
62f9abf
fix ep_rank and ep_shift
Yi4Liu Feb 19, 2025
aee1628
clean debug info
Yi4Liu Feb 19, 2025
6d146eb
add envs info
yiliu30 Feb 19, 2025
e10e30f
add example
yiliu30 Feb 19, 2025
583f576
add g5 envs info
yiliu30 Feb 19, 2025
eb40336
add real datasets
Yi4Liu Feb 19, 2025
5f1310e
512 samples
Yi4Liu Feb 19, 2025
1ad1669
not get weight from layer
Yi4Liu Feb 20, 2025
c94fbdf
revert RAY_DEDUP_LOGS
Yi4Liu Feb 20, 2025
0571a4e
add cpu weight for n2_quant
Yi4Liu Feb 20, 2025
671b49e
add long p
Yi4Liu Feb 20, 2025
eded950
add dataset
Yi4Liu Feb 20, 2025
4bd82d5
add prompts for prep and quant
Yi4Liu Feb 20, 2025
9560531
use token directly
Yi4Liu Feb 20, 2025
7bc5e67
fix args
Yi4Liu Feb 20, 2025
0ae617a
update all examples
Yi4Liu Feb 20, 2025
d4337de
add utils
Yi4Liu Feb 20, 2025
a534805
fix
Yi4Liu Feb 20, 2025
2bcdb66
update the prompt to prompt_token_ids
Yi4Liu Feb 20, 2025
b9707ec
upadte the gen
Yi4Liu Feb 20, 2025
f9d9bff
gen pile
Yi4Liu Feb 20, 2025
4763d8e
Correct Accuracy Issue for grouped_topk and Merge pull/13474
Wei-Lin-Intel Feb 20, 2025
c05f401
use pile
Yi4Liu Feb 20, 2025
f27f3b1
refine print
Yi4Liu Feb 20, 2025
285426e
fix print
Yi4Liu Feb 20, 2025
62d7e3c
add smoke test
Yi4Liu Feb 20, 2025
ee252b9
add check nan
Yi4Liu Feb 20, 2025
2b32260
fix check nan
Yi4Liu Feb 20, 2025
bc3a26c
use p for smoke
Yi4Liu Feb 20, 2025
6a2f693
add measurement results on g4
Feb 21, 2025
1f26a84
add measurement results on g5
Feb 21, 2025
1d22c0f
add preapre smoke
Yi4Liu Feb 21, 2025
d91e1f2
use same prompt
Yi4Liu Feb 21, 2025
0a48ade
refine log
Yi4Liu Feb 21, 2025
8c24a4a
correct prepare
Yi4Liu Feb 21, 2025
b5fe4ac
remove low cpu mem in prepare
Yi4Liu Feb 21, 2025
298d4ed
get pile only
Yi4Liu Feb 21, 2025
bd9dc3b
add 4layer ep16 tp16
Feb 21, 2025
289921f
add 4 layers preapre g5
wenchao987 Feb 21, 2025
672b327
move 4 layers json info one folder
Yi4Liu Feb 21, 2025
99ae83c
move 4 layers json info one folder
Yi4Liu Feb 21, 2025
034f5cc
add ep8 tp8 example
Yi4Liu Feb 21, 2025
e92173d
add calibaration result on ep 8 tp 8
Feb 21, 2025
5811e50
add unified results
yiliu30 Feb 21, 2025
1622dd2
add npz
yiliu30 Feb 21, 2025
1dc0f42
add results
yiliu30 Feb 21, 2025
fb6eabf
add res
yiliu30 Feb 21, 2025
138f3f1
replace
yiliu30 Feb 21, 2025
ad8e2f7
update
yiliu30 Feb 21, 2025
688b332
update
yiliu30 Feb 21, 2025
743d56f
update
yiliu30 Feb 21, 2025
f9d49c1
512
yiliu30 Feb 21, 2025
1c8ac47
low cpu mem
yiliu30 Feb 21, 2025
4328b87
use 2024 for quick test
Yi4Liu Feb 22, 2025
00085d6
add docs
Yi4Liu Feb 22, 2025
1ce4308
remove measurments results
Yi4Liu Feb 22, 2025
e3e6abc
update docs and remove measure results
Yi4Liu Feb 22, 2025
82c8d66
eval bf16 model
Yi4Liu Feb 22, 2025
827bc2c
use bs 1
Yi4Liu Feb 22, 2025
a254af8
Merge branch 'yang/deepseek_r1_g2' into p22
Yi4Liu Feb 22, 2025
e8ec061
use g2 model
Yi4Liu Feb 22, 2025
1342352
disbale profile_run
Yi4Liu Feb 22, 2025
af14601
eval qmodel
Yi4Liu Feb 22, 2025
6c3ca68
run lm-eval bf16
Yi4Liu Feb 22, 2025
599ab79
print result as table
Yi4Liu Feb 22, 2025
d15a560
change max len of bf16 to 2048
Yi4Liu Feb 22, 2025
ef2e454
test 128 samples
Yi4Liu Feb 22, 2025
ded5e65
enable mla
Yi4Liu Feb 22, 2025
d67fc7a
Merge branch 'p22' into p22-rebase
Yi4Liu Feb 22, 2025
322b75d
decrease the max_model_len to 2048
Yi4Liu Feb 22, 2025
ce28d86
revert max_model_len
Yi4Liu Feb 22, 2025
4b4e196
update params
Yi4Liu Feb 23, 2025
13830cb
show mem
Yi4Liu Feb 23, 2025
f7c5324
add debug info
Yi4Liu Feb 23, 2025
574fea4
fix
Yi4Liu Feb 23, 2025
9f27adc
add more debug info
Yi4Liu Feb 23, 2025
ed45a38
fix
Yi4Liu Feb 23, 2025
31acb66
fetch one prompt once
Yi4Liu Feb 23, 2025
0aea6f8
fix print
Yi4Liu Feb 23, 2025
52777f0
use bs 1
Yi4Liu Feb 23, 2025
4440ef0
refine log
Yi4Liu Feb 23, 2025
2792f9c
revert gen
Yi4Liu Feb 23, 2025
8b4da84
use bs 1 for eval
Yi4Liu Feb 23, 2025
9dcc21b
fix lm eval
Yi4Liu Feb 23, 2025
77355ff
format code
Yi4Liu Feb 23, 2025
ed40cbd
refine eval
Yi4Liu Feb 23, 2025
110ea6f
test ray
Yi4Liu Feb 23, 2025
d2ae76f
add drop
Yi4Liu Feb 23, 2025
1cfad40
disbale TOKENIZERS_PARALLELISM
Yi4Liu Feb 23, 2025
527744f
test all
Yi4Liu Feb 23, 2025
f4f7b82
add more docs
Yi4Liu Feb 24, 2025
36fe420
run lm-eval one node
Yi4Liu Feb 24, 2025
8481ea6
cp mengni fix
Feb 24, 2025
e76c504
add inc quant smoke demo
Yi4Liu Feb 25, 2025
0f6c44d
del some attrs from self_attn
Yi4Liu Feb 26, 2025
cf7c90e
use fp kv cache
Yi4Liu Feb 26, 2025
ca37f77
update quant example
Yi4Liu Feb 26, 2025
4cc5c75
update example
Yi4Liu Feb 26, 2025
8b24cd8
use custom dataset
Yi4Liu Feb 27, 2025
3b987ea
show mem after layer
Yi4Liu Feb 27, 2025
9d0ad52
add bkc
Yi4Liu Feb 27, 2025
d66ef77
update
Yi4Liu Feb 27, 2025
f3017c7
update
Yi4Liu Feb 27, 2025
3bd8934
test ray
Yi4Liu Feb 28, 2025
96cd65c
test ray
Yi4Liu Feb 28, 2025
5dfceb3
test ray
Yi4Liu Feb 28, 2025
b838fc7
use ds
Yi4Liu Feb 28, 2025
952b107
fix preprocess
Yi4Liu Feb 28, 2025
2323d32
correct test
Yi4Liu Feb 28, 2025
1f0d0dc
revert test envs
Yi4Liu Feb 28, 2025
6087674
test
Yi4Liu Feb 28, 2025
2694ef3
use inc
Yi4Liu Feb 28, 2025
2157a3f
add head worker source
Yi4Liu Feb 28, 2025
87b3dc4
Merge branch 'p22-rebase-kvcache' into p22-rebase-kvcache-tc
Yi4Liu Feb 28, 2025
03fa564
use fp8 kv
Yi4Liu Feb 28, 2025
e045827
update docs
Yi4Liu Feb 28, 2025
b10b769
add one node example
Yi4Liu Feb 28, 2025
763a0a6
udate config
Yi4Liu Feb 28, 2025
2c19766
fix tp
Yi4Liu Feb 28, 2025
3a57502
rename
Yi4Liu Feb 28, 2025
f72c0ca
fix
Yi4Liu Feb 28, 2025
3e6d237
fix
Yi4Liu Feb 28, 2025
dfae309
refine remove duplicate submodules
Yi4Liu Feb 28, 2025
6ffbf2b
fix _inc_preprocess
Yi4Liu Feb 28, 2025
93c74d9
fix kvcache
Feb 28, 2025
ec20d53
fix remove attr
Feb 28, 2025
9c35535
update docs
Yi4Liu Feb 28, 2025
632a34d
refine code
Yi4Liu Feb 28, 2025
9f018a3
add two nodes example
Yi4Liu Feb 28, 2025
d97df56
update example name
Yi4Liu Feb 28, 2025
3edfbfe
update docs
Yi4Liu Feb 28, 2025
e034b0f
udpate mode
Yi4Liu Feb 28, 2025
1c053db
update source file name
Yi4Liu Feb 28, 2025
2e57568
update docs
Yi4Liu Feb 28, 2025
f09a5e7
update docs
Yi4Liu Feb 28, 2025
feb243f
update
Yi4Liu Feb 28, 2025
24a6980
unsert QUANT_CONFIG
Yi4Liu Feb 28, 2025
0037d97
fix smoke
Yi4Liu Feb 28, 2025
0201815
update docs
Yi4Liu Feb 28, 2025
1cf6094
update doc
Yi4Liu Feb 28, 2025
44cd7b8
add toc
Yi4Liu Feb 28, 2025
bf13cc7
update
Yi4Liu Feb 28, 2025
e047f50
update docs
Yi4Liu Feb 28, 2025
657d14a
fix docs
Yi4Liu Feb 28, 2025
e491e88
update
Yi4Liu Feb 28, 2025
ffc3543
update docs
Yi4Liu Feb 28, 2025
9f7f549
clean code
Yi4Liu Feb 28, 2025
d51beca
add convert script
Yi4Liu Feb 28, 2025
074c268
update
Yi4Liu Feb 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions scripts/QuantizeDeepSeek.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# BKC for Quantizing DeepSeek V3/R1 with vLLM and INC

<!-- TOC -->

- [BKC for Quantizing DeepSeek V3/R1 with vLLM and INC](#bkc-for-quantizing-deepseek-v3r1-with-vllm-and-inc)
- [Support Matrix](#support-matrix)
- [Setting Up 2 Nodes Environment](#setting-up-2-nodes-environment)
- [Prerequisites](#prerequisites)
- [Install Dependencies](#install-dependencies)
- [Exporting Environment variables](#exporting-environment-variables)
- [Calibration](#calibration)
- [Inference with FP8 Models on Two Nodes](#inference-with-fp8-models-on-two-nodes)
- [Inference with FP8 Models on a Single Node WIP](#inference-with-fp8-models-on-a-single-node-wip)
- [Prerequisites](#prerequisites)
- [Running the Example](#running-the-example)

<!-- /TOC -->

This document outlines the steps for using vLLM and INC to calibrate DeepSeek R1 on two nodes, and to perform quantization and inference on either two nodes or a single node.

## Support Matrix

- Calibration Stage (Two Nodes)

| KVCache Precision | Configs |
|---|---|
| BF16 | `inc_measure_config.json` |
| FP8 | `inc_measure_with_fp8kv_config.json`|

- Quantize/Inference Stage

| KVCache Precision | Two Nodes Configs | One Node Configs |
|---|---|---|
| BF16 | `inc_quant_config.json` | `inc_quant_one_node_config.json`|
| FP8 | `inc_quant_with_fp8kv_config.json`| `inc_quant_with_fp8kv_one_node_config.json`|


## Setting Up 2 Nodes Environment
>
> [!NOTE]
> If you want to quantize the model using an existing calibration result, you can skip this step and proceed directly to the `Inference with FP8 Models on a Single Node` section.

We use Ray to set up a cluster with two nodes, so that we can image a system with 16 cards and update the procedure accordingly. It is crucial to ensure that both nodes have the same software stack. Docker container are used to guarantee a consistent environment. The high-level steps are as follows:

- Build and run Docker on each node.
- Export the necessary environment variables within each Docker container.
- Start the Ray cluster on the head node and connect the worker node to it.

For more details, please refer to the <https://github.com/yangulei/vllm-fork/blob/deepseek_r1_g2/scripts/multi_nodes_README.md>

### Prerequisites

- Hardware: 2x8G2 or 2x8G3
- Docker: 1.20.0-521

### Install Dependencies

- INC TBD

```bash
git clone TBD inc
cd inc
git checkout dev/yi/quant_ds
python setup.py pt develop
```

- vLLM TBD

```
git clone TBD vllm
cd vllm
git checkout TBD
pip install -r requirements-hpu.txt
VLLM_TARGET_DEVICE=hpu pip install -e . --no-build-isolation
```

- Model
- DeepSeek R1 (BF16)
- Script for converting original FP8 model to BF16 model: `convert_fp8_to_bf16_cpu.py`

### Exporting Environment variables
>
> [!NOTE]
> Please update the `HCCL_SOCKET_IFNAME` and `GLOO_SOCKET_IFNAME` variables in the `head_node_source.sh` and `worker_node_source.sh` scripts with the name of network interface of the device.

- Head Node

```bash
source head_node_source.sh
```

- Worker Node

```bash
source worker_node_source.sh
```

> [!TIP]
> - Please start Ray in the SAME directory within both Docker containers.
> - If you modify the environment variables, please RESTART Ray.

## Calibration

From the vLLM root directory, navigate to the scripts folder and run the calibration script. This process runs the BF16 model on a calibration dataset to observe the range of model weights and inputs.

- BF16 KVCache

```bash
# vllm root
export QUANT_CONFIG=inc_measure_config.json
# restart ray
cd vllm/scripts
python inc_example_two_nodes.py --mode prepare
```

- FP8 KVCache
```bash
# vllm root
export QUANT_CONFIG=inc_measure_with_fp8kv_config.json
# restart ray
cd vllm/scripts
python inc_example_two_nodes.py --mode prepare
```


## Inference with FP8 Models on Two Nodes

This script loads the BF16 model into DRAM, moves it to the HPU, and quantizes the model layer by layer.

- BF16 KVCache
```bash
# vllm root
export QUANT_CONFIG=inc_quant_config.json
# restart ray
cd vllm/scripts
python inc_example_two_nodes.py --mode quant
```

- FP8 KVCache
```bash
# vllm root
export QUANT_CONFIG=inc_quant_with_fp8kv_config.json
# restart ray
cd vllm/scripts
python inc_example_two_nodes.py --mode quant --fp8_kvcache
```

## Inference with FP8 Models on a Single Node (WIP)

In this section, we load the BF16 model on DRAM and quantize it to FP8 model using unified measurement results obtained from the two-node calibration.

### Prerequisites

- Hardware: 1x8G3 or 1x8G2(WIP), 2T DRAM
- Docker: 1.20.0-521

### Running the Example

- Quantize model weights to FP8 and using BF16 KVCache(WIP)


- BF16 KVCache
```bash
# vllm root
cd vllm/scripts
# Download the unified calibration results
huggingface-cli download TBD --local-dir nc_workspace_measure_one_node
QUANT_CONFIG=inc_quant_one_node_config.json python inc_example_one_node.py
```

- FP8 KVCache
```bash
# vllm root
cd vllm/scripts
# Download the unified calibration results
huggingface-cli download Yi30/inc-tp8-ep8-full-kvcache-from-tp16-ep16 --local-dir nc_workspace_measure_kvache_one_node
QUANT_CONFIG=inc_quant_with_fp8kv_one_node_config.json python inc_example_one_node.py --fp8_kvcache
```

## Accuray Evaluation (WIP)

## Calibration with Customize dataset (WIP)
53 changes: 53 additions & 0 deletions scripts/Quantize_BF16_R1_on_Single_Note.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Note for quantize vLLM DeepSeek V3/R1 using INC

## Perquisites

- Hardware: ~~2xG3~~ ~~2x8XG3 or 2x8XG2~~ 8XG2 or 8XG3
- Docker: 1.20.0-521

- INC https://github.com/intel/neural-compressor/tree/dev/yi/quant_vllm-patch-19

```bash
git clone https://github.com/intel/neural-compressor.git inc
cd inc
git checkout dev/yi/quant_vllm-patch-19
pip install -r requirements.txt
pip install -r requirements_pt.txt
python setup.py pt develop
```
- vLLM https://github.com/yiliu30/vllm-fork/pull/13

```
cd vllm; pip install -r requirements-hpu.txt; VLLM_TARGET_DEVICE=hpu pip install -e . --no-build-isolation;
```
- Model
- ~~Reduced DeepSeek V3 model (4 layers with random weights)~~
- ~~Reduced DeepSeek V3 model (4 layers with real weights)~~
- DeepSeek R1 (BF16)

## Example
- Quantize the BF16 model using the unified measurement results on 2x8XG2.


```bash
# vllm root
cd vllm
cd scripts
# Download the unified measurement results
# Make sure that the `nc_workspace_tmp` is under the `scripts` folder.
git clone https://huggingface.co/Yi30/nc_workspace_tmp
# Run example
python n2_ep8_tp8.py --mode q
```

> [!CAUTION]
> - The `QUANT_CONFIG` was hard-coded in [1](https://github.com/yiliu30/vllm-fork/blob/bc3a26c3d6143b6405ef9af7e06f6eddcbcbdad0/scripts/g4_multi_nodes_source.sh#L34C8-L34C20) and [2](https://github.com/yiliu30/vllm-fork/blob/bc3a26c3d6143b6405ef9af7e06f6eddcbcbdad0/scripts/g5_multi_nodes_source.sh#L38).
> - `VLLMKVCache`, `KVCache` and `lm-head` were skipped to quantize, will add them back.
> - ~~FAKE `EP` was hard-coded as 16. Please check `TEMP_EP` in vllm and `DEEPSEEK_EP` in INC.~~


## Others
- 1. Measured on 2x8G2 w/ 513 samples https://huggingface.co/Yi30/nc_workspace_tmp_pile_512_backup
- 2. 4 layers smoke on 8G2 test https://huggingface.co/Yi30/nc_workspace_tmp_4l_ep8_tp8
- 3. Merged result of 1) https://huggingface.co/Yi30/nc_workspace_tmp
- 4. 4 layers on 2x8G2 https://huggingface.co/Yi30/nc_workspace_tmp_4l_smoke
41 changes: 41 additions & 0 deletions scripts/check_nan.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import os
import json
import math


def check_values(obj, key_path="", filename=""):
"""Recursively checks if innermost values are valid numbers, prints issues."""
if isinstance(obj, dict):
for key, value in obj.items():
new_key_path = f"{key_path}.{key}" if key_path else key
check_values(value, new_key_path, filename)
elif isinstance(obj, list):
for idx, item in enumerate(obj):
check_values(item, f"{key_path}[{idx}]", filename)
else:
if (
not isinstance(obj, (int, float))
or math.isnan(obj)
or math.isinf(obj)
):
print(f"Invalid number in {filename} at '{key_path}': {obj}")


def check_json_files(directory):
"""Iterates through all JSON files in a directory and checks their values."""
for filename in os.listdir(directory):
if "mod_list" in filename:
continue
if filename.endswith(".json"):
filepath = os.path.join(directory, filename)
try:
with open(filepath, "r", encoding="utf-8") as file:
data = json.load(file)
check_values(data, filename=filename)
except (json.JSONDecodeError, IOError) as e:
print(f"Error reading {filename}: {e}")


# Set your directory containing JSON files
json_directory = "./nc_workspace_tmp/" # Change this to your actual directory
check_json_files(json_directory)
Loading