Skip to content

Issue with dataset of Llama-3.1-8b #2377

@mahmoodn

Description

@mahmoodn

Hi,
I have used the following commands to download llama-3.1-8b dataset according to the readme file.

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
  https://inference.mlcommons-storage.org/metadata/llama3-1-8b-cnn-eval.uri
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
  https://inference.mlcommons-storage.org/metadata/llama3-1-8b-sample-cnn-eval-5000.uri
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
  https://inference.mlcommons-storage.org/metadata/llama3-1-8b-cnn-dailymail-calibration.uri

And the following files are downloaded in the folder:

llama3.1-8b]$ ls dataset
cnn_dailymail_calibration.json  llama3-1-8b-cnn-dailymail-calibration.md5  llama3-1-8b-sample-cnn-eval-5000.md5
cnn_eval.json                   llama3-1-8b-cnn-eval.md5                   sample_cnn_eval_5000.json

However, the inference command fails because pickle files are not found.

llama3.1-8b]$ export DATASET_PATH=$LLAMA_FOLDER/dataset
llama3.1-8b]$ export CHECKPOINT_PATH=$LLAMA_FOLDER/Llama-3.1-8B-Instruct

llama3.1-8b]$ python -u main.py --scenario Offline \
                --model-path ${CHECKPOINT_PATH} \
                --batch-size 16 \
                --dtype bfloat16 \
                --user-conf user.conf \
                --total-sample-count 13368 \
                --dataset-path ${DATASET_PATH} \
                --output-log-dir output \
                --tensor-parallel-size ${GPU_COUNT} \
                --vllm
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
INFO:datasets:PyTorch version 2.4.0 available.
WARNING:Llama-8B-Dataset:Processed pickle file /scratch/mn/inference/language/llama3.1-8b/dataset not found. Please check that the path is correct
INFO:Llama-8B-Dataset:Loading dataset...
Traceback (most recent call last):
  File "/scratch/mn/inference/language/llama3.1-8b/main.py", line 216, in <module>
    main()
  File "/scratch/mn/inference/language/llama3.1-8b/main.py", line 173, in main
    sut = sut_cls(
  File "/scratch/mn/inference/language/llama3.1-8b/SUT_VLLM.py", line 56, in __init__
    self.data_object = Dataset(
  File "/scratch/mn/inference/language/llama3.1-8b/dataset.py", line 36, in __init__
    self.load_processed_dataset()
  File "/scratch/mn/inference/language/llama3.1-8b/dataset.py", line 52, in load_processed_dataset
    self.processed_data = pd.read_json(self.dataset_path)
  File "/home/mn/.local/lib/python3.10/site-packages/pandas/io/json/_json.py", line 791, in read_json
    json_reader = JsonReader(
  File "/home/mn/.local/lib/python3.10/site-packages/pandas/io/json/_json.py", line 904, in __init__
    data = self._get_data_from_filepath(filepath_or_buffer)
  File "/home/mn/.local/lib/python3.10/site-packages/pandas/io/json/_json.py", line 944, in _get_data_from_filepath
    self.handles = get_handle(
  File "/home/mn/.local/lib/python3.10/site-packages/pandas/io/common.py", line 873, in get_handle
    handle = open(
IsADirectoryError: [Errno 21] Is a directory: '/scratch/mn/inference/language/llama3.1-8b/dataset'

I also don't know why it throws an error that dataset path is a directory. What else should it be?
Any idea about that?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions