Skip to content

Chunk Size parameter not working in CLI #50

@AbhijithMallya

Description

@AbhijithMallya

Version

0.0.3

Operating System

Windows

Python Version

3.11

What happened?

I installed synthetic-data-kit using pip, following the official guide. Ingesting the PDF worked fine, and my vLLM server was also running successfully.
But when I tried to generate QA pairs from a large document, it failed because the tool was sending too many tokens at once. I found that the tool supports chunking using the --chunk-size option, which should fix this. But when I tried using it, the CLI gave an error saying "No such option: --chunk-size."

It seems the installed version doesn’t have chunking support yet, even though it’s mentioned in the docs. So right now, I’m stuck because I can’t split large files into smaller parts through the CLI.

Relevant log output

Ouput of Ingestion

$ synthetic-data-kit ingest constitution.pdf 
 Text successfully extracted to data/output\constitution.txt   


vLLM System Check 
$ synthetic-data-kit -c config.yaml system-check
 VLLM server is running at http://<ip_address>:8081/v1
Available models: {'object': 'list', 'data': [{'id': 'Qwen/Qwen2.5-Coder-7B-Instruct', 'object': 'model', 'created': 1752137272, 'owned_by': 'vllm', 'root':
'Qwen/Qwen2.5-Coder-7B-Instruct', 'parent': None, 'max_model_len': 16000, 'permission': [{'id': 'modelperm-1042654a9de14e8a83503cad498fc0a0', 'object':
'model_permission', 'created': 1752137272, 'allow_create_engine': False, 'allow_sampling': True, 'allow_logprobs': True, 'allow_search_indices': False, 'allow_view':   
True, 'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': False}]}]}

Steps to reproduce

  1. Installed synthetic-data-kit through pip using the command in the doumentation
pip install synthetic-data-kit

  1. The performed ingestion of a long pdf with 500 pages using the following command
synthetic-data-kit ingest <long_document>.pdf

  1. Tested the vLLM server using the system-check command. Got a successfull response

  2. Next I started to generate qa pairs using the following command :

$ synthetic-data-kit -c config.yaml create data/parsed/constitution.txt --type qa
L Error: Failed to get completion after 5 attempts: 400 Client Error: Bad Request for url: http://<ip_address>:8081/v1/chat/completions       

I went and check the logs of vLLM and i found that it is sending around 270000 tokens , which is why it is failing.
So i realised , i need to do the chunking, before generating the qa pairs.

  1. So i refered to documentation
$ synthetic-data-kit -c config.yaml create data/parsed/constitution.txt --type qa --chunk-size 2000
Usage: synthetic-data-kit create [OPTIONS] INPUT
Try 'synthetic-data-kit create --help' for help.
╭─ Error ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ No such option: --chunk-size    

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions