-
Notifications
You must be signed in to change notification settings - Fork 168
Description
Version
0.0.3
Operating System
Windows
Python Version
3.11
What happened?
I installed synthetic-data-kit using pip, following the official guide. Ingesting the PDF worked fine, and my vLLM server was also running successfully.
But when I tried to generate QA pairs from a large document, it failed because the tool was sending too many tokens at once. I found that the tool supports chunking using the --chunk-size option, which should fix this. But when I tried using it, the CLI gave an error saying "No such option: --chunk-size."
It seems the installed version doesn’t have chunking support yet, even though it’s mentioned in the docs. So right now, I’m stuck because I can’t split large files into smaller parts through the CLI.
Relevant log output
Ouput of Ingestion
$ synthetic-data-kit ingest constitution.pdf
Text successfully extracted to data/output\constitution.txt
vLLM System Check
$ synthetic-data-kit -c config.yaml system-check
VLLM server is running at http://<ip_address>:8081/v1
Available models: {'object': 'list', 'data': [{'id': 'Qwen/Qwen2.5-Coder-7B-Instruct', 'object': 'model', 'created': 1752137272, 'owned_by': 'vllm', 'root':
'Qwen/Qwen2.5-Coder-7B-Instruct', 'parent': None, 'max_model_len': 16000, 'permission': [{'id': 'modelperm-1042654a9de14e8a83503cad498fc0a0', 'object':
'model_permission', 'created': 1752137272, 'allow_create_engine': False, 'allow_sampling': True, 'allow_logprobs': True, 'allow_search_indices': False, 'allow_view':
True, 'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': False}]}]}
Steps to reproduce
- Installed synthetic-data-kit through pip using the command in the doumentation
pip install synthetic-data-kit
- The performed ingestion of a long pdf with 500 pages using the following command
synthetic-data-kit ingest <long_document>.pdf
-
Tested the vLLM server using the
system-check
command. Got a successfull response -
Next I started to generate qa pairs using the following command :
$ synthetic-data-kit -c config.yaml create data/parsed/constitution.txt --type qa
L Error: Failed to get completion after 5 attempts: 400 Client Error: Bad Request for url: http://<ip_address>:8081/v1/chat/completions
I went and check the logs of vLLM and i found that it is sending around 270000 tokens , which is why it is failing.
So i realised , i need to do the chunking, before generating the qa pairs.
- So i refered to documentation
$ synthetic-data-kit -c config.yaml create data/parsed/constitution.txt --type qa --chunk-size 2000
Usage: synthetic-data-kit create [OPTIONS] INPUT
Try 'synthetic-data-kit create --help' for help.
╭─ Error ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ No such option: --chunk-size