Chunk Size parameter not working in CLI

### Version

0.0.3

### Operating System

Windows

### Python Version

3.11

### What happened?

I installed synthetic-data-kit using pip, following the official guide. Ingesting the PDF worked fine, and my vLLM server was also running successfully. 
But when I tried to generate QA pairs from a large document, it failed because the tool was sending too many tokens at once. I found that the tool supports chunking using the --chunk-size option, which should fix this. But when I tried using it, the CLI gave an error saying "No such option: --chunk-size." 

It seems the installed version doesn’t have chunking support yet, even though it’s mentioned in the docs. So right now, I’m stuck because I can’t split large files into smaller parts through the CLI.

### Relevant log output

```shell
Ouput of Ingestion

$ synthetic-data-kit ingest constitution.pdf 
 Text successfully extracted to data/output\constitution.txt   


vLLM System Check 
$ synthetic-data-kit -c config.yaml system-check
 VLLM server is running at http://<ip_address>:8081/v1
Available models: {'object': 'list', 'data': [{'id': 'Qwen/Qwen2.5-Coder-7B-Instruct', 'object': 'model', 'created': 1752137272, 'owned_by': 'vllm', 'root':
'Qwen/Qwen2.5-Coder-7B-Instruct', 'parent': None, 'max_model_len': 16000, 'permission': [{'id': 'modelperm-1042654a9de14e8a83503cad498fc0a0', 'object':
'model_permission', 'created': 1752137272, 'allow_create_engine': False, 'allow_sampling': True, 'allow_logprobs': True, 'allow_search_indices': False, 'allow_view':   
True, 'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': False}]}]}
```

### Steps to reproduce

1. Installed synthetic-data-kit through pip using the command in the doumentation 
```
pip install synthetic-data-kit

```

2. The performed ingestion of a long pdf with 500 pages using the following command 
  ```
  synthetic-data-kit ingest <long_document>.pdf
  
  ```

3. Tested the vLLM server using the `system-check` command. Got a successfull response

4. Next I started to generate qa pairs using the following command : 
```
$ synthetic-data-kit -c config.yaml create data/parsed/constitution.txt --type qa
L Error: Failed to get completion after 5 attempts: 400 Client Error: Bad Request for url: http://<ip_address>:8081/v1/chat/completions       
```

I went and check the logs of vLLM and i found that it is sending around 270000 tokens , which is why it is failing. 
So i realised , i need to do the chunking, before generating the qa pairs.

5. So i refered to [documentation](https://github.com/meta-llama/synthetic-data-kit?tab=readme-ov-file#controlling-chunking-behavior)
```
$ synthetic-data-kit -c config.yaml create data/parsed/constitution.txt --type qa --chunk-size 2000
Usage: synthetic-data-kit create [OPTIONS] INPUT
Try 'synthetic-data-kit create --help' for help.
╭─ Error ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ No such option: --chunk-size    
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chunk Size parameter not working in CLI #50

Version

Operating System

Python Version

What happened?

Relevant log output

Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Chunk Size parameter not working in CLI #50

Description

Version

Operating System

Python Version

What happened?

Relevant log output

Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions