Support tekken tokenizer conversion #1498

EricLBuehler · 2025-06-21T08:40:30Z

Summary by CodeRabbit

New Features
- Added support for loading and using a new tokenizer format ("tekken.json"), enabling broader compatibility with different tokenizer files.
Improvements
- The app now prioritizes "tekken.json" over "tokenizer.json" when both are available, with a fallback mechanism for tokenizer file selection.
- Expanded public access to utility modules and tokenizer functionality for easier integration and use in external projects.

coderabbitai · 2025-06-21T08:40:35Z

Walkthrough

The changes introduce support for a new tokenizer format, "tekken.json," across the codebase. This includes updating macros to prioritize "tekken.json" over "tokenizer.json," implementing logic to load and convert "tekken.json" files, and making relevant modules and functions public. The tiktoken-rs dependency is also added.

Changes

File(s)	Change Summary
Cargo.toml, mistralrs-core/Cargo.toml	Added `tiktoken-rs` as a workspace dependency.
mistralrs-core/src/lib.rs	Changed `utils` module visibility from private to public.
mistralrs-core/src/utils/mod.rs	Changed `tokenizer` module visibility from crate-private to public.
mistralrs-core/src/utils/tokenizer.rs	Added support for "tekken.json" tokenizer format, new structs/functions, made `get_tokenizer` public.
mistralrs-core/src/pipeline/macros.rs	Modified macros to prefer "tekken.json" over "tokenizer.json" and handle missing files gracefully.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PipelineMacros
    participant UtilsTokenizer

    User->>PipelineMacros: Request model/tokenizer paths
    PipelineMacros->>PipelineMacros: Check for "tekken.json" in directory
    alt "tekken.json" found
        PipelineMacros->>UtilsTokenizer: Load "tekken.json"
        UtilsTokenizer->>UtilsTokenizer: Parse Tekken format, convert to HuggingFace JSON
        UtilsTokenizer-->>PipelineMacros: Return Tokenizer
    else "tokenizer.json" found
        PipelineMacros->>UtilsTokenizer: Load "tokenizer.json"
        UtilsTokenizer-->>PipelineMacros: Return Tokenizer
    else None found
        PipelineMacros-->>User: Return empty path or error
    end

Suggested labels

codex

Poem

A new tokenizer hops in, named Tekken,
With base64 vocab, it’s not what you’d reckon!
The macros now peek for this file first,
And public utils quench your thirst.
With tiktoken-rs, we’re ready to run—
More formats supported, more fun in the sun!
🐇✨

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-06-21T08:42:00Z

Code Metrics Report

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                3           62           53            0            9
 CSS                     1          473          408           14           51
 Dockerfile              1           44           24           10           10
 HTML                    1           78           64            5            9
 JavaScript              7         1397         1068          180          149
 JSON                   18          282          279            0            3
 Makefile                1            6            5            0            1
 Python                 92         4530         3813          185          532
 Shell                   1           63           26           18           19
 Plain Text              3         3723            0         2413         1310
 TOML                   23          846          778           11           57
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               69         6171            0         4657         1514
 |- BASH                14          203          190            7            6
 |- JSON                 9          436          436            0            0
 |- Python              12          391          326           24           41
 |- Rust                29         1123          945           30          148
 |- TOML                 2           75           63            0           12
 (Total)                           8399         1960         4718         1721
-------------------------------------------------------------------------------
 Rust                  397       140359       124634         3209        12516
 |- Markdown           191         4084          284         3259          541
 (Total)                         144443       124918         6468        13057
===============================================================================
 Total                 622       158055       131171        10704        16180
===============================================================================

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

mistralrs-core/src/pipeline/macros.rs (1)
388-397: Ensure consistent error handling between macros.

The error handling differs from the get_paths! macro. Here it returns an empty PathBuf when no tokenizer is found, while get_paths! would panic. Consider making the behavior consistent.

For consistency with the main get_paths! macro, consider whether this should also handle missing tokenizers consistently:
             } else if dir_list.contains(&"tokenizer.json".to_string()) {
                 info!("Loading `tokenizer.json` at `{}`", this_model_id);
                 $crate::api_get_file!(api, "tokenizer.json", model_id)
             } else {
-                PathBuf::from_str("")?
+                // Consider: should this warn or handle missing tokenizer differently?
+                info!("No tokenizer file found at `{}`", this_model_id);
+                PathBuf::from_str("")?
             }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d7577dd and de6de30.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (6)

Cargo.toml (1 hunks)
mistralrs-core/Cargo.toml (1 hunks)
mistralrs-core/src/lib.rs (1 hunks)
mistralrs-core/src/pipeline/macros.rs (2 hunks)
mistralrs-core/src/utils/mod.rs (1 hunks)
mistralrs-core/src/utils/tokenizer.rs (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: comment

🔇 Additional comments (9)

Cargo.toml (1)

172-172: Verify tiktoken-rs version currency and security status.

Ensure the specified version is current and free from known vulnerabilities.
#!/bin/bash
# Check the latest version of tiktoken-rs and any security advisories
echo "Checking latest version of tiktoken-rs..."
curl -s https://crates.io/api/v1/crates/tiktoken-rs | jq -r '.crate.newest_version'

echo "Checking for security advisories..."
curl -s https://forge.rust-lang.org/infra/cargo-layout.html 2>/dev/null || echo "Manual check required for security advisories"
mistralrs-core/Cargo.toml (1)

105-105: LGTM!

Correctly references the workspace dependency for tiktoken-rs.

mistralrs-core/src/utils/tokenizer.rs (7)

4-4: LGTM: Base64 dependency addition is appropriate.

The base64::Engine import is necessary for decoding base64-encoded token bytes in the tekken format.

15-27: Well-structured data models for tekken format.

The TekkenVocabEntry and TekkenTokenizer structs appropriately represent the tekken.json structure with proper field types and deserialization support.

29-45: Robust vocabulary building with proper error handling.

The function correctly:

Deserializes the tekken format

Decodes base64 token bytes with proper error propagation

Handles optional token strings with fallback to lossy UTF-8 conversion

Uses rank as the token ID mapping

47-89: Comprehensive HuggingFace tokenizer JSON construction.

The implementation creates a well-structured HF-compatible tokenizer configuration. The empty merges array in the BPE model is appropriate for tekken tokenizers, which don't use traditional BPE merge rules.

90-93: Clean tokenizer instantiation with proper error handling.

The conversion to bytes and tokenizer creation follows the standard pattern with appropriate error mapping.

96-96: Visibility change aligns with module exposure.

Changing from pub(crate) to pub is consistent with making the tokenizer utilities publicly accessible as mentioned in the AI summary.

100-116: Well-implemented tekken.json detection and handling.

The filename-based detection is simple and reliable. The early return pattern maintains clean code flow and proper special token handling is preserved.

coderabbitai · 2025-06-21T08:42:54Z

mistralrs-core/src/utils/mod.rs

 pub(crate) mod normal;
 pub(crate) mod progress;
-pub(crate) mod tokenizer;
+pub mod tokenizer;


🛠️ Refactor suggestion

❓ Verification inconclusive

Consider API design implications of exposing tokenizer module.

Changing from pub(crate) to pub makes the tokenizer module part of the public API. Ensure this is the intended design and consider if specific functions should be exposed instead of the entire module.

Consider whether exposing the entire tokenizer module is necessary or if specific public functions would be more appropriate for the intended use case.

Consider API design implications of exposing tokenizer module.

Changing from pub(crate) to pub makes the tokenizer module part of the public API. Ensure this is the intended design and consider if specific functions should be exposed instead of the entire module.

Consider whether exposing the entire tokenizer module is necessary or if specific public functions would be more appropriate for the intended use case.

🤖 Prompt for AI Agents

In mistralrs-core/src/utils/mod.rs at line 7, the tokenizer module is changed from pub(crate) to pub, making it part of the public API. Review whether exposing the entire tokenizer module is necessary or if only specific functions should be made public. If only certain functions need to be accessible, revert the module to pub(crate) and selectively expose those functions with pub visibility to maintain a cleaner API.

coderabbitai · 2025-06-21T08:42:54Z

mistralrs-core/src/lib.rs

 mod tools;
 mod topology;
-mod utils;
+pub mod utils;


🛠️ Refactor suggestion

Reconsider exposing entire utils module publicly.

This is a significant breaking API change that exposes many internal utilities. Consider exposing only the specific tokenizer functionality needed rather than the entire utils module.

For more controlled API exposure, consider:

-pub mod utils; +mod utils; +pub use utils::tokenizer;

This would expose only the tokenizer functionality while keeping other utilities internal.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pub mod utils;

mod utils;

pub use utils::tokenizer;

🤖 Prompt for AI Agents

In mistralrs-core/src/lib.rs at line 75, instead of publicly exposing the entire utils module, refactor the code to expose only the specific tokenizer functionality needed. Change the pub mod utils declaration to keep the module internal and add a public re-export or a dedicated public module that exposes only the tokenizer components. This limits the public API surface and prevents exposing unrelated internal utilities.

coderabbitai · 2025-06-21T08:42:54Z

mistralrs-core/src/pipeline/macros.rs

+            // Try to find tekken.json first, then fallback to tokenizer.json
+            let mut dir_list = $crate::api_dir_list!(api, &$this.model_id, true);
+            let has_tekken = dir_list.any(|f| f == "tekken.json");
+            if has_tekken {
+                info!("Loading `tekken.json` at `{}`", $this.model_id);
+                $crate::api_get_file!(api, "tekken.json", model_id)
+            } else {
+                info!("Loading `tokenizer.json` at `{}`", $this.model_id);
+                $crate::api_get_file!(api, "tokenizer.json", model_id)
+            }


⚠️ Potential issue

Fix iterator consumption bug and improve logic.

The dir_list iterator is consumed by .any() on line 143, but the same dir_list is used again on line 171, which will cause a runtime error.

Apply this fix:

- // Try to find tekken.json first, then fallback to tokenizer.json - let mut dir_list = $crate::api_dir_list!(api, &$this.model_id, true); - let has_tekken = dir_list.any(|f| f == "tekken.json"); - if has_tekken { + // Try to find tekken.json first, then fallback to tokenizer.json + let dir_list: Vec<String> = $crate::api_dir_list!(api, &$this.model_id, true).collect(); + if dir_list.contains(&"tekken.json".to_string()) { info!("Loading `tekken.json` at `{}`", $this.model_id); $crate::api_get_file!(api, "tekken.json", model_id) } else { info!("Loading `tokenizer.json` at `{}`", $this.model_id); $crate::api_get_file!(api, "tokenizer.json", model_id) }

Also update line 171 to use the collected vector:

- let dir_list = $crate::api_dir_list!(api, model_id, false).collect::<Vec<_>>(); + let dir_list = dir_list;

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In mistralrs-core/src/pipeline/macros.rs around lines 141 to 150, the iterator `dir_list` is consumed by the `.any()` method, making it unusable later in the code. To fix this, collect the iterator into a vector before using `.any()`, so it can be reused safely. Update the code to first collect `dir_list` into a vector, then check for "tekken.json" using `.any()`, and later use the collected vector instead of the original iterator on line 171.

Support tekken tokenizer conversion

de6de30

EricLBuehler marked this pull request as ready for review June 21, 2025 08:40

coderabbitai bot reviewed Jun 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support tekken tokenizer conversion #1498

Support tekken tokenizer conversion #1498

EricLBuehler commented Jun 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jun 21, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

github-actions bot commented Jun 21, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jun 21, 2025

Uh oh!

coderabbitai bot Jun 21, 2025

Uh oh!

coderabbitai bot Jun 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support tekken tokenizer conversion #1498

Are you sure you want to change the base?

Support tekken tokenizer conversion #1498

Conversation

EricLBuehler commented Jun 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Suggested labels

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

github-actions bot commented Jun 21, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EricLBuehler commented Jun 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 21, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)