-
Notifications
You must be signed in to change notification settings - Fork 466
Support tekken tokenizer conversion #1498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughThe changes introduce support for a new tokenizer format, "tekken.json," across the codebase. This includes updating macros to prioritize "tekken.json" over "tokenizer.json," implementing logic to load and convert "tekken.json" files, and making relevant modules and functions public. The Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant PipelineMacros
participant UtilsTokenizer
User->>PipelineMacros: Request model/tokenizer paths
PipelineMacros->>PipelineMacros: Check for "tekken.json" in directory
alt "tekken.json" found
PipelineMacros->>UtilsTokenizer: Load "tekken.json"
UtilsTokenizer->>UtilsTokenizer: Parse Tekken format, convert to HuggingFace JSON
UtilsTokenizer-->>PipelineMacros: Return Tokenizer
else "tokenizer.json" found
PipelineMacros->>UtilsTokenizer: Load "tokenizer.json"
UtilsTokenizer-->>PipelineMacros: Return Tokenizer
else None found
PipelineMacros-->>User: Return empty path or error
end
Suggested labels
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 3 62 53 0 9 CSS 1 473 408 14 51 Dockerfile 1 44 24 10 10 HTML 1 78 64 5 9 JavaScript 7 1397 1068 180 149 JSON 18 282 279 0 3 Makefile 1 6 5 0 1 Python 92 4530 3813 185 532 Shell 1 63 26 18 19 Plain Text 3 3723 0 2413 1310 TOML 23 846 778 11 57 YAML 2 21 19 2 0 ------------------------------------------------------------------------------- Jupyter Notebooks 3 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 205 178 1 26 (Total) 282 210 32 40 ------------------------------------------------------------------------------- Markdown 69 6171 0 4657 1514 |- BASH 14 203 190 7 6 |- JSON 9 436 436 0 0 |- Python 12 391 326 24 41 |- Rust 29 1123 945 30 148 |- TOML 2 75 63 0 12 (Total) 8399 1960 4718 1721 ------------------------------------------------------------------------------- Rust 397 140359 124634 3209 12516 |- Markdown 191 4084 284 3259 541 (Total) 144443 124918 6468 13057 =============================================================================== Total 622 158055 131171 10704 16180 =============================================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (1)
mistralrs-core/src/pipeline/macros.rs (1)
388-397: Ensure consistent error handling between macros.The error handling differs from the
get_paths!macro. Here it returns an empty PathBuf when no tokenizer is found, whileget_paths!would panic. Consider making the behavior consistent.For consistency with the main
get_paths!macro, consider whether this should also handle missing tokenizers consistently:} else if dir_list.contains(&"tokenizer.json".to_string()) { info!("Loading `tokenizer.json` at `{}`", this_model_id); $crate::api_get_file!(api, "tokenizer.json", model_id) } else { - PathBuf::from_str("")? + // Consider: should this warn or handle missing tokenizer differently? + info!("No tokenizer file found at `{}`", this_model_id); + PathBuf::from_str("")? }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
Cargo.toml(1 hunks)mistralrs-core/Cargo.toml(1 hunks)mistralrs-core/src/lib.rs(1 hunks)mistralrs-core/src/pipeline/macros.rs(2 hunks)mistralrs-core/src/utils/mod.rs(1 hunks)mistralrs-core/src/utils/tokenizer.rs(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: comment
🔇 Additional comments (9)
Cargo.toml (1)
172-172: Verify tiktoken-rs version currency and security status.Ensure the specified version is current and free from known vulnerabilities.
#!/bin/bash # Check the latest version of tiktoken-rs and any security advisories echo "Checking latest version of tiktoken-rs..." curl -s https://crates.io/api/v1/crates/tiktoken-rs | jq -r '.crate.newest_version' echo "Checking for security advisories..." curl -s https://forge.rust-lang.org/infra/cargo-layout.html 2>/dev/null || echo "Manual check required for security advisories"mistralrs-core/Cargo.toml (1)
105-105: LGTM!Correctly references the workspace dependency for tiktoken-rs.
mistralrs-core/src/utils/tokenizer.rs (7)
4-4: LGTM: Base64 dependency addition is appropriate.The
base64::Engineimport is necessary for decoding base64-encoded token bytes in the tekken format.
15-27: Well-structured data models for tekken format.The
TekkenVocabEntryandTekkenTokenizerstructs appropriately represent the tekken.json structure with proper field types and deserialization support.
29-45: Robust vocabulary building with proper error handling.The function correctly:
- Deserializes the tekken format
- Decodes base64 token bytes with proper error propagation
- Handles optional token strings with fallback to lossy UTF-8 conversion
- Uses rank as the token ID mapping
47-89: Comprehensive HuggingFace tokenizer JSON construction.The implementation creates a well-structured HF-compatible tokenizer configuration. The empty
mergesarray in the BPE model is appropriate for tekken tokenizers, which don't use traditional BPE merge rules.
90-93: Clean tokenizer instantiation with proper error handling.The conversion to bytes and tokenizer creation follows the standard pattern with appropriate error mapping.
96-96: Visibility change aligns with module exposure.Changing from
pub(crate)topubis consistent with making the tokenizer utilities publicly accessible as mentioned in the AI summary.
100-116: Well-implemented tekken.json detection and handling.The filename-based detection is simple and reliable. The early return pattern maintains clean code flow and proper special token handling is preserved.
| pub(crate) mod normal; | ||
| pub(crate) mod progress; | ||
| pub(crate) mod tokenizer; | ||
| pub mod tokenizer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
❓ Verification inconclusive
Consider API design implications of exposing tokenizer module.
Changing from pub(crate) to pub makes the tokenizer module part of the public API. Ensure this is the intended design and consider if specific functions should be exposed instead of the entire module.
Consider whether exposing the entire tokenizer module is necessary or if specific public functions would be more appropriate for the intended use case.
Consider API design implications of exposing tokenizer module.
Changing from pub(crate) to pub makes the tokenizer module part of the public API. Ensure this is the intended design and consider if specific functions should be exposed instead of the entire module.
Consider whether exposing the entire tokenizer module is necessary or if specific public functions would be more appropriate for the intended use case.
🤖 Prompt for AI Agents
In mistralrs-core/src/utils/mod.rs at line 7, the tokenizer module is changed
from pub(crate) to pub, making it part of the public API. Review whether
exposing the entire tokenizer module is necessary or if only specific functions
should be made public. If only certain functions need to be accessible, revert
the module to pub(crate) and selectively expose those functions with pub
visibility to maintain a cleaner API.
| mod tools; | ||
| mod topology; | ||
| mod utils; | ||
| pub mod utils; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Reconsider exposing entire utils module publicly.
This is a significant breaking API change that exposes many internal utilities. Consider exposing only the specific tokenizer functionality needed rather than the entire utils module.
For more controlled API exposure, consider:
-pub mod utils;
+mod utils;
+pub use utils::tokenizer;This would expose only the tokenizer functionality while keeping other utilities internal.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub mod utils; | |
| mod utils; | |
| pub use utils::tokenizer; |
🤖 Prompt for AI Agents
In mistralrs-core/src/lib.rs at line 75, instead of publicly exposing the entire
utils module, refactor the code to expose only the specific tokenizer
functionality needed. Change the pub mod utils declaration to keep the module
internal and add a public re-export or a dedicated public module that exposes
only the tokenizer components. This limits the public API surface and prevents
exposing unrelated internal utilities.
| // Try to find tekken.json first, then fallback to tokenizer.json | ||
| let mut dir_list = $crate::api_dir_list!(api, &$this.model_id, true); | ||
| let has_tekken = dir_list.any(|f| f == "tekken.json"); | ||
| if has_tekken { | ||
| info!("Loading `tekken.json` at `{}`", $this.model_id); | ||
| $crate::api_get_file!(api, "tekken.json", model_id) | ||
| } else { | ||
| info!("Loading `tokenizer.json` at `{}`", $this.model_id); | ||
| $crate::api_get_file!(api, "tokenizer.json", model_id) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix iterator consumption bug and improve logic.
The dir_list iterator is consumed by .any() on line 143, but the same dir_list is used again on line 171, which will cause a runtime error.
Apply this fix:
- // Try to find tekken.json first, then fallback to tokenizer.json
- let mut dir_list = $crate::api_dir_list!(api, &$this.model_id, true);
- let has_tekken = dir_list.any(|f| f == "tekken.json");
- if has_tekken {
+ // Try to find tekken.json first, then fallback to tokenizer.json
+ let dir_list: Vec<String> = $crate::api_dir_list!(api, &$this.model_id, true).collect();
+ if dir_list.contains(&"tekken.json".to_string()) {
info!("Loading `tekken.json` at `{}`", $this.model_id);
$crate::api_get_file!(api, "tekken.json", model_id)
} else {
info!("Loading `tokenizer.json` at `{}`", $this.model_id);
$crate::api_get_file!(api, "tokenizer.json", model_id)
}Also update line 171 to use the collected vector:
- let dir_list = $crate::api_dir_list!(api, model_id, false).collect::<Vec<_>>();
+ let dir_list = dir_list;Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In mistralrs-core/src/pipeline/macros.rs around lines 141 to 150, the iterator
`dir_list` is consumed by the `.any()` method, making it unusable later in the
code. To fix this, collect the iterator into a vector before using `.any()`, so
it can be reused safely. Update the code to first collect `dir_list` into a
vector, then check for "tekken.json" using `.any()`, and later use the collected
vector instead of the original iterator on line 171.
Summary by CodeRabbit
New Features
Improvements