Skip to content

Conversation

@N-E-W-T-O-N
Copy link

@N-E-W-T-O-N N-E-W-T-O-N commented Aug 9, 2025

Modern Tokenizer store merge's value as list of lists of strings. But models like GPT-2 or gemma-2 store as a list of strings
Ex Mistral

Added a check for this fix

@N-E-W-T-O-N N-E-W-T-O-N requested a review from a team as a code owner August 9, 2025 18:33
@N-E-W-T-O-N
Copy link
Author

Mistral

merges": [
      [
        "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n",
        "\n"
      ],
      [
        "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
        "▁"
      ],
      [

@N-E-W-T-O-N
Copy link
Author

Gpt-2

"merges":["Ġ t","Ġ a","h " ...

@N-E-W-T-O-N
Copy link
Author

N-E-W-T-O-N commented Aug 11, 2025

@shaaga @sayanshaw24

@sayanshaw24
Copy link
Collaborator

Modern Tokenizer store merge's value as list of lists of strings. But models like GPT-2 or gemma-2 store as a list of strings Ex Mistral

Added a check for this fix

awesome, thanks for this! could you add a unit test here as well to ensure expected functionality: https://github.com/microsoft/onnxruntime-extensions/tree/main/test?

@sayanshaw24
Copy link
Collaborator

/azp run onnxruntime-extensions.CI

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@N-E-W-T-O-N
Copy link
Author

Using phi-4 as an example

@N-E-W-T-O-N
Copy link
Author

N-E-W-T-O-N commented Aug 13, 2025

The following Functions is only called when use_fast=True used in passed to tokenizer

Copy link
Collaborator

@sayanshaw24 sayanshaw24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use test/data/phi-4-base or test/data/phi-4-mini-reasoning, we'd like to minimize tokenizer files checked into the repo; (also, FYI onnxruntime-extensions only needs the tokenizer.json and tokenizer_config.json files to load tokenizers).

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 987 in repo microsoft/onnxruntime-extensions

@N-E-W-T-O-N
Copy link
Author

N-E-W-T-O-N commented Aug 19, 2025

Hi @sayanshaw24
Revert back the change as requested and updated the branch .
Please merge my pr

@sayanshaw24
Copy link
Collaborator

/azp run onnxruntime-extensions.CI

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Author

@N-E-W-T-O-N N-E-W-T-O-N left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge the latest code .Please rerun the pipeline

@snnn
Copy link
Member

snnn commented Sep 10, 2025

/azp where

@azure-pipelines
Copy link

Azure DevOps orgs getting events for this repository:

Copy link
Author

@N-E-W-T-O-N N-E-W-T-O-N left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sayanshaw24 pr please merge my PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants