Skip to content

Conversation

stephantul
Copy link

The stemming algorithm in the original model was applied after the original terms were indexed. This resulted in indexing errors if two different terms had the same stem.

Before:

_, y = d.process_document("hello hellos")
print(y)
# {"hello": 2}

Now:

_, y = d.process_document("hello hellos")
print(y)
# {"hello": 1}

This raises scores on Nanobeir a tiny bit.

@stephantul stephantul changed the base branch from master to wandb July 15, 2025 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant