Tatar file (tat.traineddata) does not appear to be for the standard Cyrillic script

The tessdata collection includes a file ```tat.traineddata```. Although several Turkic languages spoken across Eurasia are referred to as “Tatar”, the ISO 639-2 code ```tat``` is generally used to refer to the Kazan Tatar language, spoken in and around the Republic of Tatarstan in Russian. The Kazan Tatar standard language has, since 1939, used a Cyrillic alphabet. This is the script in which all the myriad books written in Tatar since 1939 have been printed.

However, tat.traineddata does not actually work on Tatar in this standard Cyrillic script. Running Tesseract with the argument ```--language tat``` on a post-1939 book from the Soviet Union or Russia fails to recognize the script as Cyrillic, and instead outputs gibberish in the Latin alphabet. Attached is a page from a collection of Tatar texts (Galieva, _Татар теленнән текстлар_, Kazan, 2010) as an example.

[tatar-sample.pdf](https://github.com/user-attachments/files/20840648/tatar-sample.pdf)

The provenance of this trained data file, and the exact language and script it was trained for, should be clarified and the file should be renamed to something more specific.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tatar file (tat.traineddata) does not appear to be for the standard Cyrillic script #85

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tatar file (tat.traineddata) does not appear to be for the standard Cyrillic script #85

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions