-
Notifications
You must be signed in to change notification settings - Fork 413
Description
The tessdata collection includes a file tat.traineddata. Although several Turkic languages spoken across Eurasia are referred to as “Tatar”, the ISO 639-2 code tat is generally used to refer to the Kazan Tatar language, spoken in and around the Republic of Tatarstan in Russian. The Kazan Tatar standard language has, since 1939, used a Cyrillic alphabet. This is the script in which all the myriad books written in Tatar since 1939 have been printed.
However, tat.traineddata does not actually work on Tatar in this standard Cyrillic script. Running Tesseract with the argument --language tat on a post-1939 book from the Soviet Union or Russia fails to recognize the script as Cyrillic, and instead outputs gibberish in the Latin alphabet. Attached is a page from a collection of Tatar texts (Galieva, Татар теленнән текстлар, Kazan, 2010) as an example.
The provenance of this trained data file, and the exact language and script it was trained for, should be clarified and the file should be renamed to something more specific.