Skip to content

Commit 57f5897

Browse files
committed
Afegit scorer basat en el corpus d'opensubtitles
1 parent 3493093 commit 57f5897

File tree

10 files changed

+1096649
-0
lines changed

10 files changed

+1096649
-0
lines changed

lm/opensubtitles/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
## Scorer extés creat a partir de fonts diverses amb tots els accents diacrítics
2+
3+
- Arxius test, dev i train del dataset de Common Voice (10/12/2019)
4+
- Frases del dataset Crowdsourced high-quality Catalan speech data set (https://www.openslr.org/69/)
5+
- Frases del dataset Ancora a partir del recull d'Universal dependencies (https://github.com/UniversalDependencies/UD_Catalan-AnCora)
6+
- Frases del corpus català d'OpenSubtitles (http://opus.nlpl.eu/OpenSubtitles2018.php)

lm/opensubtitles/alphabet.txt

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
2+
# associated with a numeric label.
3+
# A line that starts with # is a comment. You can escape it with \# if you wish
4+
# to use '#' as a label.
5+
6+
a
7+
à
8+
b
9+
c
10+
ç
11+
d
12+
e
13+
è
14+
é
15+
f
16+
g
17+
h
18+
i
19+
í
20+
ï
21+
j
22+
k
23+
l
24+
m
25+
n
26+
o
27+
ò
28+
ó
29+
p
30+
q
31+
r
32+
s
33+
t
34+
u
35+
ú
36+
ü
37+
v
38+
w
39+
x
40+
y
41+
z
42+
'
43+
-
44+
·
45+
# The last (non-comment) line needs to end with a newline.

lm/opensubtitles/frases.txt

Lines changed: 482993 additions & 0 deletions
Large diffs are not rendered by default.

lm/opensubtitles/kenlm.scorer

16.9 MB
Binary file not shown.

lm/opensubtitles/lm.binary

13.4 MB
Binary file not shown.

lm/opensubtitles/raw/ancora.txt

Lines changed: 16678 additions & 0 deletions
Large diffs are not rendered by default.

lm/opensubtitles/raw/commonvoice.txt

Lines changed: 79633 additions & 0 deletions
Large diffs are not rendered by default.

lm/opensubtitles/raw/crowdsourced.txt

Lines changed: 4240 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)