UM-NEWTRANX-CORPUS Parallel Corpus

Languages Tokens Average Length Vocabularies
English 55,734,538 18.54 610,290
China 46,059,263 15.32 294,938
Languages English China
Tokens 55,734,538 46,059,263
Average Length 18.54 15.32
Vocabularies 610,290 294,938

The corpus distribution is shown as below.

UM-NEWTRANX-CORPUS Sentence Length Distribution

Languages 1<=Len<=10 10<Len<=30 30<Len<=50 50<Len<=80 Len>80
English 965,922 1,531,597 414,887 80,590 13,676
China 1,259,642 1,411,288 299,422 33,635 2,685
Languages English China
1<=Len<=10 965,922 1,259,642
10<=Len<=30 1,531,597 1,411,288
30<=Len<=50 414,887 299,422
50<=Len<=80 80,590 33,635
Len>80 13,676 2,685

Corpus acquisition

Submit
You should acknowledge the NEWTRANX Company (www.newtranx.com) and with appropriate citation in any publication or presentation containing research results obtained in whole or in part through the use of the UM-Corpus or UM-NEWTRANX-CORPUS. The following reference should be cited [1]