We present a large Spanish-Catalan parallel corpus extracted from ten years of the paper edition of a bilingual Catalan newspaper. The produced corpus of 7.5 M parallel sentences (around 180 M words per language) is useful for many natural language applications. We report excellent results when building a statistical machine translation system trained on this parallel corpus. The Spanish-Catalan corpus is partially available via ELDA (Evaluations and Language Resources Distribution Agency) in catalog number ELRA-W0053.
Publié le : 2015-02-10
Classification:  Catalan-Spanish parallel corpus, machine translation
@article{cai2807,
     author = {Marta R. Costa-Juss\`a; TALP Research Center, Universitat Polit\'ecnica de Catalunya and Jos\'e A. R. Fonollosa; TALP Research Center, Universitat Polit\'ecnica de Catalunya and Jos\'e B. Mari\~no; TALP Research Center, Universitat Polit\'ecnica de Catalunya and Marc Poch; Institut Universitari de Ling\"u\'\i stica Aplicada (IULA), Universitat Pompeu Fabra and Mireia Farr\'us; N-RAS Research Center, Universitat Pompeu Fabra},
     title = {A Large Spanish-Catalan Parallel Corpus Release for Machine Translation},
     journal = {Computing and Informatics},
     volume = {33},
     number = {3},
     year = {2015},
     language = {en},
     url = {http://dml.mathdoc.fr/item/cai2807}
}
Marta R. Costa-Jussà; TALP Research Center, Universitat Politécnica de Catalunya; José A. R. Fonollosa; TALP Research Center, Universitat Politécnica de Catalunya; José B. Mariño; TALP Research Center, Universitat Politécnica de Catalunya; Marc Poch; Institut Universitari de Lingüística Aplicada (IULA), Universitat Pompeu Fabra; Mireia Farrús; N-RAS Research Center, Universitat Pompeu Fabra. A Large Spanish-Catalan Parallel Corpus Release for Machine Translation. Computing and Informatics, Tome 33 (2015) no. 3, . http://gdmltest.u-ga.fr/item/cai2807/