OLAC Record oai:lindat.mff.cuni.cz:11372/LRT-2610 |
Metadata | ||
Title: | ParaCrawl Corpus version 1.0 | |
Bibliographic Citation: | http://hdl.handle.net/11372/LRT-2610 | |
Creator: | Koehn, Philipp | |
Heafield, Kenneth | ||
Forcada, Mikel L. | ||
Esplà-Gomis, Miquel | ||
Ortiz-Rojas, Sergio | ||
Sánchez, Gema Ramírez | ||
Cartagena, Víctor M. Sánchez | ||
Haddow, Barry | ||
Bañón, Marta | ||
Střelec, Marek | ||
Samiotou, Anna | ||
Kamran, Amir | ||
Date (W3CDTF): | 2018-02-12T07:41:46Z | |
Date Available: | 2018-02-12T07:41:46Z | |
Description: | The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html | |
Identifier (URI): | http://hdl.handle.net/11372/LRT-2610 | |
Language: | English | |
German | ||
French | ||
Spanish | ||
Italian | ||
Portuguese | ||
Dutch | ||
Polish | ||
Czech | ||
Romanian | ||
Finnish | ||
Latvian | ||
Russian | ||
Estonian | ||
Language (ISO639): | eng | |
deu | ||
fra | ||
spa | ||
ita | ||
por | ||
nld | ||
pol | ||
ces | ||
ron | ||
fin | ||
lav | ||
rus | ||
est | ||
Publisher: | ParaCrawl | |
Rights: | Public Domain Dedication (CC Zero) | |
http://creativecommons.org/publicdomain/zero/1.0/ | ||
Subject: | ParaCrawl | |
parallel corpus | ||
CommonCrawl | ||
machine translation | ||
text corpora | ||
Type: | corpus | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University | |
Description: | http://www.language-archives.org/archive/lindat.mff.cuni.cz | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:lindat.mff.cuni.cz:11372/LRT-2610 | |
DateStamp: | 2021-06-29 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Koehn, Philipp; Heafield, Kenneth; Forcada, Mikel L.; Esplà-Gomis, Miquel; Ortiz-Rojas, Sergio; Sánchez, Gema Ramírez; Cartagena, Víctor M. Sánchez; Haddow, Barry; Bañón, Marta; Střelec, Marek; Samiotou, Anna; Kamran, Amir. 2018. ParaCrawl. | |
Terms: | area_Europe country_CZ country_DE country_ES country_FI country_FR country_GB country_IT country_NL country_PL country_PT country_RO country_RU dcmi_Text iso639_ces iso639_deu iso639_eng iso639_est iso639_fin iso639_fra iso639_ita iso639_lav iso639_nld iso639_pol iso639_por iso639_ron iso639_rus iso639_spa olac_primary_text |