OLAC Record oai:www.ldc.upenn.edu:LDC94T5 |
Metadata | ||
Title: | ECI Multilingual Text | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Linguistic Data Consortium. ECI Multilingual Text LDC94T5. Web Download. Philadelphia: Linguistic Data Consortium, 1994 | |
Contributor: | Linguistic Data Consortium | |
Date (W3CDTF): | 1994 | |
Description: | The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at least. The amount of material per language varies, from about 36 million words (German) to about 5 thousand words (Bulgarian). The majority of sources are journalistic in nature (newspapers, magazines, broadcasts) additional sources include dictionaries (Albanian, Gaelic, Turkish, Japanese/English), literature, technical reports and proceedings or publications of international organizations. The table on the next page lists the languages included, the subcorpus numbers for each language (in parentheses) and the amount of data per language in thousands of lexical words. Language (Subcorpus #) Kwords Totals German (70) 34291 (09) 191 (65) 20 (28) 187 (29) 59 (30) 76 (47) 24 (59) 50 (71) 21 (70A) 999 35918 French (31) 4775 (04) 4121 (28) 187 (29) 59 (30) 76 (47) 24 (51) 6 (59) 50 (71) 21 (32) 1667 10986 Spanish (31) 4500 (13) 830 (14) 1041 (15) 447 (47) 24 (32) 1667 8 (59) 50 (71) 8580 English (31) 4222 (36) 1141 (74) 95 (28) 187 (47) 24 (51) 6 (56) 97 (59) 50 (71) 21 (32) 1667 7510 Dutch (03) 5500 (02) 600 (47) 24 (71) 21 6145 Czech (44) 4726 4726 Italian (11) 3518 (42) 303 (58) 13 (29) 59 (30) 76 (47) 24 (71) 21 4014 Chinese (78) 2895 2895 Greek (10) 2515 (47) 24 (59) 50 (71) 21 2610 Norwegian (41) 2226 2226 Swedish (37) 1718 1718 Serb/Croat/Slov(24) 700 (56) 289 989 Tibetan (76) 834 834 Portuguese (60) 675 (47) 24 (71) 21 720 Malay (80) 563 563 Russian (73) 364 364 Japanese (57) 203 203 Turkish (20) 173 (20A) 110 283 Albanian (82) 205 205 Gaelic (55) 141 141 Estonian (39) 100 100 Usbek (81) 88 88 Latin (74) 75 75 Danish (47) 24 (71) 21 45 Lithuanian (89) 20 20 Bulgarian (84) 5 5 Total 91969 | |
Extent: | Corpus size: 373760 KB | |
Identifier: | LDC94T5 | |
https://catalog.ldc.upenn.edu/LDC94T5 | ||
ISBN: 1-58563-033-0 | ||
ISLRN: 511-168-567-582-5 | ||
DOI: 10.35111/h2vd-p896 | ||
Language: | Swedish | |
Slovenian | ||
Russian | ||
Portuguese | ||
Norwegian Bokmål | ||
Norwegian Nynorsk | ||
Lithuanian | ||
Latin | ||
Japanese | ||
Scottish Gaelic | ||
French | ||
Estonian | ||
English | ||
Modern Greek (1453-) | ||
German | ||
Danish | ||
Bulgarian | ||
Tosk Albanian | ||
Spanish | ||
Serbian | ||
Mandarin Chinese | ||
Italian | ||
Dutch | ||
Czech | ||
Croatian | ||
Albanian | ||
Uzbek | ||
Malay (macrolanguage); Malay | ||
Language (ISO639): | swe | |
slv | ||
rus | ||
por | ||
nob | ||
nno | ||
lit | ||
lat | ||
jpn | ||
gla | ||
fra | ||
est | ||
eng | ||
ell | ||
deu | ||
dan | ||
bul | ||
als | ||
spa | ||
srp | ||
cmn | ||
ita | ||
nld | ||
ces | ||
hrv | ||
sqi | ||
uzb | ||
msa | ||
License: | ECI/MCI Agreement: https://catalog.ldc.upenn.edu/license/eci-slash-mci-user-agreement.pdf | |
Le Monde Material User Agreement: https://catalog.ldc.upenn.edu/license/le-monde-material-user-agreement.pdf | ||
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC94T5 | |
Rights Holder: | Portions © 1994 Trustees of the University of Pennsylvania | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC94T5 | |
DateStamp: | 2020-11-30 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Linguistic Data Consortium. 1994. Linguistic Data Consortium. | |
Terms: | area_Asia area_Europe country_AL country_BG country_CN country_CZ country_DE country_DK country_ES country_FR country_GB country_GR country_HR country_IT country_JP country_LT country_NL country_PT country_RS country_RU country_SE country_SI country_VA dcmi_Text iso639_als iso639_bul iso639_ces iso639_cmn iso639_dan iso639_deu iso639_ell iso639_eng iso639_est iso639_fra iso639_gla iso639_hrv iso639_ita iso639_jpn iso639_lat iso639_lit iso639_msa iso639_nld iso639_nno iso639_nob iso639_por iso639_rus iso639_slv iso639_spa iso639_sqi iso639_srp iso639_swe iso639_uzb olac_primary_text |