OLAC Record: Spanish Gigaword Second Edition

OLAC Record
oai:www.ldc.upenn.edu:LDC2009T21

Metadata

Title: Spanish Gigaword Second Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Mendonça, Ângelo, David Graff, and Denise DiPersio. Spanish Gigaword Second Edition LDC2009T21. Web Download. Philadelphia: Linguistic Data Consortium, 2009

Contributor: Mendonça, Ângelo

Graff, David

DiPersio, Denise

Date (W3CDTF): 2009

Date Issued (W3CDTF): 2009-07-17

Description: *Introduction* Spanish Gigaword Second Edition is a comprehensive archive of newswire text data that has been acquired over several years by LDC. This second edition updates Spanish Gigaword First Edition (LDC2006T12) and adds data collected from January 1, 2006 through December 31, 2008. The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows: * Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2008 * Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2008 * Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2008 The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code (spa) separated by an underscore (_) character. The three-letter language code conforms to LDCs internal convention based on the ISO 639-3 standard. These codes are used in the directory names where the data files are found and in the prefix that appears at the beginning of every data file name. They are also used (in all UPPER CASE) as the initial portion of the DOC id strings that uniquely identify each news story. *Data* The overall totals for each source are summarized below. Note that the Totl-MB numbers show the amount of data obtained when the files are uncompressed (i.e. approximately 7 gigabytes, total) the Gzip-MB column shows totals for compressed file sizes and the K-wrds numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFP_SPA 175 1182 3512 506562 1748787 APW_SPA 180 886 2721 402718 1244811 XIN_SPA 88 405 1238 182543 734356 TOTAL 443 2453 7471 1091823 3727954 The following tables present Text-MB, K-wrds and #DOCS broken down by source and DOC type Text-MB represents the total number of characters (including whitespace) after SGML tags are eliminated. Text-MB K-wrds #DOCs type=advis: AFP_SPA 144 20520 45446 APW_SPA 41 6173 11112 XIN_SPA 0 0 0 TOTAL 185 26693 56558 type=multi: AFP_SPA 84 12711 15346 APW_SPA 351 55758 107224 XIN_SPA 189 29970 56372 TOTAL 624 98439 178942 type=other: AFP_SPA 275 38665 160815 APW_SPA 296 40517 162448 XIN_SPA 44 6376 50168 TOTAL 615 85558 373431 type=story: AFP_SPA 2771 434677 1527180 APW_SPA 1875 300274 964027 XIN_SPA 911 146199 627816 TOTAL 5557 881150 3119023 *Samples* Please view this sample.

Extent: Corpus size: 2516582 KB

Identifier: LDC2009T21

https://catalog.ldc.upenn.edu/LDC2009T21

ISBN: 1-58563-518-9

ISLRN: 202-219-770-615-1

DOI: 10.35111/hwap-pf44

Language: Spanish

Language (ISO639): spa

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2009T21

Rights Holder: Portions © 1994-2008 Agence France Presse, © 1993-2008 The Associated Press, © 2001-2008 Xinhua News Agency, © 2006, 2009 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2009T21

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Mendonça, Ângelo; Graff, David; DiPersio, Denise. 2009. Linguistic Data Consortium.
Terms: area_Europe country_ES dcmi_Text iso639_spa olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T21
Up-to-date as of: Wed Oct 29 7:01:08 EDT 2025

Metadata
Title:		Spanish Gigaword Second Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Mendonça, Ângelo, David Graff, and Denise DiPersio. Spanish Gigaword Second Edition LDC2009T21. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:		Mendonça, Ângelo
		Graff, David
		DiPersio, Denise
Date (W3CDTF):		2009
Date Issued (W3CDTF):		2009-07-17
Description:		Introduction Spanish Gigaword Second Edition is a comprehensive archive of newswire text data that has been acquired over several years by LDC. This second edition updates Spanish Gigaword First Edition (LDC2006T12) and adds data collected from January 1, 2006 through December 31, 2008. The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows: * Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2008 * Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2008 * Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2008 The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code (spa) separated by an underscore (_) character. The three-letter language code conforms to LDCs internal convention based on the ISO 639-3 standard. These codes are used in the directory names where the data files are found and in the prefix that appears at the beginning of every data file name. They are also used (in all UPPER CASE) as the initial portion of the DOC id strings that uniquely identify each news story. Data The overall totals for each source are summarized below. Note that the Totl-MB numbers show the amount of data obtained when the files are uncompressed (i.e. approximately 7 gigabytes, total) the Gzip-MB column shows totals for compressed file sizes and the K-wrds numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFP_SPA 175 1182 3512 506562 1748787 APW_SPA 180 886 2721 402718 1244811 XIN_SPA 88 405 1238 182543 734356 TOTAL 443 2453 7471 1091823 3727954 The following tables present Text-MB, K-wrds and #DOCS broken down by source and DOC type Text-MB represents the total number of characters (including whitespace) after SGML tags are eliminated. Text-MB K-wrds #DOCs type=advis: AFP_SPA 144 20520 45446 APW_SPA 41 6173 11112 XIN_SPA 0 0 0 TOTAL 185 26693 56558 type=multi: AFP_SPA 84 12711 15346 APW_SPA 351 55758 107224 XIN_SPA 189 29970 56372 TOTAL 624 98439 178942 type=other: AFP_SPA 275 38665 160815 APW_SPA 296 40517 162448 XIN_SPA 44 6376 50168 TOTAL 615 85558 373431 type=story: AFP_SPA 2771 434677 1527180 APW_SPA 1875 300274 964027 XIN_SPA 911 146199 627816 TOTAL 5557 881150 3119023 Samples Please view this sample.
Extent:		Corpus size: 2516582 KB
Identifier:		LDC2009T21
		https://catalog.ldc.upenn.edu/LDC2009T21
		ISBN: 1-58563-518-9
		ISLRN: 202-219-770-615-1
		DOI: 10.35111/hwap-pf44
Language:		Spanish
Language (ISO639):		spa
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2009T21
Rights Holder:		Portions © 1994-2008 Agence France Presse, © 1993-2008 The Associated Press, © 2001-2008 Xinhua News Agency, © 2006, 2009 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2009T21
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Mendonça, Ângelo; Graff, David; DiPersio, Denise. 2009. Linguistic Data Consortium.
Terms:		area_Europe country_ES dcmi_Text iso639_spa olac_primary_text