OLAC Record: Spanish Gigaword First Edition

OLAC Record
oai:www.ldc.upenn.edu:LDC2006T12

Metadata

Title: Spanish Gigaword First Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Graff, David. Spanish Gigaword First Edition LDC2006T12. Web Download. Philadelphia: Linguistic Data Consortium, 2006

Contributor: Graff, David

Date (W3CDTF): 2006

Date Issued (W3CDTF): 2006-06-15

Description: *Introduction* Spanish Gigaword First Edition is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) and contains over 750 million tokens spanning approximately 2.7 million documents. Although this is the first edition of the Spanish Gigaword Corpus, some of the data included here has been released previously in other LDC corpora. The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows: * Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2005 * Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2005 * Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2005 *Data* The overall totals for each source are summarized below. The "K-wrds" figures are simply the number in thousands of whitespace-separated tokens of all types after all SGML tags are eliminated. Source K-wrds #DOCs AFP_SPA 393354 1382679 APW_SPA 263225 886998 XIN_SPA 94459 388561 TOTAL 751038 2658238 Most of the text data (all of AFP_SPA, most of APW_SPA) were received at LDC via dedicated, 24-hour/day electronic feeds (leased phone lines in the case of APW_SPA, a local satellite dish for AFP_SPA). These 24-hour transmission services were all susceptible to "line noise" (occasional corruption of text content), as well as service outages both at the data source and at our receiving computers. Usually, the various disruptions of a newswire data stream would leave tell-tale evidence in the form of byte values falling outside the range of printable ASCII characters, or recognizable patterns of anomalous ASCII strings. All the XIN_SPA data, and the portion of APW_SPA data beginning with 200406, were received as bulk electronic text archives via internet retrieval. As such, they were not susceptible to modem line-noise or related disruptions, though this does not guarantee that the source data are free of mishaps. More detailed information can be found in the included documentation. *Samples* For an example of the data in this publicaiton, please examine this sample file. *Updates* None at this time.

Extent: Corpus size: 1782579 KB

Identifier: LDC2006T12

https://catalog.ldc.upenn.edu/LDC2006T12

ISBN: 1-58563-393--3

ISLRN: 683-827-849-463-2

DOI: 10.35111/4kh9-er55

Language: Spanish

Language (ISO639): spa

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2006T12

Rights Holder: Portions © 1994-2005 Agence France Presse, © 1993-2005 The Associated Press, © 2001-2005 Xinhua News Agency, © 2006 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2006T12

DateStamp: 2021-03-05

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Graff, David. 2006. Linguistic Data Consortium.
Terms: area_Europe country_ES dcmi_Text iso639_spa olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006T12
Up-to-date as of: Wed Oct 29 7:00:53 EDT 2025

Metadata
Title:		Spanish Gigaword First Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Graff, David. Spanish Gigaword First Edition LDC2006T12. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:		Graff, David
Date (W3CDTF):		2006
Date Issued (W3CDTF):		2006-06-15
Description:		Introduction Spanish Gigaword First Edition is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) and contains over 750 million tokens spanning approximately 2.7 million documents. Although this is the first edition of the Spanish Gigaword Corpus, some of the data included here has been released previously in other LDC corpora. The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows: * Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2005 * Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2005 * Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2005 Data The overall totals for each source are summarized below. The "K-wrds" figures are simply the number in thousands of whitespace-separated tokens of all types after all SGML tags are eliminated. Source K-wrds #DOCs AFP_SPA 393354 1382679 APW_SPA 263225 886998 XIN_SPA 94459 388561 TOTAL 751038 2658238 Most of the text data (all of AFP_SPA, most of APW_SPA) were received at LDC via dedicated, 24-hour/day electronic feeds (leased phone lines in the case of APW_SPA, a local satellite dish for AFP_SPA). These 24-hour transmission services were all susceptible to "line noise" (occasional corruption of text content), as well as service outages both at the data source and at our receiving computers. Usually, the various disruptions of a newswire data stream would leave tell-tale evidence in the form of byte values falling outside the range of printable ASCII characters, or recognizable patterns of anomalous ASCII strings. All the XIN_SPA data, and the portion of APW_SPA data beginning with 200406, were received as bulk electronic text archives via internet retrieval. As such, they were not susceptible to modem line-noise or related disruptions, though this does not guarantee that the source data are free of mishaps. More detailed information can be found in the included documentation. Samples For an example of the data in this publicaiton, please examine this sample file. Updates None at this time.
Extent:		Corpus size: 1782579 KB
Identifier:		LDC2006T12
		https://catalog.ldc.upenn.edu/LDC2006T12
		ISBN: 1-58563-393--3
		ISLRN: 683-827-849-463-2
		DOI: 10.35111/4kh9-er55
Language:		Spanish
Language (ISO639):		spa
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2006T12
Rights Holder:		Portions © 1994-2005 Agence France Presse, © 1993-2005 The Associated Press, © 2001-2005 Xinhua News Agency, © 2006 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2006T12
DateStamp:		2021-03-05
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Graff, David. 2006. Linguistic Data Consortium.
Terms:		area_Europe country_ES dcmi_Text iso639_spa olac_primary_text