OLAC Record
oai:www.ldc.upenn.edu:LDC2000T45

Metadata
Title:Korean Newswire
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Cole, Andy, and Kevin Walker. Korean Newswire LDC2000T45. Web Download. Philadelphia: Linguistic Data Consortium, 2000
Contributor:Cole, Andy
Walker, Kevin
Date (W3CDTF):2000
Description:*Introduction* This corpus is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. The collection includes articles from the date ranges listed below. Please click here to see an example of the newswire. Not all dates in each interval are represented by files or articles: 1994 Jun. 2 to Dec. 31 87 files, 8.6 MB 1995 Jan. 1 to Dec. 31 179 files, 16.9 MB 1996 Jan. 1 to Mar. 29 83 files, 10.6 MB 1997 Jul 28 to Dec. 31 245 files, 48.9 MB 1998 Jan. 2 to Dec. 31 285 files, 64.2 MB 1999 Jan. 3 to Dec. 31 216 files, 56.7 MB 2000 Jan. 3 to Mar. 20 56 files, 13.6 MB Total 1,151 files 219.5 MB *Data* The articles provided here have been collected by means of a continuous feed from the news provider over a modem connection. Incoming data from the modem was spooled directly to a "raw collection" file on a daily basis and the raw files were then processed to produce the format for release by the LDC. There are approximately 143,137 articles this corpus. It is probable that there are duplicate articles in this corpus. We have taken steps to remove articles that were corrupted by failures or noise in modem transmission. The kinds of corruption that we were able to eliminate include truncated articles (a valid end-of-article sequence is not observed before a valid start-of-article) and invalid character codes within the text segment of articles. Some corruption may have occurred that did not produce these symptoms (e.g. service interruptions that might cause partial loss of data within or across articles or corruptions that garble the content but happen not to produce any invalid character codes). At present we have no means for detecting these more subtle problems in the data, but we expect that they are relatively infrequent. The format chosen for release consists of SGML tagging (since this gives a fairly simple and self-explanatory presentation of the data) and the KSC-5601 Korean character encoding. *Updates* There are no updates at this time.
Extent:Corpus size: 221184 KB
Identifier:LDC2000T45
https://catalog.ldc.upenn.edu/LDC2000T45
ISBN: 1-58563-168-X
ISLRN: 210-777-697-418-7
DOI: 10.35111/4wep-9z24
Language:Korean
Language (ISO639):kor
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2000T45
Rights Holder:Portions Copyright 1994-2000, Korean Press Agency, All Rights Reserved
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2000T45
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Cole, Andy; Walker, Kevin. 2000. Linguistic Data Consortium.
Terms: area_Asia country_KR dcmi_Text iso639_kor olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2000T45
Up-to-date as of: Thu Oct 24 7:29:27 EDT 2024