OLAC Record
oai:www.ldc.upenn.edu:LDC2008T03

Metadata
Title:ACE 2005 English SpatialML Annotations
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Mani, Inderjeet, et al. ACE 2005 English SpatialML Annotations LDC2008T03. Web Download. Philadelphia: Linguistic Data Consortium, 2008
Contributor:Mani, Inderjeet
Hitzeman, Janet
Richer, Justin
Harris, David
Date (W3CDTF):2008
Date Issued (W3CDTF):2008-01-22
Description:*Introduction* The ACE (Automatic Content Extraction) program focuses on developing automatic content extraction technology to support automatic processing of human language in text form. The kind of information recognized and extracted from text includes entities, values, temporal expressions, relations and events. SpatialML is a mark-up language for representing spatial expressions in natural language documents. SpatialML's focus is primarily on geography and culturally-relevant landmarks, rather than biology, cosmology, geology, or other regions of the spatial language domain. The goal is to allow for potentially better integration of text collections with resources such as databases that provide spatial information about a domain, including gazetteers, physical feature databases and mapping services. In ACE 2005 English SpatialML Annotations, the authors applied SpatialML tags to the English training data (originally annotated for entities, relations and events) in ACE 2005 Multilingual Training Corpus, LDC2006T06. (NOTE: 2005 ACE training data and evaluation data were distributed as e-corpora (LDC2005E18, LDC2005E23) to participants in the 2005 ACE evaluation. Some of the files in ACE 2005 English SpatialML Annotations may originate from one of those e-corpora, not from LDC2006T06). The SpatialML annotation scheme is intended to emulate earlier progress on time expressions such as TIMEX2, TimeML and the 2005 ACE guidelines. The main SpatialML tag is the PLACE tag. The central goal of SpatialML is to map PLACE information in text to data from gazetteers and other databases to the extent possible. Therefore, semantic attributes such as country abbreviations, country subdivision and dependent area abbreviations (e.g., US states), and geo-coordinates are used to help establish such a mapping. LINK and PATH tags express relations between places, such as inclusion relations and trajectories of various kinds. Information in the tag along with the tagged location string should be sufficient to uniquely determine the mapping, when such a mapping is possible. This also means that redundant information is not included in the tag. To the extent possible, SpatialML leverages ISO and other standards towards the goal of making the scheme compatible with existing and future corpora. The SpatialML guidelines are compatible with existing guidelines for spatial annotation and existing corpora within the ACE research program. In particular, the English Annotation Guidelines for Entities (Version 5.6.6 2006.08.01) were exploited, specifically the GPE, Location, and Facility entity tags, and the Physical relation tags, all of which are mapped to SpatialML tags. Ideas were also borrowed from Toponym Resolution Markup Language of Leidner (2006), the research of Schilder et al. (2004) and the annotation scheme in Garbin and Mani (2005). Information recorded in the annotation is compatible with the feature types in the Alexandria Digital Library. This corpus also leverages the integrated gazetteer database (IGDB) of Mardis and Burger (2005). Last but not least, this annotation scheme can be related to the Geography Markup Language (GML) defined by the Open Geospatial Consortium (OGC), as well as Google Earth's Keyhole Markup Language (KML), to express geographical features. SpatialML goes beyond these schemes, however, in terms of providing a richer markup for natural language that includes semantic features and relationships that allow mapping to existing resources such as gazetteers. Such a markup can be useful for (i) disambiguation, (ii) integration with mapping services, and (iii) spatial reasoning. In relation to (iii), it is possible to use spatial reasoning not only for integration with applications, but for better information extraction, e.g., for disambiguating a place name based on the locations of other place names in the document. SpatialML goes to some length to represent topological relationships among places, derived from the RCC8 Calculus (Randell et al. 1992, Cohn et al. 1997). Addtional information about SpatialML is contained in the paper "SpatialML: Annotation Scheme for Marking Spatial Expressions in Natural Lanugage," which is included in the online documentation for this corpus. Please direct all questions about this corpus to Janet Hitzeman (hitz@mitre.org) *Samples* For an example of the data in the corpus, please examine this sample.
Extent:Corpus size: 23552 KB
Identifier:LDC2008T03
https://catalog.ldc.upenn.edu/LDC2008T03
ISBN: 1-58563-458-1
ISLRN: 472-226-418-389-7
DOI: 10.35111/0m4d-qr30
Language:English
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2008T03
Rights Holder:Portions © 2003 Agence France-Presse, © 2003 The Associated Press, © 2003 Cable News Network, LP, LLLP, © 2007 The MITRE Corporation, © 2003 New York Times, © 2003 Xinhua News Agency, © 2003, 2005, 2006, 2008 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2008T03
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Mani, Inderjeet; Hitzeman, Janet; Richer, Justin; Harris, David. 2008. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2008T03
Up-to-date as of: Tue May 7 7:24:53 EDT 2024