OLAC Record
oai:www.ldc.upenn.edu:LDC2010T15

Metadata
Title:Message Understanding Conference 7 Timed (MUC7_T)
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Tomanek, Katrin, and Udo Hahn. Message Understanding Conference 7 Timed (MUC7_T) LDC2010T15. Web Download. Philadelphia: Linguistic Data Consortium, 2010
Contributor:Tomanek, Katrin
Hahn, Udo
Date (W3CDTF):2010
Date Issued (W3CDTF):2010-09-17
Description:*Introduction* Message Understanding Conference 7 Timed (MUC7_T), Linguistic Data Consortium (LDC) catalog number LDC2010T15 and isbn 1-58563-560-X, was developed by researchers at Jena University Language & Information Engnineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Germany. It is a re-annotation of a portion of the MUC7 corpus (Linguistic Data Consortium, LDC2001T02), which consists of New York Times news stories annotated for use in the Message Understanding Conference 7 (MUC7) evaluation. The series of MUC evaluations in the 1990s focused on emerging information extraction technologies. Further information about NIST's MUC7 evaluation can be found MUC project website. MUC7_T consists of 100 articles from the MUC7 corpus training set reannotated for named entities (persons, locations and organizations) with a time stamp indicating the time measured for the linguistic decision making process. The corpus was developed for two principal purposes: for use in evaluations of selective sampling strategies, such as Active Learning; and to create predictive models for annotation costs. The annotation was performed by two advanced students of linguistics with good English language skills who followed the the original guidelines of the MUC7 named entity task (which can be found in the online documentation for the MUC7 corpus). *Data* The data is stored in XML format. There is an element anno_example for each annotation example that has the original MUC7 document as text context. The MUC7 document was tokenized using the Stanford Tokenizer3 with white spaces marking token boundaries. The tokenizer is part of the Stanford Parser package which can be obtained from The Stanford Natural Language Processing Group. The following attributes are used for the element anno_example: Attribute Explanation anno_time The time it took to annotate the annotation unit of this annotation example (time in milliseconds). anno_unit_tokens All tokens of the annotation unit. anno_unit_offset Offsets for the tokens of the annotation unit relative to all tokens in the annotation example. anno_unit_labels Labels for the tokens of the annotation unit (these labels are taken from MUC7). doc_id ID of the document of the annotation example. sent_id ID of the sentence of the annotation example. anno_unit_id ID of the unit of the annotation example. muc7_org_filename The name of the original MUC7 document from which this annotation example is taken. *Dirctory Structure* The directory structure of the corpus is as follows: data: This subdirectory contains the MUC7_T data; the data for annotator A and B are in separate folders. For each annotator, there is a version of MUC7_T with CNP-level and with sentence-level annotations. docs: This subdirectory contains detailed documentation as well as publications describing applications of MUC7_T. There is also a small JavaDoc for the Java tools (see the tools subdirectory below). dtd: This subdirectory contains the Document Type Definition (DTD) for the data files. tools: This subdirectory contains a small Java API which allows users to read the MUC7_T XML data so that each annotation example is represented by a Java object. The API incudes the source code and a jar package. The source code has been tested with Java 1.5 and Java 1.6. *Updates* Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T15. *Samples* The following XML excerpts are representative the data in this corpus: * CNP * Sentence Level
Extent:Corpus size: 142336 KB
Identifier:LDC2010T15
https://catalog.ldc.upenn.edu/LDC2010T15
ISBN: 1-58563-560-X
ISLRN: 895-206-642-518-8
DOI: 10.35111/m7m6-db83
Language:English
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2010T15
Rights Holder: Portions © 1996 New York Times, © 2001, 2010 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2010T15
DateStamp:  2021-02-17
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Tomanek, Katrin; Hahn, Udo. 2010. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2010T15
Up-to-date as of: Tue May 7 7:25:03 EDT 2024