OLAC Record: Hispanic-English Database

OLAC Record
oai:www.ldc.upenn.edu:LDC2014S05

Metadata

Title: Hispanic-English Database

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Byrne, William, et al. Hispanic-English Database LDC2014S05. Web Download. Philadelphia: Linguistic Data Consortium, 2014

Contributor: Byrne, William

Knodt, Eva

Bernstein, Jared

Emami, Farzhad

Date (W3CDTF): 2014

Date Issued (W3CDTF): 2014-05-15

Description: *Introduction* Hispanic-English Database contains approximately 30 hours of English and Spanish conversational and read speech with transcripts (24 hours) and metadata collected from 22 non-native English speakers between 1996 and 1998. The corpus was developed by Entropic Research Laboratory, Inc., a developer of speech recognition and speech synthesis software toolkits that was acquired by Microsoft in 1999. Participants were adult native speakers of Spanish as spoken in Central America and South America who resided in the Palo Alto, California area, had lived in the United States for at least one year and demonstrated a basic ability to understand, read and speak English. They read a total of 2200 sentences, 50 each in Spanish and English per speaker. The Spanish sentence prompts were a subset of the materials in LATINO-40 Spanish Read News, and the English sentence prompts were taken from the TIMIT database. Conversations were task-oriented, drawing on exercises similar to those used in English second language instruction and designed to engage the speakers in collaborative, problem-solving activities. *Data* Read speech was recorded on two wideband channels with a Shure SM10A head-mounted microphone in a quiet laboratory environment. The conversational speech was simultaneously recorded on four channels, two of which were used to place phone calls to each subject in two separate offices and to record the incoming speech of the two channels into separate files. The audio was originally saved under the Entropic Audio (ESPS) format using a 16kHz sampling rate and 16 bit samples. Audio files were converted to flac compressed .wav files from the ESPS format. ESPS headers were removed and are presented in this release as *.hdr files that include demographic and technical data. Transcripts were developed with the Entropic Annotator tool and are time-aligned with speaker turns. The transcription conventions were based on those used in the LDC Switchboard and CALLHOME collections. Transcript files are denoted with a .lab extension. Data files and their corresponding label files are stored in subdirectories named using a speaker-pair id and session number. The first three letters identify the speaker on channel A. The last three letters identify the speaker on channel B. Wideband audio files contain *.wb.flac in their file name, and narrow band audio files are denoted with a *.nb.flac in the file name. *Samples* Please view these samples: * Read Speech * Conversational Speech * Transcripts *Updates* None at this time.

Extent: Corpus size: 3005258 KB

Format: Sampling Rate: 8000

Sampling Format: pcm

Identifier: LDC2014S05

https://catalog.ldc.upenn.edu/LDC2014S05

ISBN: 1-58563-633-9

ISLRN: 838-711-181-871-3

DOI: 10.35111/mss2-fb97

Language: Spanish

English

Language (ISO639): spa

eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2014S05

Rights Holder: Portions © 2014 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2014S05

DateStamp: 2021-11-16

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Byrne, William; Knodt, Eva; Bernstein, Jared; Emami, Farzhad. 2014. Linguistic Data Consortium.
Terms: area_Europe country_ES country_GB dcmi_Sound iso639_eng iso639_spa olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2014S05
Up-to-date as of: Wed Oct 29 7:01:27 EDT 2025

Metadata
Title:		Hispanic-English Database
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Byrne, William, et al. Hispanic-English Database LDC2014S05. Web Download. Philadelphia: Linguistic Data Consortium, 2014
Contributor:		Byrne, William
		Knodt, Eva
		Bernstein, Jared
		Emami, Farzhad
Date (W3CDTF):		2014
Date Issued (W3CDTF):		2014-05-15
Description:		Introduction Hispanic-English Database contains approximately 30 hours of English and Spanish conversational and read speech with transcripts (24 hours) and metadata collected from 22 non-native English speakers between 1996 and 1998. The corpus was developed by Entropic Research Laboratory, Inc., a developer of speech recognition and speech synthesis software toolkits that was acquired by Microsoft in 1999. Participants were adult native speakers of Spanish as spoken in Central America and South America who resided in the Palo Alto, California area, had lived in the United States for at least one year and demonstrated a basic ability to understand, read and speak English. They read a total of 2200 sentences, 50 each in Spanish and English per speaker. The Spanish sentence prompts were a subset of the materials in LATINO-40 Spanish Read News, and the English sentence prompts were taken from the TIMIT database. Conversations were task-oriented, drawing on exercises similar to those used in English second language instruction and designed to engage the speakers in collaborative, problem-solving activities. Data Read speech was recorded on two wideband channels with a Shure SM10A head-mounted microphone in a quiet laboratory environment. The conversational speech was simultaneously recorded on four channels, two of which were used to place phone calls to each subject in two separate offices and to record the incoming speech of the two channels into separate files. The audio was originally saved under the Entropic Audio (ESPS) format using a 16kHz sampling rate and 16 bit samples. Audio files were converted to flac compressed .wav files from the ESPS format. ESPS headers were removed and are presented in this release as .hdr files that include demographic and technical data. Transcripts were developed with the Entropic Annotator tool and are time-aligned with speaker turns. The transcription conventions were based on those used in the LDC Switchboard and CALLHOME collections. Transcript files are denoted with a .lab extension. Data files and their corresponding label files are stored in subdirectories named using a speaker-pair id and session number. The first three letters identify the speaker on channel A. The last three letters identify the speaker on channel B. Wideband audio files contain .wb.flac in their file name, and narrow band audio files are denoted with a .nb.flac in the file name. Samples* Please view these samples: * Read Speech * Conversational Speech * Transcripts Updates None at this time.
Extent:		Corpus size: 3005258 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: pcm
Identifier:		LDC2014S05
		https://catalog.ldc.upenn.edu/LDC2014S05
		ISBN: 1-58563-633-9
		ISLRN: 838-711-181-871-3
		DOI: 10.35111/mss2-fb97
Language:		Spanish
Language:		English
Language (ISO639):		spa
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2014S05
Rights Holder:		Portions © 2014 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2014S05
DateStamp:		2021-11-16
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Byrne, William; Knodt, Eva; Bernstein, Jared; Emami, Farzhad. 2014. Linguistic Data Consortium.
Terms:		area_Europe country_ES country_GB dcmi_Sound iso639_eng iso639_spa olac_primary_text