OLAC Record: West Point Arabic Speech

OLAC Record
oai:www.ldc.upenn.edu:LDC2002S02

Metadata

Title: West Point Arabic Speech

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: LaRocca, Stephen A., and Rajaa Chouairi. West Point Arabic Speech LDC2002S02. Web Download. Philadelphia: Linguistic Data Consortium, 2002

Contributor: LaRocca, Stephen A.

Chouairi, Rajaa

Date (W3CDTF): 2002

Date Issued (W3CDTF): 2002-08-20

Description: *Introduction* West Point Arabic Speech was produced by the Linguistic Data Consortium (LDC), catalog number LDC2002S02 and ISBN 1-58563-199-x. West Point Arabic Speech contains speech data that was collected and processed by members of the Department of Foreign languages at the United States Military Academy at West Point and the Center For Technology Enhanced Language Learning (CTELL) as part of an effort called "Project Santiago." The original purpose of this corpus was to train acoustic models for automatic speech recognition that could be used as an aid in teaching Arabic to West Point cadets. *Data* The corpus consists of 8,516 speech files, totaling 1.7 gigabytes or 11.42 hours of speech data. Each speech file represents one person reciting one prompt from one of four prompt scripts. The utterances were recorded using a Shure SM10A microphone and a RANE Model MS1 pre-amplifier. The files were recorded as 16-bit PCM low-byte-first ("little-endian") raw audio files, with a sampling rate of 22.05 KHz. They were then converted to NIST sphere format. Approximately 7,200 of the recordings are from native informants and 1200 files are from non-native informants. The following tables show the breakdown of corpus content in terms of male, female, native and non-native speakers. number of speakers male female total native: 41 34 75 non-native: 25 10 35 totals: 66 44 110 hours of data male female total native: 6.0 4.4 10.4 non-native: 0.74 0.28 1.02 totals: 6.74 4.68 11.42 megabytes of data male female total native: 918 667 1585 non-native: 111.9 42.8 154.7 totals: 1029.9 709.8 1739.7 number of speech files male female total native: 4107 3163 7270 non-native: 883 363 1246 totals: 4990 3526 8516 Some of the recording sessions include a handful of utterances that were cut short due to pronunciation mistakes or unexpected interruptions (e.g. phones ringing, doors slamming, etc). These partial utterances have been retained in the waveform directories and are distinguished from the full-sentence recordings by having a trailing "-u" in the filename, before the extension (e.g. "s1_080-u.sph" instead of "s1_080.sph"). The above tables describe all data; both the complete and partial utterances are accounted for. 168 of the 8,516 speech files are partial utterances, and the remaining 8,348 are complete. *Updates* There are no updates at this time.

Format: Sampling Rate: 22050

Sampling Format: 1-channel pcm

Identifier: LDC2002S02

https://catalog.ldc.upenn.edu/LDC2002S02

ISBN: 1-58563-199-x

ISLRN: 223-969-897-944-9

DOI: 10.35111/b12f-w956

Language: Arabic

Language (ISO639): ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2002S02

Rights Holder: Portions © 2002 United States Military Academy, © 2002 Trustees of the University of Pennsylvania The SANTIAGO Arabic corpus was developed at the United States Military Academy. All information contained herein is the sole and exclusive property of the United States Military Academy.

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2002S02

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: LaRocca, Stephen A.; Chouairi, Rajaa. 2002. Linguistic Data Consortium.
Terms: dcmi_Sound iso639_ara olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2002S02
Up-to-date as of: Wed Oct 29 7:00:10 EDT 2025

Metadata
Title:		West Point Arabic Speech
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		LaRocca, Stephen A., and Rajaa Chouairi. West Point Arabic Speech LDC2002S02. Web Download. Philadelphia: Linguistic Data Consortium, 2002
Contributor:		LaRocca, Stephen A.
Contributor:		Chouairi, Rajaa
Date (W3CDTF):		2002
Date Issued (W3CDTF):		2002-08-20
Description:		Introduction West Point Arabic Speech was produced by the Linguistic Data Consortium (LDC), catalog number LDC2002S02 and ISBN 1-58563-199-x. West Point Arabic Speech contains speech data that was collected and processed by members of the Department of Foreign languages at the United States Military Academy at West Point and the Center For Technology Enhanced Language Learning (CTELL) as part of an effort called "Project Santiago." The original purpose of this corpus was to train acoustic models for automatic speech recognition that could be used as an aid in teaching Arabic to West Point cadets. Data The corpus consists of 8,516 speech files, totaling 1.7 gigabytes or 11.42 hours of speech data. Each speech file represents one person reciting one prompt from one of four prompt scripts. The utterances were recorded using a Shure SM10A microphone and a RANE Model MS1 pre-amplifier. The files were recorded as 16-bit PCM low-byte-first ("little-endian") raw audio files, with a sampling rate of 22.05 KHz. They were then converted to NIST sphere format. Approximately 7,200 of the recordings are from native informants and 1200 files are from non-native informants. The following tables show the breakdown of corpus content in terms of male, female, native and non-native speakers. number of speakers male female total native: 41 34 75 non-native: 25 10 35 totals: 66 44 110 hours of data male female total native: 6.0 4.4 10.4 non-native: 0.74 0.28 1.02 totals: 6.74 4.68 11.42 megabytes of data male female total native: 918 667 1585 non-native: 111.9 42.8 154.7 totals: 1029.9 709.8 1739.7 number of speech files male female total native: 4107 3163 7270 non-native: 883 363 1246 totals: 4990 3526 8516 Some of the recording sessions include a handful of utterances that were cut short due to pronunciation mistakes or unexpected interruptions (e.g. phones ringing, doors slamming, etc). These partial utterances have been retained in the waveform directories and are distinguished from the full-sentence recordings by having a trailing "-u" in the filename, before the extension (e.g. "s1_080-u.sph" instead of "s1_080.sph"). The above tables describe all data; both the complete and partial utterances are accounted for. 168 of the 8,516 speech files are partial utterances, and the remaining 8,348 are complete. Updates There are no updates at this time.
Format:		Sampling Rate: 22050
Format:		Sampling Format: 1-channel pcm
Identifier:		LDC2002S02
		https://catalog.ldc.upenn.edu/LDC2002S02
		ISBN: 1-58563-199-x
		ISLRN: 223-969-897-944-9
		DOI: 10.35111/b12f-w956
Language:		Arabic
Language (ISO639):		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2002S02
Rights Holder:		Portions © 2002 United States Military Academy, © 2002 Trustees of the University of Pennsylvania The SANTIAGO Arabic corpus was developed at the United States Military Academy. All information contained herein is the sole and exclusive property of the United States Military Academy.
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2002S02
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		LaRocca, Stephen A.; Chouairi, Rajaa. 2002. Linguistic Data Consortium.
Terms:		dcmi_Sound iso639_ara olac_primary_text