OLAC Record: ARL Urdu Speech Database, Training Data

OLAC Record
oai:www.ldc.upenn.edu:LDC2007S03

Metadata

Title: ARL Urdu Speech Database, Training Data

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Appen Pty Ltd. ARL Urdu Speech Database, Training Data LDC2007S03. Web Download. Philadelphia: Linguistic Data Consortium, 2007

Contributor: Appen Pty Ltd

Date (W3CDTF): 2007

Date Issued (W3CDTF): 2007-02-20

Description: *Introduction* ARL Urdu Speech Database, Training Data is a collection of recorded speech with transcripts from 200 adult native Urdu speakers from Pakistan and Northern India and was developed in 2006 by Appen Pty Ltd, Sydney, Australia. The U.S. Army Research Laboratory (ARL) provided this corpus to the Linguistic Data Consortium for distribution. Urdu is an Indo-Aryan language spoken throughout South Asia that developed under the Mughal Empire and Delhi Sultinate between 1200 AD and 1800 AD. It has Persian, Turkish and Arabic influences, but in fact is a dialect of Hindustani. The word "Urdu" refers to the standardized register of Hindustani, but there are many non-standard idiolects as well. Urdu is the twentieth most spoken language in the world. It is the native language of over 60 million people, it is the offical language of Pakistan, and it is one of India's national languages. Urdu is also spoken in Afghanistan. The distribution of speaker dialects in the corpus is as follows: Accent Number of Speakers South Sindh 29 North Sindh 30 South Punjab 27 North Punjab 29 Captial Area 29 North West Regions 30 Baluchistan 26 The database is divided into two parts, a training set containing approximately 80% of the data and a test set comprised of 20% of the data. This release consists of approximately 80% of the complete dataset (training and test). *Data* Each speaker was presented with 400 prompts to read: sentences, place names, and person names. Two microphones set at different distances to the speaker were used for the recordings. The recorded speech was stored in raw format files with headers stored in separate directories. Each utterance was transcribed in the corresponding label file for each recording. The transcriptions were encoded in UTF-8. Punctuation was omitted and numbers were written out in full. *Update* Earlier versions were missing the content list file. This is now available as part of the complete download file. 09/14/18 - The test data for this corpus was originally held back and is now available as part of the download. New downloads after the indicated date will contain the full corpus. *Samples* For an example of the data in this corpus, please listen to this following audio sample (.wav format)

Extent: Corpus size: 35254880 KB

Format: Sampling Rate: 22050

Sampling Format: pcm

Identifier: LDC2007S03

https://catalog.ldc.upenn.edu/LDC2007S03

ISBN: 1-58563-412-3

ISLRN: 513-040-223-174-0

DOI: 10.35111/6z57-s580

Language: Urdu

Language (ISO639): urd

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2007S03

Rights Holder: Portions © 2006 U.S. Army Research Laboratory, © 2007 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2007S03

DateStamp: 2025-12-17

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Appen Pty Ltd. 2007. Linguistic Data Consortium.
Terms: area_Asia country_PK dcmi_Sound dcmi_Text iso639_urd olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2007S03
Up-to-date as of: Wed Jul 8 7:30:27 EDT 2026

Metadata
Title:		ARL Urdu Speech Database, Training Data
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Appen Pty Ltd. ARL Urdu Speech Database, Training Data LDC2007S03. Web Download. Philadelphia: Linguistic Data Consortium, 2007
Contributor:		Appen Pty Ltd
Date (W3CDTF):		2007
Date Issued (W3CDTF):		2007-02-20
Description:		Introduction ARL Urdu Speech Database, Training Data is a collection of recorded speech with transcripts from 200 adult native Urdu speakers from Pakistan and Northern India and was developed in 2006 by Appen Pty Ltd, Sydney, Australia. The U.S. Army Research Laboratory (ARL) provided this corpus to the Linguistic Data Consortium for distribution. Urdu is an Indo-Aryan language spoken throughout South Asia that developed under the Mughal Empire and Delhi Sultinate between 1200 AD and 1800 AD. It has Persian, Turkish and Arabic influences, but in fact is a dialect of Hindustani. The word "Urdu" refers to the standardized register of Hindustani, but there are many non-standard idiolects as well. Urdu is the twentieth most spoken language in the world. It is the native language of over 60 million people, it is the offical language of Pakistan, and it is one of India's national languages. Urdu is also spoken in Afghanistan. The distribution of speaker dialects in the corpus is as follows: Accent Number of Speakers South Sindh 29 North Sindh 30 South Punjab 27 North Punjab 29 Captial Area 29 North West Regions 30 Baluchistan 26 The database is divided into two parts, a training set containing approximately 80% of the data and a test set comprised of 20% of the data. This release consists of approximately 80% of the complete dataset (training and test). Data Each speaker was presented with 400 prompts to read: sentences, place names, and person names. Two microphones set at different distances to the speaker were used for the recordings. The recorded speech was stored in raw format files with headers stored in separate directories. Each utterance was transcribed in the corresponding label file for each recording. The transcriptions were encoded in UTF-8. Punctuation was omitted and numbers were written out in full. Update Earlier versions were missing the content list file. This is now available as part of the complete download file. 09/14/18 - The test data for this corpus was originally held back and is now available as part of the download. New downloads after the indicated date will contain the full corpus. Samples For an example of the data in this corpus, please listen to this following audio sample (.wav format)
Extent:		Corpus size: 35254880 KB
Format:		Sampling Rate: 22050
Format:		Sampling Format: pcm
Identifier:		LDC2007S03
		https://catalog.ldc.upenn.edu/LDC2007S03
		ISBN: 1-58563-412-3
		ISLRN: 513-040-223-174-0
		DOI: 10.35111/6z57-s580
Language:		Urdu
Language (ISO639):		urd
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2007S03
Rights Holder:		Portions © 2006 U.S. Army Research Laboratory, © 2007 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2007S03
DateStamp:		2025-12-17
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Appen Pty Ltd. 2007. Linguistic Data Consortium.
Terms:		area_Asia country_PK dcmi_Sound dcmi_Text iso639_urd olac_primary_text