Language Collections

ACORNS: Acquisition of Communication and Recognition Skills. This project aims at simulating embodied language learning, inspired by the Memory-Prediction theory of intelligence. ACORNS intends to build a full computational implementation of sensory information processing. ACORNS considers linguistic units as emergent patterns. Thus, the research will not only address the issues conventionally investigated in statistical pattern recognition, but also the representations that are formed in memory.
The ANDES corpus aims to make available a number of existing sets of recorded and transcribed language materials from the Andes, particularly on Quechua and Andean Spanish, collected over the last forty years by researchers in Latin America, North America, and Europe. Analytic tools will be made available, or, in specific cases developed, to access the material. It will be possible to comparatively study corpora from different countries (Bolivia, Ecuador, Peru) or regions (e.g. Cuzco Quechua versus Tarma Quechua in Peru, or Media Lengua from Cotopaxi and Imbabura in Ecuador), but also study corpora cross-linguistically (e.g. possessive constructions in Quechua and possibly related...
The Akha are one of the numerous hill tribes of Southeast Asia. They came from the border area of ​​Burma and Yunnan to Southeast Asia, they currently settle in northern Thailand, Laos and Vietnam. The language Akha belongs to the Sino-Tibetan language family and knows only the oral tradition without written evidence. The Akha collection consists of audio recordings taken from cassettes and reel-to-reel tapes, as well as word lists, narratives and a multitude of photos of the work on the Akha language from the estate of Friedhelm Scholz. The latter are accessible in the Heidelberg University Archives https://www.uni-heidelberg.de/uniarchiv/., Friedhelm Scholz studied Anthropology and...
Lokono is a critically endangered Northern Arawakan language spoken in the peri- coastal areas of the Guianas (Guyana, Suriname, French Guiana). Today, in every Lokono village there remains only a small number of elderly native speakers.
The Auslan and Australian English Corpus is the first bilingual, multimodal documentation of a deaf signed language (Auslan, the language of the Australian deaf community) and its ambient spoken language (Australian English). It aims to facilitate the direct comparison of face-to-face, multimodal talk produced by ten deaf signers and ten hearing speakers from the same city (Melbourne). Documentation and early development was supported by an Australian Research Council grant DP140102124 to Trevor Johnston, Adam Schembri, Kearsy Cormier and Onno Crasborn. Archiving was supported by UK Arts and Humanities Research Council (AH/N00924X/1) funding to Kearsy Cormier. Additional corpus enrichment...
BBC is the repository for the linguistic corpora produced by the Language Study Unit of the Free University of Bolzano. The Language Study Unit is devoted to the analysis of how linguistic systems, languages and language varieties coexist and are related to one another within society, groups and individuals. It emphasizes research in the field of language contact and face-to- face interaction in multilingual contexts, as well as language acquisition and learning, as they are exemplified in the area of South Tyrol. Data-driven research is based both on qualitative and quantitative approaches, which range from ethnographic fieldwork to statistical analysis; examined phenomena –code-choice and...
Biak is one of the many languages of the province Papua (formerly Irian Jaya) in Indonesia. The language belongs to the Austronesian language family, which is the family with the highest number of languages of the world.
Corpora of the CLARIN NL project (2009-2015). These include D-LUCEA, DiscAn, Soundbites, VALID and lessla.
The CORP-ORAL project aims to create a European Portuguese spontaneous speech corpus, completed with orthographic transcription and the prosodic marking of speech breaks/boundaries of the entire corpus, as well as phonetic transcription of a selection of chunks. The main goal is to place this structured information available on-line to the scientific community for the creation, training and further improvement of speech synthesis and recognition programs. CORP-ORAL is being structured in the following general on-going work phases: a) The recording of 60 hours of conversations between two European Portuguese speakers per conversation (at a time); with a Marantz PMD670 professional recorder...
The ongoing project aims to provide linguists interested in the structure and history of Cape Verden Creole with natural and experimental data on the highly divergent local varieties, most of which have received very little academic attention so far. The recordings are at the same time testimony of regional way of life, and the vocabulary that comes with it, which is subject to drastic change by rapidly improving infrastructures and a growing tourism industry.
The data on the Carib language is collected by dr. Berend Hoff in the period 1955-1965. See: B.J. Hoff, The Carib Language, Phonology, Morphology, Text and Word Index. Verhandelingen van het Koninklijk Instituut voor Taal-, Land-, en Volkenkunde (Royal Institute of Linguistics and Anthropology) Vol. 55 (1968), Martinus Nijhoff: The Hague. The original recordings on tape were digitalized in 2006 in Leiden, by Berend Hoff. The latter owes a debt of gratitude to the Phonetic Laboratory of Leiden University for making available the necessary facilities, and especially to his colleague Jos Pacilly, for instruction and assistance, and for the final organization of the material. The corpus is...
An audio and video corpus of the moribund language Chachapoyas [Quechuan, Peru]. Compiled by Aviva Shimelman (nomdecrayon@gmail.com) February 2015.
The Corpus of the Transcribed Ukrainian Speech (CTUS – Корпус українського транскрибованого усного мовлення, КУТУМ) was created by Olena Plakhotnikova (Taras Shevchenko National University of Kyiv, Institute of Philology, Laboratory of Experimental Phonetics). The recordings contained in CTUS consist mostly of read works of fiction in Ukrainian (particularly Ukrainian standard speech). Speech signal was segmented into syntagms – minimal syntactic-semantic components of speech which consist of one or more words linked to each other by structure and meaning, characterized by a typical intonation contour.
DBD
The Dutch Bilingual Database comprises data (over 1,500 sessions) originating from Dutch, Sranan, Sarnami, Papiamentu, Arabic, Berber and 1Turkish speakers.
The songs in this collection were recorded and annotated as part of the project 'Metre and Melody in Dinka Speech and Song', a project carried out by researchers from the University of Edinburgh and the School of Oriental and African Studies in London, and funded by the UK Arts and Humanities Research Council as part of their 'Beyond Text' programme. The project aimed to understand the interplay between traditional Dinka musical forms and the Dinka language (which distinguishes words not just by different consonants and vowels but also by means of rhythm, pitch and voice quality), and to learn more about the way the song tradition responded to the disruptions of the long...
Unsorted donated corpora
The project on the language of perception in Douala (with special attention to smell and taste) is part of a larger project on olfactory language and cognition of the research group 'Meaning, Culture and Cognition', Centre for Language Studies, Radboud University Nijmegen (principal investigator Asifa Majid, funded by the NWO).
This is a corpus of four European sign language. It contains linguistically annotated video files of Sign Language of the Netherlands (Nederlandse Gebarentaal), British Sign Language, and Swedish Sign Language; data include narratives, dialogues, small lexicons, and poetry. In addition, parts of a corpus of German Sign Language (Deutsche Gebärdensprache) is included that was already published on paper before.