International Summer School on

Language Documentation: Methods and Technology

within the programme

Documentation of Endangered Languages (DoBeS)

of the Volkswagen Foundation

University of Frankfurt / Main

1st - 11th of September, 2004

Lecture series abstracts


Lecture 1: Language Documentation: What for?

Nikolaus Himmelmann, Ruhr-Universität Bochum

This lecture has three parts: First, it defines language documentation as a field of linguistic inquiry and practice in its own right which needs to be distinguished from language description in these sense of producing grammars and dictionaries. Second, it points to the more or less obvious fact that language documentation is particularly relevant in the context of language endangerment. Finally, it gives a short preview of the lecture series.

Ad 1) Language documentation is concerned with providing a multifunctional record of the linguistic behavior and knowledge found in a given speech community. The core of such a record consists in recordings of a broad range of communicative events which commonly occur in the speech community. This is complemented by documents of list-like linguistic phenomena such as morphological paradigms, expressions for numbers and measures, folk taxonomies (for plants, animals, musical instruments and styles, and other artifacts), etc. To be accessible to, and useful for, a broad range of users, including the speech community, the recordings of communicative events have to be transcribed, translated and annotated. Furthermore, the documentation as a whole as well as every single document contained in it has to be accompanied by a set of metadata which contain information regarding participants, recording equipment, access right, etc.

Ad 2) Language documentation is particularly relevant for endangered languages simply because no primary data can be collected any more once a language is no longer spoken. But the relevance of language documentation is not confined to work on endangered languages. It is also relevant to strengthening the empirical basis of (descriptive) linguistics in that it makes accessible the corpus of primary data on which descriptive generalizations and all further theoretical based on such generalizations are based.

This part will also briefly discuss issues in defining language endangerment and the relationship between language documentation and supporting language maintenance. As for language endangerment, it will be argued that a list approach specifying a dozen or so putatively independent indicators of endangerment (such as low speaker numbers, no transmission to younger generations, etc.) and then computing endangerment rankings is essentially misguided. Instead, it is claimed that there is single major indicator of endangerment, i.e. reduction in usage domains, from which all other indicators are derived. Furthermore, one needs to distinguish clearly between indicators of endangerment – which are generally of a sociolinguistic nature – and causes of endangerment, which are of a more general, socio-economical nature.

Ad 3) The following lectures highlight various issues in actually compiling a language documentation from both a practical and a theoretical point of view. There are essentially three parts to the series.
The first two lectures (lectures 2 and 3) deal with general (i.e. not specifically linguistic) ethical and practical issues which have to be considered before one actually gets to the field: How to interact with speech communities and individual speakers and how to capture, store and process relevant data. These issues are interrelated in that data capture and processing is not just a technological issue but also has to pay attention to sensitivities and interests of the community and the individual speakers contributing data.
The next four lectures (4-7) pertain to the actual recording and processing of relevant data from an anthropological and linguistic point of view. Lectures 4 (Ethnography of communication) and 5 (Lexical knowledge) address the issue of what and how to document, given the goal of creating a lasting and multifunctional record of a speech community. Lectures 6 and 7 deal with the ethnographic and linguistic commentary (or annotation) that is necessary to make recordings and other documents useful to a broad range of users. In both instances, the challenge consists in conceiving of innovative appropriate formats for relevant annotations.
The final two lectures (8 and 9) return to more practical issues having to do with possible uses and the archiving of language documentations. Lecture 8 takes up the perhaps most important aspect in working with an endangered speech community, i.e. the possibility of getting involved in community language work. Once again, while community language work is of obvious importance in the context of language endangerment, it is also relevant in other settings. Any kind of language documentation will profit both in quantity and quality if the speech community gets actively involved and keeps on expanding and improving the collection independently of a documentation project run by outsiders. Lecture 9 finally summarizes the whole series from the archiving point of view, thus complementing and expanding lecture 3. Apart from an obvious focus on technological issues, a main concern will be a critical review of the different interests and goals of the three major groups involved in the archiving process: the people handing material to the archive, the people running and maintaining the archive, and the users of archival sources.

Nikolaus P. Himmelmann is Professor of Linguistics at the Ruhr-Universität Bochum, Germany. His major research interests include morphosyntactic typology, grammaticisation theory, prosody and grammar as well as language documentation and description. He has done fieldwork in the Philippines (Tagalog), Sulawesi (Tomini-Tolitoli languages) and East Timor (Waima’a) and published both descriptive materials and contributions to language documentation theory and method.



Lecture 2: Cooperative fieldwork with speech communities and speakers: ethics and practicalities

Arienne Dwyer, University of Kansas

Synopsis: What to expect and what is expected of you; what to avoid; how to establish rapport in a community and with host area governments; how to establish a cooperative work team with a speech community.

 The quality of linguistic fieldwork depends on the personal relationships between the researcher and his or her indigenous counterparts, and a flexible, but nevertheless carefully organised workflow that accounts for their mutual learner-teacher relationship. While the linguist does not have a thorough knowledge of the language and culture under investigation, the native speakers are not trained linguists. In order to learn the language and familiarise themselves with the culture, the linguists rely on the native speakers as their teachers, and the better these understand the linguists' research aims and methods, the more interesting their contributions will be. Ideally they get so much involved that their native speakers' perspectives lead to new insights, a revision of the researcher's hypotheses or even a reframing of the research goals.

 The lecture will show how mutual learner-teacher relationships can be developed, and addresses a number of practical questions such as:  

·  how to find a community and live there,

·  how to find indigenous counterparts,

·  how to deal with their different interests and talents,

·  how to organise cooperative work,

·  how to pay consultants; and

·  dealing with ethical and legal questions, such as consent.

 Keeping in mind that in spite of the best intentions and careful planning, things might go wrong, the lecture will also discuss the most common problems of researchers in their relationships with indigenous people.

Arienne M. Dwyer is Assistant Professor of Linguistic Anthropology at the University of Kansas (Lawrence, U.S.A.). Her research interests include language contact, ethnopoetics, and phonological typology, as well as multimedia archiving and indigenous language documentation and revitalization. Work in her major fieldwork areas of Chinese Turkestan (Xinjiang - Uyghur dialects), northern Tibet (Salar, Monguor, and Northwestern Chinese), and Kyrgyzstan (Kyrgyz) has resulted in a number of past and forthcoming publications and a workshop for Tibetan ethnographers.


Lecture 3: Handling your data

Peter Austin, SOAS, London

This lecture will be about data capture, storage, archiving, processing and use, and include issues like recording and storing sound and video, fonts and UNICODE, proprietary and non-proprietary file formats, structuring data with archiving in mind, dealing with metadata and intellectual property rights etc. It will not give too many technical details (which will be discussed in lecture tutorials and seminars instead) but suggestions for good practice. Full abstract

Peter Austin studied at the Australian National University, completing a BA with first class Honours in Asian Studies (Japanese and Linguistics) in 1974, and a PhD in 1978 on the Diyari language spoken in the far north of South Australia. He taught at the University of Western Australia (1978), held a Harkness Fellowship at UCLA and MIT (1979-80), and in 1981 set up the Department of Linguistics at La Trobe University. In 1989 he was instrumental in establishing Japanese language teaching at La Trobe. In 1996 he was appointed Foundation Professor of Linguistics at the University of Melbourne, and joined SOAS in January 2003.
Peter's research interests cover descriptive, theoretical and applied linguistics. He has extensive fieldwork experience on Australian Aboriginal languages (northern New South Wales, northern South Australia, and north-west Western Australia) and has co-authored the first fully page-formatted hypertext dictionary on the World Wide Web, a bilingual dictionary of Gamilaraay (Kamilaroi), northern New South Wales, as well as publishing seven bilingual dictionaries of Aboriginal languages. Since 1995 he has been carrying out research on Sasak and Sumbawan, Austronesian languages spoken on Lombok and Sumbawa islands, eastern Indonesia. His theoretical research is mainly on syntax and focuses on Lexical Functional Grammar, morpho-syntactic typology, computer-aided lexicography and multi-media for endangered languages. He has also published on historical and comparative linguistics, typology, and Aboriginal history and biography.  Peter Austin's Homepage



Lecture 4: The ethnography of language

Jane Hill, University of Arizona.

Synopsis: This addresses the issues of what events to record and how to identify important speech events in a community, as well as the basics of data sampling and building a corpus which represents all major types of speaking events in a given community.

The results of field work undertaken for linguistic documentation will be shaped by the how well investigators understand the ethnographic contexts of the target language or languages. The classic rules-of-thumb of the ethnography of speaking remain useful, and will be reviewed.  Especially important here is work on variation across communities and internal to them on how to accomplish key speech acts in elicitation.  New work on communities of practice in education invites ethnography of the novice-expert relationship that may be very important in documentation. Work on the ethnography of language has increasingly turned toward the study of how distributions of influence and access to both material and symbolic resources, understandings about the nature of persons and selves, and understandings about the nature of language itself are shaped by ideology. In speech communities where endangered languages are in use, ideological issues are likely to be especially fraught.  Language ideologies may be expressed discursively in highly salient formulations, but may also perpetuate themselves sub rosa through a web of presuppositions and other indexical devices that require careful analysis.  Thus this lecture will have three major components:  Foundations in the ethnography of language, communities of practice in language learning, and approaches to language ideology.  Illustrative examples will be drawn from experiences in communities with endangered languages.

Jane H. Hill is Regents' Professor of Anthropology and Linguistics at the University of Arizona.  She is a specialist on Native American languages, focussing on the Uto-Aztecan family, with fieldwork on CupeZo, Tohono O’odham, and Nahuatl. Her interests include linguistic documentation, the historical linguistics of the Uto-Aztecan language family, language contact and multilingualism in the U.S. Southwest and Mexico, and in the way popular ideas about these phenomena shape the uses of language in communities in those regions, especially in the construction of White racist culture. She is the author of A Grammar of Cupeño, to appear from University of California Publications in Linguistics.



Lecture 5: Lexical knowledge

John B. Haviland, CIESAS-Sureste & Reed College

Synopsis: This lecture deals with things to pay attention to when eliciting and annotating lexical materials (under­stood in a broader sense, including all kinds of taxonomy as well as morphological paradigms where applicable): How to deal with lexical knowledge which is not accessible through "texts", and how to organize lexical materials.

Language documentation necessarily implies compiling a lexicon, with maximal denotational coverage.  I will begin with familiar cautions—still worth repeating—about obscure ostension, referential indeterminacy, and other practical and conceptual paradoxes of field lexicography.  I will move on to consider different sorts of lexical units (roots vs. stems, word classes, clitics and potentially mysterious particles, for example) and difficulties posed by the distinction between “functional” and “content” lexical forms.  To bridge the gap between extension and intension in specifying meanings I will outline certain techniques from semantic typology which help round out semantic domains, and potentially plug paradigmatic and semantic gaps in lexicons based on textual corpora.  Finally, I will introduce and illustrate different sorts of organizational principles for the resulting dictionaries or thesauri, and demonstrate other computationally tractable mechanisms for representing lexical materials. 

John Haviland is professor of Linguistics and Anthropology at Reed College, and a researcher and professor at CIESAS, Mexico. His work concentrates on Tzotzil (Mayan) speaking peasant cornfarmers from Zinacantán, Chiapas, Mexico, and on speakers of Guugu Yimithirr (Paman), especially at the Hopevale Aboriginal Community, near Cooktown, in northern Queensland, Australia. He is currently engaged in a multimedia study of the coordination of gesture, gaze, and speech (supported by a KDI initiative grant from the National Science Foundation), and in the creation of the Archivo de los Idiomas Indígenas de Chiapas, at CIESAS-Sureste, with support from CONACyT (México). John Haviland's homepage


Lecture 6: Ethnographic commentary

Bruna Franchetto, Museu Nacional – Federal University of Rio de Janeiro – CNPq, Brazil

Synopsis: Things an ethnographer needs to know in order to make full use of the recordings and transcriptions which constitute the core of a documentation; working creatively with recorded texts (using textual segments as a basis for further ethnographic elicitation); incorporating ethnographic analyses in a documentation.

 Ethnographical information is a crucial component of any endangered language documentation. If the wider goal is not simply to collect texts and a lexicon, but also to present and preserve the cultural heritages of the documented languages, then detailed ethnographical information must be associated with the annotation of linguistic data. However, the integration of linguistic and ethnographic data in a comprehensive archive is no easy task. Drawing on the experience of the Kuikuro Project (DOBES Program, Brazil), I shall address the following key topics:

 (1) What does an ethnologist or anthropologist look for when he or she accesses or consults this kind of archive? In other words, what would they wish to find? And what may be relevant for a linguist but irrelevant for an anthropologist? Do researchers with different theoretical backgrounds (for example, structuralists versus culturalists) search for different things? We interviewed a number of anthropologists in Brazil and the replies obtained can be summarized into the following points: (i) semantic fields versus thematic fields; which are the relevant thematic fields? (ii) how to allow for the multiple relations existing in more ‘sophisticated’ thematic fields, such as shamanism? (iii) how to avoid restricting oneself to information of use only to people working with a specific human group, thus curtailing the possibility of comparison? (iv) what should a lexicon contain? Carefully elaborated translations and definitions; entries should be treated as conceptual categories; particular attention should be given to terms/concepts which cut across different thematic fields; the contexts where examples occur should be carefully noted; the problem of distinguishing between primitive and derived terms (extensions of meaning) [overlap with lecture 5]; (v) a well produced morphological analysis, enabling non-amateur etymologies of terms/concepts.

 (2) How to treat the social and cultural information associated with annotated texts. How ethnographic notes or comments associated with the text units (utterances) can complement and deepen the information given in the metadata.

 (3) Problems of translation, understood in the widest possible sense, ranging from kinds of transcription and annotation that allow the basic characteristics of verbal performances to be recovered (especially the most elaborate examples from the point of view of form, rhythm, register, vocabulary and meanings) to translation properly speaking, working from a source-language to a target-language.

 (4) What topics should be included in a sketch ethnography, taken as one component of an archive containing the documentation of a language?

 (5) Taking into account the global structure of the Kuikuro language archive, I shall show how its different components inter-relate and examine ways of establishing links between them, including between the lexicon or annotated texts and (other) texts, images (drawings, photos, videos, iconographies, maps, etc.), written material produced by indigenous speakers and researchers, and other components such as sketch ethnography, history, sociolinguistics, comparative studies, etc.

Bruna Franchetto is professor of anthropology and linguistics at the Federal University of Rio de Janeiro (Brazil). Her major research interests include language documentation and description, morphology and syntax, ergativity, comparative and historical
linguistics, verbal arts, linguistic anthropology. She has done fieldwork on languages of the Carib family spoken in Brazil, especially the Upper Xingu Carib Language (Kuikuro). She has published linguistic and ethnological descriptive materials as well as contributions to formal theories of grammar. 


Lecture 7: Linguistic commentary

Eva Schultze-Berndt, Universität Graz

Synopsis: Things a linguist needs to know in order to make full use of the recordings and transcrip­tions which constitute the core of a documentation. Notes on glossing and free translations; working creatively with recorded texts (using textual segments as a basis for further linguistic elicitation); incorporating gram­matical analyses in a documentation.

The current standard format for examples and texts in grammars and other linguistic publications combines a phonemic or orthographic transcription, a morpheme-by-morpheme interlinear gloss, and a free translation. This lecture will deal with the ways in which this format can be used and expanded upon, from a linguistic point of view, in the documentation of audio- or video-recorded primary speech data. Illustration will be provided mostly from my own fieldwork on Northern Australian languages.

The processes of transcription and translation obviously involve a considerable degree of linguistic analysis. Transcription unless a narrow phonetic transcription is used - involves decisions about the phonological system and segmentation into words and larger units such as clauses or intonation units, while glossing and translating involves decisions about the meaning and function of linguistic units. Other issues arising in the context of translation are the choice of metalanguage(s), the conventions to be followed in interlinear glossing, and the question of whether to opt for a more literal or a more idiomatic translation (or both). In the documentation of endangered languages in particular, the question of how to represent code-switching phenomena also often arises.

For a recorded speech event to be a lasting document useful to a broad range of users, it is important that the annotation includes information beyond that provided by the standard format . Information on the ethnographic background of a speech event, dealt with in a different lecture, can obviously be crucial for understanding a text. An annotation capturing the nonverbal context of an utterance or text (especially when it is not video-recorded), including the presence and spatial configuration of discourse participants and discourse referents, can be crucial for the linguistic analysis in itself, e.g. in the case of deictics, spatial terminology in general, or focus phenomena. 

Finally, the process of working with a text together with native speakers will often elicit further speech events, which may take the form of paraphrases, translations or commentaries. In an ideal documentation, these will be treated as documents in their own right and linked to the original text through extensive cross-referencing.

The lecture will also discuss possible differences in annotation between different types of documented speech events. For example, a narrative is particularly likely to elicit commentaries from various speakers, while for a conversation, the question of detail of transcription is particularly pressing, and a single overheard utterance needs to be carefully annotated with respect to the verbal and nonverbal context. Finally, for grammatical elicitation, it makes sense to document the researcher s question as well as the response.

Eva Schultze-Berndt is Professor of Linguistics at the University of Graz, Austria. Her research interests include typology, grammar of spoken language, lexical semantics, complex and secondary predication, language contact, and language documentation and  description, and her areal focus is on Northern Australian languages.


Lecture 8: Community language work

Ulrike Mosel, Universität Kiel

Synopsis: Working together with communities; things to consider to get the community actively involved in documentation and language maintenance; how to run workshops with communities.

Linguistic fieldwork, especially language documentation heavily relies on the working relationship between the professional linguist and the indigenous language workers – a challenging relationship because except for their enthusiastic intention of doing something useful for the language, both parties do not share much common ground in terms of background and aims. The lecture will first outline the differences between the linguists' and the community's language documentation with respect to text editions, grammars and dictionariesn and then in turn discuss the kind of input the linguist can give into the community's language work.

Apart from helping to create an orthography or solve existing orthographical problems, the linguist's input mainly relates to organisation and training. Thus in order to avoid disappointment, the linguist's first task is to identify the prospective language workers' skills and their needs for training, and then help them to set realistic goals.

Content and methods of training also depend on the envisaged outcome of the language work as well on the available time and money. Drawing from experiences in the Primary Education Materials Project in Samoa (1997 – 2000) and the Language Documentation Project of Teop in Bougainville, Papua New Guinea (2000 - ), the lecture will deal with two kinds of training, workshops and the individual apprenticeship within a team of language workers.


Ulrike Mosel holds the chair of General and Comparative Linguistics at the University of Kiel, Germany. After she had concluded her studies of Semitic languages at the University of Munich with a PhD thesis on Classical Arabic grammaticography, she specialised on the languages of the South Pacific with a focus on linguistic typology,  grammaticography and lexicography. Her books include a syntax of Tolai, a Samoan reference grammar (together with Even Hovdhaugen), and a sketch grammar of Saliba. She supervised twelve PhD theses on descriptive grammar and typology and wrote with Samoan teachers a Samoan grammar for teachers and a Samoan monolingual school dictionary.  Currently she is working on the documentation of the Teop language in Bougainville (Papua New Guinea).



Lecture 9: Archiving challenges

Peter Wittenburg, MPI Nijmegen, Jost Gippert, Universität Frankfurt.

Synopsis: Issues in constructing, maintaining and using archives; input/output perspectives; current archiving concepts; regulating access. What standards do we need and which are already implemented today? Documentation in a long-term perspective.

We will begin with an analysis of the differing tasks and priorities of the various parties involved in the archiving process. The major focus will be on the potential conflict between data requirements which are relevant for archiving (= long-term perspective) and the needs of the creator(s) and immediate users (short-term requirements). Last, we will describe ways to combine both issues.

1 Short-term vs Long-term Requirements

Three major groups can be identified to be involved in the process of archiving language materials: The compilers (usually linguists or teams of linguists and native speakers), the archivist, and various user groups (which may include the compilers. These three groups have different interests and priorities, as briefly summarized in the following table: 






users in

500 years

long-living representations





data safety





data organization





creation tools





current presentation tools





high quality





explicitness and consistency





presentation & revitalization





2. Long-term Requirements

There are 3 major points here: (1) The time-line, i.e. the chances of survival and average life-times. (2) The bit-stream preservation task (which will only briefly be touched upon). (3) The interpretation task and its consequences for decisions today. This task has two sub-tasks, one pertaining to structural aspects and one to contents.

3. Short-term Requirements

While the archivists primary focus is on choosing useful representation formats and proper data structures, compilers and current users are focusing on presentation and efficiency aspects that are directly linked to currently available tools (such as WORD, Shoebox, HTML, etc.).

The task here is to review some popular creation and usage scenarios and their possible advantages, but also to point out their inherent problems for long-term preservation. The latter will be discussed with examples from DOBES projects. We will also discuss problems with open standards such as UNICODE.

4. Bridging the Gap

We need pragmatic solutions which bridge the gap between long- and short-term requirements. The archivist cannot assume that the archiving requirements completely dominate the production process. Conversely, the researcher has to adhere to certain rules to allow the archivists to fulfill their job. We report on an emerging ‘etiquette in language archiving’ and technological developments that will help to improve on current input - output relations.


Peter Wittenburg

Jost Gippert is Professor of Comparative Linguistics and Dean of the Faculty of Linguistics and Cultural Sciences at the University of Frankfurt / Main, Germany. His major research interests comprise Indo-European and Caucasian languages both from a synchronical and a historical point of view as well as computational linguistics, language documentation and description. He has done fieldwork in the Caucasus and the Indian subcontinent. As the founder and leader of the TITUS project, he has been engaged in the digital preparation and dissemination of linguistic data since 1987.




Lecture 3: Handling your data [Full abstract]

Peter Austin, SOAS, London


Methods of Documentation

A language documentation project should aim to collect/create audio, video, graphic and text documentation material covering use of language in a variety of social and cultural contexts. The priorities for collecting, recording, analysing, and archiving are:

Projects will typically create materials in several types of media:

Together, these will form the language documentation and should contain a range of linguistic materials, such as:

Researchers should collect, and appropriately record, metadata for all of the collected materials (see also below).

The methods and terminology should be aimed at making knowledge about the language accessible to a wide audience: not only academics, but also community members, as well as learners and teachers. Hence, although the materials researchers prepare may well be of interest for later theoretical analysis, within a documentation project they should avoid expressing the language data in terms of any particular linguistic theory— except where absolutely necessary.

Technical issues

Types of media and their properties

Each type of media – video, audio, text, and metadata – has its own strengths and weaknesses for language documentation, and so a good documentation will consist of a combination of materials in different media.

Video materials are immediate, rich in authenticity, and multi-dimensional in content. Video is often of particular interest to endangered language communities, and can be produced independently within communities without assistance from researchers. 

On the negative side, video is more difficult to produce, and may cause methodological problems for researchers or people appearing in the video. Video is also harder to process, much more demanding to transfer and store, difficult to access without time-consuming annotation, and difficult to preserve in the long-term.

Compared to video, audio materials contain less information, but the relative simplicity and familiarity of sound recording can result in a better linguistic record. The smaller sizes and the nature of digital audio files mean that they are easier to work with, and there is a range of common and easy-to-use software for editing and presenting sound.

Text, traditionally the main method of recording linguistic materials, is compact, stable, and easy to store, access, index, and reuse. However, when used to document language usage, text always involves some kind of abstraction and analysis. This abstraction and analysis provides new representations and generalisations, while at the same time losing information that was in the original event or recording. Therefore, text resources that also retain their connections to an original recording (preferably, a connection that can be followed via a link or other explicit reference), are much stronger forms of language documentation.

Metadata is “data about data” – structured information describing characteristics of events, recordings and other data files. Although usually in the form of text, it can be considered an independent type of media because it is obtained and used entirely differently from all other types of media. Typically, metadata is collected and stored according to some formal specifications. Several types of metadata can be distinguished:

Which of these types of metadata researchers collect and store depends on the type of materials described, the usage and audience that the materials are likely to have, and the formal specification adopted. Metadata is particularly important for effective archiving and discovery of materials. There are several tools for creating, editing, depositing and searching metadata. The Open Languages Archives Community (OLAC) has a metadata specification that is used by a number of researchers while IMDI (ISLE Metadata Initiative) is an alternative used by other researchers. The Text Encoding Initiative (TEI) also offers specifications of data and metadata formats, and the E-MELD (Electronic Metastructures for Endangered Languages Documentation) project based in the US offers advice on data and metadata formats.

For further information, see Bird & Simons 2003, and the following websites:

Data formats

Researchers need to choose formats for their data. Choosing the best data formats can be complex, and formats can change as technologies and conventions evolve.

It is important to distinguish at least the following:

To accomplish data encoding, some linguists use databases, or SIL’s Shoebox which marks data structures using “field markers” at the beginning of lines; XML, however, offers the ability to encode more complex and explicit structures, and is a more robust archiving format (see below).

In some cases, there will be standard or conventional choices. In character encoding, for example, most texts in “Traditional Chinese” characters have Big5 encoding, although Unicode (ISO 10646) is a recently favoured option.

For file encoding, it is generally best to use open, non-proprietary, formats. Proprietary formats, such as those produced by MS-Word or FileMaker Pro, can be changed by the software publishers, and may be commercial secrets, so they make poor choices for archiving and long-term access to the data.

However, making the best choices may not always be easy. Some researchers distinguish between archive format, working format and distribution (or publication) format for data and develop ways to transfer or flow data (through export and import routines) between the different formats. Proprietary software tools may be the most efficient tools for working with data, so materials might be prepared using them but then exported to more standard or archivable formats; this needs careful planning. And some formats, such as PDF (“Portable Document Format”, created by Adobe Systems), are proprietary but fully open, and can be created and read by many software products.

Thus for text data we might have:

Archive format:           Unicode XML plain-text

Working format:          MS Access database

Publication format:      PDF or MS Word rtf file


Text formats – XML

The Extensible Markup Language (XML) is a document description language, used to describe the content of structured documents — each part of a structured document is described within a defined and logical structure (stored in XML document type definitions). As a result XML structured documents can be logically created, processed and transformed, by editors, stylesheets (described by extensible stylesheet language), transformations and other document processing agents.


XML is a subset of SGML (standard generalised markup language) and separates structure from format, which are confused in HTML (hypertext markup language, used on the world wide web). XML uses markup declarations, to describe structure (in a Document Type Definition); and markup, to structure content.



Document Type Definitions are made up of element type declarations, defining the elements that make up a document type, setting out their name  and their content:


inside the XML markup document this will appear as <NAME>CONTENT</NAME>. CONTENT can be defined as a set of attributes (properties), one or all of which may or must be present.

There are various DTDs that have already been formulated, such as a DTD for dictionaries or one for annotated text.


Linguistic Examples

1. Dictionary skeleton

1. dictionaries contain entries

2. attributes of entries are: form, category, meaning specification

3. meaning specification can be for polysemes or homophones

4. polyseme and homophone attributes are: gloss, example form, example gloss (free translation)


eg. printed Sasak dictionary entry:

tunggu  vtr

         1. to await, wait for, eg. tunggungkè ‘He waited for her’

         2. to guard, eg. Léq kuri bedait kance patih si tunggu kuri ‘At the door she met the bodyguard who was guarding it’


Represented in XML as:



                     <ENTRY_FORM> tunggu </ENTRY_FORM>

                     <ENTRY_CAT> vtr </ENTRY_CAT>


                                 <GLOSS> to await, to wait for </GLOSS>

                                 <EX_FORM> tunggungkè …. </EX_FORM>

                                 <EX_GLOSS> He waited for her </EX_GLOSS>



                                 <GLOSS> to guard </GLOSS>

                                 <EX_FORM> Léq kuri bedait kance patih si tunggu kuri </EX_FORM>

                                 <EX_GLOSS> At the door she met the bodyguard who was guarding it</EX_GLOSS>








This is usually written as:


<?xml version=“1.0” standalone=“no”?> <!DOCTYPE DIC SYSTEM “dic.dtd”> <DIC> <ENTRY> <ENTRY_FORM> tunggu </ENTRY_FORM> <ENTRY_CAT> vtr </ENTRY_CAT> <POLYSEME> <GLOSS> to await, to wait for </GLOSS> <EX_FORM> tunggungkè …. </EX_FORM> <EX_GLOSS> He waited for her </EX_GLOSS> </POLYSEME> <POLYSEME> <GLOSS> to guard </GLOSS> <EX_FORM> Léq kuri bedait kance patih si tunggu kuri </EX_FORM> <EX_GLOSS> At the door she met the bodyguard who was guarding it</EX_GLOSS> </POLYSEME> </ENTRY> </DIC>


2. Corpus skeleton

1. a corpus contains texts

2. texts contain sentences

3. sentence properties are: sentence form, sentence gloss, sentence reference

4. sentences contain words

5. word properties are: word form, word gloss, word function

6. words contain morphemes

7. morpheme properties are morpheme form, morpheme gloss, morpheme category


eg. printed Sasak text


































              ‘She met the bodyguard who was guarding the door’





                                  <SENTENCE_REF> Sas-t023s109 </SENTENCE_REF>

                                  <SENTENCE_FORM> Ie bedait kance patih si tunggu kuri </SENTENCE_FORM>

                                  <SENTENCE_GLOSS> She met the bodyguard who was guarding the door</SENTENCE_GLOSS>


                                                  <WORD_FORM> Ie </WORD_FORM>

                                                  <WORD_GLOSS> she </WORD_GLOSS>

                                                  <WORD_FUNC> pro </WORD_FUNC>


                                                                 <MORPH_FORM> ie </MORPH_FORM>

                                                                 <MORPH_GLOSS> 3sg </MORPH_GLOSS>

                                                                 <MORPH_CAT> pro </MORPH_CAT>




                                                   <WORD_FORM> bedait </WORD_FORM>

                                                   <WORD_GLOSS> meet </WORD_GLOSS>

                                                   <WORD_FUNC> vi </WORD_FUNC>


                                                                  <MORPH_FORM> be </MORPH_FORM>

                                                                  <MORPH_GLOSS> detrans </MORPH_GLOSS>

                                                                  <MORPH_CAT> pref </MORPH_CAT>



                                                                   <MORPH_FORM> dait </MORPH_FORM>

                                                                   <MORPH_GLOSS> meet </MORPH_GLOSS>

                                                                   <MORPH_CAT> vtr </MORPH_CAT>









This is usually written as:



XSL allows us to present some aspects of a document in a specified form, eg. all and only example sentences with their headword, all transitive verbs and their example number etc. We don’t normally write XML as such but export it from Working Format documents.


Linguist’s Tools

Writing XML documents by hand is tedious, unless there is a tool (eg. editor) that can be used to help (check markup flags, correct syntax, ensure DTD compliance etc). Mostly, we store our documents in a database and export the data into an XML structured file for data exchange or transformation. There are numerous tools available for use to help with linguistic analysis, either general purpose tools that the user must design and program (eg. MS Access, FMPro) or specific purpose tools that are used for certain tasks (eg. Shoebox, ELAN).

The special purpose tools that some people use are:

1.     MediaTagger for multimedia annotations (from MPI-Nijmegen)

2.     Transcriber for audio annotations (free)

3.     Shoebox for text and lexicon annotations (SIL, $20)

4.     Praat for speech analysis and annotation (free)

5.     Elan for audio and video annotation (from MPI-Nijmegen)

6.     Converter programs to transfer data between formats (from MPI-Nijemgen)

For further information, see:

Sound and video formats

Real-time media (audio and video, especially video) is the area where there is the most rapid technological change, the most difficulty in making the best choice, and the most uncertainty about long-term preservation.

For video, currently MPEG2 format is recommended, although this might be superseded by MPEG4 in the future. These formats involve “lossy” compression: using them causes some of the video signal to be permanently lost. While formats with lossy compression may not be ideal for documentation and archiving, it is currently virtually impossible to store uncompressed digital video.

For sound, uncompressed data at CD-quality (44KHz, 16 bit) and encoded as WAV or CD-Audio is best. While many people are finding that Sony’s minidisk (MD) provides a convenient, robust method of recording, and that the sound quality is adequate for language documentation, MD typically uses a proprietary compressed format (“ATRAC”) which is unsuited to archiving so the data must be converted to a more suitable format for long-term storage.

Recording equipment and storage media

For video and sound you should distinguish between data formats and the physical equipment used to acquire and store the data.

Each kind of recording equipment has its strengths and weaknesses in usability, convenience, accuracy, expense, and recording media. While DAT, for example, will provide better quality sound recording than MD, DAT recorders are more expensive, more fragile, prone to problems in extreme climates, and their tapes are more difficult to obtain, so in many cases MD will be a better choice. For both DAT and minidisk, the physical media are poor choices for long-term preservation, so the data should be migrated to CD or DVD as soon as possible.

At present, recording equipment is changing rapidly: the newest Sony Hi-MD minidisk recorders can record uncompressed WAV sound data, and some machines record sound direct to CD or solid-state memory. New video recorders can record direct to mini-DVD.


Archiving involves preparing materials so that they are as informative and explicitly expressed as possible, encoding them in the best ways to ensure their long-term accessibility, and then storing them safely. Archiving is for the potential benefit of the language community, for the safekeeping of the researcher’s own work, and for use by other researchers and interested people in the future.

All the materials that a project creates that are usable resources should be archived with a reliable archive, with very few exceptions. Exceptions may include raw notes, or unedited video and audio. Normally an archive will allow the researcher to reserve access to some materials for research purposes for a certain period of time during and after the research, but materials should remain accessible to those who provided the data and possibly other language community members, except under special circumstances.

Several archives such as the DoBeS archive at MPI Nijmegen and the ELAR archive at SOAS are digital archives; all the materials are stored electronically, using computers. This not only enables the inclusion of media such as sound and video, but also data to provide integration and navigation amongst the materials.

Digital archiving involves much more than handing over data files so that others can store them or work on them. Archivists encourage you to produce rich, structured documentations that match the capabilities of the digital medium. Important layers of linguistic representation can be added through the use of suitable technologies to structure data and make links between various items of data. It is recommended to make as much linkage as possible within the data: for example, between transcriptions and sound/video (e.g. time-aligned annotation, or data showing the relationship between text and the time-location in the audio/video); and links between analysed text material and a lexicon or grammar.

All archive deposits need to be accompanied by metadata (described above), describing the sources and other characteristics of the recordings and data files.

Archiving and dissemination

Dissemination of digital materials, typically via the World Wide Web, is an entirely different process from archiving. Publishing materials on the World Wide Web is not a form of archiving: 

·       archived materials are typically more comprehensive than would normally be published on the World Wide Web

·       typically, web-based materials have no guarantee of preservation

·       archives contain some materials that are not currently publishable due to sensitivities but may be important for future revitalisation of the language, or research of various kinds

Documentation should also include properly described records of the status (or restrictions, sensitivities etc) of materials. Typically, an archive will provide World Wide Web access to a catalogue of materials and, where appropriate, access to materials themselves. In all cases, restrictions and sensitivities expressed by the language community will be respected.

An archive catalogue will inform the public about the existence of materials (hence allowing them to be ‘discovered’ through internet searching). The catalogue will not contain the actual content of the materials. Archive’s policies differ but some materials will be made available to various users, subject to the conditions/restrictions attached to materials or parts of materials, and depending on the type of user.

Intellectual property, ownership and financial issues

In general, intellectual property (IP) rights and sensitivities are not acceptable reasons for not archiving materials that are collected in a documentation project. In fact, descriptions of IP rights, sensitivities and other conditions should be collected by the researcher as part of the research and archived together with the materials. Endangered languages archives will generally respect expressed IP rights and conditions of access.

It is important to discuss and negotiate with all relevant parties early in a project the ownership of intellectual property (IP) arising from the research activities. Record the results of discussion, and, where relevant, the IP status of each item resulting from the research. Typically, if you do not formulate IP ownership, it will be assumed that it rests with the information provider where identifiable, or otherwise with the institution where the research was carried out.

Another aspect to discuss and negotiate is the distribution of any royalty/income generated as a result of publishing documentation materials. The formulation may be different from the formulation of IP ownership. Researchers are encouraged to formulate an income distribution that includes a benefit to the language community.