International Summer School on

Language Documentation: Methods and Technology

within the programme

Documentation of Endangered Languages (DoBeS)

of the Volkswagen Foundation

University of Frankfurt / Main

1st - 11th of September, 2004


Lecture tutorials abstracts

 

Lecture Tutorial: Field situations

Tutorials are offered for four different regions:

Sub-Saharan Africa

Mainland Asia (including the Indian subcontinent, Caucasus, Turkey and Siberia)

Insular Southeast Asia and Pacific

North and South America

 

In each of these tutorials, regional specialists will briefly report on their own field situations addressing issues such as: major cultural norms one is prone to miss the first time around; how to get started; obtaining research permits; best seasons for travel; etc. There will be ample time for students to ask questions about typical field situations in these areas.

 

 

Lecture Tutorial: Ethical issues concerning authorship, copyrights, etc.

Tutorials are offered for four different regions:

Sub-Saharan Africa

Mainland Asia (including the Indian subcontinent, Caucasus, Turkey and Siberia)

Insular Southeast Asia and Pacific

North and South America

 In each of these tutorials, regional specialists will briefly report examples from their own fieldwork. Questions to be addressed include: How to handle intellectual property rights, linguistic taboos, secrecy issues, etc.? How to make sure that people understand authorship and copyright issues? How to document consent?

 

 

Lecture Tutorial: Capacity building in indigenous communities

Tutorials are offered for four different regions:

Sub-Saharan Africa

Mainland Asia (including the Indian subcontinent, Caucasus, Turkey and Siberia)

Insular Southeast Asia and Pacific

North and South America

 In each of these tutorials, regional specialists will briefly report on examples from work they are closely familiar with. What works and what does not work when actively involving native speakers in documentation and maintenance work? What is the role of linguists in language maintenance and revitalization. Who supports language work in the area?

 

 

Lecture Tutorial: Documentation theory

Nikolaus Himmelmann, Frank Seifart

Language documentation as a linguistic endeavor in its own right (rather than an ancillary task in an analytical project) has only recently been (re-)introduced to mainstream linguistics. Most discussions regarding language documentations pertain to practical and technological aspects. But there are also quite a few theoretical problems involved in the idea of 'documenting a language'. This tutorial provides the opportunity to discuss some of these problems. Possible topics include: What does it mean to 'theorize documentation'? What features characterize a good documentation? How can one document negative evidence (and is it important)? Competing motivations for language documentations and how they influence the choice of documents.
 

Nikolaus P. Himmelmann is Professor of Linguistics at the Ruhr-Universität Bochum, Germany. His major research interests include morphosyntactic typology, grammaticisation theory, prosody and grammar as well as language documentation and description. He has done fieldwork in the Philippines (Tagalog), Sulawesi (Tomini-Tolitoli languages) and East Timor (Waima’a) and published both descriptive materials and contributions to language documentation theory and method.

Frank Seifart is a member of the DoBeS team “Documenting the Languages of the People of the Center (North West Amazon)” and a PhD candidate at the Language and Cognition Group of the Max Planck Institute for Psycholinguistics in Nijmegen (The Netherlands). His major research interests are in Amazonian languages, morphosyntactic typology in general, the typology of systems of nominal classification in particular, language contact, tones, and language documentation theory and practice. He has done fieldwork in various indigenous communities in the North West Amazon, in particular the Miraña communities in South Colombia.

 

 

 

Lecture tutorial: “Language endangerment”

 

Sebastian Drude

 

What exactly is an endangered language, and what is language endangerment? How and why do languages disappear? What is the factual and theoretical background for the relatively recent interest in these topics, and what is the general situation in the world's major regions? Why should we care, and what can be done? This tutorial will introduce some basic concepts and facts, and will, above all, provide the opportunity to discuss these fundamental topics in this new field.

 

Sebastian Drude is teaching and research assistant at the Free University Berlin and visiting researcher at the Museu Paraense Emílio Goeldi, Belém, Brazil. He has been carrying out fieldwork among the Awetí indians in central Brazil since 1998, and since 2000 he is the principal researcher in the DOBES "Awetí Language Documentation Project". His research interests include: theoretical and practical issues in language description and documentation, Integrational Linguistcs, language typology, and comparative linguistics among the Tupian languages in lowland South America. Sebastian finished his Ph.D. at the Free University Berlin in 2002 with a theoretical study in lexicography, taking Paraguayan Guaraní as an example. Homepage Sebastian Drude

 

 

Lecture tutorial: “Practising video recording”

 

Jochen Cholin

 

After a short theoretical introduction the participants will have the opportunity to try different cameras and to
familiarize themselves with the audio and video-equipment. A tutorial follows where participants have to perform
tasks supervised by the course instructor. Every participant will be provided with a handout containing an equipment
list, instructions as well as a check list.


After an internship as camera and sound assistant, Jochen Cholin completed his professional training at a TV
production company in Cologne. Subsequently, he worked as freelancer EB-cameraman and as instructor. Afterwards,
he was employed for a few months in the technical staff of the DOBES project at the Max Planck Institute in Nijmegen,
The Netherlands. Thereafter, Jochen Cholin worked in the production and post-production of advertising films for an
advertising film production company in Düsseldorf. He worked for all public and private broadcasting stations across
Europe. Moreover, Jochen Cholin works on his own film production. In 2003, he began the study of social pedagogy
with media pedagogy as his major field of study.

 

 

 

Lecture tutorial: “Editing and digitizing videos”

 

Maarten Bisseling, Peter Wittenburg

 

In order to guarantee long term storage of and easy access to behavioral data language archives it is important to structure and/or to store media data in a unified way.

Different formats of recordings, e.g. analogous video tapes (like Hi-8, VHS) or digital video tapes (like DV) should therefore be tranformed to a common digital format, e.g. MPEG2.

In a practical session we will show how to digitize data from different sources. To that end, we will go into the following subjects at a basic level:

-        Hardware and software configuration

-        the main features of digitizing settings

-        transcoding of digital formats

In order to maintain a unified structure of data, the language archives force the users to store only data that is meta-described. Meta-data often describe only the relevant part of data (e.g. an interview). Not meta described data should then be cut off.

The seminar will show how to edit (cut, insert parts of) digital video data.

 

 Maarten Bisseling studied biology in Nijmegen and Amsterdam. Currently he works at the Max Planck Institute for Psycolinguistics in Nijmegen. Here, he coordinates the digitisation of recordings (audio, video, photos), mainly for DoBeS program members. This involves editing of media files, testing of new programs and procedures, and maintaining quality and consistency of the mediafiles.

 

 


 

Lecture tutorial: “Annotation tools: ELAN”

Han Sloetjes

 ELAN is an annotation tool that allows you to create, edit, visualize, and search annotations for video and audio data. It was developed at the Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands. ELAN is designed for the analysis of spoken language, sign language, and gesture, but it can be used by everybody who works with media files, i.e., with video and/or audio data, for purposes of annotation, analysis and documentation.

ELAN supports:

· the display of speech and/or video signals, together with their annotations;

· time linking of annotations (mostly transcription lines) to media streams;

· linking of annotations to other annotations;

· an unlimited number of annotation tiers as defined by the users;

· different character sets;

· the export as tab-delimited text files;

· the import and export between ELAN and Shoebox;

· several search options.

For each of those topics, basic information will be given. Based on that, the use of the relevant features will be explained step by step. Practical training with the computer will be the main part of the seminar tutorial.

 

Han Sloetjes worked in several positions in the graphic industry before joining the Imaging and Bioinformatics group at the Hubrecht Laboratory (Netherlands Institute for Developmental Biology) in 2001. As a Java and internet developer he was engaged there in the project “3D digital atlas of zebrafish development”. Since early 2003 he has been working as a software developer at the Max Planck Institute for Psycholinguistics in Nijmegen. He is involved in the development of metadata tools and, for the most part, the annotation tool Elan.

 

Lecture tutorial: Useful things to know about HTML/XML for website creation"

Paul Trilsbeek

 

HTML stands for HyperText Markup Language and is the language that is used to create web pages. This lecture tutorial will give an introduction to the language itself and to some tools that are often used to create web pages. It is mainly intended for people with little or no prior experience in this area.

XML stands for eXtensible Markup Language. Unlike HTML, XML is not a fixed language but rather a standard for defining your own data description language. The basic principles of XML will be briefly discussed in the tutorial and a comparison will be made between XML and HTML. The main focus of the tutorial will be on HTML however. As an example we will create a web page for one of the endangered languages in the DOBES program.

 

Paul Trilsbeek studied Sonology at the conservatory in The Hague, the Netherlands. After his study he worked as a music technologist at the Music Mind Machine project, at the University of Nijmegen, where he was responsible for the technical support of the project. There he also created some web based demo’s and interfaces for applications developed within the group. Currently he is employed at the Max Planck Institute for Psycholinguistics in Nijmegen. His activities there include Macintosh computer support, web development and handling technical issues regarding material for the DOBES archive.

 


 

 Lecture tutorial: “Metadata tools: IMDI editor and browser”

 Peter Wittenburg

 Currently, many language resources are being generated in disciplines such as corpus linguistics, anthropology and language and speech engineering. However, only few of them are available on the internet. Also most of these resources are not publically available at all and only very few people are aware of them.

Linguists, anthropologists or endangered communities’ members need standards for describing the main characteristics of resources such as: The name of the language spoken, the speaker’s age, sex and educational background etc. They also need tools that help generating and accessing these kinds of metadata descriptions.

The seminar will present step by step:

·        How to crteate metadata;

·        How to make these kinds of descriptions available on local computer or on the internet;

·        How to browse, search and access these resources.

The seminar will mainly consist of practical training with the respective software.

 

 

Lecture tutorial: “Introduction to Shoebox and Toolbox”

Irina Nevskaya

 Both Shoebox and the Field Linguist's Toolbox are computer programs that help field linguists and anthropologists to integrate various kinds of text data: lexical, cultural, grammatical, etc. They have flexible options for selecting, sorting, and displaying data. They are especially useful for helping researchers to build a dictionary as they use it to analyze and interlinearize text. The name "Shoebox" is a nostalgic reference to the pre-computer times when linguists used to store cards with langauge examples in old shoeboxes. The name Toolbox reflects the multiple tools available within the program including a database, interlinearization of text, concordancing and wordlists.

For most linguists and anthropologists, managing data on the computer is time-consuming. They collect thousands of data items when learning a language and culture. Shoebox or Toolbox go with researchers through all the stages of their field work. Because these programs integrate various kinds of data and make it quickly available, field workers can spend less time on the computer and invest more time with the people interacting and learning.

The main points of interest to be descussed during the lecture/tutorial:

1. Functions of the programs: Find, concordance, export, import, sorting, filtering, browse view.

2. Language encodings and database types, interlineary analysis.

3. The Keyman program and Shoebox rendering engines.

4. Encoding different types of derivational stems and word formulas in Toolbox.

5. Unicode encoding.

 

Irina A. Nevskaya is chief researcher at the Institute of Linguistics, Siberian Division of the Russian Academy of Sciences, Novosibirsk, and associate professor at the Kuzbass State Pedagogical Academy, Novokuzneck, Russia. At present, she holds a teaching contract at the Frankfurt University.
 

 

Lecture tutorial: "Useful Things to Know about Web-Archives - an introduction"

Peter Wittenburg 

Web-Archives are a very popular way to offer archived material to everyone with the help of ordinary web-browsers. Some of the material in these archives can only be accessed when the user has appropriate access rights and after he has been properly authenticated. Currently, there is also a big move towards an Open Access policy, since we are in the danger that much of cultural heritage is going to be privatized. DOBES strongly supports the Open Access policy, however, all persons involved have to adhere to strict legal and ethical principles which are normally defined by Code of Conducts, usage rules, copyright declarations etc. Therefore, we can say that Open Access in the DOBES case for example means that as much material as possible should be openly available and that NO commercial aspects may be involved.

Further, we can distinguish between different technical solutions in the realization of web-archives. Most holdings - in particular those from small projects or individuals - are realized using ordinary web presentation technologies. They are created like the exhibitions in museums and are stored in the usual presentation formats. Typically this means that people assemble guided tours based on HTML, Flash and other presentation technologies. Other archives such as the DOBES archive follow the eternal rules of archiving in so far that they store the material using standardized archival formats such as XML and MPEG. Together with an unbiased organizational structure based on widely accepted metadata formats such as IMDI every user can derive his own specialized exhibitions. We will look at examples of both models: those were the representation standards are primary and special presentations (shows) are secondary and vice versa. We will point to the advantages and disadvantages of both approaches. It is obvious from many examples that the focus on presentation formats can be seen as a danger for the long-term accessibility of the material. Also we will show examples and introduce technologies that allow us to generate nice presentations from standard archival formats.

Furthermore, we will discuss and show how different media can be integrated in the language archives that contain many different linguistic data types. Such types are annotated recordings including audio, video and textual material, multimedia lexica that are centered around lexemes and that are highly structured, notes of all sort such as sketch grammars or field notes that are highly unstructured and others. Often there are close relations between elements of such resources and users want to exploit these. We will discuss technologies that support the different media types, the different linguistic types and show some examples.

Finally, we will discuss some future perspectives that have to do with the so-called world-wide Data-GRID initiative. Here the idea is to integrate various archives within the different scientific domains to form one integrated virtual archive for the data of interest for a specific domain, making it possible for researchers to have access to the data they require through one “port” only.

 

Peter Wittenburg

 

Lecture tutorial: "Useful things to know about encoding"

Jost Gippert

The tutorial will address the basics of the computational encoding  of linguistic data. It will start with an overview of encoding standards, from 7-bit ASCII via 8-bit ANSI / ISO to Unicode, and discuss problems of the conversion of data across encoding systems. The main focus will be on the following questions :

 

Jost Gippert is Professor of Comparative Linguistics and Dean of the Faculty of Linguistics and Cultural Sciences at the University of Frankfurt / Main, Germany. His major research interests comprise Indo-European and Caucasian languages both from a synchronical and a historical point of view as well as computational linguistics, language documentation and description. He has done fieldwork in the Caucasus and the Indian subcontinent. As the founder and leader of the TITUS project, he has been engaged in the digital preparation and dissemination of linguistic data since 1987.

 

 

Special lecture tutorial (in German): “Editing and digitizing videos”

Romuald Skiba

 

In order to guarantee long term storage of and easy access to behavioral data language archives it is important to structure and/or to store media data in a unified way.

Different formats of recordings, e.g. analogous video tapes (like Hi-8, VHS) or digital video tapes (like DV) should therefore be tranformed to a common digital format, e.g. MPEG2.

In a practical session we will show how to digitize data from different sources. To that end, we will go into the following subjects at a basic level:

-        Hardware and software configuration

-        the main features of digitizing settings

-        transcoding of digital formats

In order to maintain a unified structure of data, the language archives force the users to store only data that is meta-described. Meta-data often describe only the relevant part of data (e.g. an interview). Not meta described data should then be cut off.

The seminar will show how to edit (cut, insert parts of) digital video data. The language of the tutorial is German.

 


Special lecture tutorial (in German): Annotation tools: ELAN”

Romuald Skiba

 

ELAN is an annotation tool that allows you to create, edit, visualize, and search annotations for video and audio data. It was developed at the Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands. ELAN is designed for the analysis of spoken language, sign language, and gesture, but it can be used by everybody who works with media files, i.e., with video and/or audio data, for purposes of annotation, analysis and documentation.

ELAN supports:

· the display of speech and/or video signals, together with their annotations;

· time linking of annotations (mostly transcription lines) to media streams;

· linking of annotations to other annotations;

· an unlimited number of annotation tiers as defined by the users;

· different character sets;

· the export as tab-delimited text files;

· the import and export between ELAN and Shoebox;

· several search options.

For each of those topic, basic information will be given. Based on that, the use of the relevant features will be explained step by step. Practical training with the computer will be the main part of the seminar tutorial. The language of the tutorial is German.

 

 

Special lecture tutorial (in German): “Metadata tools: IMDI editor and browser”

 Romuald Skiba

 Currently, many language resources are being generated in disciplines such as corpus linguistics, anthropology and language and speech engineering. However, only few of them are available on the internet. Also most of these resources are not publically available at all and only very few people are aware of them.

Linguists, anthropologists or endangered communities’ members need standards for describing the main characteristics of resources such as: The name of the language spoken, the speaker’s age, sex and educational background etc. They also need tools that help generating and accessing these kinds of metadata descriptions.

The seminar will present step by step:

·        How to create metadata;

·        How to make these kinds of descriptions available on local computer or on the internet;

·        How to browse, search and access these resources.

The seminar will mainly consist of practical training with the respective software. The language of the tutorial is German.

 

Romuald Skiba (Dr. phil.)

Professional background: Second language acquisition, German as a second Language (DaF), Computer aided language learning (CALL), Special languages (Fachsprachen), Methodology, Corpus linguistics. Editor of the journal “Language Archive Newsletter - LAN” (www.mpi.nl/LAN).

Professional affiliation: Study of Linguistics in Poland and Germany. From 1987 on a reasearcher and university teacher in Germany (FU Berlin, TU Berlin, TU Cottbus). Since 2000 at the Max-Planck-Institute for Psycholinguistics in Nijmegen: Acquisition group and corpus manager for the DoBeS project (Documentation of Endangered Languages).

 

 

Fieldwork tutors

 

David Rood (PhD in Linguistics, Univ. of Calif., Berkeley, 1969) is Professor of Lingustics in the Department of Linguistics,  University of Colorado, Boulder. He has worked with Wichita (the language of a small tribe in Oklahoma; there are now only 7 living speakers) since 1965, and with Lakota (a large, still viable tribe with many speakers in the northern plains of the US and Canada) since the early 70s. He has written both technical linguistic material and pedagogical material for both languages.

 

Dagmar Jung (PhD University of New Mexico, Alberquerque) is Assistant Professor in the Department of Linguistics, University
of Cologne. Her research interests include morphology, grammaticisation, typology and universals, descriptive and documentary linguistics as  well as anthropological linguistics. She has done extensive fieldwork on Jicarilla (Apachean) but also worked with speakers of GUARANI, Nepali, Kurdish and Urningangk (Northern Australia).
 

Sonja Gippert-Fritz (PD Dr.):
Studies in General Linguistics and Slavic languages (Univ. of Klagenfurt; Mag.phil 1979), Indo-Aryan and Iranian languages and
Comparative Linguistics, Indo-Aryan and Iranian languages (Univ. of Vienna; Doctoral thesis on Ossetic, Dr.phil. 1984). Habilitation thesis on Maldivian grammar and dialectology delivered in 1997; since 1998 Privat-Dozent of Modern Indology at the South Asia Institute, University of Heidelberg. Extensive field research stays in the Caucasus, in the Maldives and in Sri Lanka.

 

Rainer Vossen, Dr. phil. (Cologne), Dr. phil. habil. (Bayreuth), is Professor of African Linguistics at the Johann Wolfgang Goethe University of  Frankfurt/Main, Germany. He has published extensively on Nilotic, Khoisan,  Bantu and Mande languages and language history.

 

 

Round Table Discussion: "Documentation and the Media"

Chair / Introduction:

Ulrike Mosel

Participants with introductory statements:

Peter Austin

Jost Gippert

Ruth Spriggs

Vera Szoellesi

Peter Wittenburg

Selection of topics:

  1. Why should linguists and speech communities care about the media? What do they want the media to do? How can they employ the media for their purposes? Are there any dangers?
  2. How should linguists and speech communities present themselves and their work in the media?
  3. What kind of materials are useful for the media? What do people use web pages and internet archives for? What kind of misuses occur?
  4. To what extent should archives like the DOBES archive be open to the media and the general public?