Participants
1. Anthony Aristar (Eastern-Michigan University, USA)
2. Helen Aristar-Dry (Eastern-Michigan University, USA)
3. Sebastian Drude (MPI, Nijmegen, Holland)
4. Judith Eckle-Kohler (Darmstadt University, Germany)
5. Alexandra Ernst (Frankfurt University, Germany):
6. Ralf Gehrke (Frankfurt University, Germany):
7. Rüdiger Gleim (Frankfurt University, Germany):
8. Jolanta Gelumbeckaitė (Frankfurt University, Germany)
9. Jost Gippert (Frankfurt University, Germany):
10. Jeff Good (Buffalo University, USA)
11. Armin Hoenen (Frankfurt University, Germany)
12. Zahurul Islam (Frankfurt University, Germany)
13. Thomas Jügel (Frankfurt University, Germany)
14. Andy Lücking (Frankfurt University, Germany)
15. Alexander Mehler (Frankfurt University, Germany)
16. Brent Miller (Eastern-Michigan University, USA)
17. Roland Mittmann (Frankfurt University, Germany)
18. Irina Nevskaya (Frankfurt University, Germany)
19. Justin Petro (Eastern-Michigan University, USA)
20. Timothy Price (Frankfurt University, Germany)
21. Gary Simons (SIL, USA)
22. Maria Sukhareva (Frankfurt University, Germany)
23. Menzo Windhouwer (MPI, Nijmegen, Holland)
Lectures
1. Gippert, Jost (Frankfurt University, Germany):
RELISH meets LOEWE.
Presentation.
2. Brent Miller (Eastern-Michigan University, USA):
The LEGO Project.
Presentation.
LEGO (Lexicon Enhancement via the GOLD Ontology, NSF BCS-0753321), a project run by The LINGUIST List in collaboration with the University at Buffalo, has digitized a number of lexicons and tagged them with concepts from GOLD (the General Ontology for Linguistic Description). Because the data is linked to GOLD, the lexicons have been made interoperable and users will be able to search for morpho-syntactic information at a fine-grained level across languages. The project has focused on digitizing endangered and under-documented lexicons, as some lexicons represent the only documented resource available on that language. With permission from our data contributors, an XML export of each lexicon will be made available, and the current export formats will be expanded to include the finished RELISH schema. Currently, 18 lexicons and wordlists have been 'published' to the development site, with 10 more nearing completion. As work continues on adding data to the site and refining the interface, LEGO will become a useful resource for linguists, semanticists, typologists, and researchers from other disciplines interested in language information.
3. Justin Petro (Eastern-Michigan University, USA):
LEGO and RELISH 1: Harmonizing Lexicon Terminology.
Presentation.
There is a general consensus in the field of linguistics that lexical data needs to be digitized and archived in a way that conforms to established standards for digital interoperability and an integrated cyberinfrastructure. When the lexical data in question is the only documentation of an endangered language, the need for such standards becomes even more paramount. Unfortunately, in tackling this issue, diverging standards for digital language documentation have developed in Europe and the U.S. RELISH (Rendering Endangered Languages Lexicons Interoperable through Standards Harmonization, NEH Grant HG 50010-09), a collaborative project between the Institute for Language and Information Technologies (ILIT) at Eastern Michigan University, the Max-Planck-Institute for Psycholinguistics, and The Johann Wolfgang Goethe-Universität Frankfurt, is working to harmonize key European and American digital standards, establish a unified way of referencing lexicon structure and linguistic concepts, and develop a procedure for migrating these heterogeneous lexicons to a standards-compliant XML format, using six lexicons of endangered languages to demonstrate its utility. Not only will the project give these important linguistic records the widest possible accessibility for scholarly research, it will serve as a crucial step towards greater scientific collaboration across continents.
4. Helen Aristar-Dry (Eastern-Michigan University, USA):
LEGO and RELISH 2: Harmonizing Lexicon Structure.
Presentation.
This presentation discusses the RELISH schema, a serialization of the
LMF data model which is based upon the LEGO project's version of LIFT
(Lexicon Interchange Format). LIFT is an XML schema created by SIL
International for the storing and exchange of lexical information; it is designed to be an interchange format among SIL tools, e.g., Toolbox,
Fieldworks, and WeSay. Development of the RELISH schema was facilitated
by restrictions on the LEGO version of LIFT (LL-LIFT), which is constrained to allow only 1 instance of grammatical information per entry. This restriction allowed us to achieve a one-to-one mapping of most elements between LL-LIFT and LMF. In addition, references to data categories in the ISOCat registry were added for LMF compliance; and TEI feature structures were added to allow for the distribution of notes in LIFT.
5. Sebastian Drude (MPI, Nijmegen, Holland):
The Language Archive at the MPI.
Presentation.
Since the 1st of September, there is a new unit at the Max-Planck-Institute for Psycho-linguistics in Nijmegen called “The Language Archive” (TLA). It continues the work begun by the Technical Group under Peter Wittenburg – work on a digital sustainable infrastructure for valuable data, in particular data on endangered languages collected by the more than 50 DOBES projects in more than 10 years now, but also on other languages and research data of other types.
The Language Archive has as its main tasks to:
digitize and archive language resources deposited by any researcher worldwide,
provide access to archived lg. resources while respecting legal and ethical restrictions,
develop tools, services and infrastructures that can be freely used for research purposes,
set up regional archives worldwide to foster language documentation and revitalization,
organize and participate in education and training activities, and
give help and support.
In this talk I will briefly present the TLA, its mission, achievements and ongoing and future work.
6. Menzo Windhouwer (MPI, Nijmegen, Holland), Irina Nevskaya (Frankfurt University, Germany):
On the way to a Relation Registry for ISOcat data categories.
Presentation.
To foster reuse in various contexts the ISOcat Data Category Registry offers a list of elementary data categories. The Relation Registry, called RELcat, will allow (groups of) users to specify ontological relationships between data categories and thus semantically describe specific contexts. Although the relation types supported by RELcat will not be fixed a core taxonomy of relation types is provided. This presentation will describe the ideas behind the Relation Registry and its first implementation, RELcat. Also the taxonomy of relation types and useful extensions will be discussed.
7. Judith Eckle-Kohler (Darmstadt University, Germany):
What computers need to know about linguistics - data categories for computational lexicons in ISOCat abstract.
At UKP-Lab, there is ongoing work on the large-scale integration of lexical-semantic resources, such as WordNet, FrameNet, VerbNet, Wikipedia and Wiktionary. In this context, we have developed a comprehensive lexicon model in the ISO-standard Lexical Markup Framework. This lexicon model currently covers the two languages English and German. The talk will focus on the syntactic part of this lexicon model where valency and syntactic behaviour of verbs and other predicate-like lexemes is specified. The ISOCat Data Categories required for a fine-grained specification of valency are considered in detail.
8. Alexander Mehler, Rüdiger Gleim, Alexandra Ernst, Andy Lücking (Frankfurt University, Germany):
Linguistic networks.
Presentation.
Linguistic Networks is a framework for modeling networking in the
domain of language and language use. It comprises both instruments for
the analysis and synthesis of networks as well as visualization tools
for network exploration. Mainly by means of the example of the
Patrologia Latina corpus we (i) describe the preprocessing
architecture underlying network induction, (ii) show how time series
of these networks are generated, (iii) exemplify explorative analyses
of linguistic networks by means of sample lemma and word form
networks, and (iv) introduce an online-platform for visualizing those
networks.
9. Jeff Good (Buffalo University, USA):
Wordlists as a test case for semantic and structural interoperation.
Presentation.
Wordlists are a type of resource widely employed in field and comparative linguistics in order to produce lexical data that can be readily compared. While their lexical content tends to be relatively sparse, they are available for thousands of the world's languages, resulting in coverage that cannot be matched by other kinds of lexical resources. This talk will discuss the structure of wordlists and suggest that they can serve as a good test dataset for the development of ontologies for semantic concepts and for the use of linked data in linguistics.
10. Armin Hoenen & Thomas Jügel (Frankfurt University, Germany):
Building a digital lexicon - trial and error - the cases of Latin and Avestan abstract.
Presentation.
In the context of the LOEWE project we are developping a lexical representation for Avestan, an extinct Indo-European language. Basis for the representation will be a data model that was implemented as the relational data base schema called eLexicon and in an RDF representation format. The eLexicon was basis for the build up of a concrete Late Latin instance in turn as a basis for a PoS- tagging of the Patrologia Latina, the biggest corpus of medieval Latin. In my presentation I will describe the data structure of the model, its application and the build up process of a Latin Lexicon, which started with a merger of existing lexica like the AGFL Latin instance. This approach had to be dismissed because the lexica had different tagsets, errors and dealt differently with some phenomena. This exemplifies the usefullness of a standard for lexicon building as envisaged by the RELISH symposium. The second approach is a lexicon build- up on the basis of morphologically expanded wordforms. This approach is called generative in opposition to the analytic approach where words are segmented and analyzed morpheme-wise. The final part of the presentation will deal with the case of Avestan and its lexicon creation. How do we convert written data from manuscripts, an Avestan lexicon into computer readable form, which decisions must a digitizing editor take? Finally the practical application of the Avestan lexicon to various tasks such as collocation analyses, PoS-tagging and texttechnological analyses will be explained. The successfull usage of a wiki- platform for academic interchange in establishing an annotation standard for Avestan shall also be discussed.
References
Gleim et al. in CL 2011 für das eLexicon
Mehler et al. (2011) in Leonardo für PL bzw. lateinische linguistic-networks
11. Roland Mittmann (Frankfurt University, Germany):
Harmonizing Dictionary Digitization - A Practical Report from the Project "Old German Reference Corpus" with an Outlook to Standardization.
Presentation.
The project "Referenzkorpus Altdeutsch“ ("Old German Reference Corpus“) aims to create an annotated corpus of all Old High German and Old Saxon texts. To this end, it makes use of several printed dictionaries that have meanwhile been digitized. The talk presents our approach on performing this digitization. Within the project, the digitized dictionaries serve only as a source for the automatic annotation of texts and are thus not themselves intended for publication. The talk concludes however with an outlook to what could and would have to be done to adapt these dictionaries to a standard encoding in order to facilitate their use beyond the project.
12. Jolanta Gelumbeckaitė (Frankfurt University, Germany):
Referenzkorpus Altlitauisch (SLIEKKAS/KALT).
Presentation.
Das tiefannotierte Referenzkorpus des Altlitauischen (1500–1800) soll die sämtlichen Texte der litauischen Sprache vom Beginn der kontinuierlichen schriftlichen Überlieferung bis etwa 1800 erfassen. Die digitalisierten Texte werden durchgehend mit strukturellen, positionellen sowie morphosyntaktischen Annotationen (Multi-layer-stand-off) versehen, die eine komplexe mehrstufige Abfrage des Korpus ermöglicht.
Mit dem altlitauischen Korpus soll vor allem der sprachhistorischen Forschung, aber zugleich der literatur- und kulturhistorischen Forschung zum (Alt)Litauischen eine bislang nicht vorhandene Ressource zur Verfügung gestellt werden. Insbesondere soll das Referenzkorpus Altlitauisch die Verwirklichung der zwei größten Desiderata instand setzen, nämlich die Erstellung des historischen Wörterbuches des Litauischen sowie die Erarbeitung der Grammatik des Altlitauischen.
13. Maria Sukhareva & Zahurul Islam (Frankfurt University, Germany):
A Three-step Model of Language Detection in Multilingual Ancient Texts abstract.
Presentation.
Ancient corpora contain various multilingual patterns. This imposes numerous problems on their manual annotation and automatic processing. We introduce a lexicon building system, called Lexicon Expander, that has an integrated language detection module, Language Detection (LD) Toolkit. The Lexicon Expander post-processes the output of the LD Toolkit which leads to the improvement of f-score and accuracy values. Furthermore, the functionality of the Lexicon Expander also includes manual editing of lexical entries and automatic morphological expansion by means of a morphological grammar.
14. Timothy Price (Frankfurt University, Germany):
Moving beyond 'Presentation vs. Content': leveraging the wealth of semantic categorization already at hand.
Presentation.
From dealing with multiple variables to staying on task during long-term recursive processes, computers do several things with which the human mind struggles. Yet computers fail when it comes to understanding nuance. Getting a computer to handle nuance, e.g. in language, requires a seemingly never-ending string of meta-tools (tools to build tools to build yet more tools), requiring ever increasing input of human effort and expense. If there's a shortcut around the problem of human-inputted semantic mark-up, it might just exist in centuries' worth of formalized lexicography.
15. Gary Simons (SIL, USA):
Endangered languages and endangered language families: A global assessment.
Presentation.
This presentation will report on the results of two recent studies that have been based on the Ethnologue's global database of languages and their situations (see http://www.ethnologue.com). The first is a paper co-authored with Doug Whalen, “Endangered language families”, that is to appear in Language, March 2012. In it we find that of the 372 linguistic stocks that had living members in 1950 (including 250 reconstructable groups and 122 isolates), 6% have already fallen silent. Combining data from Ethnologue and the UNESCO Atlas of the World's Languages in Danger (see http://www.unesco.org/culture/languagesatlas/) we conclude that a further 23% of linguistic stocks are moribund since none of their member languages are being passed on to children. The presentation will show additional findings.
The second is a paper co-authored with Paul Lewis, “The world’s languages in crisis: A 20-year update” which will be presented later this month at the 26th Linguistics Symposium of the University of Wisconsin-Milwaukee. This year's symposium is on the theme of language death, endangerment, documentation, and revitalization. The study is based on language vitality estimates being gathered for the upcoming edition of the Ethnologue. In it we conclude that of the 7331 languages known to have been living in 1950, 69% are still vital in that they are normally passed on to children. Of the remainder, 14% are already dead or dying and 17% are clearly in trouble. One finding is that the situation differs markedly in different parts of the world. The presentation will show contrasting profiles by world regions.
05.10.2011