Start Texts Search Lexicon Alignment Links Publications Visus

Alignment

A test alignment of the text Pavasario linksmybės with the early German translation by G. H. F. Nesselmann (1869) was created based on the guidelines (in German). This enables various methods of evaluation and visualization, some of which have been implemented as examples. Specifically, an alignment dictionary was created, which is sorted by Lithuanian lemmata and their word forms and lists their corresponding translations. The format is very simple and links to other sections of the site, e.g. the lexicon. Creating a German equivalent would have required a manually proofed lemmatization, word class assignment and morphological analysis. This data could then serve as a basis for analyses of the translation, like a contrastive word class analysis. This would enable us to address questions such as Which word classes are typically translated by another? How often and in what contexts did the translator resort to a word class transistion? etc.

While we did automatically pre-process the German text using modern tools, we believe that their results should not be presented as significant without detailed manual verification – on account of the genre, age and type of the text used, and since they have only been applied to a small part of the corpus in any case. We used MarMot, the RNNTagger and the TreeTagger for an approximation of the out-of-vocabulary words. The text contained 415 tokens unknown to the TreeTagger (out of a total of around 6300). This figure is not staggeringly high, but does indicate that a certain amount of work would need to be done, especially if applied to the entire corpus. On the page of the alignment dictionary we have linked two types of visualization – firstly as a matrix with colour-coded fields and secondly as a parallel vertical view of the two language versions, with aligned tokens connected by lines crossing the centre. Both types of visualization are familiar and in use in machine translation research.

Statistics: All following figures have been rounded slightly where necessary. In the alignment 4300 Lithuanian tokens were translated by 6400 German ones, which may already hint at explicitation or an effort to keep the meter aesthetically appealing through a wider choice of words. 1400 Lithuanian lemmata corresponded to 2000 German ones (according to RNNTagger, not verified). The possible error rate aside, the grammatical divergence between the two languages surely plays a certain role here. 2000 Lithuanian word forms corresponded to 2700 German ones. The word form:lemma ratio between Lithuanian and German is thus 3:3 – nearly identical. Lithuanian used 63 multi-word expression, of which 56 where phrasemes (marked with ‘y’ in the alignment according to the guidelines). The German translation on the other hand used 1535 multi-word expressions, which may indicate a larger number of paraphrases (i.e. a certain level of explicitation). This is almost certainly present, but the bulk of these multi-word expressions are the result of our annotation practice of aligning articles together with their nouns, with similar rules for prepositions etc. Thus most of these multi-word expressions are an artifact of annotation, the grammatical divergence between both languages and/or their arbitrary orthographical differences.

In 47 cases tokens from one Lithuanian line of verse were translated by tokens spread over multiple German lines. This was usually a result of divergent enjambment. 750 German tokens were unaligned according to the annotation (which is itself only preliminary, since it takes time to develop a working routine), meaning they were “invented” for the German translation. 261 of these were simply commas. Conversely 384 Lithuanian tokens remained untranslated, though of these too 110 were commas. Thus there seems to be some difference between the two authors’ punctuation conventions and style. More interesting is the observation that here, too, the number of German tokens is higher, which fits well with the above thesis that translation involves a certain level of compensation in order to achieve metrical aesthetics and a translation that is at least semantically accurate. A translation which is accurate word-for-word while also preserving the meter is presumably impossible.

Finally, the largest number of alignments is of the 1:1 type (one Lithuanian token corresponding to one German token). This prominently includes high-frequency function words, which account for 2700 of these alignments. In 1500 alignments, however, one Lithuanian token corresponds to multiple German ones. Given the low number of phrasemes, this serves to explain the high number of German multi-word expressions (used here in the technical sense of “a sequence of more than one token”, not “figure of speech”). In only 4 instances was a Lithuanian phraseme rendered in its entirety by just one German word. Indeed there are also few instances (3 in total) in which multiple Lithuanian tokens were rendered by multiple German ones. This is naturally due – at least in part – to the direction of translation and the alignment attuned to it, which seeks to differentiate the translation equivalents in the Lithuanian text as finely as possible but does not make a similar effort for the German text (where it would be less worthwhile in any case).

If you would like to help the CorDon project by aligning some texts yourself, you can find the (German) guidelines here and the unaligned texts here.


;