EANC Parser

The electronic collection of EANC texts was processed by the EANC Parser, a morphological parsing (lemmatization) program developed by Corpus Technologies. The EANC Parser assigns token markup tags to each wordform, provided that the respective lexeme is present in the EANC grammatical wordlist.

An example of the EANC Parser output is provided for your reference.

In order to make lemmatization possible, we first had to work out a formal and exhaustive classification of Standard Eastern Armenian inflection types for both nominal and verbal categories. Each inflectable lexeme in the EANC wordlist was then assigned a specific tag corresponding to the relevant inflection type (e.g. N11). As a result, the EANC grammatical wordlist was produced.

Currently, 72,6% of all tokens in EANC are analyzed unambiguously, 17% have ambiguous analysis, and 7,5% are not recognized. Parsing success rate varies depending on a genre. The highest percentage of unrecognized tokens occurs, unsurprisingly, in oral discourse.

EANC Parser: Ambiguity distribution

as of January 2009

# of analyses Comment
Fiction

Science

Press

Other written

Oral discourse

EANC
Total

1 unambiguous 75,4% 67,0% 72,6% 69,4% 65,2% 72,6%

2 ambiguous (homonymous) 15,1% 9,6% 12,7% 12,3% 12,8% 13,3%

3 ambiguous (homonymous) 1,8% 2,1% 2,0% 1,8% 1,6% 1,9%

4 - 7 ambiguous (homonymous) 1,8% 2,1% 2,0% 1,8% 1,6% 1,9%

Subtotal ambiguous 18,7% 13,7% 16,7% 15,8% 16,0% 17,0%

1? hypothetic (not in dictionary) 0,0% 1,3% 0,6% 0,7% 0,2% 0,5%

0 not recognized 5,4% 11,9% 7,7% 7,0% 12,3% 7,5%

Special tokens: Cyrillic, Latin, digits 0,3% 6,3% 2,8% 5,6% 6,0% 2,4%

Total 100% 100% 100% 99% 100% 100%

Ambiguous analysis (homonymy)

Homonymy, both regular and coincidental, is quite common in SEA. For example, the forms for infinitive and perfective converb in SEA are regularly homonymous for the -ե (-e) conjugation type (e.g. for գրել grel ‘to write’). An example of a coincidental homonymy is հարգի hargi : it can be interpreted both as an adjective ‘respectable’ or as a 3rd person present subjunctive of the verb հարգել hargel ‘to respect’.

The EANC Parser does not use contextual or syntactic information and deals exclusively with wordforms. As a result, grammatical queries may return false matches due to homonymy. As an example, a query for subjunctive will return sentences where հարգի hargi is an adjective and not a subjunctive.

The noise level can be cut down by adding specific constraints to a query, e.g. by introducing another wordform that is supposed to co-occur with the relevant reading. Another option is to exclude homonymous hits explicitly (click Advanced under the token query line). In some cases, however, this may significantly decrease the overall number of hits.

Sometimes, one of the possible grammatical interpretations of a wordform is much more common than the other(s). For the most frequent cases of an extremely improbable homonymy, second readings have been discarded (e.g. the locative asum from the noun as ).

Zero analysis

For some tokens, the EANC parser is unable to provide any morphological analysis. These tokens are lexical items and/or wordforms which are not currently included in the EANC grammatical wordlist, specifically:

recent loanwords
neologisms
elements of code-switching to Russian or English
some abbreviations
some proper names
some technical terms
some Western Armenian spellings
most obsolete spellings
distorted spellings
cases of inflectional variance not included into the wordlist (mainly applicable to oral discourse)
scanning errors
typos and misspellings in the original texts

There is an ongoing effort to cut down the error rate by filling the gaps in the EANC wordlist and copy-editing the texts.

Hypothetical analysis

For some abbreviations followed by a dash and case/number markers, the EANC Parser suggests a "hypothesis". Lemmas displayed in the pop-up window for such tokens are followed by a question mark.

Periphrastic constructions

Another important consequence of context-independent parsing is that elements of periphrastic constructions typical of the SEA verbal system are analyzed as independent morphosyntactic units. In EANC, it is impossible to build a single token query for such periphrastic constructions. Adding an auxiliary to the query (as an element of the close context of the converb) may provide an indirect way to find target constructions.

For example, a query consisting of an imperfective converb immediately followed by the auxiliary verb է ē ‘to be’ in the present tense will find occurrences of the periphrastic imperfective present.