EANC software is developed by Corpus Technologies and is designed as a scalable and, ultimately, language-independent software platform for corpus studies.
The system is built so that corpora of structurally different languages can be indexed and made available for search, provided that such corpora follow the specific markup standards developed by Corpus Technologies (see sample text). Although EANC Parser and user interface are inherently language-dependent, the rest of the system is designed to support virtually any morphological structure, markup detail, or alphabet.
EANC interface runs under Windows, Mac OS, and Linux and supports major Internet browsers. Please note that EANC may not display the Armenian characters correctly if your computer does not support Unicode. In case you encounter a problem while working with EANC, please file an error report (follow the link in the lower right corner of the main window under the main search form) or contact us directly.
An important objective of the EANC team is to make the EANC search functionality easily accessible to the user. You do not need to register with EANC or to download any external software to be able to use EANC.
EANC database software consists of four major parts:
The collection of raw electronic texts is first processed by EANC Parser (a PERL program), which adds XML-compliant or tab-delimited metatext and token markup. Next, the resulting files are processed by Indexer to create the corpus database structure. Server implements search and sorting algorithms in the corpus database. Finally, User interface and Client provide web access to the EANC database and its search functionality.
Indexer is a PHP+MySQL program that extracts address information for each token and each markup element from the XML output provided by the EANC Parser. The output of Indexer is a set of hash tables that establish a pointer connection between each unique lexeme, wordform and grammatical attribute occurring in EANC, and their respective positions (addresses) in the corpus data files. The corpus data files represent a non-relational database consisting of binary address arrays. Sorting keys for each token are also stored in the data files. This allows sorting output contexts by specific key criteria, such as alphabetically, by period/genre, etc. Server is a С++ program which implements core search algorithms over the corpus data files via the ISAM method. Search algorithms are designed to minimize response time for most common queries. Given the size of EANC (well over 100 mln tokens), response time may exceed the standard 0.5-0.8 second threshold for some contextual queries such as searching for complex collocation sequences of frequent gram attributes.
Many queries may correspond to a large number of matches in EANC; however, only up to 10,000 matches are displayed to the user. These 10,000 are drawn from various parts of the Corpus proportionally to the way all matches are distributed throughout EANC, so as to form a representative sample (if a subcorpus has been defined, the same distribution sampling is performed over the subcorpus).
The main Search Form is the central element of the EANC user interface. It is used to build queries for:
Additionally, the main Search Form provides ample tools to build contextual (collocation) queries where an arbitrary number of tokens and/or attributes may appear in a sequence having specific distances between the elements.
When the user defines a search query, the user interface transmits that query to Client. Client is a PHP program that pre-processes user input in the User interface, builds and sends a query to Server, and then receives and post-processes the search output. Client is also responsible for more advanced interface operations, such as displaying token markup or transliterating the output.