|
Composition
EANC is designed as a comprehensive corpus with the objective to include as many Standard Eastern Armenian texts as practicable. As of March 2009, EANC comprises about 110 million tokens. Overall, we have been guided by the goal of comprehensive representation – all literary, scientific and oral texts available to us have been indexed for search. The only exception to this are certain widely-available texts, such as electronic press and legal documents, whose presence has been limited for the sake of balance among different genres.
Due to its comprehensive nature, EANC is inherently different from the "major" languages’ corpora such as Russian National Corpus or British National Corpus which choose their collections selectively. BNC additionally imposes a limit on the number of words per document, truncating longer texts. EANC, on the other hand, includes a great majority of all extant Eastern Armenian literary texts. In this respect, EANC is similar to Czech National Corpus or Slovak National Corpus.
The written discourse subcorpus of EANC includes 836 fiction texts, both prose and poetry (including 206 translated fiction titles), 7,858 newspaper issues and a sizeable collection of scientific and other non-fiction texts.
The SEA oral discourse subcorpus (3 million tokens) is an important structural element of EANC, comprised of spontaneous dialogs, task-oriented interviews, TV talk shows, films, and other audio recordings, all transcribed for EANC. Recently added samples of online communication are of a type intermediate between oral and written register; they have been placed in the oral subcorpus.
Each of the 9,960 document entries in EANC is labeled by metatext information specifying genre and other bibliographic details (e.g.: date of creation/publication, name of the author, etc.).
EANC Composition |
|
|
|
|
|
as of March 2009 |
|
|
|
|
|
|
|
|
|
|
|
|
Written discourse |
# tokens |
% EANC |
# of docs |
|
|
|
|
|
|
|
|
|
Fiction |
|
|
|
|
|
|
prose: novels |
29 909 172 |
27,1% |
371 |
|
incl. 99 translated |
|
prose: short stories |
5 959 142 |
5,4% |
183 |
|
incl. 56 translated |
|
prose: plays |
1 411 030 |
1,3% |
55 |
|
incl. 8 translated |
|
prose subtotal |
37 279 344 |
33,8% |
609 |
|
|
|
|
|
|
|
|
|
|
poetry |
3 648 160 |
3,3% |
227 |
|
incl. 43 translated |
|
|
|
|
|
|
|
Press |
47 264 735 |
42,9% |
7858 |
|
|
|
|
|
|
|
|
|
Non-fiction |
|
|
|
|
|
|
science |
13 875 930 |
12,6% |
113 |
|
incl. 22 translated |
|
essays, memoirs, official, religious |
4 735 997 |
4,3% |
379 |
|
incl. 8 translated |
|
|
|
|
|
|
|
Written discourse total |
106 804 166 |
96,8% |
9 186 |
|
|
|
|
|
|
|
|
|
Oral discourse |
# tokens |
% EANC |
# of docs |
|
|
|
|
|
|
|
|
|
|
Oral spontaneous discourse |
1 029 646 |
0,94% |
208 |
|
|
|
Oral public discourse |
1 933 899 |
1,76% |
543 |
|
|
|
Oral task-oriented discourse |
70 010 |
0,06% |
22 |
|
|
|
|
|
|
|
|
|
+ |
Online communication |
442 399 |
0,40% |
1 |
|
|
|
|
|
|
|
|
|
Oral subcorpus total |
3 475 954 |
3,2% |
774 |
|
|
|
|
|
|
|
|
|
EANC Total |
110 280 120 |
100% |
9 960 |
|
|
Most of the texts in EANC have been acquired by scanning and optical character recognition of various printed sources. Some of fiction titles, however, as well as modern press have been downloaded from open internet archives (for more information and credits see Armenian texts online). All oral corpus consists of texts transcribed by EANC from 2006 to 2008 as well as by Victoria Khurshudian in 2003 to 2005. The following chart represents EANC composition by type of source.
EANC composition - tokens by source type |
|
|
|
|
Written discourse |
OCR |
downloaded |
other |
|
|
tokens |
% EANC |
tokens |
% EANC |
tokens |
% EANC |
Fiction |
38 672 087 |
36,2% |
1 580 876 |
1,5% |
674 541 |
0,6% |
Press |
12 709 536 |
11,9% |
34 555 199 |
32,4% |
|
|
Non-fiction |
15 571 293 |
14,6% |
2 222 181 |
2,1% |
818 453 |
0,8% |
Written discourse total |
66 952 916 |
62,7% |
38 358 256 |
35,9% |
1 492 994 |
1,4% |
|
|
|
|
|
|
|
|
Online communication |
442 399 |
100% |
downloaded |
|
|
|
|
|
|
|
|
|
|
|
Oral discourse |
3 033 555 |
100% |
transcribed |
|
|
|
|