Composition

EANC is designed as a comprehensive corpus with the objective to include as many Standard Eastern Armenian texts as practicable. As of March 2009, EANC comprises about 110 million tokens. Overall, we have been guided by the goal of comprehensive representation – all literary, scientific and oral texts available to us have been indexed for search. The only exception to this are certain widely-available texts, such as electronic press and legal documents, whose presence has been limited for the sake of balance among different genres.

Due to its comprehensive nature, EANC is inherently different from the "major" languages’ corpora such as Russian National Corpus or British National Corpus which choose their collections selectively. BNC additionally imposes a limit on the number of words per document, truncating longer texts. EANC, on the other hand, includes a great majority of all extant Eastern Armenian literary texts. In this respect, EANC is similar to Czech National Corpus or Slovak National Corpus.

The written discourse subcorpus of EANC includes 836 fiction texts, both prose and poetry (including 206 translated fiction titles), 7,858 newspaper issues and a sizeable collection of scientific and other non-fiction texts.

The SEA oral discourse subcorpus (3 million tokens) is an important structural element of EANC, comprised of spontaneous dialogs, task-oriented interviews, TV talk shows, films, and other audio recordings, all transcribed for EANC. Recently added samples of online communication are of a type intermediate between oral and written register; they have been placed in the oral subcorpus.

Each of the 9,960 document entries in EANC is labeled by metatext information specifying genre and other bibliographic details (e.g.: date of creation/publication, name of the author, etc.).

EANC Composition          
as of March 2009          
             
Written discourse

# tokens

% EANC

# of docs

   
             
Fiction          
  prose:  novels

29 909 172

27,1%

371

 

incl. 99 translated

  prose:  short stories

5 959 142

5,4%

183

  incl. 56 translated
  prose:  plays

1 411 030

1,3%

55

  incl. 8 translated
  prose subtotal 

37 279 344

33,8%

609

   
             
  poetry

3 648 160

3,3%

227

  incl. 43 translated
             
Press

47 264 735

42,9%

7858

   
             
Non-fiction          
  science

13 875 930

12,6%

113

  incl. 22 translated
  essays, memoirs, official, religious

4 735 997

4,3%

379

  incl. 8 translated
             
Written discourse total 

106 804 166

96,8%

9 186

   
             
Oral discourse

# tokens

% EANC

# of docs

   
             
  Oral spontaneous discourse

1 029 646

0,94%

208

   
  Oral public discourse

1 933 899

1,76%

543

   
  Oral task-oriented discourse

70 010

0,06%

22

   
             
+ Online communication

442 399

0,40%

1

   
             
Oral subcorpus total 

3 475 954

3,2%

774

   
             
EANC Total

110 280 120

100%

9 960



Most of the texts in EANC have been acquired by scanning and optical character recognition of various printed sources. Some of fiction titles, however, as well as modern press have been downloaded from open internet archives (for more information and credits see Armenian texts online). All oral corpus consists of texts transcribed by EANC from 2006 to 2008 as well as by Victoria Khurshudian in 2003 to 2005. The following chart represents EANC composition by type of source.

EANC composition - tokens by source type        
Written discourse

OCR

downloaded

other

   

tokens

% EANC

tokens 

% EANC

tokens 

% EANC 

Fiction

38 672 087

36,2% 1 580 876 1,5% 674 541 0,6%
Press

12 709 536

11,9% 34 555 199 32,4%    
Non-fiction

15 571 293

14,6% 2 222 181 2,1% 818 453 0,8%
Written discourse total 

66 952 916

62,7%

38 358 256 35,9% 1 492 994 1,4%
               
Online communication

442 399

100%

downloaded

     
               
Oral discourse

3 033 555

100%

transcribed