The Corpus

The corpus used in this project is the METU Turkish Corpus (MTC) with approximately 2 million words. The original MTC files include informative tags, such as the author of the text, the paragraph boundaries in the text, etc., which were removed to obtain raw text files. The MTC has been divided into 4 subcorpora keeping the genre distributions equal. The character encoding of the files in the subcorpus is UTF-8. The tables below show the genre distribution of the MTC, the genre distribution of each subcorpus and the genres excluded from the subcorpora. METU-TDB consists of annotations done on Subcorpus-1 of the MTC.

Genre distribution of the METU Turkish Corpus

Genre tag

Category Count

%

Novel

123

15.63

Story

114

14.49

Research-Survey

49

6.23

Article

38

4.83

Travel

19

2.41

Interview

7

0.89

Memoir

18

2.29

News

419

53.24

TOTAL

787

100.00


 

Genre distributions in the 4 subcorpora of the METU Turkish Corpus

 

I.

 

II.

 

III.

 

IV.

 

Genre

File count

%

File Count

%

File Count

%

File Count

%

Novel

31

15.74

30

15.23

31

15.82

31

15.74

Story

28

14.21

29

14.72

28

14.29

29

14.72

Research-Survey

13

6.60

12

6.09

12

6.12

12

6.09

Article

9

4.57

10

5.08

9

4.59

10

5.08

Travel

5

2.54

5

2.54

4

2.04

5

2.54

Interview

2

1.02

2

1.02

2

1.02

1

0.51

Memoir

4

2.03

5

2.54

5

2.55

4

2.03

News

105

53.30

104

52.79

105

53.57

105

53.30

Total

197

 

197

 

196

 

197

 

 

File counts and genre excluded from the subcorpora

Column

83

Essay

76

Total Excluded

159