The Corpus
The corpus used in this project is the METU Turkish Corpus (MTC) with approximately 2 million words. The original MTC files include informative tags, such as the author of the text, the paragraph boundaries in the text, etc., which were removed to obtain raw text files. The MTC has been divided into 4 subcorpora keeping the genre distributions equal. The character encoding of the files in the subcorpus is UTF-8. The tables below show the genre distribution of the MTC, the genre distribution of each subcorpus and the genres excluded from the subcorpora. METU-TDB consists of annotations done on Subcorpus-1 of the MTC.
Genre distribution of the METU Turkish Corpus |
||
Genre tag |
Category Count |
% |
Novel |
123 |
15.63 |
Story |
114 |
14.49 |
Research-Survey |
49 |
6.23 |
Article |
38 |
4.83 |
Travel |
19 |
2.41 |
Interview |
7 |
0.89 |
Memoir |
18 |
2.29 |
News |
419 |
53.24 |
TOTAL |
787 |
100.00 |
Genre distributions in the 4 subcorpora of the METU Turkish Corpus |
||||||||
|
I. |
|
II. |
|
III. |
|
IV. |
|
Genre |
File count |
% |
File Count |
% |
File Count |
% |
File Count |
% |
Novel |
31 |
15.74 |
30 |
15.23 |
31 |
15.82 |
31 |
15.74 |
Story |
28 |
14.21 |
29 |
14.72 |
28 |
14.29 |
29 |
14.72 |
Research-Survey |
13 |
6.60 |
12 |
6.09 |
12 |
6.12 |
12 |
6.09 |
Article |
9 |
4.57 |
10 |
5.08 |
9 |
4.59 |
10 |
5.08 |
Travel |
5 |
2.54 |
5 |
2.54 |
4 |
2.04 |
5 |
2.54 |
Interview |
2 |
1.02 |
2 |
1.02 |
2 |
1.02 |
1 |
0.51 |
Memoir |
4 |
2.03 |
5 |
2.54 |
5 |
2.55 |
4 |
2.03 |
News |
105 |
53.30 |
104 |
52.79 |
105 |
53.57 |
105 |
53.30 |
Total |
197 |
|
197 |
|
196 |
|
197 |
|
File counts and genre excluded from the subcorpora |
|
Column |
83 |
Essay |
76 |
Total Excluded |
159 |