METU-TDB Corpus

The Corpus

The corpus used in this project is the METU Turkish Corpus (MTC) with approximately 2 million words. The original MTC files include informative tags, such as the author of the text, the paragraph boundaries in the text, etc., which were removed to obtain raw text files. The MTC has been divided into 4 subcorpora keeping the genre distributions equal. The character encoding of the files in the subcorpus is UTF-8. The tables below show the genre distribution of the MTC, the genre distribution of each subcorpus and the genres excluded from the subcorpora. METU-TDB consists of annotations done on Subcorpus-1 of the MTC.

Genre distribution of the METU Turkish Corpus
Genre tag	Category Count	%
Novel	123	15.63
Story	114	14.49
Research-Survey	49	6.23
Article	38	4.83
Travel	19	2.41
Interview	7	0.89
Memoir	18	2.29
News	419	53.24
TOTAL	787	100.00

Genre distributions in the 4 subcorpora of the METU Turkish Corpus
	I.		II.		III.		IV.
Genre	File count	%	File Count	%	File Count	%	File Count	%
Novel	31	15.74	30	15.23	31	15.82	31	15.74
Story	28	14.21	29	14.72	28	14.29	29	14.72
Research-Survey	13	6.60	12	6.09	12	6.12	12	6.09
Article	9	4.57	10	5.08	9	4.59	10	5.08
Travel	5	2.54	5	2.54	4	2.04	5	2.54
Interview	2	1.02	2	1.02	2	1.02	1	0.51
Memoir	4	2.03	5	2.54	5	2.55	4	2.03
News	105	53.30	104	52.79	105	53.57	105	53.30
Total	197		197		196		197

File counts and genre excluded from the subcorpora
Column	83
Essay	76
Total Excluded	159

METU - TDB

METU - Turkish Discourse Bank

The Corpus

Genre distribution of the METU Turkish Corpus

Genre distributions in the 4 subcorpora of the METU Turkish Corpus

File counts and genre excluded from the subcorpora