Updated 2024년 6월 7일

코퍼스 규모

하위 코퍼스별 텍스트

코퍼스 텍스트 수 문장 수 **토큰 수** **토큰의 백분율**
Main 131,488 30,499,014 374,449,975 17.30%
┗ including manually disambiguated 2,170 519,726 6,106,069 0.30%
Media 2,838,953 57,862,754 850,630,557 39.20%
┗ National media 2,728,688 55,215,073 815,141,029 37.60%
┗ Regional & international 110,265 2,647,681 35,489,528 1.60%
SynTagRus 1,304 109,886 1,568,027 0.10%
Social networks 1,753,274 14,107,401 157,899,671 7.30%
Spoken 4,514 2,018,401 14,554,052 0.70%
Accentological 1,340,677 13,503,719 135,282,644 6.20%
Multimedia 1,383 1,019,768 5,763,881 0.30%
MultiPARC 51 79,672 458,531 0.00%
┗ Russian 21 36,567 229,200 0.00%
┗ English-Russian 30 43,105 229,331 0.00%
Parallel 7,337 13,533,996 179,244,125 8.30%
┗ English 1,322 3,119,872 45,562,451 2.10%
┗ Armenian 28 126,636 1,570,735 0.10%
┗ Bashkir 124 124,270 550,387 0.00%
┗ Belarusian 312 1,162,868 10,916,697 0.50%
┗ Bulgarian 59 418,988 5,159,914 0.20%
┗ Buryat 7 30,750 401,516 0.00%
┗ Spanish 140 364,298 5,443,988 0.30%
┗ Italian 126 302,264 4,930,970 0.20%
┗ Chinese 1,075 253,500 4,422,747 0.20%
┗ Korean 185 12,300 73,752 0.00%
┗ Latvian 245 410,438 4,398,564 0.20%
┗ Lithuanian 65 72,244 702,448 0.00%
┗ German 294 2,194,004 31,742,120 1.50%
┗ Polish 54 501,800 6,355,629 0.30%
┗ Portuguese 38 88,572 1,602,412 0.10%
┗ Romanian 31 60,140 903,375 0.00%
┗ Serbian 37 144,027 1,903,176 0.10%
┗ Slovene 53 173,172 1,989,747 0.10%
┗ Ukrainian 865 919,426 9,383,774 0.40%
┗ Finnish 320 299,184 3,741,431 0.20%
┗ French 67 498,180 7,631,429 0.40%
┗ Khakas 331 126,710 1,194,970 0.10%
┗ Hindi β 9 9,292 122,347 0.00%
┗ Czech 553 333,360 4,372,828 0.20%
┗ Swedish 787 1,344,054 16,520,159 0.80%
┗ Estonian 95 192,493 2,158,315 0.10%
┗ Japanese 103 31,512 453,279 0.00%
┗ Multilingual 12 219,642 5,034,965 0.20%
Dialect 2,014 125,156 599,258 0.00%
Educational 1,247 1,184,926 13,761,608 0.60%
From 2 to 15 75 413,781 4,408,536 0.20%
Poetry 101,521 1,340,752 13,879,558 0.60%
Russian classics β 27,289 1,544,467 18,556,005 0.90%
Historical 10,825 813,088 14,877,877 0.70%
┗ Old East Slavic 301 838,928 0.00%
┗ Inscriptions 663 5,228 0.00%
┗ Birchbark letters 1,230 1,230 23,598 0.00%
┗ Middle Russian 7,212 379,522 8,745,540 0.40%
┗ Church Slavonic 1,419 432,336 5,264,583 0.20%
Panchronic 141,035 30,890,027 384,096,728 17.70%
6,362,987 169,046,808 2,170,031,033 100%

텍스트 종류

주요 코퍼스 내의 텍스트의 정보

텍스트 종류 텍스트 수 텍스트 수 토큰 수 토큰의 백분율
Non-fiction 120,563 16,724,139 223,140,501 59.50%
Fiction 10,967 14,998,103 151,878,171 40.50%
131,530 31,722,242 375,018,672 100%

Fiction

장르 텍스트 수 문장 수 토큰 수 토큰의 백분율
Crime 139 860,979 7,722,512 5.0%
Children's literature 848 764,249 6,635,985 4.3%
Nonfiction 462 1,083,362 12,453,680 8.1%
Drama 307 617,011 3,155,297 2.0%
Historical prose 282 1,319,374 14,437,611 9.4%
Love story 55 169,273 1,542,336 1.0%
No genre 6,094 8,191,709 86,466,150 56.1%
Translation 16 13,415 185,172 0.1%
Adventure 280 570,595 5,828,821 3.8%
Miscellaneous 80 27,709 351,910 0.2%
Sentimental fiction 30 10,867 167,334 0.1%
Sci-fi 733 999,272 9,724,283 6.3%
Humour and satire 1,569 604,686 5,585,078 3.6%
10,895 15,232,501 154,256,169 100%

Non-fiction

영역 텍스트 수 문장 수 토큰 수 토큰의 백분율
Day-to-day life 6,308 3,270,147 33,360,180 14.7%
Official and business 3,534 332,593 5,102,771 2.3%
Technical 1,210 120,232 1,621,124 0.7%
Journalism 97,896 9,983,031 136,953,541 60.5%
Advertising 2,153 84,875 853,096 0.4%
Academic 7,759 2,442,550 39,858,190 17.6%
Theological 1,219 373,689 5,298,688 2.3%
Electronic communication 888 352,547 3,474,946 1.5%
120,967 16,959,664 226,522,536 100%

| --- | --- | --- | --- | --- |

생성 날짜

날짜별 주요 코퍼스 내의 텍스트

| --- | --- | --- | --- | --- |

품사

품사별 토큰 (단어 중의성이 해소된 코퍼스만)

| --- | --- | --- |