Updated 2024년 6월 7일

코퍼스 규모

하위 코퍼스별 텍스트

코퍼스 텍스트 수 문장 수 **토큰 수** **토큰의 백분율**
Main 131,488 30,499,014 374,449,975 17.30%
┗ including manually disambiguated 2,170 519,726 6,106,069 0.30%
Media 2,838,953 57,862,754 850,630,557 39.20%
┗ National media 2,728,688 55,215,073 815,141,029 37.60%
┗ Regional & international 110,265 2,647,681 35,489,528 1.60%
SynTagRus 1,304 109,886 1,568,027 0.10%
Social networks 1,753,274 14,107,401 157,899,671 7.30%
Spoken 4,514 2,018,401 14,554,052 0.70%
Accentological 1,340,677 13,503,719 135,282,644 6.20%
Multimedia 1,383 1,019,768 5,763,881 0.30%
MultiPARC 51 79,672 458,531 0.00%
┗ Russian 21 36,567 229,200 0.00%
┗ English-Russian 30 43,105 229,331 0.00%
Parallel 7,337 13,533,996 179,244,125 8.30%
┗ English 1,322 3,119,872 45,562,451 2.10%
┗ Armenian 28 126,636 1,570,735 0.10%
┗ Bashkir 124 124,270 550,387 0.00%
┗ Belarusian 312 1,162,868 10,916,697 0.50%
┗ Bulgarian 59 418,988 5,159,914 0.20%
┗ Buryat 7 30,750 401,516 0.00%
┗ Spanish 140 364,298 5,443,988 0.30%
┗ Italian 126 302,264 4,930,970 0.20%
┗ Chinese 1,075 253,500 4,422,747 0.20%
┗ Korean 185 12,300 73,752 0.00%
┗ Latvian 245 410,438 4,398,564 0.20%
┗ Lithuanian 65 72,244 702,448 0.00%
┗ German 294 2,194,004 31,742,120 1.50%
┗ Polish 54 501,800 6,355,629 0.30%
┗ Portuguese 38 88,572 1,602,412 0.10%
┗ Romanian 31 60,140 903,375 0.00%
┗ Serbian 37 144,027 1,903,176 0.10%
┗ Slovene 53 173,172 1,989,747 0.10%
┗ Ukrainian 865 919,426 9,383,774 0.40%
┗ Finnish 320 299,184 3,741,431 0.20%
┗ French 67 498,180 7,631,429 0.40%
┗ Khakas 331 126,710 1,194,970 0.10%
┗ Hindi β 9 9,292 122,347 0.00%
┗ Czech 553 333,360 4,372,828 0.20%
┗ Swedish 787 1,344,054 16,520,159 0.80%
┗ Estonian 95 192,493 2,158,315 0.10%
┗ Japanese 103 31,512 453,279 0.00%
┗ Multilingual 12 219,642 5,034,965 0.20%
Dialect 2,014 125,156 599,258 0.00%
Educational 1,247 1,184,926 13,761,608 0.60%
From 2 to 15 75 413,781 4,408,536 0.20%
Poetry 101,521 1,340,752 13,879,558 0.60%
Russian classics β 27,289 1,544,467 18,556,005 0.90%
Historical 10,825 813,088 14,877,877 0.70%
┗ Old East Slavic 301 838,928 0.00%
┗ Inscriptions 663 5,228 0.00%
┗ Birchbark letters 1,230 1,230 23,598 0.00%
┗ Middle Russian 7,212 379,522 8,745,540 0.40%
┗ Church Slavonic 1,419 432,336 5,264,583 0.20%
Panchronic 141,035 30,890,027 384,096,728 17.70%
6,362,987 169,046,808 2,170,031,033 100%

텍스트 종류

주요 코퍼스 내의 텍스트의 정보

텍스트 종류 텍스트 수 텍스트 수 토큰 수 토큰의 백분율
Non-fiction 120,563 16,724,139 223,140,501 59.50%
Fiction 10,967 14,998,103 151,878,171 40.50%
131,530 31,722,242 375,018,672 100%

Fiction

장르 텍스트 수 문장 수 토큰 수 토큰의 백분율
Crime 139 860,979 7,722,512 5.0%
Children's literature 848 764,249 6,635,985 4.3%
Nonfiction 462 1,083,362 12,453,680 8.1%
Drama 307 617,011 3,155,297 2.0%
Historical prose 282 1,319,374 14,437,611 9.4%
Love story 55 169,273 1,542,336 1.0%
No genre 6,094 8,191,709 86,466,150 56.1%
Translation 16 13,415 185,172 0.1%
Adventure 280 570,595 5,828,821 3.8%
Miscellaneous 80 27,709 351,910 0.2%
Sentimental fiction 30 10,867 167,334 0.1%
Sci-fi 733 999,272 9,724,283 6.3%
Humour and satire 1,569 604,686 5,585,078 3.6%
10,895 15,232,501 154,256,169 100%

Non-fiction

영역 텍스트 수 문장 수 토큰 수 토큰의 백분율
Day-to-day life 6,308 3,270,147 33,360,180 14.7%
Official and business 3,534 332,593 5,102,771 2.3%
Technical 1,210 120,232 1,621,124 0.7%
Journalism 97,896 9,983,031 136,953,541 60.5%
Advertising 2,153 84,875 853,096 0.4%
Academic 7,759 2,442,550 39,858,190 17.6%
Theological 1,219 373,689 5,298,688 2.3%
Electronic communication 888 352,547 3,474,946 1.5%
120,967 16,959,664 226,522,536 100%
텍스트 주제 텍스트 수 문장 수 토큰 수 토큰의 백분율
Administration and management 17,352 1,432,831 17,313,749 4.6%
Army and armed conflict 12,702 1,271,972 15,623,927 4.2%
Archaeology 21 2,034 29,368 0.0%
Astrology, parapsychology, esoterica 432 101,154 1,035,185 0.3%
Astronomy 449 41,821 649,728 0.2%
Business, commerce, economics, finance 12,339 772,790 10,336,228 2.8%
Biology 1,202 225,567 3,512,071 0.9%
Military affairs 12 12,459 235,429 0.1%
Geography 461 219,107 3,661,940 1.0%
Geodesy 1 613 15,250 0.0%
Geology 631 132,920 1,876,853 0.5%
Mining industry 393 27,414 422,038 0.1%
Home and home economy 1,325 92,597 1,122,868 0.3%
Leisure and entertainment 5,878 479,007 4,835,165 1.3%
Natural science 685 203,688 2,357,321 0.6%
Natural history 30 13,663 209,619 0.1%
Health and medicine 6,114 532,349 6,607,054 1.8%
IT 665 85,394 1,295,556 0.3%
Art and culture 18,094 3,370,208 39,702,197 10.6%
Art history 122 37,886 572,173 0.2%
history 5,236 1,792,984 27,041,126 7.2%
Crime 10,700 376,771 3,899,899 1.0%
Culturology 355 128,026 2,054,761 0.5%
Light industry, food industry 329 24,575 372,466 0.1%
Forestry 94 9,848 146,354 0.0%
Logic 1 3,464 51,840 0.0%
Mathematics 218 43,041 610,062 0.2%
Machinery 25 2,026 30,965 0.0%
Metallurgy 21 2,098 32,409 0.0%
Science and technology 11,190 2,278,086 36,021,595 9.6%
Education 4,126 671,779 7,440,823 2.0%
Politics and society 34,380 4,056,351 53,209,214 14.2%
Political science 18 7,301 117,753 0.0%
Law 3,701 311,517 4,689,965 1.2%
Nature 4,582 490,472 5,609,959 1.5%
Industry 5,093 354,633 4,366,983 1.2%
Accidents 230 9,838 93,509 0.0%
Psychology 706 170,791 2,635,726 0.7%
Travel 2,330 967,866 12,788,062 3.4%
Religion 6,972 1,114,064 14,750,077 3.9%
Agriculture 2,150 196,473 2,310,054 0.6%
Sociology 485 125,290 1,976,834 0.5%
Sport 4,200 292,914 3,564,947 0.9%
Statistics 368 15,999 230,886 0.1%
Construction, architecture 2,237 175,700 2,078,298 0.6%
Technology 8,254 598,351 7,527,780 2.0%
Transport 4,984 218,267 2,348,824 0.6%
Physics 1,338 125,083 1,871,882 0.5%
Philology 976 361,234 5,710,387 1.5%

생성 날짜

날짜별 주요 코퍼스 내의 텍스트

| --- | --- | --- | --- | --- |

품사

품사별 토큰 (단어 중의성이 해소된 코퍼스만)

| --- | --- | --- |