Updated 2023년 11월 14일

코퍼스 규모

하위 코퍼스별 텍스트

코퍼스 텍스트 수 문장 수 https://gorgeous-preface-418.notion.site/af49c73fecb14eb3b90f0102b2a06da2 https://gorgeous-preface-418.notion.site/af49c73fecb14eb3b90f0102b2a06da2의 백분율
Main 131,488 30,499,014 374,449,975 17.8%
┗ including manually disambiguated 2,170 519,726 6,106,069 0.3%
Media 2,717,817 54,264,799 788,960,499 37.6%
┗ National media 2,660,024 52,487,609 765,537,565 36.5%
┗ Regional & international 57,793 1,777,190 23,422,934 1.1%
SynTagRus 1,304 107,129 1,529,501 0.1%
Social networks 1,753,274 14,197,458 157,921,573 7.5%
Spoken 4,396 1,965,934 13,984,358 0.7%
Accentological 1,335,740 13,393,954 134,239,098 6.4%
Multimedia 1,383 1,019,775 5,763,881 0.3%
MultiPARC 51 79,672 458,531 0.0%
┗ Russian 21 36,567 229,200 0.0%
┗ English-Russian 30 43,105 229,331 0.0%
Parallel 6,722 13,100,414 173,288,365 8.3%
┗ English 1,189 3,056,510 44,477,958 2.1%
┗ Armenian 28 126,636 1,570,738 0.1%
┗ Bashkir 124 124,270 550,387 0.0%
┗ Belarusian 312 1,162,868 10,916,697 0.5%
┗ Bulgarian 59 418,988 5,159,914 0.2%
┗ Buryat 7 30,750 401,516 0.0%
┗ Spanish 139 359,218 5,385,735 0.3%
┗ Italian 126 302,264 4,930,991 0.2%
┗ Chinese 1,075 253,500 4,422,747 0.2%
┗ Korean 185 12,300 73,752 0.0%
┗ Latvian 245 410,442 4,400,017 0.2%
┗ Lithuanian 65 72,244 702,448 0.0%
┗ German 287 2,118,530 30,544,451 1.5%
┗ Polish 54 501,800 6,355,630 0.3%
┗ Portuguese 28 25,136 566,675 0.0%
┗ Romanian 31 60,140 903,377 0.0%
┗ Serbian 37 144,027 1,903,178 0.1%
┗ Slovene 53 173,172 1,989,749 0.1%
┗ Ukrainian 865 919,426 9,383,774 0.4%
┗ Finnish 320 299,184 3,741,431 0.2%
┗ French 61 462,184 7,123,534 0.3%
┗ Hindi 9 9,292 123,176 0.0%
┗ Czech 529 301,344 3,947,051 0.2%
┗ Swedish 787 1,344,054 16,520,159 0.8%
┗ Estonian 95 192,493 2,158,315 0.1%
┗ Multilingual 12 219,642 5,034,965 0.2%
Dialect 2,014 125,156 599,258 0.0%
Educational 1,247 1,184,926 13,761,608 0.7%
From 2 to 15 75 431,193 4,419,420 0.2%
Poetry 96,702 1,208,251 13,404,836 0.6%
Russian classics β 25,882 1,513,674 17,549,885 0.8%
Historical 11,120 800,150 14,931,707 0.7%
┗ Old East Slavic 248 807,904 0.0%
┗ Inscriptions 663 5,228 0.0%
┗ Birchbark letters 1,230 1,230 23,598 0.0%
┗ Middle Russian 7,560 383,769 8,843,355 0.4%
┗ Church Slavonic 1,419 415,151 5,251,622 0.3%
Panchronic 140,326 30,944,770 383,815,697 18.3%
6,229,541 164,836,269 2,099,078,192 100%

텍스트 종류

주요 코퍼스 내의 텍스트의 정보

텍스트 종류 텍스트 수 텍스트 수 토큰 수 토큰의 백분율
Non-fiction 120,563 16,724,139 223,140,501 59.5%
Fiction 10,967 14,998,103 151,878,171 40.5%
131,530 31,722,242 375,018,672 100%

Fiction

장르 텍스트 수 문장 수 토큰 수 토큰의 백분율
Crime 139 860,979 7,722,512 5.0%
Children's literature 848 764,249 6,635,985 4.3%
Nonfiction 462 1,083,362 12,453,680 8.1%
Drama 307 617,011 3,155,297 2.0%
Historical prose 282 1,319,374 14,437,611 9.4%
Love story 55 169,273 1,542,336 1.0%
No genre 6,094 8,191,709 86,466,150 56.1%
Translation 16 13,415 185,172 0.1%
Adventure 280 570,595 5,828,821 3.8%
Miscellaneous 80 27,709 351,910 0.2%
Sentimental fiction 30 10,867 167,334 0.1%
Sci-fi 733 999,272 9,724,283 6.3%
Humour and satire 1,569 604,686 5,585,078 3.6%
10,895 15,232,501 154,256,169 100%

Non-fiction

영역 텍스트 수 문장 수 토큰 수 토큰의 백분율
Day-to-day life 6,308 3,270,147 33,360,180 14.7%
Official and business 3,534 332,593 5,102,771 2.3%
Technical 1,210 120,232 1,621,124 0.7%
Journalism 97,896 9,983,031 136,953,541 60.5%
Advertising 2,153 84,875 853,096 0.4%
Academic 7,759 2,442,550 39,858,190 17.6%
Theological 1,219 373,689 5,298,688 2.3%
Electronic communication 888 352,547 3,474,946 1.5%
120,967 16,959,664 226,522,536 100%
텍스트 주제 텍스트 수 문장 수 토큰 수 토큰의 백분율
Administration and management 17,352 1,432,831 17,313,749 4.6%
Army and armed conflict 12,702 1,271,972 15,623,927 4.2%
Archaeology 21 2,034 29,368 0.0%
Astrology, parapsychology, esoterica 432 101,154 1,035,185 0.3%
Astronomy 449 41,821 649,728 0.2%
Business, commerce, economics, finance 12,339 772,790 10,336,228 2.8%
Biology 1,202 225,567 3,512,071 0.9%
Military affairs 12 12,459 235,429 0.1%
Geography 461 219,107 3,661,940 1.0%
Geodesy 1 613 15,250 0.0%
Geology 631 132,920 1,876,853 0.5%
Mining industry 393 27,414 422,038 0.1%
Home and home economy 1,325 92,597 1,122,868 0.3%
Leisure and entertainment 5,878 479,007 4,835,165 1.3%
Natural science 685 203,688 2,357,321 0.6%
Natural history 30 13,663 209,619 0.1%
Health and medicine 6,114 532,349 6,607,054 1.8%
IT 665 85,394 1,295,556 0.3%
Art and culture 18,094 3,370,208 39,702,197 10.6%
Art history 122 37,886 572,173 0.2%
history 5,236 1,792,984 27,041,126 7.2%
Crime 10,700 376,771 3,899,899 1.0%
Culturology 355 128,026 2,054,761 0.5%
Light industry, food industry 329 24,575 372,466 0.1%
Forestry 94 9,848 146,354 0.0%
Logic 1 3,464 51,840 0.0%
Mathematics 218 43,041 610,062 0.2%
Machinery 25 2,026 30,965 0.0%
Metallurgy 21 2,098 32,409 0.0%
Science and technology 11,190 2,278,086 36,021,595 9.6%
Education 4,126 671,779 7,440,823 2.0%
Politics and society 34,380 4,056,351 53,209,214 14.2%
Political science 18 7,301 117,753 0.0%
Law 3,701 311,517 4,689,965 1.2%