정규화 | Notion

러시아어 데이터 분석에서는 어간 추출보다 표제어(원형) 추출을 많이 사용합니다. 토큰화 작업의 결과인 words.txt 파일로 실습을 진행하겠습니다.

필요한 라이브러리를 임포트합니다.
```
from pymystem3 import Mystem
```
pymystem3 라이브러리에서 Mystem 클래스를 불러오겠습니다.
- pymystem3

파일 경로를 설정합니다.

word_input_file = 'words.txt'
lemmatized_output_file = 'lemmatized_words.txt'

표제어 추출을 이해 Mystem 객체를 생성합니다.
```
mystem = Mystem()
```
표제어로 변환된 단어들을 저장할 빈 리스트 lemmatized_words를 생성합니다.
```
lemmatized_words = []
```
이 리스트에 나중에 표제어로 변환된 단어들이 차례대로 추가됩니다.
단어 리스트를 표제어로 변환합니다.
```
with open(word_input_file, 'r', encoding='utf-8') as infile:
     for line in infile:
      word_list = line.strip().split()
      lemmatized_list = mystem.lemmatize(' '.join(word_list))
      lemmatized_words.append(lemmatized_list)
```
먼저 word_input_file 변수에 words.txt 파일을 읽기 모드로 엽니다. 각 줄의 양옆 공백을 제거한 후(strip()), 한 줄을 공백을 기준으로 단어 단위로 나누겠습니다(split()). 이렇게 나눈 단어들을 word_list에 저장합니다.

단어 리스트를 표제어로 변환하기 위해 word_list에 포함된 단어들을 다시 하나의 문자열로 합친 후(' '.join(word_list)), mystem.lemmatize() 메서드를 사용하여 이 문자열을 표제어로 변환합니다.

변환된 표제어 리스트(lemmatized_list)를 lemmatized_words 리스트에 추가합니다.
표제어 추출 결과를 텍스트 파일로 저장합니다.
```
with open(lemmatized_output_file, 'w', encoding='utf-8') as lemmatized_file:
     for lemmatized_list in lemmatized_words:
          lemmatized_file.write(' '.join(lemmatized_list) + '\\n')
```
lemmatized_words에 저장된 표제어 리스트를 하나씩 읽고, 각 표제어 리스트를 공백으로 구분하여 텍스트 파일에 한 줄씩 기록합니다.

lemmatized_output_file 파일을 읽어 첫 10줄을 출력하겠습니다.

print("\\n표제어 추출 결과의 첫 10줄:")
with open(lemmatized_output_file, 'r', encoding='utf-8') as f:
     for _ in range(10):
          print(f.readline().strip())

지금까지의 과정을 모두 포함하면 다음과 같이 코드가 완성됩니다.

from pymystem3 import Mystem

word_input_file = 'words.txt'
lemmatized_output_file = 'lemmatized_words.txt'

mystem = Mystem()

lemmatized_words = []

with open(word_input_file, 'r', encoding='utf-8') as infile:
     for line in infile:
      word_list = line.strip().split()
      lemmatized_list = mystem.lemmatize(' '.join(word_list))
      lemmatized_words.append(lemmatized_list)

with open(lemmatized_output_file, 'w', encoding='utf-8') as lemmatized_file:
     for lemmatized_list in lemmatized_words:
          lemmatized_file.write(' '.join(lemmatized_list) + '\\n')
          
print("\\n표제어 추출 결과의 첫 10줄:")
with open(lemmatized_output_file, 'r', encoding='utf-8') as f:
     for _ in range(10):
          print(f.readline().strip())

주석 보기

화살표 버튼을 클릭하여 셀을 실행합니다.

다음과 같이 출력됩니다.

정규화 작업이 끝났습니다. lemmatized_words.txt 파일이 자동 저장됩니다.

words.txt와 lemmatized_words.txt를 더블클릭하여 비교해 봅시다.
- words.txt
- lemmatized_words.txt
words.txt에는 활용형의 단어들이 포함되어 있지만, lemmatized_words.txt의 경우 모든 단어가 표제어(원형)로 변환된 것을 확인할 수 있습니다.