불용어 처리

이 작업은 내용어가 아닌 기능어들을 제거하는 과정입니다. 불용어 리스트는 개발자에 따라 차이가 있으니 용도에 따라 적절한 불용어 목록을 사용할 수 있습니다.

코드 호스팅 플랫폼(GitHub, PyPI, Hugging Face Hub 등)에서 stopwords를 키워드로 검색하면 여러 불용어 목록을 찾을 수 있습니다.
spaCy 라이브러리를 사용할 경우, 러시아어 모델을 로드한 후 불용어 목록에 접근할 수 있습니다.

실습에서는 nltk에서 제공하는 nltk.corpus.stopwords 모듈을 통해 불용어를 불러오는 방법을 사용하겠습니다.

nltk 라이브러리에서 stopwords 데이터를 다운로드한 후 임포트합니다.
```
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
```
러시아어 불용어 목록을 로드합니다.
```
russian_stops = stopwords.words("russian")
print(russian_stops)
print('길이:', len(russian_stops))
```
현재 불용어 목록을 출력한 후, 불용어 목록의 길이(즉, 불용어의 개수)를 출력합니다.
사용자 정의 불용어 목록을 추가합니다.
```
add_stopword_list = ["было", "какая", "к", "''"]

for word in add_stopword_list:
     russian_stops.append(word)
```
nltk에서 제공하는 러시아어 불용어는 151개입니다. 우리는 여기에 4 개의 불용어("было", "какая", "к", "''")를 추가하여 작업을 진행해 보겠습니다. 각 단어를 기존 불용어 목록(russian_stops)에 추가합니다.

수정된 불용어 목록을 출력합니다.

print(russian_stops)
print('길이:', len(russian_stops))

1~4의 과정을 모두 포함하면 다음과 같이 코드가 완성됩니다.

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

russian_stops = stopwords.words("russian")
print(russian_stops)
print('길이:', len(russian_stops))

add_stopword_list = ["было", "какая", "к", "''"]

for word in add_stopword_list:
     russian_stops.append(word)
     
print(russian_stops)
print('길이:', len(russian_stops))

주석 보기

화살표 버튼을 클릭하여 셀을 실행합니다.

다음과 같이 출력됩니다.

이제 불용어를 제거하고, 작업 결과를 저장해 보겠습니다. 정규화 작업의 결과인 lemmatized_words.txt 파일로 실습을 진행하겠습니다.

새로운 코드 셀을 삽입하고, 파일 경로를 설정합니다.

lemmatized_input_file = 'lemmatized_words.txt'
filtered_output_file = 'filtered_lemmatized_words.txt'

불용어를 제거한 후 남은 단어들을 저장합니다.
```
result = []
```
표제어를 읽습니다.
```
with open(lemmatized_input_file, 'r', encoding='utf-8') as infile:
     for line in infile:
          word_list = line.strip().split()
          for word in word_list:
            if word.strip() not in russian_stops:
              result.append(word.strip())
```
line.strip()으로 양쪽 공백을 제거하고, split()을 사용해 한 줄에 있는 단어들을 공백 기준으로 나눕니다. 각 단어가 불용어 목록(russian_stops)에 포함되어 있지 않으면, 그 단어를 result 리스트에 추가합니다.
결과를 텍스트 파일로 저장합니다.
```
with open(filtered_output_file, 'w', encoding='utf-8') as filtered_file:
     for word in result:
          filtered_file.write(word + '\\n')
```
result 리스트에 저장된 단어들을 텍스트 파일에 한 줄씩 기록합니다. 각 단어가 새로운 줄에 기록되도록 줄 바꿈 문자(\n)와 함께 출력합니다.
결과의 길이와 샘플을 출력합니다.
```
print('길이:', len(result))
print('샘플:', result[:100])
```
len(result)는 불용어를 제거한 후 남은 단어들의 총 개수를 출력합니다. 불용어가 제거된 단어 중 첫 100개를 샘플로 출력해 보겠습니다.

6~10의 과정을 모두 포함하면 다음과 같이 코드가 완성됩니다.

lemmatized_input_file = 'lemmatized_words.txt'
filtered_output_file = 'filtered_lemmatized_words.txt'

result = []

with open(lemmatized_input_file, 'r', encoding='utf-8') as infile:
     for line in infile:
          word_list = line.strip().split()
          for word in word_list:
            if word.strip() not in russian_stops:
              result.append(word.strip())

with open(filtered_output_file, 'w', encoding='utf-8') as filtered_file:
     for word in result:
          filtered_file.write(word + '\\n')

print('길이:', len(result))
print('샘플:', result[:100])

주석 보기