토큰화 | Notion

필요한 라이브러리를 임포트합니다.
```
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt_tab')
```
정규 표현식을 처리하기 위한 re와 자연어 처리를 위한 라이브러리인 nltk을 임포트합니다. nltk 라이브러리에서 문장 토큰화 함수와 단어 토큰화 함수를 임포트하겠습니다.
nltk의 punkt 패키지를 다운로드 합니다.
```
nltk.download('punkt_tab')
```
nltk는 별도의 설치 없이 코랩에서 바로 사용할 수 있지만, 필요한 모델은 다운로드하여 사용해야 합니다. 예를 들어, word_tokenize와 sent_tokenize를 사용하려면 punkt 모델이 필요합니다. punkt 패키지를 다운로드합니다.

파일 경로를 설정합니다.

input_file = '/content/cleaned_rus_news.txt'
sentence_output_file = 'sentences.txt'
word_output_file = 'words.txt'
with open(input_file, 'r', encoding='utf-8') as infile:
     text = infile.read()

정제화 작업의 결과인 cleaned_rus_news.txt을 입력 파일로 사용하겠습니다.

파일을 열고 정제된 텍스트를 저장합니다.

with open(input_file, 'r', encoding='utf-8') as infile:
     text = infile.read()

이제 문장 토큰화와 단어 토큰화를 수행하겠습니다.

sentences = sent_tokenize(text)
words = [word_tokenize(sentence) for sentence in sentences]

문장과 단어를 각각 텍스트 파일로 저장합니다.

with open(sentence_output_file, 'w', encoding='utf-8') as sent_file:
     for sentence in sentences:
          sent_file.write(sentence + '\\n')
with open(word_output_file, 'w', encoding='utf-8') as word_file:
     for word_list in words:
          word_file.write(' '.join(word_list) + '\\n')

결과 파일의 첫 10줄을 출력하겠습니다.

print("문장 토큰화 결과의 첫 10줄:")
with open(sentence_output_file, 'r', encoding='utf-8') as f:
     for _ in range(10):
          print(f.readline().strip())
print("\\n단어 토큰화 결과의 첫 10줄:")
with open(word_output_file, 'r', encoding='utf-8') as f:
     for _ in range(10):
          print(f.readline().strip())

지금까지의 과정을 모두 포함하면 다음과 같이 코드가 완성됩니다.

import re
import nltk

from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download('punkt_tab')

input_file = '/content/cleaned_rus_news.txt'
sentence_output_file = 'sentences.txt'
word_output_file = 'words.txt'

with open(input_file, 'r', encoding='utf-8') as infile:
     text = infile.read()

sentences = sent_tokenize(text)
words = [word_tokenize(sentence) for sentence in sentences]

with open(sentence_output_file, 'w', encoding='utf-8') as sent_file:
     for sentence in sentences:
          sent_file.write(sentence + '\\n')
with open(word_output_file, 'w', encoding='utf-8') as word_file:
     for word_list in words:
          word_file.write(' '.join(word_list) + '\\n')

print("문장 토큰화 결과의 첫 10줄:")
with open(sentence_output_file, 'r', encoding='utf-8') as f:
     for _ in range(10):
          print(f.readline().strip())
print("\\n단어 토큰화 결과의 첫 10줄:")
with open(word_output_file, 'r', encoding='utf-8') as f:
     for _ in range(10):
          print(f.readline().strip())

주석 보기

화살표 버튼을 클릭하여 셀을 실행합니다.

다음과 같이 출력됩니다.

토큰화 작업이 끝났습니다. 아래와 같이 두 개의 파일이 자동 저장됩니다.