바이그램 | Notion

바이그램(Bigram)은 n-그램(n-gram) 모델에서 n=2인 경우를 말합니다. 즉, 텍스트 내에서 두 개의 연속된 단어 쌍을 나타냅니다. 자연어 처리에서 주로 단어의 연속성을 파악하여 단어 간 관계나 문맥을 이해하는 데 사용됩니다.

정규화를 통해 전처리된 lemmatized_words.txt 데이터를 입력파일로 사용하여 빈도수 상위 20개의 바이그램을 출력해 봅시다.

필요한 라이브러리를 임포트합니다.
```
from collections import Counter
```

파일 경로를 설정합니다.

input_file = 'lemmatized_words.txt'
bigrams_output_file = 'bigrams.txt'

바이그램을 저장할 빈 리스트를 설정합니다.
```
bigrams = []
```
파일을 열고 바이그램을 생성합니다.
```
with open(input_file, 'r', encoding='utf-8') as infile:
  for line in infile:
    words = line.strip().split() 
    bigrams.extend([(words[i], words[i + 1]) for i in range(len(words) - 1)])
```
각 줄을 공백을 기준으로 나누어 단어 리스트로 변환합니다. strip()은 줄 끝의 공백이나 줄 바꿈 문자를 제거하고, split()은 공백을 기준으로 단어를 나눕니다. 조건을 만족하는 값을 추출하여 새로운 리스트를 생성하기 위해 리스트 컴프리헨션을 사용하겠습니다. 연속된 두 단어를 조합하여 바이그램을 생성합니다. words[i]와 words[i + 1]은 연속된 두 단어를 의미합니다. 이들 각각을 튜플 형태로 추가합니다.
바이그램의 빈도수를 계산합니다.
```
bigram_counts = Counter(bigrams)
```

결과를 파일에 저장합니다.

with open(bigrams_output_file, 'w', encoding='utf-8') as outfile:
  for bigram, count in bigram_counts.items():
    outfile.write(f"{bigram[0]} {bigram[1]}: {count}\\n")

파일 형식은 단어1 단어2: 빈도수 형태로 저장합니다.

결과의 길이를 출력하고, 빈도수 상위 20개의 바이그램을 출력합니다.

print('바이그램 개수:', len(bigram_counts))
print('\\n빈도수 높은 순서로 샘플 20개 바이그램:')
for bigram, count in bigram_counts.most_common(20):
  print(f"{bigram[0]} {bigram[1]}: {count}")

출력 형식은 단어1 단어2: 빈도수로 표시됩니다.

지금까지의 과정을 모두 포함하면 다음과 같이 코드가 완성됩니다.

from collections import Counter

input_file = 'lemmatized_words.txt'
bigrams_output_file = 'bigrams.txt'

bigrams = []

with open(input_file, 'r', encoding='utf-8') as infile:
  for line in infile:
    words = line.strip().split() 
    bigrams.extend([(words[i], words[i + 1]) for i in range(len(words) - 1)])
  
bigram_counts = Counter(bigrams)

with open(bigrams_output_file, 'w', encoding='utf-8') as outfile:
  for bigram, count in bigram_counts.items():
    outfile.write(f"{bigram[0]} {bigram[1]}: {count}\\n")

print('바이그램 개수:', len(bigram_counts))
print('\\n빈도수 높은 순서로 샘플 20개 바이그램:')
for bigram, count in bigram_counts.most_common(20):
  print(f"{bigram[0]} {bigram[1]}: {count}")

주석 보기

화살표 버튼을 클릭하여 셀을 실행합니다.

다음과 같이 출력됩니다.

바이그램 작업이 끝났습니다. bigrams.txt 파일이 자동 저장됩니다.