๐Ÿ˜Ž ๊ณต๋ถ€ํ•˜๋Š” ์ง•์ง•์•ŒํŒŒ์นด๋Š” ์ฒ˜์Œ์ด์ง€?

2. NLTK๋กœ ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„ ์ธก์ •ํ•˜๊ธฐ “์ด ์†Œ์„ค์˜ ์›์ž‘์ž๋Š” ๋ˆ„๊ตฌ์ผ๊นŒ” ๋ณธ๋ฌธ

๐Ÿ‘ฉ‍๐Ÿ’ป ์ธ๊ณต์ง€๋Šฅ (ML & DL)/ML & DL

2. NLTK๋กœ ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„ ์ธก์ •ํ•˜๊ธฐ “์ด ์†Œ์„ค์˜ ์›์ž‘์ž๋Š” ๋ˆ„๊ตฌ์ผ๊นŒ”

์ง•์ง•์•ŒํŒŒ์นด 2022. 10. 17. 15:55
728x90
๋ฐ˜์‘ํ˜•

221017 ์ž‘์„ฑ

<๋ณธ ๋ธ”๋กœ๊ทธ๋Š”์‹ค์ „ ํŒŒ์ด์ฌ ํ•ธ์ฆˆ์˜จ ํ”„๋กœ์ ํŠธ์˜ github ๋ฅผ ์ฐธ๊ณ ํ•ด์„œ ๊ณต๋ถ€ํ•˜๋ฉฐ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค>

https://www.onlybook.co.kr/m/entry/python-projects

 

์‹ค์ „ ํŒŒ์ด์ฌ ํ•ธ์ฆˆ์˜จ ํ”„๋กœ์ ํŠธ

์‹ค์ „ ํŒŒ์ด์ฌ ํ•ธ์ฆˆ์˜จ ํ”„๋กœ์ ํŠธ ๋ฌธ์ œ ํ•ด๊ฒฐ๊ณผ ์‹ค๋ฌด ์‘์šฉ๋ ฅ์„ ํ‚ค์šฐ๊ธฐ ์œ„ํ•œ ๋‚˜๋งŒ์˜ ํŒŒ์ด์ฌ ํฌํŠธํด๋ฆฌ์˜ค ๋งŒ๋“ค๊ธฐ ๋ฆฌ ๋ณธ ์ง€์Œ | ์˜คํ˜„์„ ์˜ฎ๊น€ 420์ชฝ | 28,000์› | 2022๋…„ 5์›” 31์ผ ์ถœ๊ฐ„ | 185*240*20 | ISBN13 9791189909406

www.onlybook.co.kr

๐Ÿ’• NLTK

: Natural Language ToolKit์˜ ์•ฝ์ž๋กœ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ฐ ๋ถ„์„์„ ์œ„ํ•œ ํŒŒ์ด์ฌ ํŒจํ‚ค์ง€

: ํ† ํฐ์ƒ์„ฑํ•˜๊ธฐ, ํ˜•ํƒœ์†Œ ๋ถ„์„, ํ’ˆ์‚ฌ ํƒœ๊น…ํ•˜๊ธฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๊ธฐ๋Šฅ์„ ์ œ๊ณต

 

๐Ÿ’• NLTK๋กœ ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„ ์ธก์ •ํ•˜๊ธฐ

: ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•ด ์•„์„œ ์ฝ”๋‚œ ๋„์ผ์ด๋‚˜ H. G. ์›ฐ์Šค ์ค‘ ๋ˆ„๊ฐ€ ใ€Ž์žƒ์–ด๋ฒ„๋ฆฐ ์„ธ๊ณ„ใ€๋ฅผ ์ผ๋Š”์ง€๋ฅผ ๊ฒฐ์ •

: NLTK, matplotlib ๋“ฑ์˜ ๋ชจ๋“ˆ์€ ๋ฌผ๋ก ์ด๊ณ  ๋ถˆ์šฉ์–ด(stop words), ํ’ˆ์‚ฌ, ์–ดํœ˜์˜ ํ’๋ถ€ํ•จ, ์ž์นด๋“œ ์œ ์‚ฌ์„ฑ(Jaccard similarity) ๋“ฑ์˜ ์Šคํƒ€์ผ๋กœ๋ฉ”ํŠธ๋ฆฌ(stylometry) ๊ธฐ๋ฒ•์„ ํ™œ์šฉ

# punctuation ํžˆํŠธ๋งต ๋งŒ๋“ค๊ธฐ
import math
from string import punctuation      # ๋”ฐ์˜ดํ‘œ,๋งˆ์นจํ‘œ ๋ฌผ์Œํ‘œ ๋“ฑ๋“ฑ ์ด๋Ÿฐ๋ฅ˜์˜ ๋ฌธ์žฅ๋ถ€ํ˜ธ
import nltk
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns

PUNCT_SET = set(punctuation)

def main():  
    # ์ž‘์„ฑ์ž๊ฐ€ ์‚ฌ์ „์— ํ…์ŠคํŠธ ํŒŒ์ผ์„ ๋กœ๋“œ
    strings_by_author = dict()
    strings_by_author['doyle'] = text_to_string('hound.txt')
    strings_by_author['wells'] = text_to_string('war.txt')
    strings_by_author['unknown'] = text_to_string('lost.txt')

    # ๋ฌธ์žฅ ๋ถ€ํ˜ธ๋งŒ ๋ณด์กดํ•œ ํ…์ŠคํŠธ ๋ฌธ์ž์—ด์„ ํ† ํฐํ™”
    punct_by_author = make_punct_dict(strings_by_author)

    # punctuation์„ ์ˆซ์ž ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ํžˆํŠธ๋งต์„ ํ‘œ์‹œ
    plt.ion()

    for author in punct_by_author:
        heat = convert_punct_to_number(punct_by_author, author)
        arr = np.array((heat[:6561])) # ์ •์‚ฌ๊ฐํ˜• ๋ฐฐ์—ด์˜ ๊ฐ€์žฅ ํฐ ํฌ๊ธฐ๋กœ ์ž๋ฅด๊ธฐ
        arr_reshaped = arr.reshape(int(math.sqrt(len(arr))),
                                   int(math.sqrt(len(arr))))
        fig, ax = plt.subplots(figsize=(7, 7))
        sns.heatmap(arr_reshaped,
                    cmap=ListedColormap(['blue', 'yellow']),    # ์ธ์ž๋กœ ์ฃผ์–ด์ง„ ์ƒ‰์ƒ์„ ๊ทธ๋ž˜ํ”„์ƒ์— ํ‘œ์‹œํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ์ฒด
                    square=True,
                    ax=ax)
        ax.set_title('Heatmap Semicolons {}'.format(author))
    plt.show()    

# # """ํ…์ŠคํŠธ ํŒŒ์ผ์„ ์ฝ๊ณ  ๋ฌธ์ž์—ด ๋ชฉ๋ก์„ ๋ฐ˜ํ™˜"""
# def text_to_string(filename):
#     strings = []
#     with open(filename) as f:
#         strings.append(f.read())
#     return '\n'.join(strings)

# """ํ…์ŠคํŠธ ํŒŒ์ผ์„ ์ฝ๊ณ  ๋ฌธ์ž์—ด์„ ๋ฐ˜ํ™˜"""
def text_to_string(filename):
    """Read a text file and return a string."""
    with open(filename) as infile:
        return infile.read()

# """์ €์ž๊ฐ€ ๋ง๋ญ‰์น˜์— ์˜ํ•ด ํ† ํฐํ™”๋œ punctuation ์‚ฌ์ „์„ ๋ฐ˜ํ™˜"""
def make_punct_dict(strings_by_author):
    """Return dictionary of tokenized punctuation by corpus by author."""
    punct_by_author = dict()

    for author in strings_by_author:
        # ๋‹จ์–ด ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๊ธฐ
        tokens = nltk.word_tokenize(strings_by_author[author])
        
        # punct_by_author : ๋ฌธ์žฅ ๋ถ€ํ˜ธ๋งŒ ๋ณด์กดํ•œ ํ…์ŠคํŠธ ๋ฌธ์ž์—ด์„ ํ† ํฐํ™”
        punct_by_author[author] = ([token for token in tokens
                                    if token in PUNCT_SET])
        print("Number punctuation marks in {} = {}"
              .format(author, len(punct_by_author[author])))
    return punct_by_author  


# """์ˆซ์ž ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜๋œ punctuation ๋ชฉ๋ก์„ ๋ฐ˜ํ™˜"""
def convert_punct_to_number(punct_by_author, author):
    """Return list of punctuation marks converted to numerical values."""
    heat_vals = []
    for char in punct_by_author[author]:
        if char == ';':
            value = 1
        else:
            value = 2
        heat_vals.append(value)
    return heat_vals

if __name__ == '__main__':
    main()
728x90
๋ฐ˜์‘ํ˜•
Comments