๐Ÿ˜Ž ๊ณต๋ถ€ํ•˜๋Š” ์ง•์ง•์•ŒํŒŒ์นด๋Š” ์ฒ˜์Œ์ด์ง€?

3. NLTK๋กœ ํ…์ŠคํŠธ ์š”์•ฝํ•˜๊ธฐ – “ํ•ต์‹ฌ ๋ฌธ์žฅ์„ ๋ฝ‘์•„๋‚ด๊ณ  ๋‹จ์–ด ๊ตฌ๋ฆ„์„ ๋งŒ๋“ค์–ด๋ณด์ž” ๋ณธ๋ฌธ

๐Ÿ‘ฉ‍๐Ÿ’ป ์ธ๊ณต์ง€๋Šฅ (ML & DL)/ML & DL

3. NLTK๋กœ ํ…์ŠคํŠธ ์š”์•ฝํ•˜๊ธฐ – “ํ•ต์‹ฌ ๋ฌธ์žฅ์„ ๋ฝ‘์•„๋‚ด๊ณ  ๋‹จ์–ด ๊ตฌ๋ฆ„์„ ๋งŒ๋“ค์–ด๋ณด์ž”

์ง•์ง•์•ŒํŒŒ์นด 2022. 10. 17. 15:59
728x90
๋ฐ˜์‘ํ˜•

221017 ์ž‘์„ฑ

<๋ณธ ๋ธ”๋กœ๊ทธ๋Š”์‹ค์ „ ํŒŒ์ด์ฌ ํ•ธ์ฆˆ์˜จ ํ”„๋กœ์ ํŠธ์˜ github ๋ฅผ ์ฐธ๊ณ ํ•ด์„œ ๊ณต๋ถ€ํ•˜๋ฉฐ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค>

https://www.onlybook.co.kr/m/entry/python-projects

 

์‹ค์ „ ํŒŒ์ด์ฌ ํ•ธ์ฆˆ์˜จ ํ”„๋กœ์ ํŠธ

์‹ค์ „ ํŒŒ์ด์ฌ ํ•ธ์ฆˆ์˜จ ํ”„๋กœ์ ํŠธ ๋ฌธ์ œ ํ•ด๊ฒฐ๊ณผ ์‹ค๋ฌด ์‘์šฉ๋ ฅ์„ ํ‚ค์šฐ๊ธฐ ์œ„ํ•œ ๋‚˜๋งŒ์˜ ํŒŒ์ด์ฌ ํฌํŠธํด๋ฆฌ์˜ค ๋งŒ๋“ค๊ธฐ ๋ฆฌ ๋ณธ ์ง€์Œ | ์˜คํ˜„์„ ์˜ฎ๊น€ 420์ชฝ | 28,000์› | 2022๋…„ 5์›” 31์ผ ์ถœ๊ฐ„ | 185*240*20 | ISBN13 9791189909406

www.onlybook.co.kr

 

 

๐Ÿ’• NLTK๋กœ ํ…์ŠคํŠธ ์š”์•ฝํ•˜๊ธฐ
: ๋งˆํ‹ด ๋ฃจํ„ฐ์˜ ‘๋‚˜์—๊ฒŒ๋Š” ๊ฟˆ์ด ์žˆ์Šต๋‹ˆ๋‹ค’์™€ ๊ฐ™์€ ์œ ๋ช…ํ•œ ์—ฐ์„ค์„ ์ธํ„ฐ๋„ท์—์„œ ๊ธ์–ด์™€์„œ ์š”์ ์„ ์š”์•ฝ ์†Œ์„ค ๋ณธ๋ฌธ์„ ๋ฉ‹์ง„ ๊ด‘๊ณ ๋‚˜ ํŒ์ด‰ ๊ธ€๋กœ ๋ณ€ํ™˜
: BeautifulSoup, Requests, regex, NLTK, Collections, wordcloud, matplotlib ๋“ฑ์„ ํ™œ์šฉ
"""
To run this program install Gensim 3.8.3 (https://pypi.org/project/gensim/3.8.3/)
"""

from collections import Counter
import re
import requests
import bs4
import nltk
from nltk.corpus import stopwords

# ์›น์Šคํฌ๋ž˜ํ•‘์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ์–ป๋Š”๋‹ค
def main():
    # Use webscraping to obtain the text.
    url = 'http://www.analytictech.com/mb021/mlk.htm'
    page = requests.get(url)
    page.raise_for_status()
    soup = bs4.BeautifulSoup(page.text, 'html.parser')
    p_elems = [element.text for element in soup.find_all('p')]

    speech = ' '.join(p_elems)  # Make sure to join on a space!

    # ์˜คํƒ€๋ฅผ ์ˆ˜์ •ํ•˜๊ณ  ์ถ”๊ฐ€ ๊ณต๋ฐฑ, ์ˆซ์ž ๋ฐ ๊ตฌ๋‘์ ์„ ์ œ๊ฑฐ
    speech = speech.replace(')mowing', 'knowing')
    speech = re.sub('\s+', ' ', speech) 
    speech_edit = re.sub('[^a-zA-Z]', ' ', speech)
    speech_edit = re.sub('\s+', ' ', speech_edit)

    # Request input.
    while True:
        max_words = input("Enter max words per sentence for summary: ")
        num_sents = input("Enter number of sentences for summary: ")
        if max_words.isdigit() and num_sents.isdigit():
            break
        else:
            print("\nInput must be in whole numbers.\n")
                      
    # ๋ฌธ์žฅ ์ ์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์‹คํ–‰
    speech_edit_no_stop = remove_stop_words(speech_edit)
    word_freq = get_word_freq(speech_edit_no_stop)
    sent_scores = score_sentences(speech, word_freq, max_words)

    # ์ตœ์ƒ์œ„ ๋ฌธ์žฅ์„ ์ถœ๋ ฅ
    counts = Counter(sent_scores)
    summary = counts.most_common(int(num_sents))
    print("\nSUMMARY:")
    for i in summary:
        print(i[0])

# """๋ฌธ์ž์—ด์—์„œ ์ค‘์ง€ ๋‹จ์–ด๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ฌธ์ž์—ด์„ ๋ฐ˜ํ™˜"""
def remove_stop_words(speech_edit):
    """Remove stop words from string and return string."""
    stop_words = set(stopwords.words('english'))
    speech_edit_no_stop = ''
    for word in nltk.word_tokenize(speech_edit):
        if word.lower() not in stop_words:
            speech_edit_no_stop += word + ' '  
    return speech_edit_no_stop

# """๋ฌธ์ž์—ด์—์„œ ๋‹จ์–ด ๋นˆ๋„ ์‚ฌ์ „์„ ๋ฐ˜ํ™˜"""
def get_word_freq(speech_edit_no_stop):
    """Return a dictionary of word frequency in a string."""
    word_freq = nltk.FreqDist(nltk.word_tokenize(speech_edit_no_stop.lower()))
    return word_freq

# """๋‹จ์–ด ๋นˆ๋„์— ๋”ฐ๋ผ ๋ฌธ์žฅ ์ ์ˆ˜์˜ ์‚ฌ์ „์„ ๋ฐ˜ํ™˜"""
def score_sentences(speech, word_freq, max_words):
    """Return dictionary of sentence scores based on word frequency."""
    sent_scores = dict()

    # ๋ฌธ์žฅ ํ† ํฐํ™”
    sentences = nltk.sent_tokenize(speech)
    for sent in sentences:
        sent_scores[sent] = 0
        words = nltk.word_tokenize(sent.lower())
        sent_word_count = len(words)
        if sent_word_count <= int(max_words):
            for word in words:
                if word in word_freq.keys():
                    sent_scores[sent] += word_freq[word]
            sent_scores[sent] = sent_scores[sent] / sent_word_count
    return sent_scores

if __name__ == '__main__':
    main()
728x90
๋ฐ˜์‘ํ˜•
Comments