๐Ÿ˜Ž ๊ณต๋ถ€ํ•˜๋Š” ์ง•์ง•์•ŒํŒŒ์นด๋Š” ์ฒ˜์Œ์ด์ง€?

4. ์•”ํ˜ธํ™” ๊ธฐ๋ฒ•์œผ๋กœ ์•ˆ์ „ํ•œ ๋ฉ”์‹œ์ง€ ์ „์†กํ•˜๊ธฐ – “ํ•ด๋… ๋ถˆ๊ฐ€๋Šฅํ•œ ์•”ํ˜ธ๋ฌธ์„ ์ž‘์„ฑํ•ด๋ณด์ž” ๋ณธ๋ฌธ

๐Ÿ‘ฉ‍๐Ÿ’ป ์ธ๊ณต์ง€๋Šฅ (ML & DL)/ML & DL

4. ์•”ํ˜ธํ™” ๊ธฐ๋ฒ•์œผ๋กœ ์•ˆ์ „ํ•œ ๋ฉ”์‹œ์ง€ ์ „์†กํ•˜๊ธฐ – “ํ•ด๋… ๋ถˆ๊ฐ€๋Šฅํ•œ ์•”ํ˜ธ๋ฌธ์„ ์ž‘์„ฑํ•ด๋ณด์ž”

์ง•์ง•์•ŒํŒŒ์นด 2022. 10. 20. 11:29
728x90
๋ฐ˜์‘ํ˜•

221020 ์ž‘์„ฑ

<๋ณธ ๋ธ”๋กœ๊ทธ๋Š”์‹ค์ „ ํŒŒ์ด์ฌ ํ•ธ์ฆˆ์˜จ ํ”„๋กœ์ ํŠธ์˜ github ๋ฅผ ์ฐธ๊ณ ํ•ด์„œ ๊ณต๋ถ€ํ•˜๋ฉฐ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค>

https://www.onlybook.co.kr/m/entry/python-projects

 

์‹ค์ „ ํŒŒ์ด์ฌ ํ•ธ์ฆˆ์˜จ ํ”„๋กœ์ ํŠธ

์‹ค์ „ ํŒŒ์ด์ฌ ํ•ธ์ฆˆ์˜จ ํ”„๋กœ์ ํŠธ ๋ฌธ์ œ ํ•ด๊ฒฐ๊ณผ ์‹ค๋ฌด ์‘์šฉ๋ ฅ์„ ํ‚ค์šฐ๊ธฐ ์œ„ํ•œ ๋‚˜๋งŒ์˜ ํŒŒ์ด์ฌ ํฌํŠธํด๋ฆฌ์˜ค ๋งŒ๋“ค๊ธฐ ๋ฆฌ ๋ณธ ์ง€์Œ | ์˜คํ˜„์„ ์˜ฎ๊น€ 420์ชฝ | 28,000์› | 2022๋…„ 5์›” 31์ผ ์ถœ๊ฐ„ | 185*240*20 | ISBN13 9791.

www.onlybook.co.kr

https://github.com/rlvaugh/Real_World_Python

 

GitHub - rlvaugh/Real_World_Python: Code and supporting files for book Real World Python

Code and supporting files for book Real World Python - GitHub - rlvaugh/Real_World_Python: Code and supporting files for book Real World Python

github.com

 

 

๐Ÿ–ค NLTK๋กœ ์•”ํ˜ธ๋ฌธ ์ž‘์„ฑํ•˜๊ธฐ

: ์ผ„ ํด๋ฆฟ์˜ ๋ฒ ์ŠคํŠธ์…€๋Ÿฌ ์ŠคํŒŒ์ด ์†Œ์„ค์ธ ใ€Ž๋ ˆ๋ฒ ์นด์˜ ์—ด์‡ ใ€์— ๋‚˜์˜ค๋Š” ์›ํƒ€์ž„ ํŒจ๋“œ ๋ฐฉ์‹์„ ๋””์ง€ํ„ธ ๋ฐฉ์‹์œผ๋กœ ์žฌ๊ตฌ์„ฑํ•ด์„œ, ์•„๋ฌด๋„ ๊นฐ ์ˆ˜ ์—†๋Š” ์•”ํ˜ธ๋ฌธ์„ ์—ฌ๋Ÿฌ๋ถ„์˜ ์นœ๊ตฌ์™€ ํ•จ๊ป˜ ๊ณต์œ ํ•œ๋‹ค.

: Collections ๋ชจ๋“ˆ์„ ํ™œ์šฉ

 

from collections import Counter
import re
import requests
import bs4
import nltk
from nltk.corpus import stopwords

def main():
    # ์›น์Šคํฌ๋ž˜ํ•‘์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ์–ป๊ธฐ
    url = 'http://www.analytictech.com/mb021/mlk.htm'
    page = requests.get(url)
    page.raise_for_status()
    # HTML, XML, HTML5 ๋“ฑ ๋ฐ์ดํ„ฐ ์ถ”์ถœ์— ์‚ฌ์šฉ
    soup = bs4.BeautifulSoup(page.text, 'html.parser')
    p_elems = [element.text for element in soup.find_all('p')]

    speech = ' '.join(p_elems)  # Make sure to join on a space!

    # ์˜คํƒ€๋ฅผ ์ˆ˜์ •ํ•˜๊ณ  ์ถ”๊ฐ€ ๊ณต๋ฐฑ, ์ˆซ์ž ๋ฐ ๊ตฌ๋‘์ ์„ ์ œ๊ฑฐ
    speech = speech.replace(')mowing', 'knowing')
    speech = re.sub('\s+', ' ', speech) 
    speech_edit = re.sub('[^a-zA-Z]', ' ', speech)
    speech_edit = re.sub('\s+', ' ', speech_edit)

    # ์ž…๋ ฅ์„ ์š”์ฒญ
    while True:
        max_words = input("Enter max words per sentence for summary: ")
        num_sents = input("Enter number of sentences for summary: ")
        if max_words.isdigit() and num_sents.isdigit():
            break
        else:
            print("\nInput must be in whole numbers.\n")
                      
    # ๋ฌธ์žฅ ์ ์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์‹คํ–‰
    speech_edit_no_stop = remove_stop_words(speech_edit)
    word_freq = get_word_freq(speech_edit_no_stop)
    sent_scores = score_sentences(speech, word_freq, max_words)

    # ์ตœ์ƒ์œ„ ๋ฌธ์žฅ์„ ์ถœ๋ ฅ
    counts = Counter(sent_scores)
    summary = counts.most_common(int(num_sents))
    print("\nSUMMARY:")
    for i in summary:
        print(i[0])

# """๋ฌธ์ž์—ด์—์„œ ์ค‘์ง€ ๋‹จ์–ด๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ฌธ์ž์—ด์„ ๋ฐ˜ํ™˜"""
def remove_stop_words(speech_edit):
    stop_words = set(stopwords.words('english'))
    speech_edit_no_stop = ''
    for word in nltk.word_tokenize(speech_edit):
        if word.lower() not in stop_words:
            speech_edit_no_stop += word + ' '  
    return speech_edit_no_stop

# """๋ฌธ์ž์—ด์—์„œ ๋‹จ์–ด ๋นˆ๋„ ์‚ฌ์ „์„ ๋ฐ˜ํ™˜"""
def get_word_freq(speech_edit_no_stop):
    word_freq = nltk.FreqDist(nltk.word_tokenize(speech_edit_no_stop.lower()))
    return word_freq

# """๋‹จ์–ด ๋นˆ๋„์— ๋”ฐ๋ผ ๋ฌธ์žฅ ์ ์ˆ˜์˜ ์‚ฌ์ „์„ ๋ฐ˜ํ™˜"""
def score_sentences(speech, word_freq, max_words):
    sent_scores = dict()
    sentences = nltk.sent_tokenize(speech)
    for sent in sentences:
        sent_scores[sent] = 0
        words = nltk.word_tokenize(sent.lower())
        sent_word_count = len(words)
        if sent_word_count <= int(max_words):
            for word in words:
                if word in word_freq.keys():
                    sent_scores[sent] += word_freq[word]
            sent_scores[sent] = sent_scores[sent] / sent_word_count
    return sent_scores

if __name__ == '__main__':
    main()

 

 

 

728x90
๋ฐ˜์‘ํ˜•
Comments