๐Ÿ˜Ž ๊ณต๋ถ€ํ•˜๋Š” ์ง•์ง•์•ŒํŒŒ์นด๋Š” ์ฒ˜์Œ์ด์ง€?

[Kaggle] Best Book to Read in 2021 ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„ (plotly ์‹œ๊ฐํ™”) ๋ณธ๋ฌธ

๐Ÿ‘ฉ‍๐Ÿ’ป ์ปดํ“จํ„ฐ ๊ตฌ์กฐ/Kaggle

[Kaggle] Best Book to Read in 2021 ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„ (plotly ์‹œ๊ฐํ™”)

์ง•์ง•์•ŒํŒŒ์นด 2022. 11. 11. 09:55
728x90
๋ฐ˜์‘ํ˜•

<๋ณธ ๋ธ”๋กœ๊ทธ๋Š” dhelee ๋‹˜์˜ ๋ธ”๋กœ๊ทธ์™€ kaggle ์„ ์ฐธ๊ณ ํ•ด์„œ ๊ณต๋ถ€ํ•˜๋ฉฐ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค>

https://velog.io/@dhelee/TIL-Day22-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EC%8B%9C%EA%B0%81%ED%99%94-%EC%9B%B9-%ED%8E%98%EC%9D%B4%EC%A7%80-%EB%A7%8C%EB%93%A4%EA%B8%B0#2-flask-%EC%9B%B9-%EB%A7%8C%EB%93%A4%EA%B8%B0

 

[TIL Day22] ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™” ์›น ํŽ˜์ด์ง€ ๋งŒ๋“ค๊ธฐ

1. ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„ 2. Flask ์›น ๋งŒ๋“ค๊ธฐ 3. pythonanywhere๋กœ ์›น ํŽ˜์ด์ง€ ๋ฐฐํฌํ•˜๊ธฐ

velog.io

https://www.kaggle.com/datasets/shashwatwork/best-book-ever-data-for-2021?select=books_1.Best_Books_Ever.csv 

 

 

๐Ÿ€ ์ฝ˜ํ…์ธ  

๋ฐ์ดํ„ฐ ์„ธํŠธ์—๋Š” GoodReads Best Books Ever ๋ชฉ๋ก์— ์žˆ๋Š” ์ฑ…์— ํ•ด๋‹นํ•˜๋Š” 25๊ฐœ์˜ ๋ณ€์ˆ˜์™€ 52478๊ฐœ์˜ ๋ ˆ์ฝ”๋“œ๊ฐ€ ํฌํ•จ

๋ฐ์ดํ„ฐ๋Š” ์ฒ˜์Œ 30000๊ถŒ์˜ ์ฑ…๊ณผ ๋‚˜๋จธ์ง€ 22478๊ถŒ์˜ ๋‘ ์„ธํŠธ๋กœ ๊ฒ€์ƒ‰

 

๐Ÿ€ ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„

๐Ÿ”ท 1. ๋ฐ์ดํ„ฐ & ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋กœ๋“œ

 
import pandas as pd
import numpy as np

import plotly.figure_factory as ff
import plotly.offline as py 
import statistics
import plotly.express as px
import matplotlib.pyplot as plt
data = pd.read_csv('Best_Books_ever.csv', usecols=['title', 'series', 'author', 'rating', 'language', 'genres', 'characters', 'pages', 'publishDate', 'awards', 'numRatings', 'likedPercent', 'price'])
data.head()

data.info()

 

 

 

๐Ÿ”ท 2. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

  • datatype์„ ๋ฐ”๊พธ๊ณ  ์‹ถ์„ ๋•Œ๋Š” pd.to_numeric()์„ ์ด์šฉ
  • errors='coerce'๋กœ ์„ค์ •ํ•˜๋ฉด ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ์— NaN์œผ๋กœ ์ฒ˜๋ฆฌ
  • data.isnull().mean(axis=0).plot.barh() ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค column๋ณ„๋กœ ๊ฒฐ์ธก์น˜์˜ ๋น„์œจ ํ™•์ธ
data['price'] = pd.to_numeric(data['price'], errors='coerce')
data['pages'] = pd.to_numeric(data['pages'], errors='coerce')
  • ๊ฒฐ์ธก์น˜ ํ™•์ธํ•˜๊ธฐ
'''Missing Value Chart'''
data.isnull().mean(axis=0).plot.barh()
plt.title("Ratio of missing values per columns")

 

  • price, pages๊ฐ€ ์—†๋Š” ํ–‰์€ ์ œ๊ฑฐ
data.drop(data[data['price'].isnull()].index, inplace=True)
data.drop(data[data['pages'].isnull()].index, inplace=True)
data.reset_index(drop=True, inplace=True) # reindex
'''Missing Value Chart'''
data.isnull().mean(axis=0).plot.barh()
plt.title("Ratio of missing values per columns")

 

  • ํšจ๊ณผ์ ์ธ EDA๋ฅผ ์œ„ํ•ด ์ƒˆ๋กœ์šด column์„ ๋„์ถœ
# ์‹œ๋ฆฌ์ฆˆ๋ฌผ์ธ์ง€ ์—ฌ๋ถ€ 'is_series'
data['is_series'] = 1
data['is_series'].loc[data['series'].isnull()] = 0


# ์บ๋ฆญํ„ฐ ์ˆ˜ 'num_characters'
data['num_characters'] = 0
for i in range(len(data)):
    if data['characters'][i] == '[]':
        continue
    else:
        data['num_characters'][i] = len(data['characters'][i].split(','))
        
        
# ๋ฐ›์€ ์ƒ์˜ ๊ฐœ์ˆ˜ 'num_awards'
data['num_awards'] = 0
for i in range(len(data)):
    if data['awards'][i] == '[]':
        continue
    else:
        data['num_awards'][i] = len(data['awards'][i].split(','))

 

  • ์žฅ๋ฅด๊ฐ€ 1:M์œผ๋กœ ๋ถ„๋ฅ˜๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ, main_genre๋ฅผ ์„ ์ •
    • ์žฅ๋ฅด๋ณ„ ๋ฐ์ดํ„ฐ ์ˆ˜๋ฅผ ์นด์šดํŒ…ํ•˜๊ณ  ๊ฐ€์žฅ ๋นˆ๋„๊ฐ€ ๋†’์€ ์ƒ์œ„ 15๊ฐœ ์žฅ๋ฅด๋ฅผ main_genre ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ์„ ์ •
    • ๋‚˜๋จธ์ง€๋Š” etc๋กœ ๋ถ„๋ฅ˜
# ์žฅ๋ฅด๋ณ„ ๋นˆ๋„ ์นด์šดํŒ…
genre_dict = {}
for i in range(len(data)):
    if data['genres'][i] == '[]':
        continue
    lst = data['genres'][i][2:-2].split("', '")
    for s in lst:
        genre_dict[s] = genre_dict.get(s, 0) + 1
genre_dict

 

# ์ƒ์œ„ 15๊ฐœ ์žฅ๋ฅด๋งŒ ์„ ์ •, ๋‚˜๋จธ์ง€๋Š” etc๋กœ ๋ถ„๋ฅ˜
import operator
genre_lst = sorted(genre_dict.items(), key=operator.itemgetter(1), reverse=True)[:15]

# ์„ ์ •๋œ genre category
genre_lst

 

  • ๋ถ„์„์— ์‚ฌ์šฉํ•  ์ตœ์ข… ๋ฐ์ดํ„ฐ์…‹
# ์ฃผ์š” ์žฅ๋ฅด๋กœ ์žฌ๋ฐฐ์น˜, ํ•ด๋‹น๋˜๋Š” ์žฅ๋ฅด๊ฐ€ ์—†์œผ๋ฉด etc
data['main_genre'] = 'etc'
for i in range(len(data)):
    for g, num in genre_lst:
        if g in data['genres'][i]:
            data['main_genre'][i] = g
            break

 

  • ์ตœ์ข… ๋ฐ์ดํ„ฐ์…‹ ํ™•์ธ
del data['series']
del data['genres']
del data['characters']
del data['awards']

data.head(5)

 

 

๐Ÿ”ท 3. ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„

data.describe()

  • ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜ ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ํžˆํŠธ๋งต
    • plotly.express๋ฅผ ์ด์šฉํ•ด px.imshow(data.corr())๋กœ ํžˆํŠธ๋งต ๊ทธ๋ฆผ
fig = px.imshow(data.corr(), template='plotly_dark', title='Heatmap')
fig.show()

 

 

  • ์‹œ๋ฆฌ์ฆˆ๋ฌผ์ธ ์ฑ…๊ณผ ๊ทธ๋ ‡์ง€ ์•Š์€ ์ฑ…์˜ ํ‰์  ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅธ์ง€ ์•Œ๊ณ  ์‹ถ์–ด์„œ ๋‘ ๊ทธ๋ฃน์˜ ํ‰์  ๋ถ„ํฌ
    • plotly.figure_factory๋ฅผ ์ด์šฉํ•ด distplot์„ ๊ทธ๋ฆผ
    • ff.create_distplot(hist_data, group_labels, bin_size=.2, colors=colors)
# ์‹œ๋ฆฌ์ฆˆ๋ฌผ๊ณผ ๋‹จํŽธ์˜ ํ‰์  ๋ถ„ํฌ

# group data
hist_data = [data[data['is_series'] == 1]['rating'], data[data['is_series'] == 0]['rating']]
group_labels = ['is_series', 'not_series']
colors = ['#2BCDC1', '#F66095']

 

# create distplot
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2, colors=colors)
fig.update_layout(title_text='Rating Distribution', template='plotly_dark')
fig.show()

 

  • ์žฅ๋ฅด๋ณ„ ํ‰์  ๋ถ„ํฌ๋ฅผ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด boxplot ๊ทธ๋ฆฌ๊ธฐ
    • px.box(data, x="main_genre", y="rating", color='main_genre')
# ์žฅ๋ฅด๋ณ„ ํ‰์  ๋ถ„ํฌ 

fig = px.box(data, x="main_genre", y="rating", color='main_genre', template='plotly_dark')
fig['layout'].update(title='Rating Distributions by Genre')
fig.show()

 

  • ์ฑ…์ด ๋„ˆ๋ฌด ๋‘๊บผ์šฐ๋ฉด ์‚ฌ๋žŒ๋“ค์ด ๋งŽ์ด ์ฝ์ง€ ๋ชปํ•  ๊ฒƒ! -> likedPercent์™€ pages์˜ ๊ด€๋ จ์„ฑ์„ ์‹œ๊ฐํ™”
    • density heatmap ์‚ฌ์šฉ
    • px.density_heatmap(data, x="pages", y="likedPercent", marginal_x="histogram", marginal_y="histogram", range_x=[0, 500], range_y=[80, 100])
# likedPercent vs Pages ๋ฐ€๋„ ํžˆํŠธ๋งต

fig = px.density_heatmap(data, x="pages", y="likedPercent", marginal_x="histogram", marginal_y="histogram", range_x=[0, 500], range_y=[80, 100], template='plotly_dark')
fig['layout'].update(title='Density Heatmap of LikedPercent vs Pages')
fig.show()

440~449 ๋ณด๋‹ค 200~229๊ฐ€ ๋” ๋‚ซ๋‹ค๊ณ  ๋ณด์ž„

300~349 pages ๊ฐ€ ๊ฐ€์žฅ ์ธ๊ธฐ ๋งŽ์Œ

 

 

  • best books์— ์ด๋ฆ„์„ ์˜ฌ๋ฆฐ ์ฑ…๋“ค์ด ์–ด๋–ค ์žฅ๋ฅด๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€, ๋น„์œจ์„ ํŒŒ์•…
    • pie chart
    • px.pie(df2, values=values, names=labels)
# ์žฅ๋ฅด์˜ ๋น„์œจ

# count values by main_genre
df2 = pd.DataFrame(data['main_genre'].value_counts()).reset_index()
df2.columns = ['main_genre', 'counts']

labels = df2['main_genre'].tolist()
values = df2['counts'].tolist()

fig = px.pie(df2, values=values, names=labels, template='plotly_dark')
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig['layout'].update(title='Genre Ratio', boxmode='group')
fig.show()

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

728x90
๋ฐ˜์‘ํ˜•
Comments