๐ ๊ณต๋ถํ๋ ์ง์ง์ํ์นด๋ ์ฒ์์ด์ง?
[Kaggle] Best Book to Read in 2021 ํ์์ ๋ฐ์ดํฐ ๋ถ์ (plotly ์๊ฐํ) ๋ณธ๋ฌธ
๐ฉ๐ป ์ปดํจํฐ ๊ตฌ์กฐ/Kaggle
[Kaggle] Best Book to Read in 2021 ํ์์ ๋ฐ์ดํฐ ๋ถ์ (plotly ์๊ฐํ)
์ง์ง์ํ์นด 2022. 11. 11. 09:55728x90
๋ฐ์ํ
<๋ณธ ๋ธ๋ก๊ทธ๋ dhelee ๋์ ๋ธ๋ก๊ทธ์ kaggle ์ ์ฐธ๊ณ ํด์ ๊ณต๋ถํ๋ฉฐ ์์ฑํ์์ต๋๋ค>
[TIL Day22] ๋ฐ์ดํฐ ์๊ฐํ ์น ํ์ด์ง ๋ง๋ค๊ธฐ
1. ํ์์ ๋ฐ์ดํฐ ๋ถ์ 2. Flask ์น ๋ง๋ค๊ธฐ 3. pythonanywhere๋ก ์น ํ์ด์ง ๋ฐฐํฌํ๊ธฐ
velog.io
๐ ์ฝํ ์ธ
๋ฐ์ดํฐ ์ธํธ์๋ GoodReads Best Books Ever ๋ชฉ๋ก์ ์๋ ์ฑ ์ ํด๋นํ๋ 25๊ฐ์ ๋ณ์์ 52478๊ฐ์ ๋ ์ฝ๋๊ฐ ํฌํจ
๋ฐ์ดํฐ๋ ์ฒ์ 30000๊ถ์ ์ฑ ๊ณผ ๋๋จธ์ง 22478๊ถ์ ๋ ์ธํธ๋ก ๊ฒ์
๐ ํ์์ ๋ฐ์ดํฐ ๋ถ์
๐ท 1. ๋ฐ์ดํฐ & ๋ผ์ด๋ธ๋ฌ๋ฆฌ ๋ก๋
import pandas as pd
import numpy as np
import plotly.figure_factory as ff
import plotly.offline as py
import statistics
import plotly.express as px
import matplotlib.pyplot as plt
data = pd.read_csv('Best_Books_ever.csv', usecols=['title', 'series', 'author', 'rating', 'language', 'genres', 'characters', 'pages', 'publishDate', 'awards', 'numRatings', 'likedPercent', 'price'])
data.head()
data.info()
๐ท 2. ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ
- datatype์ ๋ฐ๊พธ๊ณ ์ถ์ ๋๋ pd.to_numeric()์ ์ด์ฉ
- errors='coerce'๋ก ์ค์ ํ๋ฉด ์๋ฌ๊ฐ ๋ฐ์ํ๋ ๊ฒฝ์ฐ์ NaN์ผ๋ก ์ฒ๋ฆฌ
- data.isnull().mean(axis=0).plot.barh() ๊ทธ๋ํ๋ฅผ ๊ทธ๋ ค column๋ณ๋ก ๊ฒฐ์ธก์น์ ๋น์จ ํ์ธ
data['price'] = pd.to_numeric(data['price'], errors='coerce')
data['pages'] = pd.to_numeric(data['pages'], errors='coerce')
- ๊ฒฐ์ธก์น ํ์ธํ๊ธฐ
'''Missing Value Chart'''
data.isnull().mean(axis=0).plot.barh()
plt.title("Ratio of missing values per columns")
- price, pages๊ฐ ์๋ ํ์ ์ ๊ฑฐ
data.drop(data[data['price'].isnull()].index, inplace=True)
data.drop(data[data['pages'].isnull()].index, inplace=True)
data.reset_index(drop=True, inplace=True) # reindex
'''Missing Value Chart'''
data.isnull().mean(axis=0).plot.barh()
plt.title("Ratio of missing values per columns")
- ํจ๊ณผ์ ์ธ EDA๋ฅผ ์ํด ์๋ก์ด column์ ๋์ถ
# ์๋ฆฌ์ฆ๋ฌผ์ธ์ง ์ฌ๋ถ 'is_series'
data['is_series'] = 1
data['is_series'].loc[data['series'].isnull()] = 0
# ์บ๋ฆญํฐ ์ 'num_characters'
data['num_characters'] = 0
for i in range(len(data)):
if data['characters'][i] == '[]':
continue
else:
data['num_characters'][i] = len(data['characters'][i].split(','))
# ๋ฐ์ ์์ ๊ฐ์ 'num_awards'
data['num_awards'] = 0
for i in range(len(data)):
if data['awards'][i] == '[]':
continue
else:
data['num_awards'][i] = len(data['awards'][i].split(','))
- ์ฅ๋ฅด๊ฐ 1:M์ผ๋ก ๋ถ๋ฅ๋์ด ์์ผ๋ฏ๋ก, main_genre๋ฅผ ์ ์
- ์ฅ๋ฅด๋ณ ๋ฐ์ดํฐ ์๋ฅผ ์นด์ดํ ํ๊ณ ๊ฐ์ฅ ๋น๋๊ฐ ๋์ ์์ 15๊ฐ ์ฅ๋ฅด๋ฅผ main_genre ์นดํ ๊ณ ๋ฆฌ๋ก ์ ์
- ๋๋จธ์ง๋ etc๋ก ๋ถ๋ฅ
# ์ฅ๋ฅด๋ณ ๋น๋ ์นด์ดํ
genre_dict = {}
for i in range(len(data)):
if data['genres'][i] == '[]':
continue
lst = data['genres'][i][2:-2].split("', '")
for s in lst:
genre_dict[s] = genre_dict.get(s, 0) + 1
genre_dict
# ์์ 15๊ฐ ์ฅ๋ฅด๋ง ์ ์ , ๋๋จธ์ง๋ etc๋ก ๋ถ๋ฅ
import operator
genre_lst = sorted(genre_dict.items(), key=operator.itemgetter(1), reverse=True)[:15]
# ์ ์ ๋ genre category
genre_lst
- ๋ถ์์ ์ฌ์ฉํ ์ต์ข ๋ฐ์ดํฐ์
# ์ฃผ์ ์ฅ๋ฅด๋ก ์ฌ๋ฐฐ์น, ํด๋น๋๋ ์ฅ๋ฅด๊ฐ ์์ผ๋ฉด etc
data['main_genre'] = 'etc'
for i in range(len(data)):
for g, num in genre_lst:
if g in data['genres'][i]:
data['main_genre'][i] = g
break
- ์ต์ข ๋ฐ์ดํฐ์ ํ์ธ
del data['series']
del data['genres']
del data['characters']
del data['awards']
data.head(5)
๐ท 3. ํ์์ ๋ฐ์ดํฐ ๋ถ์
data.describe()
- ์์นํ ๋ณ์ ๊ฐ ์๊ด๊ด๊ณ๋ฅผ ํ์
ํ๊ธฐ ์ํด ํํธ๋งต
- plotly.express๋ฅผ ์ด์ฉํด px.imshow(data.corr())๋ก ํํธ๋งต ๊ทธ๋ฆผ
fig = px.imshow(data.corr(), template='plotly_dark', title='Heatmap')
fig.show()
- ์๋ฆฌ์ฆ๋ฌผ์ธ ์ฑ
๊ณผ ๊ทธ๋ ์ง ์์ ์ฑ
์ ํ์ ๋ถํฌ๊ฐ ๋ค๋ฅธ์ง ์๊ณ ์ถ์ด์ ๋ ๊ทธ๋ฃน์ ํ์ ๋ถํฌ
- plotly.figure_factory๋ฅผ ์ด์ฉํด distplot์ ๊ทธ๋ฆผ
- ff.create_distplot(hist_data, group_labels, bin_size=.2, colors=colors)
# ์๋ฆฌ์ฆ๋ฌผ๊ณผ ๋จํธ์ ํ์ ๋ถํฌ
# group data
hist_data = [data[data['is_series'] == 1]['rating'], data[data['is_series'] == 0]['rating']]
group_labels = ['is_series', 'not_series']
colors = ['#2BCDC1', '#F66095']
# create distplot
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2, colors=colors)
fig.update_layout(title_text='Rating Distribution', template='plotly_dark')
fig.show()
- ์ฅ๋ฅด๋ณ ํ์ ๋ถํฌ๋ฅผ ๋น๊ตํ๊ธฐ ์ํด boxplot ๊ทธ๋ฆฌ๊ธฐ
- px.box(data, x="main_genre", y="rating", color='main_genre')
# ์ฅ๋ฅด๋ณ ํ์ ๋ถํฌ
fig = px.box(data, x="main_genre", y="rating", color='main_genre', template='plotly_dark')
fig['layout'].update(title='Rating Distributions by Genre')
fig.show()
- ์ฑ
์ด ๋๋ฌด ๋๊บผ์ฐ๋ฉด ์ฌ๋๋ค์ด ๋ง์ด ์ฝ์ง ๋ชปํ ๊ฒ! -> likedPercent์ pages์ ๊ด๋ จ์ฑ์ ์๊ฐํ
- density heatmap ์ฌ์ฉ
- px.density_heatmap(data, x="pages", y="likedPercent", marginal_x="histogram", marginal_y="histogram", range_x=[0, 500], range_y=[80, 100])
# likedPercent vs Pages ๋ฐ๋ ํํธ๋งต
fig = px.density_heatmap(data, x="pages", y="likedPercent", marginal_x="histogram", marginal_y="histogram", range_x=[0, 500], range_y=[80, 100], template='plotly_dark')
fig['layout'].update(title='Density Heatmap of LikedPercent vs Pages')
fig.show()
440~449 ๋ณด๋ค 200~229๊ฐ ๋ ๋ซ๋ค๊ณ ๋ณด์
300~349 pages ๊ฐ ๊ฐ์ฅ ์ธ๊ธฐ ๋ง์
- best books์ ์ด๋ฆ์ ์ฌ๋ฆฐ ์ฑ
๋ค์ด ์ด๋ค ์ฅ๋ฅด๋ค์ ๊ฐ์ง๊ณ ์๋์ง, ๋น์จ์ ํ์
- pie chart
- px.pie(df2, values=values, names=labels)
# ์ฅ๋ฅด์ ๋น์จ
# count values by main_genre
df2 = pd.DataFrame(data['main_genre'].value_counts()).reset_index()
df2.columns = ['main_genre', 'counts']
labels = df2['main_genre'].tolist()
values = df2['counts'].tolist()
fig = px.pie(df2, values=values, names=labels, template='plotly_dark')
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig['layout'].update(title='Genre Ratio', boxmode='group')
fig.show()
728x90
๋ฐ์ํ
'๐ฉโ๐ป ์ปดํจํฐ ๊ตฌ์กฐ > Kaggle' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[Kaggle] Credit Card Anomaly Detection (0) | 2022.11.02 |
---|---|
[Kaggle]Pneumonia/Normal Classification(CNN) (0) | 2022.03.20 |
[Kaggle]Super Image Resolution_๊ณ ํ์ง ์ด๋ฏธ์ง ๋ง๋ค๊ธฐ (0) | 2022.02.07 |
[Kaggle] CNN Architectures (0) | 2022.02.04 |
[Kaggle] HeartAttack ์์ธก (0) | 2022.01.31 |
Comments