๐ ๊ณต๋ถํ๋ ์ง์ง์ํ์นด๋ ์ฒ์์ด์ง?
[Kaggle] Smart Home Dataset with weather Information ๋ณธ๋ฌธ
[Kaggle] Smart Home Dataset with weather Information
์ง์ง์ํ์นด 2022. 9. 16. 16:21220916 ์์ฑ
<๋ณธ ๋ธ๋ก๊ทธ๋kaggle์ koheimuramatus ๋์ code์ notebook ์ ์ฐธ๊ณ ํด์ ๊ณต๋ถํ๋ฉฐ ์์ฑํ์์ต๋๋ค :-) >
https://www.kaggle.com/code/koheimuramatsu/change-detection-forecasting-in-smart-home/notebook
Change Detection & Forecasting in Smart Home
Explore and run machine learning code with Kaggle Notebooks | Using data from Smart Home Dataset with weather Information
www.kaggle.com
๐ energy data from house appliances and weather information
- ๊ฐ์ ์ ํ๋ณ ์๋์ง ์๋น๋๊ณผ ๊ธฐ๊ฐ ๊ฐ์ ๊ด๊ณ๋ฅผ ์ดํด
- ๊ฐ์ ์ ํ์ ์ด์ ์ฌ์ฉ์ ๊ฐ์ง
- ๋ ์จ ์ ๋ณด์ ํ์๊ด ๋ฐ์ ์๋์ง ๊ฐ์ ๊ด๊ณ
๐ ์ฝ๋ ๊ตฌํ
1๏ธโฃ ๋ผ์ด๋ธ๋ฌ๋ฆฌ ๋ก๋
- changefinder : ์จ๋ผ์ธ ๋ณ๊ฒฝ์ ๊ฐ์ง ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- HoloViews : ๋ฐ์ดํฐ ๋ถ์ ๋ฐ ์๊ฐํ๋ฅผ ์ํํ๊ณ ๊ฐ๋จํ๊ฒ ํ๋๋ก ์ค๊ณ
- shap : ๋ชจ๋ ๊ธฐ๊ณ ํ์ต ๋ชจ๋ธ์ ์ถ๋ ฅ์ ์ค๋ช ํ๊ธฐ ์ํ ๊ฒ์ ์ด๋ก ์ ์ธ ์ ๊ทผ ๋ฐฉ์
!pip install changefinder
!conda install -c pyviz holoviews bokeh -y
!pip install lightgbm
!conda install -c conda-forge shap -y
import numpy as np
import pandas as pd
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
from matplotlib import pyplot as plt
import seaborn as sns
import os
import changefinder
from scipy import stats
from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import grangercausalitytests
from statsmodels.tsa.stattools import adfuller
from fbprophet import Prophet
from sklearn.metrics import mean_absolute_error
import shap
shap.initjs()
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from tabulate import tabulate
from IPython.display import HTML, display
2๏ธโฃ ๋ฐ์ดํฐ ๋ก๋
df = pd.read_csv("HomeC.csv/HomeC.csv",low_memory=False)
print(f'HomeC.csv : {df.shape}')
df.head(3)
- Weather information
- temperature
- ๋์์ ์ถ์๋ฅผ ๋ํ๋ด๋ ๋ฌผ๋ฆฌ๋
- ๋์์ ์ถ์๋ฅผ ๋ํ๋ด๋ ๋ฌผ๋ฆฌ๋
- humidity
- ๊ณต๊ธฐ ์ค์ ์กด์ฌํ๋ ์์ฆ๊ธฐ์ ๋๋
- visibility
- ๊ด์ ์ด ์ด๋ํ๋ ๋๊ธฐ์ ๊ธธ์ด๋ก ์ ์๋๋ ๊ธฐ์ ๊ดํ ๋ฒ์
- apparentTemperature
- ๊ธฐ์จ, ์๋์ต๋ ๋ฐ ํ์์ ๋ณตํฉ์ ์ธ ์ํฅ์ผ๋ก ์ธํด ์ธ๊ฐ์ด ์ง๊ฐํ๋ ์จ๋ ๋ฑ๊ฐ
- pressure
- ๊ธฐ์์ ํ๋ฝ์ ๋์ ๋ ์จ๊ฐ ์ค๊ณ ์์์ ๋ํ๋ด๊ณ , ๊ธฐ์์ ์์น์ ์ข์ ๋ ์จ๋ฅผ ๋ํ๋ธ๋ค.
- windSpeed
- ์ผ๋ฐ์ ์ผ๋ก ์จ๋ ๋ณํ๋ก ์ธํด ๊ณต๊ธฐ๊ฐ ๊ณ ์์์ ์ ์์ผ๋ก ์ด๋ํจ์ ๋ฐ๋ผ ๋ฐ์ํ๋ ๊ธฐ๋ณธ์ ์ธ ๋๊ธฐ๋
- cloudCover
- ํน์ ์์น์์ ๊ด์ธกํ ๋ ๊ตฌ๋ฆ์ ๊ฐ๋ ค์ง ํ๋์ ์ผ๋ถ
- windBearing
- ๊ธฐ์ํ์์ ๋ฐฉ์๊ฐ 000°๋ ๋ฐ๋์ด ๋ถ์ง ์์ ๋์๋ง ์ฌ์ฉ๋๋ ๋ฐ๋ฉด 360°๋ ๋ฐ๋์ด ๋ถ์ชฝ์์ ๋ถ์ด์ค๋ ๊ฒ์ ์๋ฏธ
- ํธ๋ฃจ ๋ ธ์ค(True North)์ ๊ด๋ จ๋ ๋ชจ๋ ๋ฐฉํฅ์ "ํธ๋ฃจ ๋ฒ ์ด๋ง(True Bearing)"
- dewPoint
- ๋ฌผ๋ฐฉ์ธ์ด ์์ถ๋๊ธฐ ์์ํ๊ณ ์ด์ฌ์ด ํ์ฑ๋ ์ ์๋ ๋๊ธฐ ์จ๋(์๋ ฅ๊ณผ ์ต๋์ ๋ฐ๋ผ ์ธก์ )
- precipProbability
- ์ง์ ๋ ์์ธก ๊ธฐ๊ฐ ๋ฐ ์์น ๋ด์์ ์ต์ ๊ฐ์๋์ด ๋ฐ์ํ ํ๋ฅ ์ ์ธก์
- precipIntensity
- ์๊ฐ์ด ์ง๋จ์ ๋ฐ๋ผ ๋ด๋ฆฌ๋ ๋น์ ์์ ์ธก์ ํ๋ ๊ฒ
- temperature
3๏ธโฃ ์ ์ฒ๋ฆฌ
df.columns
![](https://blog.kakaocdn.net/dn/qxfF4/btrMeQx8B7v/odFBdDTcD93abNdyuOuN2K/img.png)
df.columns = [i.replace(' [kW]', '') for i in df.columns]
- ๋ํ๊ฑฐ๋ ํ์์๋ ์ ๋ค drop
df['Furnace'] = df[['Furnace 1','Furnace 2']].sum(axis=1)
df['Kitchen'] = df[['Kitchen 12','Kitchen 14','Kitchen 38']].sum(axis=1)
df.drop(['Furnace 1','Furnace 2','Kitchen 12','Kitchen 14','Kitchen 38','icon','summary'], axis=1, inplace=True)
- nan ๊ฐ drop
df[df.isnull().any(axis=1)]
df = df[0:-1]
- ์๋ชป๋ ๊ฐ๋ค์ด ๋์ ๋์ด ์์
df['cloudCover'].unique()
df[df['cloudCover']=='cloudCover'].shape
df['cloudCover'].replace(['cloudCover'], method='bfill', inplace=True)
df['cloudCover'] = df['cloudCover'].astype('float')
4๏ธโฃ datetime information
- 1๋ถ์ ์๊ฐ ๊ฐ๊ฒฉ์ผ๋ก ๋ฐ์ดํฐ๊ฐ ์์ง๋์์ง๋ง ์๊ฐ ๋จ๊ณ๊ฐ ์ด ๋จ์๋ก ์ฆ๊ฐ
pd.to_datetime(df['time'], unit='s').head(3)
- ๋ช ๋ถ ๋จ์๋ก ์๋ก์ด ๋ ์ง ๋ฒ์๋ฅผ ๋ง๋ ๋ค
df['time'] = pd.DatetimeIndex(pd.date_range('2016-01-01 05:00', periods=len(df), freq='min'))
df.head(3)
- EDA ๋ฐ ๋ชจ๋ธ๋ง ๋จ๊ณ์์ ๋ , ์, ์ผ ๋ฑ์ ๋ ์ง ์๊ฐ ์ ๋ณด๋ฅผ ํ์ฉํ๋ ค๋ฉด ์๊ฐ ์ด์์ ์ถ์ถ
df['year'] = df['time'].apply(lambda x : x.year)
df['month'] = df['time'].apply(lambda x : x.month)
df['day'] = df['time'].apply(lambda x : x.day)
df['weekday'] = df['time'].apply(lambda x : x.day_name())
df['weekofyear'] = df['time'].apply(lambda x : x.weekofyear)
df['hour'] = df['time'].apply(lambda x : x.hour)
df['minute'] = df['time'].apply(lambda x : x.minute)
df.head(3)
5๏ธโฃ Timing information
- Night : 22:00 - 23:59 / 00:00 - 03:59
- Morning : 04:00 - 11:59
- Afternoon : 12:00 - 16:59
- Evening : 17:00 - 21:59
def hours2timing(x):
if x in [22,23,0,1,2,3]:
timing = 'Night'
elif x in range(4, 12):
timing = 'Morning'
elif x in range(12, 17):
timing = 'Afternoon'
elif x in range(17, 22):
timing = 'Evening'
else:
timing = 'X'
return timing
df['timing'] = df['hour'].apply(hours2timing)
df.head(3)
6๏ธโฃ Removing Duplicate Columns
fig = plt.subplots(figsize=(10, 8))
corr = df.corr()
sns.heatmap(corr[corr>0.9], vmax=1, vmin=-1, center=0)
plt.show()
- 'use' - 'house allother'์ 'gen'๊ณผ 'solar' columns' ์๊ด๊ณ์๊ฐ ๊ฑฐ์ 0.95๋ฅผ ๋์๊ธฐ ๋๋ฌธ์ ์ด ์ปฌ๋ผ๋ค์ ์๋ก์ด ์ปฌ๋ผ์ผ๋ก ํฉ์น ํ์๊ฐ ์์
df['use_HO'] = df['use']
df['gen_Sol'] = df['gen']
df.drop(['use','House overall','gen','Solar'], axis=1, inplace=True)
df.head(3)
7๏ธโฃ EDA
- House Appliances
use = hv.Distribution(df['use_HO']).opts(title="Total Energy Consumption Distribution", color="red")
gen = hv.Distribution(df['gen_Sol']).opts(title="Total Energy Generation Distribution", color="blue")
(use + gen).opts(opts.Distribution(xlabel="Energy Consumption", ylabel="Density", xformatter='%.1fkw', width=400, height=300,tools=['hover'],show_grid=True))
![](https://blog.kakaocdn.net/dn/F0vYd/btrMhVMa0qf/dKKXSio3jrPuUzkmT1L4RK/img.png)
dw = hv.Distribution(df[df['Dishwasher']<1.5]['Dishwasher'],label="Dishwasher").opts(color="red")
ho = hv.Distribution(df[df['Home office']<1.5]['Home office'],label="Home office").opts(color="blue")
fr = hv.Distribution(df[df['Fridge']<1.5]['Fridge'],label="Fridge Distribution").opts(color="orange")
wc = hv.Distribution(df[df['Wine cellar']<1.5]['Wine cellar'],label="Wine cellar").opts(color="green")
gd = hv.Distribution(df[df['Garage door']<1.5]['Garage door'],label="Garage door").opts(color="purple")
ba = hv.Distribution(df[df['Barn']<1.5]['Barn'],label="Barn").opts(color="grey")
we = hv.Distribution(df[df['Well']<1.5]['Well'],label="Well").opts(color="pink")
mcr = hv.Distribution(df[df['Microwave']<1.5]['Microwave'],label="Microwave").opts(color="yellow")
lr = hv.Distribution(df[df['Living room']<1.5]['Living room'],label="Living room").opts(color="brown")
fu = hv.Distribution(df[df['Furnace']<1.5]['Furnace'],label="Furnace").opts(color="skyblue")
ki = hv.Distribution(df[df['Kitchen']<1.5]['Kitchen'],label="Kitchen").opts(color="lightgreen")
(dw * ho * fr * wc * gd * ba * we * mcr * lr * fu * ki).opts(opts.Distribution(xlabel="Energy Consumption", ylabel="Density", xformatter='%.1fkw',title='Energy Consumption of Appliances Distribution',
width=800, height=350,tools=['hover'],show_grid=True))
- Weather Information
temp = hv.Distribution(df['temperature'],label="temperature").opts(color="red")
apTemp = hv.Distribution(df['apparentTemperature'],label="apparentTemperature").opts(color="orange")
temps = (temp * apTemp).opts(opts.Distribution(title='Temperature Distribution')).opts(legend_position='top',legend_cols=2)
hmd = hv.Distribution(df['humidity']).opts(color="yellow", title='Humidity Distribution')
vis = hv.Distribution(df['visibility']).opts(color="blue", title='Visibility Distribution')
prs = hv.Distribution(df['pressure']).opts(color="green", title='Pressure Distribution')
wnd = hv.Distribution(df['windSpeed']).opts(color="purple", title='WindSpeed Distribution')
cld = hv.Distribution(df['cloudCover']).opts(color="grey", title='CloudCover Distribution')
prc = hv.Distribution(df['precipIntensity']).opts(color="skyblue", title='PrecipIntensity Distribution')
dew = hv.Distribution(df['dewPoint']).opts(color="lightgreen", title='DewPoint Distribution')
(temps + hmd + vis + prs + wnd + cld + prc + dew).opts(opts.Distribution(xlabel="Values", ylabel="Density", width=400, height=300,tools=['hover'],show_grid=True)).cols(4)
8๏ธโฃ Time Series Analysis
- ์๋์ง ์๋น๋ 7์๋ถํฐ 9์๊น์ง ์ต๊ณ ์กฐ์ ๋ฌํจ
- ์๋์ง ์ธ๋๋ ํฐ ์ ์ ์ด ์์ง๋ง 1์๋ถํฐ 7์๊น์ง ์ ์ฐจ ์์นํ๋ค๊ฐ ์์ํ ํ๋ฝ
def groupByMonth(col):
return df[[col,'month']].groupby('month').agg({col:['mean']})[col]
def groupByWeekday(col):
weekdayDf = df.groupby('weekday').agg({col:['mean']})
weekdayDf.columns = [f"{i[0]}_{i[1]}" for i in weekdayDf.columns]
weekdayDf['week_num'] = [['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'].index(i) for i in weekdayDf.index]
weekdayDf.sort_values('week_num', inplace=True)
weekdayDf.drop('week_num', axis=1, inplace=True)
return weekdayDf
def groupByTiming(col):
timingDf = df.groupby('timing').agg({col:['mean']})
timingDf.columns = [f"{i[0]}_{i[1]}" for i in timingDf.columns]
timingDf['timing_num'] = [['Morning','Afternoon','Evening','Night'].index(i) for i in timingDf.index]
timingDf.sort_values('timing_num', inplace=True)
timingDf.drop('timing_num', axis=1, inplace=True)
return timingDf
df = df.set_index(df['time'])
use = hv.Curve(df['use_HO'].resample('D').mean()).opts(title="Total Energy Consumption Time-Series by Day", color="red", ylabel="Energy Consumption")
gen = hv.Curve(df['gen_Sol'].resample('D').mean()).opts(title="Total Energy Generation Time-Series by Day", color="blue", ylabel="Energy Generation")
(use + gen).opts(opts.Curve(xlabel="Day", yformatter='%.1fkw', width=400, height=300,tools=['hover'],show_grid=True,fontsize={'title':11}))
use = hv.Curve(groupByMonth('use_HO')).opts(title="Total Energy Consumption Time-Series by Month", color="red", ylabel="Energy Consumption")
gen = hv.Curve(groupByMonth('gen_Sol')).opts(title="Total Energy Generation Time-Series by Month", color="blue", ylabel="Energy Generation")
(use + gen).opts(opts.Curve(xlabel="Month", yformatter='%.1fkw', width=400, height=300,tools=['hover'],show_grid=True,fontsize={'title':10})).opts(shared_axes=False)
- ์ง๊ด์ ์ผ๋ก ์๋์ง ์๋น์ ๋ฐ์ ์ ์ฃผ๊ฐ ์ถ์ธ๋ ์๋ค
- ํ์ค์ ์ผ๋ก ์ฝ๊ฐ์ ์ถ์ธ๊ฐ ์๋ ๊ฒ์ฒ๋ผ ๋ณด์ด์ง๋ง, ๊ฐ์น์ ๋ณํ๋ ๋ฌด์ ๊ฐ๋ฅ
use = hv.Curve(groupByWeekday('use_HO')).opts(title="Total Energy Consumption Time-Series by Weekday", color="red", ylabel="Energy Consumption")
gen = hv.Curve(groupByWeekday('gen_Sol')).opts(title="Total Energy Generation Time-Series by Weekday", color="blue", ylabel="Energy Generation")
(use + gen).opts(opts.Curve(xlabel="Weekday", yformatter='%.2fkw', width=400, height=300,tools=['hover'],show_grid=True, xrotation=20,fontsize={'title':10})).opts(shared_axes=False)
- ์๋์ง ์๋น๋ ๋ฎ์๋ ๋ฎ๊ณ ๋ฐค์๋ ๋์
- ์๋์ง ์์ฑ์ ๋ฎ์๋ ๋๊ณ ๋ฐค์๋ ๋ฎ์
- ๋ฎ์๋ ์ง์ ์ฃผ๋ฏผ์ด ์๊ธฐ ๋๋ฌธ์ ์๋์ง ๋ฐ์ ์ด ์ด์ง
- ๋ฐค์๋ ์ฃผ๋ฏผ์ด ๊ท๊ฐํ๊ธฐ ๋๋ฌธ์ ์๋น๊ฐ ์ฆ๊ฐ
use = hv.Curve(groupByTiming('use_HO')).opts(title="Total Energy Consumption Time-Series by Timing", color="red", ylabel="Energy Consumption")
gen = hv.Curve(groupByTiming('gen_Sol')).opts(title="Total Energy Generation Time-Series by Timing", color="blue", ylabel="Energy Generation")
(use + gen).opts(opts.Curve(xlabel="Timing", yformatter='%.1fkw', width=400, height=300,tools=['hover'],show_grid=True,fontsize={'title':10})).opts(shared_axes=False)
- ํ ์คํผ์ค, ๋์ฅ๊ณ , ์์ธ ์
๋ฌ, ๊ฑฐ์ค ๋ฐ ๊ฐ๊ตฌ์๋ ๋ถ๋ช
ํ ์๊ณ์ด ํธ๋ ๋ ์กด์ฌ
- ๊ฐ์ ์ ํ์ ๊ณ์ ์ ๋ฐ๋ผ ์ค๋ด ์จ๋๋ฅผ ์ผ์ ํ๊ฒ ์ ์งํ๊ฑฐ๋ ํธ์ํ ์จ๋๋ก ์กฐ์ ํด์ผ ๋๊ธฐ ๋๋ฌธ
dw = hv.Curve(df['Dishwasher'].resample('D').mean(),label="Dishwasher Time-Series by Day").opts(color="red")
ho = hv.Curve(df['Home office'].resample('D').mean(),label="Home office Time-Series by Day").opts(color="blue")
fr = hv.Curve(df['Fridge'].resample('D').mean(),label="Fridge Time-Series by Day").opts(color="orange")
wc = hv.Curve(df['Wine cellar'].resample('D').mean(),label="Wine cellar Time-Series by Day").opts(color="green")
gd = hv.Curve(df['Garage door'].resample('D').mean(),label="Garage door Time-Series by Day").opts(color="purple")
ba = hv.Curve(df['Barn'].resample('D').mean(),label="Barn Time-Series by Day").opts(color="grey")
we = hv.Curve(df['Well'].resample('D').mean(),label="Well Time-Series by Day").opts(color="pink")
mcr = hv.Curve(df['Microwave'].resample('D').mean(),label="Microwave Time-Series by Day").opts(color="yellow")
lr = hv.Curve(df['Living room'].resample('D').mean(),label="Living room Time-Series by Day").opts(color="brown")
fu = hv.Curve(df['Furnace'].resample('D').mean(),label="Furnace Time-Series by Day").opts(color="skyblue")
ki = hv.Curve(df['Kitchen'].resample('D').mean(),label="Kitchen Time-Series by Day").opts(color="lightgreen")
(dw + ho + fr + wc + gd + ba + we + mcr + lr + fu + ki).opts(opts.Curve(xlabel="Day", ylabel="Energy Consumption", yformatter='%.2fkw' , \
width=400, height=300,tools=['hover'],show_grid=True)).cols(6)
- ๋ฌ๋ง๋ค์ ๊ฐ์ ์ ํ์ ์๋์ง ์๋น๋
dw = hv.Curve(groupByMonth('Dishwasher'),label="Dishwasher Time-Series by Month").opts(color="red")
ho = hv.Curve(groupByMonth('Home office'),label="Home office Time-Series by Month").opts(color="blue")
fr = hv.Curve(groupByMonth('Fridge'),label="Fridge Time-Series by Month").opts(color="orange")
wc = hv.Curve(groupByMonth('Wine cellar'),label="Wine cellar Time-Series by Month").opts(color="green")
gd = hv.Curve(groupByMonth('Garage door'),label="Garage door Time-Series by Month").opts(color="purple")
ba = hv.Curve(groupByMonth('Barn'),label="Barn Time-Series by Month").opts(color="grey")
we = hv.Curve(groupByMonth('Well'),label="Well Time-Series by Month").opts(color="pink")
mcr = hv.Curve(groupByMonth('Microwave'),label="Microwave Time-Series by Month").opts(color="yellow")
lr = hv.Curve(groupByMonth('Living room'),label="Living room Time-Series by Month").opts(color="brown")
fu = hv.Curve(groupByMonth('Furnace'),label="Furnace Time-Series by Month").opts(color="skyblue")
ki = hv.Curve(groupByMonth('Kitchen'),label="Kitchen Time-Series by Month").opts(color="lightgreen")
(dw + ho + fr + wc + gd + ba + we + mcr + lr + fu + ki).opts(opts.Curve(xlabel="Month", ylabel="Energy Consumption", yformatter='%.2fkw', \
width=400, height=300,tools=['hover'],show_grid=True)).opts(shared_axes=False).cols(6)
- ๊ฐ์ ์ ํ์ ์๋์ง ์๋น๋์๋ ์ฃผ๊ฐ ์ถ์ธ๊ฐ ์๋ค.
dw = hv.Curve(groupByWeekday('Dishwasher'),label="Dishwasher Time-Series by Weekday").opts(color="red")
ho = hv.Curve(groupByWeekday('Home office'),label="Home office Time-Series by Weekday").opts(color="blue")
fr = hv.Curve(groupByWeekday('Fridge'),label="FridgeTime-Series by Weekday").opts(color="orange")
wc = hv.Curve(groupByWeekday('Wine cellar'),label="Wine cellar Time-Series by Weekday").opts(color="green")
gd = hv.Curve(groupByWeekday('Garage door'),label="Garage door Time-Series by Weekday").opts(color="purple")
ba = hv.Curve(groupByWeekday('Barn'),label="Barn Time-Series by Weekday").opts(color="grey")
we = hv.Curve(groupByWeekday('Well'),label="Well Time-Series by Weekday").opts(color="pink")
mcr = hv.Curve(groupByWeekday('Microwave'),label="Microwave Time-Series by Weekday").opts(color="yellow")
lr = hv.Curve(groupByWeekday('Living room'),label="Living room Time-Series by Weekday").opts(color="brown")
fu = hv.Curve(groupByWeekday('Furnace'),label="Furnace Time-Series by Weekday").opts(color="skyblue")
ki = hv.Curve(groupByWeekday('Kitchen'),label="Kitchen Time-Series by Weekday").opts(color="lightgreen")
(dw + ho + fr + wc + gd + ba + we + mcr + lr + fu + ki).opts(opts.Curve(xlabel="Weekday", ylabel="Energy Consumption", yformatter='%.2fkw', \
width=400, height=300,tools=['hover'],show_grid=True, xrotation=20)).opts(shared_axes=False).cols(6)
- ์ ์ฒด์ ์ผ๋ก ์ ๋
๋ถํฐ ๋ฐค์ฌ์ด ์๋์ง ์๋น๋์ด ์ํญ ์ฆ๊ฐ
- ์ฃผ๋ฏผ๋ค์ด ์ง์ฅ์์ ๋์์ ์์ฐ์ ์ธ ํ๋์ ์์ํ๊ธฐ ๋๋ฌธ
dw = hv.Curve(groupByTiming('Dishwasher'),label="Dishwasher Time-Series by Timing").opts(color="red")
ho = hv.Curve(groupByTiming('Home office'),label="Home office Time-Series by Timing").opts(color="blue")
fr = hv.Curve(groupByTiming('Fridge'),label="FridgeTime-Series by Timing").opts(color="orange")
wc = hv.Curve(groupByTiming('Wine cellar'),label="Wine cellar Time-Series by Timing").opts(color="green")
gd = hv.Curve(groupByTiming('Garage door'),label="Garage door Time-Series by Timing").opts(color="purple")
ba = hv.Curve(groupByTiming('Barn'),label="Barn Time-Series by Timing").opts(color="grey")
we = hv.Curve(groupByTiming('Well'),label="Well Time-Series by Timing").opts(color="pink")
mcr = hv.Curve(groupByTiming('Microwave'),label="Microwave Time-Series by Timing").opts(color="yellow")
lr = hv.Curve(groupByTiming('Living room'),label="Living room Time-Series by Timing").opts(color="brown")
fu = hv.Curve(groupByTiming('Furnace'),label="Furnace Time-Series by Timing").opts(color="skyblue")
ki = hv.Curve(groupByTiming('Kitchen'),label="Kitchen Time-Series by Timing").opts(color="lightgreen")
(dw + ho + fr + wc + gd + ba + we + mcr + lr + fu + ki).opts(opts.Curve(xlabel="Timing", ylabel="Energy Consumption", yformatter='%.2fkw', \
width=400, height=300,tools=['hover'],show_grid=True)).opts(shared_axes=False).cols(6)
9๏ธโฃ Weather Time-Series
temp = hv.Curve(df['temperature'].resample('D').mean(),label="temperature").opts(color="red")
apTemp = hv.Curve(df['apparentTemperature'].resample('D').mean(),label="apparentTemperature").opts(color="orange")
temps = (temp * apTemp).opts(opts.Curve(title='Temperature Time-Series by Day')).opts(legend_position='top',legend_cols=2)
hmd = hv.Curve(df['humidity'].resample('D').mean()).opts(color="yellow", title='Humidity Time-Series by Day')
vis = hv.Curve(df['visibility'].resample('D').mean()).opts(color="blue", title='Visibility Time-Series by Day')
prs = hv.Curve(df['pressure'].resample('D').mean()).opts(color="green", title='Pressure Time-Series by Day')
wnd = hv.Curve(df['windSpeed'].resample('D').mean()).opts(color="purple", title='WindSpeed Time-Series by Day')
cld = hv.Curve(df['cloudCover'].resample('D').mean()).opts(color="grey", title='CloudCover Time-Series by Day')
prc = hv.Curve(df['precipIntensity'].resample('D').mean()).opts(color="skyblue", title='PrecipIntensity Time-Series by Day')
dew = hv.Curve(df['dewPoint'].resample('D').mean()).opts(color="lightgreen", title='DewPoint Time-Series by Day')
(temps + hmd + vis + prs + wnd + cld + prc + dew).opts(opts.Curve(xlabel="Day", ylabel="Values", width=400, height=300,tools=['hover'],show_grid=True)).cols(4)
๐ Correlation Analysis
- ๊ฐ์ ๋ผ๋ฆฌ๋ ์๋ฌด ๊ด๊ณ ์๋ค
fig,ax = plt.subplots(figsize=(10, 8))
corr = df[['Dishwasher','Home office','Fridge','Wine cellar','Garage door','Barn','Well','Microwave','Living room','Furnace','Kitchen']].corr()
sns.heatmap(corr, annot=True, vmin=-1.0, vmax=1.0, center=0)
ax.set_title('Correlation of Appliances',size=20)
plt.show()
- ๋ ์จ์์ ์๊ด๊ด๊ณ
- ์จ๋๋ ๊ฒ๋ณด๊ธฐ ์จ๋ ๋ฐ ์ด์ฌ์ ๊ณผ ๊ด๋ จ
- ์ต๋๋ ๊ฐ์์ฑ, ํ์, ๊ตฌ๋ฆ ๋ฎ๊ฐ ๋ฐ ์ด์ฌ์ ๊ณผ ๊ด๋ จ
- ๊ฐ์์ฑ์ ์ต๋, ํ์, ๊ตฌ๋ฆ ๋ฎ๊ฐ ๋ฐ ๊ฐ์๋๊ณผ ๊ด๋ จ
- CloudCover๋ ์ต๋, ๊ฐ์์ฑ ๋ฐ ๊ฐ์๋๊ณผ ๊ด๋ จ
- ๊ฐ์ ๊ฐ๋๋ ๊ฐ์์ฑ ๋ฐ ๊ตฌ๋ฆ ๋ฎ๊ฐ์ ๊ด๋ จ
- DewPoint๋ ์จ๋, ๋ช ๋ฐฑํ ์จ๋ ๋ฐ ์ต๋์ ๊ด๋ จ
fig,ax = plt.subplots(figsize=(10, 8))
corr = df[['temperature','apparentTemperature','humidity','visibility','pressure','windSpeed','cloudCover','precipIntensity','dewPoint']].corr()
sns.heatmap(corr, annot=True, vmin=-1.0, vmax=1.0, center=0)
ax.set_title('Correlation of Weather Information',size=20)
plt.show()
- ์ผ๋ถ ๊ฐ์ ์ ํ์ ๋ ์จ ์ ๋ณด์ ์ํฅ ๋ฐ์
- ๋์ฅ๊ณ ๋ ์จ๋, ์ธ๊ด์ ์จ๋ ๋ฐ ์ด์ฌ์ ๊ณผ ๊ด๋ จ
- ์์ธ ์ ์ฅ๊ณ ๋ ์จ๋, ์ธ๊ด์ ์จ๋ ๋ฐ ์ด์ฌ์ ๊ณผ ๊ด๋ จ
- ์ฉํด๋ก๋ ์จ๋, ์ธ๊ด์ ์จ๋, ํ์ ๋ฐ ์ด์ฌ์ ๊ณผ ๊ด๋ จ
fig,ax = plt.subplots(figsize=(20, 12))
corr = df[['use_HO','gen_Sol','Dishwasher','Home office','Fridge','Wine cellar','Garage door','Barn','Well','Microwave','Living room','Furnace','Kitchen',\
'temperature','apparentTemperature','humidity','visibility','pressure','windSpeed','cloudCover','precipIntensity','dewPoint']].corr()
sns.heatmap(corr, annot=True, vmin=-1.0, vmax=1.0, center=0)
ax.set_title('Correlation of Appliances & Weather Information',size=20)
plt.show()
1๏ธโฃ1๏ธโฃ Model
๐ฃ ๋ณ๊ฒฝ ๊ฐ์ง : ๊ณผ๋ํ ์๋์ง ์๋น๋์ ์ฌ์ ์ ๊ฐ์งํ์ฌ ์ฌ์ฉ๋ฃ ์ธ์์ ๋ฐฉ์ง
๐ฃ ๋ฏธ๋์๋น ์์ธก : ๊ธฐ์์ ๋ณด ํ์ฉ ๋ฐ ์๋์ง ๊ณต๊ธ ์ต์ ํ๋ก ๋ฏธ๋ ์๋์ง ์๋น ๋ฐ ๋ฐ์ ์์ธก
โถ ๐ฃ Case1. Detect Changes in Energy Consumption
- ์๋์ง ์๋น๋ ๋ฐ์ดํฐ์์ ์๋์ง ์๋น๋ ์ฌ์ฉ ๊ฒฝํฅ์ ๋ณํ์ ์ ํฌ์ฐฉํ ์ ์์ ๊ฒ
- ์๋น ํธ๋ ๋์ ๋ณํ๋ฅผ ํฌ์ฐฉํจ์ผ๋ก์จ ์๋น๊ฐ ์ฆ๊ฐํ ๊ฐ๋ฅ์ฑ์ด ์๋ ๋ฌ์ ์๋์ง ๊ณต๊ธ์ ๋๋ฆฌ๊ณ
- ๊ฐ์ํ ๊ฐ๋ฅ์ฑ์ด ์๋ ๋ฌ์ ์๋์ง ๊ณต๊ธ์ ์ค์ด๋ ๋ฐฉ๋ฒ์ ์๊ฐ
- change point
- ๋ณ๊ฒฝ ์ง์ ์ ์๊ณ์ด ๋ฐ์ดํฐ์ ์ถ์ธ๊ฐ ์๊ฐ์ ๋ฐ๋ผ ๋ณํํ๋ ์ง์
- ํน์ด์น๋ ์๊ฐ์ ์ธ ์ด์ ์ํ(๊ธ๊ฒฉํ ๊ฐ์ ๋๋ ์ฆ๊ฐ)๋ฅผ ๋ํ๋
- ๋ณํ์ ์ ์ด์ ์ํ๊ฐ ์๋ ์ํ๋ก ๋์๊ฐ์ง ์๊ณ ๊ณ์๋๋ค
- ChangeFinder
- ๋ณ๊ฒฝ ์ง์ ์ ๊ฐ์งํ๋ ๋ฐ ์ฌ์ฉ๋๋ ์๊ณ ๋ฆฌ์ฆ
- SDAR(Sequency Discounting AR) ์๊ณ ๋ฆฌ์ฆ์ ๊ธฐ๋ฐํ ๋ก๊ทธ ์ฐ๋๋ฅผ ์ฌ์ฉํ์ฌ ๋ณ๊ฒฝ ์ ์๋ฅผ ๊ณ์ฐ
- SDAR ์๊ณ ๋ฆฌ๋ฌ์ AR ์๊ณ ๋ฆฌ์ฆ์ ํ ์ธ ๋งค๊ฐ ๋ณ์๋ฅผ ๋์
ํ์ฌ ๊ณผ๊ฑฐ ๋ฐ์ดํฐ์ ์ํฅ์ ์ค์์ผ๋ก์จ ์ ์งํ์ง ์์ ์๊ณ์ด ๋ฐ์ดํฐ๋ ๊ฐ๋ ฅํ๊ฒ ํ์ต
- Training STEP1
- SDAR ์๊ณ ๋ฆฌ์ฆ์ ์ฌ์ฉํ์ฌ ๊ฐ ๋ฐ์ดํฐ ์ง์ ์์ ์๊ณ์ด ๋ชจ๋ธ ๊ต์ก
- ํ๋ จ๋ ์๊ณ์ด ๋ชจํ์ ๊ธฐ๋ฐ์ผ๋ก ๋ค์ ์์ ์ ๋ฐ์ดํฐ ์ ์ด ๋ํ๋ ๊ฐ๋ฅ์ฑ์ ๊ณ์ฐ
- ๋ก๊ทธ ์์ค์ ๊ณ์ฐํ์ฌ ํน์ด์น ์ ์๋ก ์ฌ์ฉ
- Score(xt)=−logPt−1(xt|x1,x2,…,xt−1)
- Smoothing Step
- smoothing window(WW) ๋ด์์ ํน์ด์น ์ ์๋ฅผ ํํ
- ํํํ๋ฅผ ํตํด ํน์ด์น๋ก ์ธํ ์ ์๊ฐ ๊ฐ์ ๋๋ฉฐ, ์ด์ ์ํ๊ฐ ์ค๋ซ๋์ ์ง์๋์๋์ง ์ฌ๋ถ๋ฅผ ํ์ธ
- Score_smoothed(xt)=1W∑t=t−W+1tScore(xi)
- Training STEP2
- Smoothing ์ ํตํด ์ป์ ์ ์๋ฅผ ์ฌ์ฉํ์ฌ SDAR ์๊ณ ๋ฆฌ์ฆ์ผ๋ก ๋ชจ๋ธ์ ๊ต์ก
- ํ๋ จ๋ ์๊ณ์ด ๋ชจํ์ ๊ธฐ๋ฐ์ผ๋ก ๋ค์ ์์ ์ ๋ฐ์ดํฐ ์ ์ด ๋ํ๋ ๊ฐ๋ฅ์ฑ์ ๊ณ์ฐ
- ๋ก๊ทธ ์์ค์ ๊ณ์ฐํ์ฌ ๋ณ๊ฒฝ ์ ์๋ก ์ฌ์ฉ
- Training STEP1
- Hyperparameter Tuning
- Discounting parameter r(0<r<1)r(0<r<1) : ์ด ๊ฐ์ด ์์์๋ก ๊ณผ๊ฑฐ ๋ฐ์ดํฐ ํฌ์ธํธ์ ์ํฅ๋ ฅ์ ์ปค์ง๋ฉฐ ๋ณ๊ฒฝ์ ์์ ๋ณ๋์ ์ปค์ง
- Order parameter for AR orderorder : ๊ณผ๊ฑฐ ๋ฐ์ดํฐ ํฌ์ธํธ๊ฐ ๋ชจํ์ ์ผ๋ง๋ ํฌํจ๋์ด ์๋์ง ์ฌ๋ถ
- Smoothing window smoothsmooth : ํ๋ผ๋ฏธํฐ๊ฐ ํด์๋ก ํน์ด์น๋ณด๋ค ๋ณธ์ง์ ์ธ ๋ณํ๋ฅผ ํฌ์ฐฉํ๊ธฐ ์ฝ์ง๋ง ๋๋ฌด ํด ๊ฒฝ์ฐ ๋ณ๊ฒฝ ๋ด์ฉ ์์ฒด๋ฅผ ํฌ์ฐฉํ๊ธฐ ์ด๋ ค์
def chng_detection(col, _r=0.01, _order=1, _smooth=10):
cf = changefinder.ChangeFinder(r=_r, order=_order, smooth=_smooth)
ch_df = pd.DataFrame()
ch_df[col] = df[col].resample('D').mean()
# calculate the change score
ch_df['change_score'] = [cf.update(i) for i in ch_df[col]]
ch_score_q1 = stats.scoreatpercentile(ch_df['change_score'], 25)
ch_score_q3 = stats.scoreatpercentile(ch_df['change_score'], 75)
thr_upper = ch_score_q3 + (ch_score_q3 - ch_score_q1) * 3
anom_score = hv.Curve(ch_df['change_score'])
anom_score_th = hv.HLine(thr_upper).opts(color='red', line_dash="dotdash")
anom_points = [[ch_df.index[i],ch_df[col][i]] for i, score in enumerate(ch_df["change_score"]) if score > thr_upper]
org = hv.Curve(ch_df[col],label=col).opts(yformatter='%.1fkw')
detected = hv.Points(anom_points, label=f"{col} detected").opts(color='red', legend_position='bottom', size=5)
return ((anom_score * anom_score_th).opts(title=f"{col} Change Score & Threshold") + \
(org * detected).opts(title=f"{col} Detected Points")).opts(opts.Curve(width=800, height=300, show_grid=True, tools=['hover'])).cols(1)
- ์๋์ง ์๋น์ ๋ณํ์ ์ ํ์งํ ์ ์๋ ๋ชจ๋ธ์ ๊ตฌ์ถ
- ๋ฐ์ดํฐ ์ถ์ธ๊ฐ ๋ณํํ๋ 7์(๊ธ์ฆ)๊ณผ 9์(๊ธ๊ฐ)์ ๋ณํ์ ์ ํฌ์ฐฉ
chng_detection('use_HO', _r=0.001, _order=1, _smooth=3)
โถ ๐ฃ Case2. Predict Future Energy Consumption
- ๊ธฐ์ ์ ๋ณด๋ก๋ถํฐ ๊ฐ ๊ธฐ๊ธฐ์ ์๋์ง ์๋น๋์ ์์ธกํ๋ ๊ฒ์ด ๊ฐ๋ฅ
- ์๋์ง ์๋น๋์ ์์ธกํจ์ผ๋ก์จ ๋ ์จ ์ ๋ณด๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ํ์ํ ์๋์ง ๊ณต๊ธ๋์ ์ถ์ ํ ์ ์์ด ์๋์ง ์ต์ ํ๊ฐ ๊ฐ๋ฅ
- VAR, President, LightGBM์ ์ธ ๊ฐ์ง ๋ชจ๋ธ
โ 1) MODEL VAR
- ๋ฒกํฐ ์๊ธฐ ํ๊ท(VAR) ๋ชจ๋ธ์ ์๊ธฐ ํ๊ท(AR) ๋ชจ๋ธ์ ๋ค๋ณ๋ ํ์ฅ
- ์ฌ๋ฌ ๋ณ์๋ฅผ ์ด์ฉํ ์์ธก์ด ๊ฐ๋ฅํ๋ฉฐ, ๋จ์ผ ๋ณ์๋ฅผ ์ด์ฉํ ์์ธก์ ๋นํด ์์ธก ์ ํ๋๊ฐ ํฅ์๋ ๊ฒ
def grangerTestPlot(weather_info, applicances, _maxlag):
grangerTest_df = pd.DataFrame()
for weather in weather_info:
for appliance in applicances:
test_result = grangercausalitytests(df[[appliance, weather]], maxlag=_maxlag, verbose=False)
p_values = [round(test_result[i][0]['ssr_chi2test'][1],4) for i in range(1, _maxlag+1)]
min_p_value = np.min(p_values)
grangerTest_df.loc[appliance, weather] = min_p_value
fig,ax = plt.subplots(figsize=(10, 8))
sns.heatmap(grangerTest_df, vmax=1, vmin=0, center=1, annot=True)
ax.set_title('Granger Causality Test Result',size=20)
plt.xlabel("Weather Information",size=15)
plt.ylabel("Energy Consumption",size=15)
plt.show()
- ๊ฐ์ ์ ์ฒด ๋ฐ ์ผ๋ฐ ๊ฐ์ ์ ํ์ ์๋์ง ์๋น๋์ ์ ์ ํ์ฌ ๊ธฐ์์ ๋ณด๋ก Granger Causality Test ๋ฅผ ์ค์
- ์ ํ๋ ๋ ์๋์ง ์๋น ๋ณ์์์ ์๋ ฅpressure๊ณผ ๊ฐ์์ฑvisibility์ P-๊ฐ์ด 5%๋ฅผ ์ด๊ณผ
- ์ด๋ค ์ ์ฌ์ด์ ์ธ๊ณผ๊ด๊ณ๊ฐ ๊ด์ฐฐ๋์ง ์์์
grangerTestPlot(
weather_info=['temperature', 'humidity', 'visibility', 'pressure', 'windSpeed', 'cloudCover', 'windBearing', 'precipIntensity','dewPoint'], \
applicances=['use_HO','Wine cellar'], \
_maxlag=12)
- ๋ง์ ์๊ณ์ด ๋ชจ๋ธ๋ง ๋ฐฉ๋ฒ์์๋ ๋ฐ์ดํฐ๊ฐ ์ ์์ด์ด์ผ ํ๊ธฐ ๋๋ฌธ์ ์ ์์ฑ์ ์๊ณ์ด ๋ชจ๋ธ๋ง์ ์ค์
- ๋ฐ์ดํฐ์ ํ๊ท ์ ์ผ์
- ๋ฐ์ดํฐ์ ๋ถ์ฐ์ด ์ผ์
- ๋ฐ์ดํฐ์ ๊ณต๋ถ์ฐ์ ์ผ์
- ์ ์์ฑ์ ํ
์คํธํ๊ธฐ ์ํด Augmented Dickey-Fuller Test ๋ฅผ ์ฌ์ฉ
- ADF ๊ฒ์ฌ ๊ฒฐ๊ณผ ๋ชจ๋ ๋ณ์์ P-๊ฐ์ 5% ์ด๋ด
for i in ['temperature', 'humidity','windSpeed', 'cloudCover', 'windBearing', 'precipIntensity','dewPoint','use_HO','Wine cellar']:
print(f"p-value {i} : {adfuller(df[i].resample('H').mean(), autolag='AIC', regression = 'ct')[1]}")
- ์ค๋ช
๋ณ์์ ๋ ์จ ์ ๋ณด๋ฅผ ์ถ๊ฐํ ์ด ์๋์ง ์๋น๋๊ณผ ์์ธ ์
๋ฌ ์๋์ง ์๋น๋์ ์์ธก ๊ฒฐ๊ณผ
- ๋ ๋ค ๋์ฒด๋ก ์์ฃผ ์งง์ ์๊ฐ ์์ ์ ์์ธก
var_df = df.resample('H').mean()
def var_train(cols=['temperature', 'humidity', 'visibility', 'windSpeed', 'windBearing', 'dewPoint','Furnace', 'use_HO'], max_order=10, train_ratio=0.9,test_ratio=0.1):
#make dataframe for training
tr,te = [int(len(var_df) * i) for i in [train_ratio, test_ratio]]
train, test = var_df[0:tr], var_df[tr:]
#model training
var_func = VAR(train[cols], freq='H')
var_func.select_order(max_order)
model = var_func.fit(maxlags=max_order, ic='aic', trend='ct')
model_result = model.summary()
#make predict dataframe
varForecast_df = pd.DataFrame(model.forecast(model.endog, steps=len(test)),columns=cols)
varForecast_df.index = test.index
return varForecast_df, model_result
- ๋ชจ๋ธ ํ๊ฐ
varForecast_df, model_result = var_train(cols=['temperature', 'humidity','windSpeed', 'cloudCover', 'windBearing', 'precipIntensity','dewPoint','use_HO','Wine cellar'], \
max_order=48, train_ratio=0.99,test_ratio=0.01)
#evaluation with MAE
var_use_mae = mean_absolute_error(var_df['use_HO'][-len(varForecast_df):], varForecast_df['use_HO'])
((hv.Curve(var_df['use_HO'], label='use_HO').opts(color='blue')\
* hv.Curve(varForecast_df['use_HO'], label='use_HO predicted').opts(color='red', title='VAR Result - Total Energy Consumption')).opts(legend_position='bottom') + \
(hv.Curve(var_df['use_HO'][-int(len(var_df)*0.05):], label='use_HO').opts(color='blue') \
* hv.Curve(varForecast_df['use_HO'], label='use_HO predicted').opts(color='red', title='VAR Result Enlarged - Total Energy Consumption')).opts(legend_position='bottom'))\
.opts(opts.Curve(xlabel="Time", yformatter='%.2fkw', width=800, height=300, show_grid=True, tools=['hover'])).opts(shared_axes=False).cols(1)
var_wine_mae = mean_absolute_error(var_df['Wine cellar'][-len(varForecast_df):], varForecast_df['Wine cellar'])
((hv.Curve(var_df['Wine cellar'], label='Wine cellar').opts(color='blue')\
* hv.Curve(varForecast_df['Wine cellar'], label='Wine cellar predicted').opts(color='red', title='VAR Result - Wine Cellar Energy Consumption')).opts(legend_position='bottom') + \
(hv.Curve(var_df['Wine cellar'][-int(len(var_df)*0.05):], label='Wine Cellar').opts(color='blue') \
* hv.Curve(varForecast_df['Wine cellar'], label='Wine cellar predicted').opts(color='red', title='VAR Result Enlarged - Wine Cellar Energy Consumption')).opts(legend_position='bottom'))\
.opts(opts.Curve(xlabel="Time", yformatter='%.2fkw', width=800, height=300, show_grid=True, tools=['hover'])).opts(shared_axes=False).cols(1)
print(model_result)
โ 2) MODEL Prophet
prf_df = df.resample('H').mean()
def prophet_train(train_ratio=0.99, test_ratio=0.01, trg='use_HO', regressors=['temperature', 'humidity']):
#make dataframe for training
tr,te = [int(len(prf_df) * i) for i in [train_ratio, test_ratio]]
train, test = prf_df[0:tr], prf_df[tr:]
prophet_df = pd.DataFrame()
prophet_df["ds"] = train.index
prophet_df['y'] = train[trg].values
#add regressors
for i in regressors:
prophet_df[i] = train[i].values
#train model by Prophet
m = Prophet()
#include additional regressors into the model
for i in regressors:
m.add_regressor(i)
m.fit(prophet_df)
#make dataframe for prediction
future = pd.DataFrame()
future['ds'] = test.index
#add regressors
for i in regressors:
future[i] = test[i].values
#predict the future
prophe_result = m.predict(future)
prfForecast_df = pd.DataFrame()
prfForecast_df[trg] = prophe_result.yhat
prfForecast_df.index = prophe_result.ds
return prfForecast_df
- ์ด ์๋์ง ์๋น๋์ ๋๋ต์ ์ผ๋ก ์์ธกํ ์ ์๋ ๊ฒ์ฒ๋ผ ๋ณด์ด์ง๋ง, ์์ธ ์ ์ฅ๊ณ ์ ์๋์ง ์๋น๋์ ์ฝ๊ฐ๋ง ์์ธก
prfForecast_df = prophet_train(trg='Wine cellar',regressors=['temperature', 'humidity','windSpeed', 'cloudCover', 'windBearing', 'precipIntensity','dewPoint'])
#evaluation with MAE
prf_wine_mae = mean_absolute_error(prf_df['Wine cellar'][-len(prfForecast_df):], prfForecast_df['Wine cellar'])
((hv.Curve(prf_df['Wine cellar'], label='Wine cellar').opts(color='blue')\
* hv.Curve(prfForecast_df['Wine cellar'], label='Wine cellar predicted').opts(color='red', title='Prophet Result - Wine cellar Energy Consumption')).opts(legend_position='bottom') + \
(hv.Curve(prf_df['Wine cellar'][-int(len(var_df)*0.05):], label='Wine cellar').opts(color='blue') \
* hv.Curve(prfForecast_df['Wine cellar'], label='Wine cellar predicted').opts(color='red', title='Prophet Result Enlarged - Wine cellar Energy Consumption')).opts(legend_position='bottom'))\
.opts(opts.Curve(xlabel="Time", yformatter='%.2fkw', width=800, height=300, show_grid=True, tools=['hover'])).opts(shared_axes=False).cols(1)
โ 3) MODEL LightGBM Regressor
- ์๊ณ์ด ํ๊ท ๋ชจ๋ธ์ ๊ตฌ์ถํ๋ฉด ๋ฏธ๋์ ์๋์ง ์๋น๋ฅผ ์์ธกํ๊ณ ์๋์ง ์๋น์ ๋ ์จ ์ ๋ณด ์ฌ์ด์ ๊ด๊ณ๋ฅผ ์ดํด
_lgbm_df = df.resample('H').mean()
_lgbm_df['weekday'] = LabelEncoder().fit_transform(pd.Series(_lgbm_df.index).apply(lambda x : x.day_name())).astype(np.int8)
_lgbm_df['timing'] = LabelEncoder().fit_transform(_lgbm_df['hour'].apply(hours2timing)).astype(np.int8)
def lgbm_train(cols=['temperature','dewPoint','use_HO'],trg='use_HO',train_ratio=0.8,valid_ratio=0.1,test_ratio=0.1):
#make dataframe for training
lgbm_df = _lgbm_df[cols]
tr,vd,te = [int(len(lgbm_df) * i) for i in [train_ratio, valid_ratio, test_ratio]]
X_train, Y_train = lgbm_df[0:tr].drop([trg], axis=1), lgbm_df[0:tr][trg]
X_valid, Y_valid = lgbm_df[tr:tr+vd].drop([trg], axis=1), lgbm_df[tr:tr+vd][trg]
X_test = lgbm_df[tr+vd:tr+vd+te+2].drop([trg], axis=1)
lgb_train = lgb.Dataset(X_train, Y_train)
lgb_valid = lgb.Dataset(X_valid, Y_valid, reference=lgb_train)
#model training
params = {
'task' : 'train',
'boosting':'gbdt',
'objective' : 'regression',
'metric' : {'mse'},
'num_leaves':200,
'drop_rate':0.05,
'learning_rate':0.1,
'seed':0,
'feature_fraction':1.0,
'bagging_fraction':1.0,
'bagging_freq':0,
'min_child_samples':5
}
gbm = lgb.train(params, lgb_train, num_boost_round=100, valid_sets=[lgb_train, lgb_valid], early_stopping_rounds=100)
#make predict dataframe
pre_df = pd.DataFrame()
pre_df[trg] = gbm.predict(X_test, num_iteration=gbm.best_iteration)
pre_df.index = lgbm_df.index[tr+vd:tr+vd+te+2]
return pre_df, gbm, X_train
๐ต ์ด ์๋์ง ์๋น๋์ ์์ธก
lgbmForecast_df, model, x_train = lgbm_train(\
cols=['temperature', 'humidity', 'visibility', 'apparentTemperature',\
'pressure', 'windSpeed', 'cloudCover', 'windBearing', 'precipIntensity',\
'dewPoint', 'precipProbability','year', 'month','day', 'weekday', 'weekofyear', \
'hour', 'timing','use_HO'],\
trg='use_HO',train_ratio=0.9,valid_ratio=0.09,test_ratio=0.01)
#calculate SHAP value for model interpretation
explainer = shap.TreeExplainer(model=model,feature_perturbation='tree_path_dependent')
shap_values = explainer.shap_values(X=x_train)
- ์ด ์๋์ง ์๋น๋์ ์์ธก์ Prophet model ์ ๊ฒฐ๊ณผ๋ณด๋ค ๋ ์ ํ
- Prophet model์ ํฌํจ๋์ง ์์๋ 'ํ์ผ', 'ํ์ด๋ฐ' ๋ฑ์ ์๊ฐ ์ ๋ณด๊ฐ ํจ๊ณผ์ ์ผ ์ ์์
- SHAP์ ํน์ง ๋ถ์์ ์ดํด๋ณด๋ฉด, 'week of year', 'timing', 'hour', 'ewPoint', 'thewPoint', 'temperature' ๋ฑ์ ํน์ง์ ๊ธ์ ์ ์ธ ๋ณํ์ 'weekday', 'cloudCover' ๋ฑ์ ๋ถ์ ์ ์ธ ๋ณํ๊ฐ ์ ์ฒด ์๋์ง ์๋น๋์ ์ฆ๊ฐ์ ์ํฅ์ ๋ฏธ์น๋ ๊ฒ์ผ๋ก ์๊ฐ
- ๋ชจ๋ธ ํ๊ฐ MAE
#evaluation with MAE
lgbm_use_mae = mean_absolute_error(_lgbm_df['use_HO'][-len(lgbmForecast_df):], lgbmForecast_df['use_HO'])
((hv.Curve(_lgbm_df['use_HO'], label='use_HO').opts(color='blue')\
* hv.Curve(lgbmForecast_df['use_HO'], label='use_HO predicted').opts(color='red', title='LightGBM Result - Total Energy Consumption')).opts(legend_position='bottom') + \
(hv.Curve(_lgbm_df['use_HO'][-int(len(_lgbm_df)*0.05):], label='use_HO').opts(color='blue') \
* hv.Curve(lgbmForecast_df['use_HO'], label='use_HO predicted').opts(color='red', title='LightGBM Result Enlarged - Total Energy Consumption')).opts(legend_position='bottom'))\
.opts(opts.Curve(xlabel="Time", yformatter='%.2fkw', width=800, height=300, show_grid=True, tools=['hover'])).opts(shared_axes=False).cols(1)
- force_plot
shap.force_plot(base_value=explainer.expected_value, shap_values=shap_values, features=x_train, feature_names=x_train.columns)
- summary_plot
shap.summary_plot(shap_values=shap_values, features=x_train, feature_names=x_train.columns, plot_type="violin")
๐ต ์์ธ ์ ๋ฌ์ ์๋์ง ์๋น ์์ธก
lgbmForecast_df, model, x_train = lgbm_train(\
cols=['temperature', 'humidity', 'visibility', 'apparentTemperature',\
'pressure', 'windSpeed', 'cloudCover', 'windBearing', 'precipIntensity',\
'dewPoint', 'precipProbability','year', 'month','day', 'weekday', 'weekofyear', \
'hour', 'timing','Wine cellar'],\
trg='Wine cellar',train_ratio=0.9,valid_ratio=0.09,test_ratio=0.01)
#calculate SHAP value for model interpretation
explainer = shap.TreeExplainer(model=model,feature_perturbation='tree_path_dependent')
shap_values = explainer.shap_values(X=x_train)
- ์์ธ ์
๋ฌ์ ์๋์ง ์๋น ์์ธก์ Prophet model ์ ๊ฒฐ๊ณผ๋ณด๋ค ๋ ์ ํ
- Prophet model ์ ํฌํจ๋์ง ์์๋ 'week of year', 'hour' ๋ฑ์ ์๊ฐ ์ ๋ณด๊ฐ ํจ๊ณผ์ ์ผ ์ ์์
- SHAP์ ํน์ง ๋ถ์์ ์ดํด๋ณด๋ฉด, 'week of year', 'hour', 'ewPoint', 'wind Speed' ๋ฑ์ ํน์ง์ ๊ธ์ ์ ์ธ ๋ณํ์ '์ต๋', 'cloudCover' ๋ฑ์ ๋ถ์ ์ ์ธ ๋ณํ๊ฐ ์์ธ์ ๋ฌ์ ์๋์ง ์๋น๋ ์ฆ๊ฐ์ ์ํฅ์ ๋ฏธ์น๋ ๊ฒ์ผ๋ก ์๊ฐ
- ๋ชจ๋ธ ํ๊ฐ MAE
#evaluation with MAE
lgbm_wine_mae = mean_absolute_error(_lgbm_df['Wine cellar'][-len(lgbmForecast_df):], lgbmForecast_df['Wine cellar'])
((hv.Curve(_lgbm_df['Wine cellar'], label='Wine cellar').opts(color='blue')\
* hv.Curve(lgbmForecast_df['Wine cellar'], label='Wine cellar predicted').opts(color='red', title='LightGBM Result - Wine cellar Energy Consumption')).opts(legend_position='bottom') + \
(hv.Curve(_lgbm_df['Wine cellar'][-int(len(_lgbm_df)*0.05):], label='use_HO').opts(color='blue') \
* hv.Curve(lgbmForecast_df['Wine cellar'], label='Wine cellar predicted').opts(color='red', title='LightGBM Result Enlarged - Wine cellar Energy Consumption')).opts(legend_position='bottom'))\
.opts(opts.Curve(xlabel="Time", yformatter='%.2fkw', width=800, height=300, show_grid=True, tools=['hover'])).opts(shared_axes=False).cols(1)
shap.force_plot(base_value=explainer.expected_value, shap_values=shap_values, features=x_train, feature_names=x_train.columns)
shap.summary_plot(shap_values=shap_values, features=x_train, feature_names=x_train.columns, plot_type="violin")
1๏ธโฃ2๏ธโฃ Evaluation - MAE
display(HTML('<h3>Evaluation - MAE</h3>'+tabulate([['Total Energy Consumption',var_use_mae,prf_use_mae,lgbm_use_mae],['Wine cellar Energy Consumption',var_wine_mae,prf_wine_mae,lgbm_wine_mae]],\
["Target", "VAR", "Prophet","LightGBM Regressor"], tablefmt="html")))
1๏ธโฃ3๏ธโฃ Conclusions
- ๊ฐ ๊ฐ์ ์ ํ์ ์๋์ง ์๋น๋์๋ ์ผ์ ํ ๊ฒฝํฅ์ด ์์
- ์ฐ๋ฆฌ๊ฐ ๋ง๋ ChangeFinder ๋ชจ๋ธ์ ์๋์ง ์๋น์ ํธ๋ ๋ ๋ณํ๋ฅผ ์กฐ๊ธฐ์ ํฌ์ฐฉํจ
- President์ LightGBM์ผ๋ก ๊ตฌ์ถ๋ ๋ชจ๋ธ์ ๋ฏธ๋์ ์๋์ง ์๋น๋ฅผ ์์ธกํ ์ ์๋ ๊ฒ์ผ๋ก ๋ํ๋จ
- ๋ ์จ ์ ๋ณด์ ์๊ฐ ์ ๋ณด๊ฐ ์์ธก์ ๋งค์ฐ ์ ์ฉํ ๊ฒ์ผ๋ก ๋ฐํ์ง