😎 κ³΅λΆ€ν•˜λŠ” μ§•μ§•μ•ŒνŒŒμΉ΄λŠ” μ²˜μŒμ΄μ§€?

[Kaggle] Time-series data analysis using LSTM λ³Έλ¬Έ

πŸ‘©‍πŸ’» 인곡지λŠ₯ (ML & DL)/Serial Data

[Kaggle] Time-series data analysis using LSTM

μ§•μ§•μ•ŒνŒŒμΉ΄ 2022. 9. 20. 11:30
728x90
λ°˜μ‘ν˜•

220920 μž‘μ„±

<λ³Έ λΈ”λ‘œκ·ΈλŠ”kaggle의 AMIR REZAEIAN λ‹˜μ˜ code와 notebook 을 μ°Έκ³ ν•΄μ„œ κ³΅λΆ€ν•˜λ©° μž‘μ„±ν•˜μ˜€μŠ΅λ‹ˆλ‹€ :-) >

https://www.kaggle.com/code/amirrezaeian/time-series-data-analysis-using-lstm-tutorial/notebook

http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption#

 

Time-series data analysis using LSTM (Tutorial)

Explore and run machine learning code with Kaggle Notebooks | Using data from Household Electric Power Consumption

www.kaggle.com

 

 

 

😎 ν”„λ‘œμ νŠΈ μ†Œκ°œ

  • κ°œλ³„ κ°€μ • μ „λ ₯ μ†ŒλΉ„ 데이터 μ„ΈνŠΈ
  • 데이터에 λŒ€ν•΄ κ°€μž₯ κ°„λ‹¨ν•œ LSTM(μž₯단기 κΈ°μ–΅) μˆœν™˜ 신경망을 κ΅¬μΆ•ν•˜λŠ” 방법
  • 4λ…„ λ™μ•ˆ 1λΆ„ μƒ˜ν”Œλ§ μ†λ„λ‘œ ν•œ κ°€μ •μ˜ μ „λ ₯ μ†ŒλΉ„λŸ‰ μΈ‘μ •
  • 2006λ…„ 12μ›”μ—μ„œ 2010λ…„ 11μ›”(47κ°œμ›”) 사이에 Sceaux(ν”„λž‘μŠ€ νŒŒλ¦¬μ—μ„œ 7km)에 μœ„μΉ˜ν•œ μ§‘μ—μ„œ μˆ˜μ§‘ν•œ 2075259개의 츑정값이 포함

 

😎 데이터 μ…‹ 정보

  • μ°Έκ³ 
    • (global_active_power*1000/60 - sub_metering_1 - sub_metering_2 - sub_metering_3)은 보쑰 κ³„λŸ‰ 1, 2 및 3μ—μ„œ μΈ‘μ •λ˜μ§€ μ•Šμ€ μ „κΈ° μž₯λΉ„κ°€ κ°€μ •μ—μ„œ 1λΆ„(in watt hour)μ—μ„œ μ†ŒλΉ„ν•˜λŠ” ν™œμ„± μ—λ„ˆμ§€
    • λ°μ΄ν„°μ„ΈνŠΈμ— 츑정값에 일뢀 λˆ„λ½λœ 값이 포함 (ν–‰μ˜ 거의 1,25%)
      • λͺ¨λ“  달λ ₯ νƒ€μž„μŠ€νƒ¬ν”„κ°€ λ°μ΄ν„°μ„ΈνŠΈμ— μžˆμ§€λ§Œ 일뢀 νƒ€μž„μŠ€νƒ¬ν”„μ˜ 경우 μΈ‘μ • 값이 λˆ„λ½
      • λˆ„λ½λœ 값은 두 개의 연속 μ„Έλ―Έμ½œλ‘  속성 ꡬ뢄 기호 사이에 값이 μ—†λŠ” κ²ƒμœΌλ‘œ λ‚˜νƒ€λ‹˜
  • 속성 정보
    • date : dd/mm/yyyy ν˜•μ‹μ˜ λ‚ μ§œ
    • time : hh:mm:ss ν˜•μ‹μ˜ μ‹œκ°„
    • global_active_power : κ°€μ •μš© μ „ 세계 λΆ„ 평균 유효 μ „λ ₯(kilowatt)
    • global_reactive_power : κ°€μ •μš© μ „ 세계 λΆ„ 평균 무효 μ „λ ₯ (λ‹¨μœ„: kilowatt)
    • μ „μ•• : λΆ„ 평균 μ „μ••(λ‹¨μœ„: volt)
    • global_intensity : κ°€μ •μš© κΈ€λ‘œλ²Œ λΆ„ 평균 μ „λ₯˜ 강도(λ‹¨μœ„: ampere)
    • sub_metering_1 : μ—λ„ˆμ§€ 보쑰 κ³„λŸ‰ 1번(in watt-hour of active energy). 주둜 식기세척기, 였븐, μ „μžλ ˆμΈμ§€(ν•« ν”Œλ ˆμ΄νŠΈλŠ” μ „κΈ°κ°€ μ•„λ‹Œ κ°€μŠ€)κ°€ μžˆλŠ” 주방에 ν•΄λ‹Ή
    • sub_metering_2 : μ—λ„ˆμ§€ 보쑰 κ³„λŸ‰ 2번(in watt-hour of active energy). 세탁기, νšŒμ „μ‹ 건쑰기, 냉μž₯κ³ , μ‘°λͺ…이 μžˆλŠ” 세탁싀에 ν•΄λ‹Ή
    • sub_metering_3 : μ—λ„ˆμ§€ 보쑰 κ³„λŸ‰ 3번(in watt-hour of active energy). μ „κΈ° 온수기 및 에어컨에 ν•΄λ‹Ή

 

 

 

😎 μ½”λ“œ κ΅¬ν˜„

1️⃣ Package load

import sys 
import numpy as np # linear algebra
from scipy.stats import randint
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv), data manipulation as in SQL
import matplotlib.pyplot as plt # this is used for the plot the graph 
import seaborn as sns # used for plot interactive graph. 
from sklearn.model_selection import train_test_split # to split the data into two parts
from sklearn.model_selection import KFold # use for cross validation
from sklearn.preprocessing import StandardScaler # for normalization
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline # pipeline making
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn import metrics # for the check the error and accuracy of the model
from sklearn.metrics import mean_squared_error,r2_score

## for Deep-learing:
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.utils.np_utils import to_categorical
from tensorflow.keras.optimizers import SGD
from keras.callbacks import EarlyStopping
from keras.utils import np_utils
import itertools
from keras.layers import LSTM
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers import Dropout

 

2️⃣ Data load

df = pd.read_csv('household_power_consumption/household_power_consumption.txt', sep=';', 
                 parse_dates={'dt' : ['Date', 'Time']}, infer_datetime_format=True, 
                 low_memory=False, na_values=['nan','?'], index_col='dt')
  • 1) 데이터에 λ¬Έμžμ—΄λ‘œ 'nan'κ³Ό '?'이 포함  -> 두 개λ₯Ό numpy nan으둜 λ³€ν™˜ν•˜μ—¬ λ™μΌν•˜κ²Œ 처리
  • 2) 'λ‚ μ§œ'와 'μ‹œκ°„' 두 열을 'dt'둜 병합
  •  
    3) 데이터λ₯Ό μ‹œκ°„μ΄ λ˜λ„λ‘ 인덱슀λ₯Ό κ°€μ Έμ™€μ„œ μ‹œκ³„μ—΄ μœ ν˜•μœΌλ‘œ λ³€ν™˜

df.info()
df.isnull().sum()
df.dtypes
df.shape
df.describe()
df.columns

 

  • nan κ°’ μ²˜λ¦¬ν•˜κΈ°
droping_list_all=[]

for j in range(0,7):
    if not df.iloc[:, j].notnull().all():
        droping_list_all.append(j)        
        #print(df.iloc[:,j].unique())
droping_list_all

  • mean κ°’μœΌλ‘œ μ±„μš°κΈ°
for j in range(0,7):        
        df.iloc[:,j]=df.iloc[:,j].fillna(df.iloc[:,j].mean())

df.isnull().sum()

 

3️⃣ Data visualization

  • ν•˜λ£¨ λ™μ•ˆ μž¬μƒ˜ν”Œλ§ν•˜κ³  Global_active_power의 mean κ³Ό sum
  • μž¬μƒ˜ν”Œλ§λœ 데이터 μ§‘ν•©μ˜ mean κ³Ό sum 은 μœ μ‚¬ν•œ ꡬ쑰λ₯Ό κ°–λŠ” κ²ƒμœΌλ‘œ λ³΄μž„
df.Global_active_power.resample('D').sum().plot(title='Global_active_power resampled over day for sum') 
plt.tight_layout()
plt.show()   

df.Global_active_power.resample('D').mean().plot(title='Global_active_power resampled over day for mean', color='red') 
plt.tight_layout()
plt.show()


# μ–˜λ„ κ°€λŠ₯
# t = df.Global_active_power.resample('D').agg(['sum', 'mean'])

# t.plot(subplots = True, title='Global_active_power resampled over day')
# plt.show()

sum κ³Ό mean 의 κ·Έλž˜ν”„κ°€ 맀우 λΉ„μŠ·ν•¨

  • 'Global_intensity'의  mean κ³Ό std κ°€ ν•˜λ£¨ λ™μ•ˆ μƒ˜ν”Œλ§λœ 것
r = df.Global_intensity.resample('D').agg(['mean', 'std'])

r.plot(subplots = True, title='Global_intensity resampled over day')
plt.show()

  • ν•˜λ£¨ λ™μ•ˆ μƒ˜ν”Œλ§λœ 'Global_reactive_power'의 mean 및 std
r2 = df.Global_reactive_power.resample('D').agg(['mean', 'std'])

r2.plot(subplots = True, title='Global_reactive_power resampled over day', color='purple')
plt.show()

 

  • ν•œλ‹¬ λ™μ•ˆ μƒ˜ν”Œλ§λœ 'Global_active_power'의 sum
df['Global_active_power'].resample('M').mean().plot(kind='bar', label = "sum", color = "pink")

plt.xticks(rotation=60)
plt.ylabel('Global_active_power')

plt.title('Global_active_power per month (averaged over month)')
plt.legend()
plt.show()

 

 

  • λΆ„κΈ°λ³„λ‘œ λ‹€μ‹œ μƒ˜ν”Œλ§λœ 'Global_active_power'의 mean
df['Global_active_power'].resample('Q').mean().plot(kind='bar', label = "mean", color = "royalblue")

plt.xticks(rotation=60)
plt.ylabel('Global_active_power')

plt.title('Global_active_power per quarter (averaged over quarter)')
plt.legend()
plt.show()

 

  • 월에 걸쳐 μƒ˜ν”Œλ§λœ 'Voltage'의 mean
df['Voltage'].resample('M').mean().plot(kind='bar', label = "mean", color = "olive")

plt.xticks(rotation=60)
plt.ylabel('Voltage')

plt.title('Voltage per quarter (summed over quarter)')
plt.legend()
plt.show()

 

 

  • 월에 걸쳐 μƒ˜ν”Œλ§λœ 'Sub_metering_1'의 mean
df['Sub_metering_1'].resample('M').mean().plot(kind='bar', label = "mean", color = "brown")

plt.xticks(rotation=60)
plt.ylabel('Sub_metering_1')

plt.title('Sub_metering_1 per quarter (summed over quarter)')
plt.legend()
plt.show()

πŸ”Ό 월별 'Voltage'의 mean 이 λ‹€λ₯Έ νŠΉμ§•μ— λΉ„ν•΄ 거의 μΌμ •ν•˜λ‹€

 

 

  • ν•˜λ£¨ λ™μ•ˆ μƒ˜ν”Œλ§λœ μ—¬λŸ¬ κΈ°λŠ₯의 mean
cols = [0, 1, 2, 3, 5, 6]
i = 1
groups=cols
values = df.resample('D').mean().values

# plot each column
plt.figure(figsize=(15, 10))
for group in groups:
	plt.subplot(len(cols), 1, i)
	plt.plot(values[:, group])
	plt.title(df.columns[group], y=0.75, loc='right')
	i += 1
plt.show()

 

  • 일주일 λ™μ•ˆ μž¬μƒ˜ν”Œλ§ 및 mean
df.Global_reactive_power.resample('W').mean().plot(color='y', legend=True)
df.Global_active_power.resample('W').mean().plot(color='r', legend=True)
df.Sub_metering_1.resample('W').mean().plot(color='b', legend=True)
df.Global_intensity.resample('W').mean().plot(color='g', legend=True)

plt.show()

각자 주기의 νŒ¨ν„΄μ΄ μžˆλŠ”κ°€?

 

  • ν•œ 달에 걸쳐 μž¬μƒ˜ν”Œλ§λœ λ‹€λ₯Έ νŠΉμ§•μ˜ mean에 λŒ€ν•œ histogram
df.Global_reactive_power.resample('W').mean().plot(color='y', legend=True)
df.Global_active_power.resample('W').mean().plot(color='r', legend=True)
df.Sub_metering_1.resample('W').mean().plot(color='b', legend=True)
df.Global_intensity.resample('W').mean().plot(color='g', legend=True)

plt.show()

 

  • ν•œ 달에 걸쳐 μž¬μƒ˜ν”Œλ§λœ λ‹€λ₯Έ νŠΉμ§•μ˜ mean에 λŒ€ν•œ histogram
df.Global_active_power.resample('M').mean().plot(kind='hist', color='r', legend=True )
df.Global_reactive_power.resample('M').mean().plot(kind='hist',color='b', legend=True)
df.Global_intensity.resample('M').mean().plot(kind='hist', color='g', legend=True)
df.Sub_metering_1.resample('M').mean().plot(kind='hist', color='y', legend=True)
plt.show()

#df.Voltage.resample('M').sum().plot(kind='hist',color='g', legend=True ν•˜λ‹ˆκΉŒ scale이 μ•ˆλ§žμ•„μ„œ 이상함

 

  • Global_intensity, Global_active_power의 상관관계
    • pct_change  차이[λ°±λΆ„μœ¨]
      • ν•œ 객체 λ‚΄μ—μ„œ ν–‰κ³Ό ν–‰μ˜ 차이λ₯Ό ν˜„μž¬κ°’κ³Όμ˜ λ°±λΆ„μœ¨λ‘œ 좜λ ₯ν•˜λŠ” λ©”μ„œλ“œ
      • (λ‹€μŒν–‰ - ν˜„μž¬ν–‰) ÷ ν˜„μž¬ν–‰ ==== (맀도가격 - λ§€μˆ˜κ°€κ²©) % λ§€μˆ˜κ°€κ²©
      • νŠΉμ • N일에 λŒ€ν•œ 수읡λ₯ μ„ κ΅¬ν•˜κ³  μ‹Άλ‹€λ©΄ pct_change(periods=N)을 μž…λ ₯
    • df.pct_change(periods=1, fill_method='pad', limit=None, freq=None, kwargs)
      • periods : 비ꡐ할 간격을 지정 (기본은 +1둜 λ°”λ‘œ 이전 κ°’κ³Ό 비ꡐ)
      • fill_method : {ffill : μ•žμ˜ κ°’μœΌλ‘œ 채움 / bfill : λ’€μ˜ κ°’μœΌλ‘œ 채움} 결츑치λ₯Ό λŒ€μ²΄ν•  κ°’
      • limit : 결츑값을 λͺ‡κ°œλ‚˜ λŒ€μ²΄ν• μ§€ 정함
      • freq : μ‹œκ³„μ—΄ APIμ—μ„œ μ‚¬μš©ν•  증뢄을 지정
data_returns = df.pct_change()

# jointplot : scatter(산점도)와 histogram(뢄포)을 λ™μ‹œμ— κ·Έλ €μ£Όλ©° μˆ«μžν˜• λ°μ΄ν„°λ§Œ ν‘œν˜„ κ°€λŠ₯
sns.jointplot(x='Global_intensity', y='Global_active_power', data=data_returns)  

plt.show()

  • Voltage와 Global_active_power μ‚¬μ΄μ˜ 상관 관계
sns.jointplot(x='Voltage', y='Global_active_power', data=data_returns)  

plt.show()

 

μœ„μ˜ 두 κ·Έλž˜ν”„μ—μ„œ 'Global_incentity'와 'Global_active_power'λŠ” 상관관계가 μžˆμŒμ„ μ•Œ 수 있음
'Voltage', 'Global_active_power'λŠ” 상관 관계가 적음
μ‚°μ λ„λž€
: 두 λ³€μˆ˜μ˜ 관계λ₯Ό λ³΄μ—¬μ£ΌλŠ” 자료 ν‘œμ‹œ 방법
: 각 츑정값은 두 λ³€μˆ˜λ₯Ό μ˜λ―Έν•˜λŠ” (x, y)

- λ³€μˆ˜ xκ°€ μ¦κ°€ν• μˆ˜λ‘ λ³€μˆ˜ y도 증가할 λ•Œ, 두 λ³€μˆ˜ μ‚¬μ΄μ—λŠ” μ–‘μ˜ 상관관계가 μžˆλ‹€

 

- λ³€μˆ˜ xκ°€ μ¦κ°€ν• μˆ˜λ‘ λ³€μˆ˜ yλŠ” κ°μ†Œν•  λ•Œ, 두 λ³€μˆ˜ μ‚¬μ΄μ—λŠ” 음의 상관관계가 μžˆλ‹€
- 두 λ³€μˆ˜ 사이에 νŠΉλ³„ν•œ 관계가 μ—†λ‹€λ©΄, 두 λ³€μˆ˜λŠ” 아무 연관성이 μ—†λ‹€

[μ°Έκ³ ] https://ko.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/introduction-to-scatterplots/a/scatterplots-and-correlation-review

 

 

 

4️⃣ Correlations among features

  • μ—΄ κ°„μ˜ 상관 관계
plt.matshow(df.corr(method='spearman'),vmax=1,vmin=-1,cmap='PRGn')
plt.title('without resampling', size=15)
plt.colorbar()
plt.show()

 

  • λͺ‡ κ°œμ›” λ™μ•ˆ μž¬μƒ˜ν”Œλ§λœ νŠΉμ§•μ˜ mean 상관 관계
plt.matshow(df.resample('M').mean().corr(method='spearman'),vmax=1,vmin=-1,cmap='PRGn')
plt.title('resampled over month', size=15)
plt.colorbar()
plt.margins(0.02)
plt.matshow(df.resample('A').mean().corr(method='spearman'),vmax=1,vmin=-1,cmap='PRGn')
plt.title('resampled over year', size=15)
plt.colorbar()
plt.show()

μœ„μ—μ„œ 보면 λ¦¬μƒ˜ν”Œλ§ 기술둜 νŠΉμ§• κ°„μ˜ 상관관계λ₯Ό λ³€κ²½ν•  수 있음

 

 

5️⃣ Machine-Leaning: LSTM

- μ‹œκ³„μ—΄κ³Ό 순차적 λ¬Έμ œμ— κ°€μž₯ μ ν•©ν•œ 반볡 신경망(LSTM)을 적용
    : 큰 데이터λ₯Ό 가지고 μžˆλ‹€λ©΄ 이 접근법이 μ΅œμ„ 
- 지도 ν•™μŠ΅ 문제λ₯Ό Global_active_power μΈ‘μ • 및 λ‹€λ₯Έ κΈ°λŠ₯이 주어진 ν˜„μž¬ μ‹œκ°„(t)μ—μ„œ Global_active_powerλ₯Ό μ˜ˆμΈ‘ν•˜λŠ” κ²ƒμœΌλ‘œ ν”„λ ˆμž„ν•  것

 

  • 계산 μ‹œκ°„μ„ λ‹¨μΆ•ν•˜κ³  λͺ¨λΈμ„ ν…ŒμŠ€νŠΈν•  수 μžˆλŠ” λΉ λ₯Έ κ²°κ³Όλ₯Ό μ–»κΈ° μœ„ν•΄ μ‹œκ°„ λ‹¨μœ„λ‘œ 데이터λ₯Ό μž¬κ΅¬μ„± (μ›λž˜ λ°μ΄ν„°λŠ” λΆ„ λ‹¨μœ„λ‘œ 제곡)
  • λ°μ΄ν„°μ˜ 크기가 2075259μ—μ„œ 34589둜 μ€„μ–΄λ“€μ§€λ§Œ, λ°μ΄ν„°μ˜ 전체적인 κ΅¬μ‘°λŠ” μœ μ§€λœλ‹€.
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
	n_vars = 1 if type(data) is list else data.shape[1]
	dff = pd.DataFrame(data)
	cols, names = list(), list()
	# input sequence (t-n, ... t-1)
	for i in range(n_in, 0, -1):
		cols.append(dff.shift(i))
		names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
	# forecast sequence (t, t+1, ... t+n)
	for i in range(0, n_out):
		cols.append(dff.shift(-i))
		if i == 0:
			names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
		else:
			names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
	# put it all together
	agg = pd.concat(cols, axis=1)
	agg.columns = names
	# drop rows with NaN values
	if dropnan:
		agg.dropna(inplace=True)
	return agg
  • 계산 μ‹œκ°„μ„ λ‹¨μΆ•ν•˜κ³  λͺ¨λΈμ„ ν…ŒμŠ€νŠΈν•  수 μžˆλŠ” λΉ λ₯Έ κ²°κ³Όλ₯Ό μ–»κΈ° μœ„ν•΄ μ‹œκ°„ λ‹¨μœ„λ‘œ 데이터λ₯Ό μž¬κ΅¬μ„± (μ›λž˜ λ°μ΄ν„°λŠ” λΆ„ λ‹¨μœ„λ‘œ 제곡)
  • λ°μ΄ν„°μ˜ 크기가 2075259μ—μ„œ 34589둜 μ€„μ–΄λ“€μ§€λ§Œ, λ°μ΄ν„°μ˜ 전체적인 κ΅¬μ‘°λŠ” μœ μ§€λœλ‹€.
## resampling of data over hour
df_resample = df.resample('h').mean() 
df_resample.shape

 

  • [0,1] λ²”μœ„μ˜ λͺ¨λ“  κΈ°λŠ₯을 ν™•μž₯
  • μž¬μƒ˜ν”Œλ§λœ 데이터(μ‹œκ°„ 이상)λ₯Ό 기반으둜 ν›ˆλ ¨
values = df_resample.values 


## full data without resampling
#values = df.values

# integer encode direction
# ensure all data is float
#values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# frame as supervised learning
reframed = series_to_supervised(scaled, 1, 1)

# drop columns we don't want to predict
reframed.drop(reframed.columns[[8,9,10,11,12,13]], axis=1, inplace=True)
print(reframed.head())

ν˜„μž¬ μ‹œκ°„(μž¬μƒ˜ν”Œλ§μ— 따라 닀름)μ—μ„œ 7개의 μž…λ ₯ λ³€μˆ˜(μž…λ ₯ μ‹œλ¦¬μ¦ˆ)와 'Global_active_power'에 λŒ€ν•œ 1개의 좜λ ₯ λ³€μˆ˜λ₯Ό λ³΄μž„

 

πŸ’™ Splitting the rest of data to train and validation sets

  • μ€€λΉ„λœ 데이터 μ„ΈνŠΈλ₯Ό train와 test set둜 λ‚˜λˆ”
  • λͺ¨λΈμ˜ ꡐ윑 속도λ₯Ό 높이기 μœ„ν•΄ 데이터 μ²«ν•΄μ—λ§Œ λͺ¨λΈμ„ train ν•œ ν›„ ν–₯ν›„ 3λ…„ λ™μ•ˆ 데이터λ₯Ό 평가
# split into train and test sets
values = reframed.values

n_train_time = 365*24
train = values[:n_train_time, :]
test = values[n_train_time:, :]

# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]

# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
  • LSTM이 μ˜ˆμƒν•œ λŒ€λ‘œ μž…λ ₯을 3D ν˜•μ‹, μ¦‰ [μƒ˜ν”Œ, μ‹œκ°„ λ‹¨κ³„, νŠΉμ§•]으둜 μž¬κ΅¬μ„±

 

πŸ’™ Model architecture

  • 1) 첫 번째 visible layer 에 100개의 λ‰΄λŸ°μ΄ μžˆλŠ” LSTM
  • 2) 20%λ₯Ό dropout
  • 3) Global_active_powerλ₯Ό μ˜ˆμΈ‘ν•˜κΈ° μœ„ν•œ output layer 의 λ‰΄λŸ° 1개
  • 4) input shapeλŠ” 7개의 feature둜 κ΅¬μ„±λœ 1회 time step
  • 5) 평균 μ ˆλŒ€ 였차(MAE) 손싀 ν•¨μˆ˜μ™€ ν™•λ₯ μ  경사 κ°•ν•˜μ˜ 효율적인 Adam 버전을 μ‚¬μš©
model = Sequential()
model.add(LSTM(100, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dropout(0.2))

model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

 

  • 6) λͺ¨λΈμ€ batch sizeκ°€ 70인 20개의 training epoch 에 적합할 것
# fit network
history = model.fit(train_X, train_y, epochs=20, batch_size=70, validation_data=(test_X, test_y), verbose=2, shuffle=False)

 

  • 7) Loss μ‹œκ°ν™”
# summarize history for loss

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')
plt.show()

 

  • 8) μ˜ˆμΈ‘ν•˜κΈ° +  RMSE
# make a prediction
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], 7))

# invert scaling for forecast
inv_yhat = np.concatenate((yhat, test_X[:, -6:]), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:,0]

# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
inv_y = np.concatenate((test_y, test_X[:, -6:]), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:,0]

# calculate RMSE
rmse = np.sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)

ν—‰ λ„ˆλ¬΄ λ†’λ‹€!

 

λͺ¨λΈμ„ κ°œμ„ ν•˜λ €λ©΄ epoch와 batch_sizeλ₯Ό μ‘°μ •ν•˜κΈ°

 

  • Time steps, λͺ¨λ“  stepλŠ” 1μ‹œκ°„ (μ‹œκ°„ 단계λ₯Ό μ‹€μ œ μ‹œκ°„ 인덱슀둜 μ‰½κ²Œ λ³€ν™˜ν•  수 있음)
  • demo λͺ©μ μœΌλ‘œ, 200μ‹œκ°„ μ•ˆμ— μ˜ˆμΈ‘μ„ 비ꡐ λͺ©ν‘œ!
aa=[x for x in range(200)]

plt.plot(aa, inv_y[:200], marker='.', label="actual")
plt.plot(aa, inv_yhat[:200], 'r', label="prediction")
plt.ylabel('Global_active_power', size=15)
plt.xlabel('Time step', size=15)
plt.legend(fontsize=15)
plt.show()

λ“€μ­‰λ‚ μ­‰ν•˜κ΅°...

 

6️⃣ Final

  • 순차적 λ¬Έμ œμ— λŒ€ν•œ μ΅œμ‹  기술인 LSTM 신경망을 μ‚¬μš©
  • 계산 μ‹œκ°„μ„ λ‹¨μΆ•ν•˜κ³  κ²°κ³Όλ₯Ό λΉ λ₯΄κ²Œ μ–»κΈ° μœ„ν•΄ 첫 ν•΄ 데이터(μ‹œκ°„μ— 따라 λ‹€μ‹œ μƒ˜ν”Œλ§)λ₯Ό μ‚¬μš©ν•˜μ—¬ λͺ¨λΈμ„ κ΅μœ‘ν•˜κ³  λ‚˜λ¨Έμ§€ 데이터λ₯Ό μ‚¬μš©ν•˜μ—¬ λͺ¨λΈμ„ ν…ŒμŠ€νŠΈ
  • 합리적인 μ˜ˆμΈ‘μ„ 얻을 수 μžˆλ‹€λŠ” 것을 보여주기 μœ„ν•΄ 맀우 κ°„λ‹¨ν•œ LSTM 신경망을 ꡬ성
    • BUT, ν–‰μ˜ μˆ˜κ°€ λ„ˆλ¬΄ 많고 결과적으둜 계산은 맀우 μ‹œκ°„μ΄ κ±Έλ¦Ό
    • κ°€μž₯ 쒋은 것은 GPUμ—μ„œ μ‹€ν–‰λ˜λŠ” 슀파크(MLlib)λ₯Ό μ‚¬μš©ν•˜μ—¬ μ½”λ“œμ˜ λ§ˆμ§€λ§‰ 뢀뢄을 μž‘μ„±ν•˜λŠ” 것
    • CNN은 데이터에 상관관계가 있기 λ•Œλ¬Έμ— μ—¬κΈ°μ„œ μœ μš©ν•˜λ‹€(CNN 계측은 λ°μ΄ν„°μ˜ 둜컬 ꡬ쑰λ₯Ό μ‘°μ‚¬ν•˜λŠ” 쒋은 방법)

 

 

 

 

 

 

 

 

728x90
λ°˜μ‘ν˜•
Comments