π 곡λΆνλ μ§μ§μνμΉ΄λ μ²μμ΄μ§?
[DACON] HAICon2020 μ°μ μ μ΄μμ€ν 보μμν νμ§ AI & LSTM λ³Έλ¬Έ
[DACON] HAICon2020 μ°μ μ μ΄μμ€ν 보μμν νμ§ AI & LSTM
μ§μ§μνμΉ΄ 2022. 9. 7. 09:08220907 μμ±
<λ³Έ λΈλ‘κ·Έλ dacon λνμμμ Sllab ν μ½λμ daconμ HAI 2.0 Baseline κΈμ μ°Έκ³ ν΄μ 곡λΆνλ©° μμ±νμμ΅λλ€ :-) >
https://dacon.io/competitions/official/235624/codeshare/1570?page=1&dtype=recent
HAI 2.0 Baseline
HAICon2020 μ°μ μ μ΄μμ€ν 보μμν νμ§ AI κ²½μ§λν
dacon.io
https://dacon.io/competitions/official/235624/codeshare/1831
[2μ]SIlab
HAICon2020 μ°μ μ μ΄μμ€ν 보μμν νμ§ AI κ²½μ§λν
dacon.io
π 1. λν μκ°
- μ΅κ·Ό κ΅κ°κΈ°λ°μμ€ λ° μ°μ μμ€μ μ μ΄μμ€ν μ λν μ¬μ΄λ² 보μμνμ΄ μ§μμ μΌλ‘ μ¦κ°
- νμ₯ μ μ΄μμ€ν μ νΉμ±μ μ ννκ² λ°μνκ³ , λ€μν μ μ΄μμ€ν μ¬μ΄λ²κ³΅κ²© μ νμ ν¬ν¨νλ λ°μ΄ν°μ μ AIκΈ°λ° λ³΄μκΈ°μ μ°κ΅¬λ₯Ό μν νμμ μΈ μμ
- μ°μ μ μ΄μμ€ν 보μ λ°μ΄ν°μ HAI 1.0 (μ μ΄μμ€ν ν μ€νΈλ² λ) μ νμ©νμ¬ μ μ μν©μ λ°μ΄ν°λ§μ νμ΅νμ¬ κ³΅κ²© λ° λΉμ μ μν©μ νμ§ν μ μλ μ΅μ μ λ¨Έμ λ¬λ, λ₯λ¬λ λͺ¨λΈμ κ°λ°νκ³ μ±λ₯μ κ²½μνλ λν
π 2. λ°μ΄ν° ꡬμ±
- νμ΅ λ°μ΄ν°μ
(3κ°)
- νμΌλͺ : 'train1.csv', 'train2.csv', 'train3.csv'
- μ€λͺ : μ μμ μΈ μ΄μ μν©μμ μμ§λ λ°μ΄ν°(κ° νμΌλ³λ‘ μκ° μ°μμ±μ κ°μ§)
- Column1 ('time') : κ΄μΈ‘ μκ°
- Column2, 3, …, 80 ('C01', 'C02', …, 'C79') : μν κ΄μΈ‘ λ°μ΄ν°
- κ²μ¦ λ°μ΄ν°μ
(1κ°)
- νμΌλͺ : 'validation.csv'
- 5κ°μ§ 곡격 μν©μμ μμ§λ λ°μ΄ν°(μκ° μ°μμ±μ κ°μ§)
- Column1 ('time') : κ΄μΈ‘ μκ°
- Column2, 3, …, 80 ('C01', 'C02', …, 'C79') : μν κ΄μΈ‘ λ°μ΄
- Column81 : 곡격 λΌλ²¨ (μ μ:'0', 곡격:'1')
- ν
μ€νΈ λ°μ΄ν°μ
(4κ°)
- νμΌλͺ : 'test1.csv', 'test2.csv', 'test3.csv', 'test4.csv'
- Column1 ('time') : κ΄μΈ‘ μκ°
- Column2, 3, …, 80 ('C01', 'C02', …, 'C79') : μν κ΄μΈ‘ λ°μ΄ν°
- HAI 2.0
- νμΌλͺ : eTaPR-1.12-py3-none-any.whl
π 3. μ½λ ꡬν
1οΈβ£ λ°μ΄ν° λ‘λ
- !pip install /νμΌκ²½λ‘/eTaPR-1.12-py3-none-any.whl
!pip install /home/ubuntu/coding/220906/eTaPR-1.12-py3-none-any.whl
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
import time
from keras.models import Model, Sequential, load_model
import keras
from keras import optimizers
from keras.layers import Input,Bidirectional ,LSTM, Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from keras.callbacks import EarlyStopping, ModelCheckpoint
from pathlib import Path
from datetime import timedelta
import dateutil
from tqdm.notebook import trange
from TaPR_pkg import etapr
- Data set
TRAIN_DATASET = sorted([x for x in Path("/home/ubuntu/coding/220906/HAI 2.0/training").glob("*.csv")])
TRAIN_DATASET
TEST_DATASET = sorted([x for x in Path("/home/ubuntu/coding/220906/HAI 2.0/testing").glob("*.csv")])
TEST_DATASET
VALIDATION_DATASET = sorted([x for x in Path("/home/ubuntu/coding/220906/HAI 2.0/validation").glob("*.csv")])
VALIDATION_DATASET
def dataframe_from_csv(target):
return pd.read_csv(target).rename(columns=lambda x: x.strip())
def dataframe_from_csvs(targets):
return pd.concat([dataframe_from_csv(x) for x in targets])
- Data load
- Train dataset load
- μ€λͺ : μ μμ μΈ μ΄μ μν©μμ μμ§λ λ°μ΄ν°(κ° νμΌλ³λ‘ μκ° μ°μμ±μ κ°μ§)
- Column1 ('time') : κ΄μΈ‘ μκ°
- Column2, 3, …, 80 ('C01', 'C02', …, 'C79') : μν κ΄μΈ‘ λ°μ΄ν°
- Train dataset load
TRAIN_DF_RAW = dataframe_from_csvs(TRAIN_DATASET)
TRAIN_DF_RAW
# μΆνμ λλ λ°μ΄ν°κ° λ무 컀μ 50000κ°λ‘ μ€μλ€.
# TRAIN_DF_RAW = dataframe_from_csvs(TRAIN_DATASET)
# TRAIN_DF_RAW = TRAIN_DF_RAW[40000:90000]
# TRAIN_DF_RAW
- νμ΅ λ°μ΄ν°μ μ 곡격μ λ°μ§ μμ νμμ λ°μ΄ν°μ΄κ³ μκ°μ λνλ΄λ νλμΈ time
- λλ¨Έμ§ νλλ λͺ¨λ λΉμλ³νλ μΌμ/μ‘μΆμμ΄ν°μ κ°
- μ 체 λ°μ΄ν°λ₯Ό λμμΌλ‘ μ΄μμ νμ§νλ―λ‘ "attack" νλλ§ μ¬μ©
2οΈβ£ λ°μ΄ν° μ μ²λ¦¬
- νμ΅ λ°μ΄ν°μ
μ μλ λͺ¨λ μΌμ/μ‘μΆμμ΄ν° νλλ₯Ό λ΄κ³ μμ
- νμ΅ λ°μ΄ν°μ μ μ‘΄μ¬νμ§ μλ νλκ° ν μ€νΈ λ°μ΄ν°μ μ μ‘΄μ¬νλ κ²½μ°κ° μμ
- νμ΅ μ λ³΄μ§ λͺ»νλ νλμ λν΄μ ν μ€νΈλ₯Ό ν μ μμΌλ―λ‘ νμ΅ λ°μ΄ν°μ μ κΈ°μ€μΌλ‘ νλ μ΄λ¦
- νμ΅ λ°μ΄ν°μ μμ μ΅μκ° μ΅λκ°μ μ»μ κ²°κ³Ό
TAG_MIN = TRAIN_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET].min()
TAG_MAX = TRAIN_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET].max()
- Min-Max Normalization (μ΅μ-μ΅λ μ κ·ν)
- (X - MIN) / (MAX-MIN)
- λͺ¨λ featureμ λν΄ κ°κ°μ μ΅μκ° 0, μ΅λκ° 1λ‘, κ·Έλ¦¬κ³ λ€λ₯Έ κ°λ€μ 0κ³Ό 1 μ¬μ΄μ κ°μΌλ‘ λ³ν
- κ°μ΄ μ ν λ³νμ§ μλ νλ κ²½μ° μ΅μκ°κ³Ό μ΅λκ°μ΄ κ°μ κ² -> μ΄λ° νλλ₯Ό λͺ¨λ 0
def normalize(df):
ndf = df.copy()
for c in df.columns:
if TAG_MIN[c] == TAG_MAX[c]:
ndf[c] = df[c] - TAG_MIN[c]
else:
ndf[c] = (df[c] - TAG_MIN[c]) / (TAG_MAX[c] - TAG_MIN[c])
return ndf
-
ewm (μ§μκ°μ€ν¨μ)
-
μ€λλ λ°μ΄ν°μ μ§μκ°μ λ₯Ό μ μ©νμ¬ μ΅κ·Ό λ°μ΄ν°κ° λ ν° μν₯μ λΌμ§λλ‘ κ°μ€μΉλ₯Ό μ£Όλ ν¨μ
- μΆκ° λ©μλλ‘ mean() μ μ¬μ©ν΄μ μ§μκ°μ€νκ· μΌλ‘ μ¬μ©
-
μ¬μ©λ²
-
df.ewm(com=None, span=None, halflife=None, alpha=None, min_periods=0, adjust=True, ignore_na=False, axis=0, times=None, method='single')
-
μ£Όλ‘ κ°μ€μΉλ₯Ό κ²°μ νλ μμλ alpha!! ννκ³μ(κ°μ κ³μ)====> com / span / halflifeλ₯Ό ν΅ν΄ μλ κ³μ°νλλ‘ νκ±°λ, alphaλ₯Ό ν΅ν΄ μ§μ μ€μ
-
-
-
- exponential weighted functionμ ν΅κ³Όμν¨ κ²°κ³Όμ λλ€. μΌμμμ λ°μνλ noiseλ₯Ό smoothing μμΌμ£ΌκΈ°
- Pandas Dataframeμ μλ κ° μ€ 1 μ΄κ³Όμ κ°μ΄ μλμ§, 0 λ―Έλ§μ κ°μ΄ μλμ§, NaNμ΄ μλμ§ μ κ²
- np.any( ) : λ°°μ΄μ λ°μ΄ν° μ€ μ‘°κ±΄κ³Ό λ§λ λ°μ΄ν°κ° μμΌλ©΄ True, μ ν μμΌλ©΄ False
def boundary_check(df):
x = np.array(df, dtype=np.float32)
return np.any(x > 1.0), np.any(x < 0), np.any(np.isnan(x))
boundary_check(TRAIN_DF)
3οΈβ£ νμ΅ λͺ¨λΈ μ€μ & λ°μ΄ν° μ μΆλ ₯ μ μ
- λ² μ΄μ€λΌμΈ λͺ¨λΈμ Stacked RNN(GRU cells)μ μ΄μ©ν΄μ μ΄μμ νμ§
- μ μ λ°μ΄ν°λ§ νμ΅ν΄μΌ νκ³ , μ μ λ°μ΄ν°μλ μ΄λ ν labelλ μμΌλ―λ‘ unsupervised learningμ ν΄μΌ ν¨
- μ¬λΌμ΄λ© μλμ°λ₯Ό ν΅ν΄ μκ³μ΄ λ°μ΄ν°μ μΌλΆλ₯Ό κ°μ Έμμ ν΄λΉ μλμ°μ ν¨ν΄μ κΈ°μ΅
- μ¬λΌμ΄λ© μλμ°λ 90μ΄(HAIλ 1μ΄λ§λ€ μνλ§λμ΄ μμ΅λλ€)λ‘ μ€μ
- λͺ¨λΈμ μ
μΆλ ₯
- μ λ ₯ : μλμ°μ μλΆλΆ 89μ΄μ ν΄λΉνλ κ°
- μΆλ ₯ : μλμ°μ κ°μ₯ λ§μ§λ§ μ΄(90λ²μ§Έ μ΄)μ κ°
- νμ§ μμλ λͺ¨λΈμ΄ μΆλ ₯νλ κ°(μμΈ‘κ°)κ³Ό μ€μ λ‘ λ€μ΄μ¨ κ°μ μ°¨λ₯Ό λ³΄κ³ μ°¨μ΄κ° ν¬λ©΄ μ΄μμΌλ‘ κ°μ£Ό
- λ§μ μ€μ°¨κ° λ°μνλ€λ κ²μ κΈ°μ‘΄μ νμ΅ λ°μ΄ν°μ μμ λ³Έ μ μ΄ μλ ν¨ν΄μ΄κΈ° λλ¬Έ (κ°μ )
- HaiDataset : PyTorchμ Dataset μΈν°νμ΄μ€λ₯Ό μ μν κ²
- λ°μ΄ν°μ μ μ½μ λλ μ¬λΌμ΄λ© μλμ°κ° μ ν¨ν μ§ μ κ²
- μ μμ μΈ μλμ°λΌλ©΄ μλμ°μ 첫 μκ°κ³Ό λ§μ§λ§ μκ°μ μ°¨κ° 89μ΄
- stride νλΌλ―Έν° : μ¬λΌμ΄λ©μ ν λ ν¬κΈ°
- μ 체 μλμ°λ₯Ό λͺ¨λ νμ΅ν μλ μμ§λ§, μκ³μ΄ λ°μ΄ν°μμλ μ¬λΌμ΄λ© μλμ°λ₯Ό 1μ΄μ© μ μ©νλ©΄ μ΄μ μλμ°μ λ€μ μλμ°μ κ°μ΄ κ±°μ κ°μ
- νμ΅μ λΉ λ₯΄κ² λ§μΉκΈ° μν΄ 10μ΄μ© 건λλ°λ©΄μ λ°μ΄ν°λ₯Ό μΆμΆ
- μ¬λΌμ΄λ© ν¬κΈ°λ₯Ό 1λ‘ μ€μ νμ¬ λͺ¨λ λ°μ΄ν°μ μ λ³΄κ² νλ©΄ λ μ’μ κ²
WINDOW_GIVEN = 59
WINDOW_SIZE= 60
def HaiDataset(timestamps, df, stride=1, attacks=None):
ts= np.array(timestamps)
tag_values=np.array(df,dtype=np.float32)
valid_idxs=[]
for L in trange(len(ts)-WINDOW_SIZE+1):
R = L + WINDOW_SIZE - 1
if dateutil.parser.parse(ts[R]) - dateutil.parser.parse(ts[L]) == timedelta(seconds=WINDOW_SIZE - 1):
valid_idxs.append(L)
valid_idxs=np.array(valid_idxs, dtype=np.int32)[::stride]
n_idxs = len(valid_idxs)
print("# of valid windows:", n_idxs)
if attacks is not None:
attacks = np.array(attacks, dtype=np.float32)
with_attack = True
else:
with_attack = False
timestamp, X, y, att = list(), list(), list(), list()
if with_attack:
for i in valid_idxs:
last=i+WINDOW_SIZE-1
seq_time, seq_x, seq_y, seq_attack = ts[last], tag_values[i:i+WINDOW_GIVEN], tag_values[last], attacks[last]
timestamp.append(seq_time)
X.append(seq_x)
y.append(seq_y)
att.append(seq_attack)
return np.array(timestamp), np.array(X), np.array(y), np.array(att)
else:
for i in valid_idxs:
last=i+WINDOW_SIZE-1
seq_time, seq_x, seq_y = ts[last], tag_values[i:i+WINDOW_GIVEN], tag_values[last]
timestamp.append(seq_time)
X.append(seq_x)
y.append(seq_y)
return np.array(timestamp), np.array(X), np.array(y)
- νλ ¨ λ°μ΄ν° λλκΈ°
ts, X_train, y_train = HaiDataset(TRAIN_DF_RAW[TIMESTAMP_FIELD], TRAIN_DF, stride=1)
- λ°μ΄ν°μ
μ΄ μ λ‘λλ¨
- λͺ¨λΈμ 3μΈ΅ bidirectional GRUλ₯Ό μ¬μ©
- Hidden cellμ ν¬κΈ°λ 100
- Dropoutμ μ¬μ©νμ§ μμ
HAI_DATASET_TRAIN = HaiDataset(TRAIN_DF_RAW[TIMESTAMP_FIELD], TRAIN_DF, stride=10)
HAI_DATASET_TRAIN[0]
- Validation dataset load
- Validationλ μμ κ°μ΄ μ κ·ν
- μ κ²νκΈ°
VALIDATION_DF_RAW = dataframe_from_csvs(VALIDATION_DATASET)
VALIDATION_DF_RAW
VALIDATION_DF = normalize(VALIDATION_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET])
boundary_check(VALIDATION_DF)
ts, X_valid, y_valid, attack = HaiDataset(VALIDATION_DF_RAW[TIMESTAMP_FIELD], VALIDATION_DF, attacks=VALIDATION_DF_RAW[ATTACK_FIELD])
- EarlyStopping μ§μ
- κ³Όμ ν©μ λ°©μ§νκΈ° μν΄μλ EarlyStopping μ΄λΌλ μ½λ°±ν¨μλ₯Ό μ¬μ©νμ¬ μ μ ν μμ μ νμ΅μ μ‘°κΈ° μ’ λ£
EarlyStopping
EarlyStopping(monitor = 'val_loss', min_delta = 0, patience = 0, mode = 'auto')
- monitor : νμ΅ μ‘°κΈ°μ’
λ£λ₯Ό μν΄ κ΄μ°°
- val_loss λ val_accuracy κ° μ£Όλ‘ μ¬μ© (default : val_loss)
- min_delta : κ°μ λκ³ μλ€κ³ νλ¨νκΈ° μν μ΅μ λ³νλ
- λ³νλμ΄ min_delta λ³΄λ€ μ μ κ²½μ°μλ κ°μ μ΄ μλ€κ³ νλ¨ (default = 0)
- patience : κ°μ μ΄ μλλ€κ³ λ°λ‘ μ’ λ£μν€μ§ μκ³ , κ°μ μ μν΄ λͺλ²μ μν¬ν¬λ₯Ό κΈ°λ€λ¦΄μ§ μ€μ (default = 0)
- mode : κ΄μ°°νλͺ©μ λν΄ κ°μ μ΄ μλ€κ³ νλ¨νκΈ° μν κΈ°μ€μ μ€μ
- monitorμμ μ€μ ν νλͺ©μ΄ val_loss μ΄λ©΄ κ°μ΄ κ°μλμ§ μμ λ μ’ λ£νμ¬μΌ νλ―λ‘ min μ μ€μ
- val_accuracy μ κ²½μ°μλ maxλ₯Ό μ€μ (default = auto)
- auto : monitorμ μ€μ λ μ΄λ¦μ λ°λΌ μλμΌλ‘ μ§μ
- min : κ΄μ°°κ°μ΄ κ°μνλ κ²μ λ©μΆ λ, νμ΅μ μ’ λ£
- max: κ΄μ°°κ°μ΄ μ¦κ°νλ κ²μ λ©μΆ λ, νμ΅μ μ’ λ£
- ModelCheckpoint μ§μ
ModelCheckpoint
tf.keras.callbacks.ModelCheckpoint(
filepath, monitor='val_loss', verbose=0, save_best_only=False,
save_weights_only=False, mode='auto', save_freq='epoch', options=None, **kwargs
)
- λͺ¨λΈμ΄ νμ΅νλ©΄μ μ μν 쑰건μ λ§μ‘±νμ λ Modelμ weight κ°μ μ€κ° μ μ₯
- νμ΅μκ°μ΄ κ½€ μ€λκ±Έλ¦°λ€λ©΄, λͺ¨λΈμ΄ κ°μ λ validation scoreλ₯Ό λμΆν΄λΌ λλ§λ€ weightλ₯Ό μ€κ° μ μ₯ν¨
- μ€κ°μ memory overflowλ crashκ° λλλΌλ λ€μ weightλ₯Ό λΆλ¬μμ νμ΅ μ΄μ΄λκ° μ μμ ( μκ° save)
μΈμ | μ€λͺ |
filepath | λͺ¨λΈμ μ μ₯ν κ²½λ‘λ₯Ό μ λ ₯ |
monitor | λͺ¨λΈμ μ μ₯ν λ, κΈ°μ€μ΄ λλ κ°μ μ§μ |
verbose | 0, 1 1μΌ κ²½μ° λͺ¨λΈμ΄ μ μ₯ λ λ, 'μ μ₯λμμ΅λλ€' λΌκ³ νλ©΄μ νμ 0μΌ κ²½μ° νλ©΄μ νμλλ κ² μμ΄ κ·Έλ₯ λ°λ‘ λͺ¨λΈμ΄ μ μ₯ |
save_best_only | True, False True μΈ κ²½μ°, monitor λκ³ μλ κ°μ κΈ°μ€μΌλ‘ κ°μ₯ μ’μ κ°μΌλ‘ λͺ¨λΈμ΄ μ μ₯ FalseμΈ κ²½μ°, 맀 μνλ§λ€ λͺ¨λΈμ΄ filepath{epoch}μΌλ‘ μ μ₯ (model0, model1, model2....) |
save_weights_only | True, False TrueμΈ κ²½μ°, λͺ¨λΈμ weightsλ§ μ μ₯ FalseμΈ κ²½μ°, λͺ¨λΈ λ μ΄μ΄ λ° weights λͺ¨λ μ μ₯ |
mode | 'auto', 'min', 'max' val_acc μΈ κ²½μ°, μ νλμ΄κΈ° λλ¬Έμ ν΄μλ‘ μ’μ (maxλ₯Ό μ λ ₯) val_loss μΈ κ²½μ°, loss κ°μ΄κΈ° λλ¬Έμ κ°μ΄ μμμλ‘ μ’μ (minμ μ λ ₯) autoλ‘ ν κ²½μ°, λͺ¨λΈμ΄ μμμ min, maxλ₯Ό νλ¨νμ¬ λͺ¨λΈμ μ μ₯ |
save_freq | 'epoch' λλ integer(μ μν μ«μ) 'epoch'μ μ¬μ©ν κ²½μ°, 맀 μνλ§λ€ λͺ¨λΈμ΄ μ μ₯ integerμ μ¬μ©ν κ²½μ°, μ«μλ§νΌμ λ°°μΉλ₯Ό μ§νλλ©΄ λͺ¨λΈμ΄ μ μ₯ |
options | tf.train.CheckpointOptionsλ₯Ό μ΅μ
μΌλ‘ μ£ΌκΈ° λΆμ°νκ²½μμ λ€λ₯Έ λλ ν 리μ λͺ¨λΈμ μ μ₯νκ³ μΆμ κ²½μ° μ¬μ© |
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=4)
mc = ModelCheckpoint('best_LSTM_model_60seq_TEST.h5', monitor='val_loss', mode='min', verbose=1, save_best_only=True)
4οΈβ£ Training Model
- LSTM (Long Short-Term Memory)
- λ¨κΈ° λ©λͺ¨λ¦¬μ μ₯κΈ° λ©λͺ¨λ¦¬λ₯Ό λλ νμ΅ ν, λ λ©λͺ¨λ¦¬λ₯Ό λ³ν©ν΄ μ΄λ²€νΈ νλ₯ μ μμΈ‘ → κ³Όκ±°μ μ 보λ₯Ό ν¨μ¬ μ λ°μ
- LSTMμ κ²μ΄νΈλΌλ κ°λ
μΌλ‘ μ λ³ κΈ°μ΅μ ν보
- C(Cell state)λ μ₯κΈ° κΈ°μ΅ μν
- H(Hidden State)λ λ¨κΈ° κΈ°μ΅ μν
- xλ h, Cλ λͺ¨λ λ°μ΄ν°μ 벑ν°
## LSTM Model define
n_features = TRAIN_DF.shape[1]
inputs=Input(shape=(WINDOW_GIVEN , n_features))
first=Bidirectional(LSTM(100, return_sequences=True))(inputs)
second=Bidirectional(LSTM(100, return_sequences=True))(first)
thrid=Bidirectional(LSTM(100))(second)
lstm_out=Dense(n_features)(thrid)
aux_input = Input(shape=(n_features,), name='aux_input')
outputs = keras.layers.add([lstm_out, aux_input])
model_60seq=Model(inputs=[inputs, aux_input], outputs=outputs)
model_60seq.compile(loss='mean_squared_error', optimizer='Adam')
model_60seq.summary()
- aux_trainμΌλ‘ 리μ€νΈ μ 리
- train, valid λλ€ γ± γ±
aux_train=[]
for i in range(len(X_train)):
aux_train.append(X_train[i][0])
aux_train=np.array(aux_train)
aux_valid=[]
for i in range(len(X_valid)):
aux_valid.append(X_valid[i][0])
aux_valid=np.array(aux_valid)
- λͺ¨λΈ νλ ¨
- earlystoppoing!
hist=model_60seq.fit([X_train, aux_train], y_train, batch_size=512, epochs=32, shuffle=True,
callbacks=[es, mc], validation_data=([X_valid,aux_valid],y_valid))
- λͺ¨λΈμ μ₯
model_60seq.save('best_LSTM_model_60seq.h5') #keras h5
5οΈβ£ Skip learning
## model load
model_60seq = load_model('best_LSTM_model_60seq.h5')
- Validation μμΈ‘
## Validation set Prediction
y_pred=model_60seq.predict([X_valid,aux_valid])
tmp=[]
for i in range(len(y_pred)):
tmp.append(abs(y_valid[i]-y_pred[i]))
- anomaly score ꡬνκΈ°
ANOMALY_SCORE=np.mean(tmp,axis=1)
- threshold μ΄μ©ν΄ label κ° λ£κΈ°
def put_labels(distance, threshold):
xs = np.zeros_like(distance)
xs[distance > threshold] = 1
return xs
- labels
THRESHOLD = 0.027
LABELS = put_labels(ANOMALY_SCORE, THRESHOLD)
LABELS, LABELS.shape
- attack labels
ATTACK_LABELS = put_labels(np.array(VALIDATION_DF_RAW[ATTACK_FIELD]), threshold=0.5)
ATTACK_LABELS, ATTACK_LABELS.shape
- what's this.?
def fill_blank(check_ts, labels, total_ts):
def ts_generator():
for t in total_ts:
yield dateutil.parser.parse(t)
def label_generator():
for t, label in zip(check_ts, labels):
yield dateutil.parser.parse(t), label
g_ts = ts_generator()
g_label = label_generator()
final_labels = []
try:
current = next(g_ts)
ts_label, label = next(g_label)
while True:
if current > ts_label:
ts_label, label = next(g_label)
continue
elif current < ts_label:
final_labels.append(0)
current = next(g_ts)
continue
final_labels.append(label)
current = next(g_ts)
ts_label, label = next(g_label)
except StopIteration:
return np.array(final_labels, dtype=np.int8)
- final labels
%%time
FINAL_LABELS = fill_blank(ts, LABELS, np.array(VALIDATION_DF_RAW[TIMESTAMP_FIELD]))
FINAL_LABELS.shape
ATTACK_LABELS.shape[0] == FINAL_LABELS.shape[0]
# μ±λ₯ νκ°λꡬ eTaPR(enhanced Time-series aware Precision and Recall)
TaPR = etapr.evaluate(anomalies=ATTACK_LABELS, predictions=FINAL_LABELS)
print(f"F1: {TaPR['f1']:.3f} (TaP: {TaPR['TaP']:.3f}, TaR: {TaPR['TaR']:.3f})")
print(f"# of detected anomalies: {len(TaPR['Detected_Anomalies'])}")
print(f"Detected anomalies: {TaPR['Detected_Anomalies']}")
6οΈβ£ Moving Average
## MOVING AVERAGE
seq60_10mean=[]
for idx in range(len(ANOMALY_SCORE)):
if idx >= 10:
seq60_10mean.append((ANOMALY_SCORE[idx-10:idx].mean()+ANOMALY_SCORE[idx])/2)
else:
seq60_10mean.append(ANOMALY_SCORE[idx])
seq60_10mean=np.array(seq60_10mean)
print(seq60_10mean.shape)
THRESHOLD = 0.019
LABELS = put_labels(seq60_10mean, THRESHOLD)
LABELS, LABELS.shape
%%time
FINAL_LABELS = fill_blank(ts, LABELS, np.array(VALIDATION_DF_RAW[TIMESTAMP_FIELD]))
FINAL_LABELS.shape
ATTACK_LABELS.shape[0] == FINAL_LABELS.shape[0]
TaPR = etapr.evaluate(anomalies=ATTACK_LABELS, predictions=FINAL_LABELS)
print(f"F1: {TaPR['f1']:.3f} (TaP: {TaPR['TaP']:.3f}, TaR: {TaPR['TaR']:.3f})")
print(f"# of detected anomalies: {len(TaPR['Detected_Anomalies'])}")
print(f"Detected anomalies: {TaPR['Detected_Anomalies']}")
- test data μμ κ°μ΄
############### Test data Start
TEST_DF_RAW = dataframe_from_csvs(TEST_DATASET)
TEST_DF_RAW = TEST_DF_RAW[:11960]
TEST_DF_RAW
TEST_DF = normalize(TEST_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET]).ewm(alpha=0.9).mean()
TEST_DF
boundary_check(TEST_DF)
CHECK_TS, X_test, y_test = HaiDataset(TEST_DF_RAW[TIMESTAMP_FIELD], TEST_DF, attacks=None)
aux_test=[]
for i in range(len(X_test)):
aux_test.append(X_test[i][0])
aux_test=np.array(aux_test)
## Model Prediction
y_pred=model_60seq.predict([X_test,aux_test])
tmp=[]
for i in range(len(y_test)):
tmp.append(abs(y_test[i]-y_pred[i]))
ANOMALY_SCORE=np.mean(tmp,axis=1)
# Moving Average
seq60_10mean=[]
for idx in range(len(ANOMALY_SCORE)):
if idx >= 10:
seq60_10mean.append((ANOMALY_SCORE[idx-10:idx].mean()+ANOMALY_SCORE[idx])/2)
else:
seq60_10mean.append(ANOMALY_SCORE[idx])
seq60_10mean=np.array(seq60_10mean)
print(seq60_10mean.shape)
### Threshold Setting
THRESHOLD=0.019
LABELS_60seq = put_labels(seq60_10mean, THRESHOLD)
LABELS_60seq, LABELS_60seq.shape
submission = pd.read_csv('/home/ubuntu/coding/220906/HAI 2.0/sample_submission.csv')
submission.index = submission['time']
submission.loc[CHECK_TS,'attack'] = LABELS_60seq
submission
submission.to_csv('SILab_Moving_Average.csv', index=False)
7οΈβ£ Gray Aare Smoothing
submission = pd.read_csv('/home/ubuntu/coding/220906/HAI 2.0/sample_submission.csv')
submission.index = submission['time']
submission['attack_1']=0
submission['anomaly_score']=0
submission.loc[CHECK_TS,'anomaly_score'] = seq60_10mean
submission.loc[CHECK_TS,'attack_1'] = LABELS_60seq
LABELS_60seq=submission['attack_1']
seq60_10mean=submission['anomaly_score']
def Gray_Area(attacks):
start = [] # start point
finish = [] # finish point
c = [] # count
com = 0
count = 0
for i in range(1, len(attacks)):
if attacks[i - 1] != attacks[i]:
if com == 0:
start.append(i)
count = count + 1
com = 1
elif com == 1:
finish.append(i - 1)
c.append(count)
count = 0
start.append(i)
count = count + 1
else:
count = count + 1
finish.append(len(attacks) - 1)
c.append(finish[len(finish) - 1] - start[len(start) - 1] + 1)
for i in range(0, len(start)):
if c[i] < 10:
s = start[i]
f = finish[i] + 1
g1 = [1 for i in range(c[i])] # Temp Attack list
g0 = [0 for i in range(c[i])] # Temp Normal List
if attacks[start[i] - 1] == 1:
attacks[s:f] = g1 # change to attack
else:
attacks[s:f] = g0 # change to normal
return attacks
gray_LABELS_60seq=Gray_Area(LABELS_60seq)
submission = pd.read_csv('/home/ubuntu/coding/220906/HAI 2.0/sample_submission.csv')
submission.index = submission['time']
submission['attack']=gray_LABELS_60seq
submission
submission.to_csv('SILab_Gray_Area_Smoothing.csv', index=False)
8οΈβ£ Attack Point and Finish Point Policy
WINDOW_GIVEN = 9
WINDOW_SIZE= 10
ts, X_train, y_train = HaiDataset(TRAIN_DF_RAW[TIMESTAMP_FIELD], TRAIN_DF, stride=1)
ts, X_valid, y_valid, attack = HaiDataset(VALIDATION_DF_RAW[TIMESTAMP_FIELD],
VALIDATION_DF, attacks=VALIDATION_DF_RAW[ATTACK_FIELD])
## LSTM Model define
n_features = TRAIN_DF.shape[1]
inputs=Input(shape=(WINDOW_GIVEN , n_features))
first=Bidirectional(LSTM(100, return_sequences=True))(inputs)
second=Bidirectional(LSTM(100, return_sequences=True))(first)
thrid=Bidirectional(LSTM(100))(second)
lstm_out=Dense(n_features)(thrid)
aux_input = Input(shape=(n_features,), name='aux_input')
outputs = keras.layers.add([lstm_out, aux_input])
model_10seq=Model(inputs=[inputs, aux_input], outputs=outputs)
model_10seq.compile(loss='mean_squared_error', optimizer='Adam')
model_10seq.summary()
aux_train=[]
for i in range(len(X_train)):
aux_train.append(X_train[i][0])
aux_train=np.array(aux_train)
aux_valid=[]
for i in range(len(X_valid)):
aux_valid.append(X_valid[i][0])
aux_valid=np.array(aux_valid)
hist=model_10seq.fit([X_train, aux_train], y_train, batch_size=512, epochs=32, shuffle=True,
callbacks=[es, mc], validation_data=([X_valid,aux_valid],y_valid))
model_10seq.save('best_LSTM_model_10seq.h5') #keras h5
## model load
model_10seq = load_model('best_LSTM_model_10seq.h5')
CHECK_TS, X_test, y_test = HaiDataset(TEST_DF_RAW[TIMESTAMP_FIELD], TEST_DF, attacks=None)
aux_test=[]
for i in range(len(X_test)):
aux_test.append(X_test[i][0])
aux_test=np.array(aux_test)
## Model Prediction
y_pred=model_10seq.predict([X_test,aux_test])
tmp=[]
for i in range(len(y_test)):
tmp.append(abs(y_test[i]-y_pred[i]))
ANOMALY_SCORE=np.mean(tmp,axis=1)
# Moving Average
seq10_10mean=[]
for idx in range(len(ANOMALY_SCORE)):
if idx >= 10:
seq10_10mean.append((ANOMALY_SCORE[idx-10:idx].mean()+ANOMALY_SCORE[idx])/2)
else:
seq10_10mean.append(ANOMALY_SCORE[idx])
seq10_10mean=np.array(seq10_10mean)
print(seq10_10mean.shape)
### Threshold Setting
THRESHOLD=0.008
LABELS_10seq = put_labels(seq10_10mean, THRESHOLD)
LABELS_10seq, LABELS_10seq.shape
submission = pd.read_csv('./HAI 2.0/sample_submission.csv')
submission.index = submission['time']
submission['attack_1']=0
submission.loc[CHECK_TS,'attack_1'] = LABELS_10seq
LABELS_10seq=submission['attack_1']
gray_LABELS_10seq=Gray_Area(LABELS_10seq)
Queue=[0 for i in range(59)]
Label=[]
for i in range(len(gray_LABELS_60seq)):
Queue.append(gray_LABELS_60seq[i])
N=Queue.count(1)
if N>=60:
if seq60_10mean[i-100:i].max() > 0.1:
Label.append(gray_LABELS_10seq[i])
else:
Label.append(gray_LABELS_60seq[i])
elif N>=1:
Label.append(gray_LABELS_60seq[i])
else:
Label.append(gray_LABELS_60seq[i])
del Queue[0]
Label=np.array(Label)
submission = pd.read_csv('/home/ubuntu/coding/220906/HAI 2.0/sample_submission.csv')
submission.index = submission['time']
submission['attack']=Label
submission