๐Ÿ˜Ž ๊ณต๋ถ€ํ•˜๋Š” ์ง•์ง•์•ŒํŒŒ์นด๋Š” ์ฒ˜์Œ์ด์ง€?

[DACON] HAICon2020 ์‚ฐ์—…์ œ์–ด์‹œ์Šคํ…œ ๋ณด์•ˆ์œ„ํ˜‘ ํƒ์ง€ AI & ๋น„์ง€๋„ ๊ธฐ๋ฐ˜ Autoencoder ๋ณธ๋ฌธ

๐Ÿ‘ฉ‍๐Ÿ’ป ์ธ๊ณต์ง€๋Šฅ (ML & DL)/Serial Data

[DACON] HAICon2020 ์‚ฐ์—…์ œ์–ด์‹œ์Šคํ…œ ๋ณด์•ˆ์œ„ํ˜‘ ํƒ์ง€ AI & ๋น„์ง€๋„ ๊ธฐ๋ฐ˜ Autoencoder

์ง•์ง•์•ŒํŒŒ์นด 2022. 9. 8. 13:14
728x90
๋ฐ˜์‘ํ˜•

220908 ์ž‘์„ฑ

<๋ณธ ๋ธ”๋กœ๊ทธ๋Š” dacon ๋Œ€ํšŒ์—์„œ์˜  ๋ฐ์ดํฌ๋ฃจ 2๊ธฐ Team Zoo ํŒ€ ์ฝ”๋“œ์™€ dacon์˜ HAI 2.0 Baseline ๊ธ€์„ ์ฐธ๊ณ ํ•ด์„œ ๊ณต๋ถ€ํ•˜๋ฉฐ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค :-) >

https://dacon.io/codeshare/5141?dtype=recent 

 

[Team Zoo] ํŠน๋ณ„ํŽธ 4. ๋น„์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ์ด์ƒํƒ์ง€ ํ™œ์šฉ(feat. ์‹œ๊ณ„์—ด)

 

dacon.io

https://dacon.io/competitions/official/235624/codeshare/1570?page=1&dtype=recent 

 

HAI 2.0 Baseline

HAICon2020 ์‚ฐ์—…์ œ์–ด์‹œ์Šคํ…œ ๋ณด์•ˆ์œ„ํ˜‘ ํƒ์ง€ AI ๊ฒฝ์ง„๋Œ€ํšŒ

dacon.io

 

 

 

1๏ธโƒฃ Libraries & Data Load

!pip install /ํŒŒ์ผ๊ฒฝ๋กœ/eTaPR-1.12-py3-none-any.whl

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

from tqdm.notebook import trange
from TaPR_pkg import etapr
from pathlib import Path
import time

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import load_model
from tensorflow.keras.callbacks import EarlyStopping

 

 

2๏ธโƒฃ Data Preprocessing

1) Data load

TRAIN_DATASET = sorted([x for x in Path("/home/ubuntu/coding/220906/HAI 2.0/training").glob("*.csv")])
TRAIN_DATASET

TEST_DATASET  = sorted([x for x in Path("/home/ubuntu/coding/220906/HAI 2.0/testing").glob("*.csv")])
TEST_DATASET

VALIDATION_DATASET  = sorted([x for x in Path("/home/ubuntu/coding/220906/HAI 2.0/validation").glob("*.csv")])
VALIDATION_DATASET

def dataframe_from_csv(target):
    return pd.read_csv(target, engine='python').rename(columns=lambda x: x.strip())

def dataframe_from_csvs(targets):
    return pd.concat([dataframe_from_csv(x) for x in targets])
TRAIN_DF_RAW = dataframe_from_csvs(TRAIN_DATASET)
TRAIN_DF_RAW = TRAIN_DF_RAW[:30720]
TRAIN_DF_RAW

๋ฐ์ดํ„ฐ ์–‘ ๋„ˆ๋ฌด ์ปค์„œ ์ค„์ž„ ใ…œใ…œ

2) Variables setting

  • ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์ด์ƒ์„ ํƒ์ง€ํ•˜๋ฏ€๋กœ "attack" ํ•„๋“œ๋งŒ ์‚ฌ์šฉ
  • VALID_COLUMNS_IN_TRAIN_DATASET์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์— ์žˆ๋Š” ๋ชจ๋“  ์„ผ์„œ/์•ก์ถ”์—์ดํ„ฐ ํ•„๋“œ
  • ํ•™์Šต ์‹œ ๋ณด์ง€ ๋ชปํ–ˆ๋˜ ํ•„๋“œ์— ๋Œ€ํ•ด์„œ ํ…Œ์ŠคํŠธ๋ฅผ ํ•  ์ˆ˜ ์—†์œผ๋ฏ€๋กœ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ์ค€์œผ๋กœ ํ•„๋“œ ์ด๋ฆ„
TIMESTAMP_FIELD = "time"
IDSTAMP_FIELD = 'id'
ATTACK_FIELD = "attack"
VALID_COLUMNS_IN_TRAIN_DATASET = TRAIN_DF_RAW.columns.drop([TIMESTAMP_FIELD])
VALID_COLUMNS_IN_TRAIN_DATASET

 

3) Data Normalization

  • Min-Max Normalization (์ตœ์†Œ-์ตœ๋Œ€ ์ •๊ทœํ™”)
    • (X - MIN) / (MAX-MIN)
    • ๋ชจ๋“  feature์— ๋Œ€ํ•ด ๊ฐ๊ฐ์˜ ์ตœ์†Œ๊ฐ’ 0, ์ตœ๋Œ€๊ฐ’ 1๋กœ, ๊ทธ๋ฆฌ๊ณ  ๋‹ค๋ฅธ ๊ฐ’๋“ค์€ 0๊ณผ 1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜
TAG_MIN = TRAIN_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET].min()
TAG_MAX = TRAIN_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET].max()
def normalize(df):
    ndf = df.copy()
    for c in df.columns:
        if TAG_MIN[c] == TAG_MAX[c]:
            ndf[c] = df[c] - TAG_MIN[c]
        else:
            ndf[c] = (df[c] - TAG_MIN[c]) / (TAG_MAX[c] - TAG_MIN[c])
    return ndf
TRAIN_DF = normalize(TRAIN_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET])

 

  • Pandas Dataframe์— ์žˆ๋Š” ๊ฐ’ ์ค‘ 1 ์ดˆ๊ณผ์˜ ๊ฐ’์ด ์žˆ๋Š”์ง€, 0 ๋ฏธ๋งŒ์˜ ๊ฐ’์ด ์žˆ๋Š”์ง€, NaN์ด ์žˆ๋Š”์ง€ ์ ๊ฒ€
    • np.any( ) : ๋ฐฐ์—ด์˜ ๋ฐ์ดํ„ฐ ์ค‘ ์กฐ๊ฑด๊ณผ ๋งž๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์œผ๋ฉด True, ์ „ํ˜€ ์—†์œผ๋ฉด False
def boundary_check(df):
    x = np.array(df, dtype=np.float32)
    print(x)
    return np.any(x > 1.0), np.any(x < 0), np.any(np.isnan(x))
boundary_check(TRAIN_DF)

 

 

3๏ธโƒฃ Model

  • Autoencoder
    • ์ฃผ๋กœ ์ด๋ฏธ์ง€์˜ ์ƒ์„ฑ์ด๋‚˜ ๋ณต์›์— ๋งŽ์ด ์‚ฌ์šฉ
    • ์ •์ƒ์ ์ธ ์ด๋ฏธ์ง€๋กœ ๋ชจ๋ธ ํ•™์Šต ํ›„ ๋น„์ •์ƒ์ ์ธ ์ด๋ฏธ์ง€๋ฅผ ๋„ฃ์–ด ์ด๋ฅผ ๋””์ฝ”๋”ฉ ํ•˜๊ฒŒ ๋˜๋ฉด ์ •์ƒ ์ด๋ฏธ์ง€ ํŠน์„ฑ๊ณผ ๋””์ฝ”๋”ฉ ๋œ ์ด๋ฏธ์ง€ ๊ฐ„์˜ ์ฐจ์ด์ธ ์žฌ๊ตฌ์„ฑ ์†์‹ค(Reconstruction Error)๋ฅผ ๊ณ„์‚ฐ
    • ์žฌ๊ตฌ์„ฑ ์†์‹ค์ด ๋‚ฎ์€ ๋ถ€๋ถ„์€ ์ •์ƒ(normal), ์žฌ๊ตฌ์„ฑ ์†์‹ค์ด ๋†’์€ ๋ถ€๋ถ„์€ ์ด์ƒ(Abnormal)๋กœ ํŒ๋‹จ

 

  • Autoencoder์˜ ๋ ˆ์ด์–ด๋ฅผ LSTM์œผ๋กœ ๊ตฌ์„ฑํ•˜์—ฌ ์‹œํ€ธ์Šค ํ•™์Šต์ด ๊ฐ€๋Šฅ
    • !D-Convolution layer๋ฅผ ์ ์šฉํ•˜์—ฌ timestamp์™€ feature ์ •๋ณด๋ฅผ ์„ธ๋ฐ€ํ•˜๊ฒŒ ์ด๋™ํ•˜๋ฉด์„œ ํ•™์Šต

 

  • Encoder-Decoder LSTM (=seq2seq) 
    • input๋„ sequencial ๋ฐ์ดํ„ฐ, output๋„ sequencial ๋ฐ์ดํ„ฐ
    • (๋ฌธ์ œ) input๊ณผ output์˜ sequence ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ
    • (ํ•ด๊ฒฐ) Encoding : ์—ฌ๋Ÿฌ ๊ธธ์ด์˜ input์„ ๊ณ ์ • ๊ธธ์ด ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜
      • Encoder-Decoder LSTM ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ๊ธธ์ด์˜ ์‹œ๊ณ„์—ด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„, ๋‹ค์–‘ํ•œ ๊ธธ์ด์˜ ์‹œ๊ณ„์—ด ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ
      • LSTM Autoencoder๋Š” ๋‹ค์–‘ํ•œ ๊ธธ์ด์˜ ์‹œ๊ณ„์—ด input ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ ์ • ๊ธธ์ด ๋ฒกํ„ฐ๋กœ ์••์ถ•ํ•ด Decoder์— ์ž…๋ ฅ์œผ๋กœ ์ „๋‹ฌํ•ด์คŒ
      • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ encoded feature vector๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ด ์žˆ์Œ

def temporalize(X, y, timesteps):
    output_X = []
    output_y = []
    for i in range(len(X) - timesteps - 1):
        t = []
        for j in range(1, timesteps + 1):
            t.append(X[[(i + j + 1)], :])
        output_X.append(t)
        output_y.append(y[i + timesteps + 1])
    return np.squeeze(np.array(output_X)), np.array(output_y)
train = np.array(TRAIN_DF)
x_train = train.reshape(train.shape[0], 1, train.shape[1])
x_train.shape

 

 

 

โ–ถ Model details

  • Conv1D
    • filters : ์ปจ๋ณผ๋ฃจ์…˜ ์—ฐ์‚ฐ์˜ output ์ถœ๋ ฅ ์ˆ˜
    • kernel_size : timestamp๋ฅผ ์–ผ๋งˆ๋งŒํผ ๋ณผ ๊ฒƒ์ธ๊ฐ€( = window_size)
    • padding : ํ•œ ์ชฝ ๋ฐฉํ–ฅ์œผ๋กœ ์–ผ๋งˆ๋งŒํผ paddingํ•  ๊ฒƒ์ธ๊ฐ€
    • dilation : kernel ๋‚ด๋ถ€์—์„œ ์–ผ๋งˆ๋งŒํผ์˜ ๊ฐ„๊ฒฉ์œผ๋กœ kernel์„ ์ ์šฉํ•  ๊ฒƒ์ธ๊ฐ€
    • stride : default = 1, ์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด์˜ ์ด๋™ํฌ๊ธฐ
  • LSTM
    • unit : ์ถœ๋ ฅ ์ฐจ์›์ธต๋งŒ ์„ค์ •

 

  • ๋ชจ๋ธ์˜ ๊ตฌ์กฐ
    • Conv1D - Dense์ธต - LSTM - Dense์ธต์œผ๋กœ encoder ์™€ decoder๊ฐ€ ๋Œ€์นญ์ด ๋˜๋„๋ก ์„ค๊ณ„
    • ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์ฃผ๋กœ filters, kernel_size, Dense, LSTM์˜ units ๊ฐ’์„ ์กฐ์ ˆ
    • Conv1D ๋ ˆ์ด์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ maxpooling๊ณผ ๊ฐ™์ด ๊ธฐ์กด์˜ CNN ๋ชจ๋ธ๊ณผ ๋™์ผํ•œ ๋ฐฉ์‹ ์ ์šฉ ๊ฐ€๋Šฅ
def conv_auto_model(x):
    n_steps = x.shape[1]
    n_features = x.shape[2]

    keras.backend.clear_session()

    model = keras.Sequential(
        [
            layers.Input(shape=(n_steps, n_features)),
            layers.Conv1D(filters=512, kernel_size=64, padding='same', data_format='channels_last',
                          dilation_rate=1, activation="linear"),
            layers.Dense(128),
            layers.LSTM(
                units=64, activation="relu", name="lstm_1", return_sequences=False
            ),
            layers.Dense(64),
            layers.RepeatVector(n_steps),
            layers.Dense(64),
            layers.LSTM(
                units=64, activation="relu", name="lstm_2", return_sequences=True
            ),
            layers.Dense(128),
            layers.Conv1D(filters=512, kernel_size=64, padding='same', data_format='channels_last',
                          dilation_rate=1, activation="linear"),
            layers.TimeDistributed(layers.Dense(x.shape[2], activation='linear'))
        ]
    )
    return model
model1 = conv_auto_model(x_train)
model1.compile(optimizer='adam', loss='mse')
model1.summary()

 

4๏ธโƒฃ Model fit

  • epoch์„ 3์œผ๋กœ ํ•˜๊ณ , earlystopping์„ ์‚ฌ์šฉ
early_stopping = EarlyStopping(monitor='val_loss', patience=5)

epochs = 3
batch = 64

# fit
history = model1.fit(x_train, x_train,
                     epochs=epochs, batch_size=batch,
                     validation_split=0.2, callbacks=[early_stopping]).history

model1.save('model1.h5')

plt.plot(history['loss'], label='train loss')
plt.plot(history['val_loss'], label='valid loss')
plt.legend()
plt.xlabel('Epoch'); plt.ylabel('loss')
plt.show()

 

  • model load
model1.save('best_AutoEncoder_model1.h5') #keras h5
model = load_model('best_AutoEncoder_model1.h5')

 

 

 

5๏ธโƒฃ Anomaly Detection

  • ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉ
VALIDATION_DF_RAW = dataframe_from_csvs(VALIDATION_DATASET)
VALIDATION_DF_RAW.to_csv('VALIDATION_DF_RAW.csv')
VALIDATION_DF_RAW

 

  • ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์…‹์—์„œ๋Š” ์ •์ƒ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ •๊ทœํ™”๋ฅผ ์ง„ํ–‰
VALIDATION_DF = normalize(VALIDATION_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET])
boundary_check(VALIDATION_DF)

  • ์‹œ๊ฐํ™”๋กœ ์ผ์ • ๊ตฌ๊ฐ„์—์„œ 0๊ณผ 1 ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ๊ฒƒ์„ ํ™•์ธ
VALIDATION_DF.plot()

VALIDATION_DF['C75'].plot()

 

6๏ธโƒฃ Data cleaning

  • validation set์—์„œ ์กฐ๊ธˆ ๋” ์ •๊ตํ•˜๊ฒŒ threshold ์กฐ์ ˆ ๋ฐ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ ํ•ด๋‹น ๋ณ€์ˆ˜์˜ ๊ฐ’์„ ์ •์ƒ ๋ฒ”์œ„์— ๋งž๊ฒŒ ์ž„์˜๋กœ ์กฐ์ ˆ
# valid ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๊ณ  ์•ž๋ถ€๋ถ„ ์ •์ƒ์ธ๋ฐ ๊ฐ’์ด ํŠ€๋Š” ๋ณ€์ˆ˜๊ฐ€ ์žˆ์–ด์„œ ์กฐ์ ˆ
VALIDATION_DF['C75'][:2110] = 0.95
val = np.array(VALIDATION_DF)
x_val = val.reshape(val.shape[0], 1, val.shape[1])
x_val.shape

 

  • ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ๊ฐ€ 3์ฐจ์›์˜ ํ˜•ํƒœ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ณต์›๋œ ๊ฒฐ๊ณผ์™€์˜ ์ฐจ์ด๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” 2์ฐจ์›์œผ๋กœ ๋‹ค์‹œ ๋ฐ”๊พธ๊ธฐ
def flatten(X):
    flattened_X = np.empty((X.shape[0], X.shape[2]))  # sample x features array.
    for i in range(X.shape[0]):
        flattened_X[i] = X[i, (X.shape[1]-1), :]
    return(flattened_X)

def scale(X, scaler):
    for i in range(X.shape[0]):
        X[i, :, :] = scaler.transform(X[i, :, :])
        
    return X

 

  • ๋ชจ๋ธ์˜ ์˜ํ•ด ์žฌ๊ตฌ์„ฑ๋œ ๊ฐ’์„ ์‹ค์ œ ๊ฐ’๊ณผ ์ฐจ์ด๋ฅผ ๊ตฌํ•ด์„œ ์žฌ๊ตฌ์„ฑ ์†์‹ค(reconstruction error) ๊ฐ’์„ ๊ตฌํ•˜๊ธฐ
    •  
      ์ •์ƒ์ธ ๊ฒฝ์šฐ ๋ชจ๋ธ์ด ์ž˜ ํ•™์Šต๋˜์–ด ๋ณต์›์ด ์ž˜ ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— reconstruction error ๊ฐ’์ด ์ž‘๊ฒŒ ๋‚˜์˜ฌ ๊ฒƒ
    • ๊ณต๊ฒฉ์ธ ๊ฒฝ์šฐ ์ •๊ทœํ™”๋œ ๊ฐ’์—์„œ 0๊ณผ 1์„ ๋ฒ—์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— reconstruction error ๊ฐ’์ด ํฌ๊ฒŒ ๋‚˜์˜ฌ ๊ฒƒ
start = time.time()
valid_x_predictions = model.predict(x_val)
print(valid_x_predictions.shape)

error = flatten(x_val) - flatten(valid_x_predictions)
print((flatten(x_val) - flatten(valid_x_predictions)).shape)

valid_mse = np.mean(np.power(flatten(x_val) - flatten(valid_x_predictions), 2), axis=1)
print(valid_mse.shape)
print(time.time()-start)

7๏ธโƒฃ Precision Recall Curve

  • threshold์˜ ๊ฒฝ์šฐ Recall๊ณผ Precision์˜ ๊ฐ’์ด ๊ต์ฐจ๋˜๋Š” ์ง€์ ์„ ๊ธฐ์ค€
error_df = pd.DataFrame({'Reconstruction_error': valid_mse, 'True_class':list(VALIDATION_DF_RAW['attack'])})
precision_rt, recall_rt, threshold_rt = metrics.precision_recall_curve(error_df['True_class'], error_df['Reconstruction_error'])

plt.figure(figsize=(8,5))
plt.plot(threshold_rt, precision_rt[1:], label='Precision')
plt.plot(threshold_rt, recall_rt[1:], label='Recall')
plt.xlabel('Threshold'); plt.ylabel('Precision/Recall')
plt.legend()
#plt.show()

 

index_cnt = [cnt for cnt, (p, r) in enumerate(zip(precision_rt, recall_rt)) if p==r][0]
print('precision: ',precision_rt[index_cnt],', recall: ',recall_rt[index_cnt])

# fixed Threshold
threshold_fixed = threshold_rt[index_cnt]
print('threshold: ',threshold_fixed)

 

 

8๏ธโƒฃ Predict Validation Data set

  • reconstruction error ๊ฐ’์ด ์ž‘๊ฒŒ ๋‚˜์™”๊ณ , ๋น„์ •์ƒ์ธ ๊ตฌ๊ฐ„์€ ํ™•์‹คํ•˜๊ฒŒ reconstruction error ๊ฐ’์ด ๋†’๊ฒŒ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธ
error_df = pd.DataFrame({'Reconstruction_error': valid_mse ,
                         'True_class': list(VALIDATION_DF_RAW['attack'])})
groups = error_df.groupby('True_class')
fig, ax = plt.subplots(figsize=(20,20))

for name, group in groups:
    ax.plot(group.index, group.Reconstruction_error, marker='o', ms=3.5, linestyle='',
            label= "Break" if name == 1 else "Normal")
    
ax.hlines(threshold_fixed, ax.get_xlim()[0], ax.get_xlim()[1], colors="r", zorder=100, label='Threshold')
ax.legend()

plt.title("Reconstruction error for different classes")
plt.ylabel("Reconstruction error")
plt.xlabel("Data point index")

 

 

9๏ธโƒฃ ์ด๋™ํ‰๊ท 

  • ์ด๋™ํ‰๊ท  ๊ฐ’์„ ํ†ตํ•ด ์ •์ƒ์ธ ๊ตฌ๊ฐ„์€ ํ‰๊ท ์ ์œผ๋กœ ๋” ๋‚ฎ๊ฒŒ ํ•˜๊ณ , ๋น„์ •์ƒ์ธ ๊ตฌ๊ฐ„์€ ํ‰๊ท ์ ์œผ๋กœ ๋” ๋†’์€ ๊ฐ’์„ ๋‚˜ํƒ€๋‚ด๋„๋ก ํ•จ
error_df

 

#์ด๋™ํ‰๊ท 
mean_window = error_df['Reconstruction_error'].rolling(50).mean()
window_error = mean_window.fillna(0)
window_error

 

  • ํ™•์‹คํ•˜๊ฒŒ ๊ณต๊ฒฉ์ธ ๊ตฌ๊ฐ„ ์žก๊ธฐ
window_error_df = pd.DataFrame({'Reconstruction_error': window_error ,
                         'True_class': list(VALIDATION_DF_RAW['attack'])})
groups = window_error_df.groupby('True_class')
fig, ax = plt.subplots(figsize=(20,20))

for name, group in groups:
    ax.plot(group.index, group.Reconstruction_error, marker='o', ms=3.5, linestyle='',
            label= "Break" if name == 1 else "Normal")
    
ax.hlines(threshold_fixed, ax.get_xlim()[0], ax.get_xlim()[1], colors="r", zorder=100, label='Threshold')
ax.legend()

plt.title("Reconstruction error for different classes")
plt.ylabel("Reconstruction error")
plt.xlabel("Data point index")

  • threshold ๊ฐ’์œผ๋กœ validation set ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธ
pred_y = [1 if e > threshold_fixed else 0 for e in window_error_df['Reconstruction_error'].values]
pred_y = np.array(pred_y)
pred_y.shape

 

๐Ÿ”Ÿ ํ‰๊ฐ€

ATTACK_LABELS = np.array(VALIDATION_DF_RAW[ATTACK_FIELD])
FINAL_LABELS = np.array(pred_y)

ATTACK_LABELS.shape[0] == FINAL_LABELS.shape[0]

TaPR = etapr.evaluate(anomalies=ATTACK_LABELS, predictions=FINAL_LABELS)

print(f"F1: {TaPR['f1']:.3f} (TaP: {TaPR['TaP']:.3f}, TaR: {TaPR['TaR']:.3f})")
print(f"# of detected anomalies: {len(TaPR['Detected_Anomalies'])}")
print(f"Detected anomalies: {TaPR['Detected_Anomalies']}")
  • ๋น„์ •์ƒ ๊ตฌ๊ฐ„์—์„œ ํŠน์ดํ–ˆ๋˜ ๋ณ€์ˆ˜์˜ ๋ฐ์ดํ„ฐ ๊ฐ’์„ ์กฐ์ ˆํ•˜๊ณ  ์ด๋™ ํ‰๊ท  ๊ฐ’์„ ํ™œ์šฉํ•˜์˜€์„ ๋•Œ
  • validation set์—์„œ TaPR ์ ์ˆ˜๊ฐ€ 99.8์ด๋ผ๋Š” ๋†’์€ ์ ์ˆ˜๊ฐ€ ๋‚˜์˜ด

 

 

1๏ธโƒฃ1๏ธโƒฃ Predict Test Data set

  • ์œ„์™€ ์ „์ฒ˜๋ฆฌ, ์ •๊ทœํ™”, ๋ชจ๋ธ ๋™์ผ
TEST_DF_RAW = dataframe_from_csvs(TEST_DATASET)
TEST_DF_RAW

TEST_DF = normalize(TEST_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET]).ewm(alpha=0.9).mean()
TEST_DF

TEST_DF.plot()

boundary_check(TEST_DF)

 

test = np.array(TEST_DF)
x_test = test.reshape(test.shape[0], 1, test.shape[1])
x_test.shape

start = time.time()
test_x_predictions = model.predict(x_test)

print(test_x_predictions.shape)

test_mse = np.mean(np.power(flatten(x_test) - flatten(test_x_predictions), 2), axis=1)

print(test_mse.shape)
print(time.time()-start)

test_error = pd.DataFrame({'Reconstruction_error': test_mse})
  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์…‹์—์„œ๋Š” label ๊ฐ’์„ ์•Œ ์ˆ˜ ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋™ํ‰๊ท ์˜ ๊ตฌ๊ฐ„๊ณผ threshold ๊ฐ’์„ ์กฐ๊ธˆ์”ฉ ๋ณ€๊ฒฝํ•˜๋ฉด์„œ ์ œ์ถœ ํ›„ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ณ  ์กฐ์ ˆ

movemean_test = pd.DataFrame({'Reconstruction_error': test_d})
pred_y_test = [1 if e > 0.000425 else 0 for e in movemean_test['Reconstruction_error'].values]
pred_y_test = np.array(pred_y_test)
pred_y_test.shape

submission = pd.read_csv('HAI 2.0/sample_submission.csv')
submission.index = submission['time']
submission['attack'] = pred_y_test
submission['attack'].value_counts()

test_error_df = pd.DataFrame({'Reconstruction_error': test_d,
                         'True_class': list(submission['attack'])})
groups = test_error_df.groupby('True_class')
fig, ax = plt.subplots(figsize=(20,20))

for name, group in groups:
    ax.plot(group.index, group.Reconstruction_error, marker='o', ms=3.5, linestyle='',
            label= "Break" if name == 1 else "Normal")
    
ax.hlines(0.000425, ax.get_xlim()[0], ax.get_xlim()[1], colors="r", zorder=100, label='Threshold')
ax.legend()

plt.title("Reconstruction error for different classes")
plt.ylabel("Reconstruction error")
plt.xlabel("Data point index")

728x90
๋ฐ˜์‘ํ˜•
Comments