😎 κ³΅λΆ€ν•˜λŠ” μ§•μ§•μ•ŒνŒŒμΉ΄λŠ” μ²˜μŒμ΄μ§€?

[DACON] HAICon2020 μ‚°μ—…μ œμ–΄μ‹œμŠ€ν…œ λ³΄μ•ˆμœ„ν˜‘ 탐지 AI & LSTM λ³Έλ¬Έ

πŸ‘©‍πŸ’» 인곡지λŠ₯ (ML & DL)/Serial Data

[DACON] HAICon2020 μ‚°μ—…μ œμ–΄μ‹œμŠ€ν…œ λ³΄μ•ˆμœ„ν˜‘ 탐지 AI & LSTM

μ§•μ§•μ•ŒνŒŒμΉ΄ 2022. 9. 7. 09:08
728x90
λ°˜μ‘ν˜•

220907 μž‘μ„±

<λ³Έ λΈ”λ‘œκ·ΈλŠ” dacon λŒ€νšŒμ—μ„œμ˜ Sllab νŒ€ μ½”λ“œμ™€ dacon의 HAI 2.0 Baseline 글을 μ°Έκ³ ν•΄μ„œ κ³΅λΆ€ν•˜λ©° μž‘μ„±ν•˜μ˜€μŠ΅λ‹ˆλ‹€ :-) >

https://dacon.io/competitions/official/235624/codeshare/1570?page=1&dtype=recent 

 

HAI 2.0 Baseline

HAICon2020 μ‚°μ—…μ œμ–΄μ‹œμŠ€ν…œ λ³΄μ•ˆμœ„ν˜‘ 탐지 AI κ²½μ§„λŒ€νšŒ

dacon.io

https://dacon.io/competitions/official/235624/codeshare/1831

 

[2μœ„]SIlab

HAICon2020 μ‚°μ—…μ œμ–΄μ‹œμŠ€ν…œ λ³΄μ•ˆμœ„ν˜‘ 탐지 AI κ²½μ§„λŒ€νšŒ

dacon.io

 

 

😎 1. λŒ€νšŒ μ†Œκ°œ

  • 졜근 κ΅­κ°€κΈ°λ°˜μ‹œμ„€ 및 μ‚°μ—…μ‹œμ„€μ˜ μ œμ–΄μ‹œμŠ€ν…œμ— λŒ€ν•œ 사이버 λ³΄μ•ˆμœ„ν˜‘μ΄ μ§€μ†μ μœΌλ‘œ 증가
  • ν˜„μž₯ μ œμ–΄μ‹œμŠ€ν…œμ˜ νŠΉμ„±μ„ μ •ν™•ν•˜κ²Œ λ°˜μ˜ν•˜κ³ , λ‹€μ–‘ν•œ μ œμ–΄μ‹œμŠ€ν…œ 사이버곡격 μœ ν˜•μ„ ν¬ν•¨ν•˜λŠ” 데이터셋은 AI기반 λ³΄μ•ˆκΈ°μˆ  연ꡬλ₯Ό μœ„ν•œ ν•„μˆ˜μ μΈ μš”μ†Œ
  • μ‚°μ—…μ œμ–΄μ‹œμŠ€ν…œ λ³΄μ•ˆ 데이터셋 HAI 1.0 (μ œμ–΄μ‹œμŠ€ν…œ ν…ŒμŠ€νŠΈλ² λ“œ) 을 ν™œμš©ν•˜μ—¬ 정상 μƒν™©μ˜ λ°μ΄ν„°λ§Œμ„ ν•™μŠ΅ν•˜μ—¬ 곡격 및 비정상 상황을 탐지할 수 μžˆλŠ” μ΅œμ‹ μ˜ λ¨Έμ‹ λŸ¬λ‹, λ”₯λŸ¬λ‹ λͺ¨λΈμ„ κ°œλ°œν•˜κ³  μ„±λŠ₯을 κ²½μŸν•˜λŠ” λŒ€νšŒ

 

 

 

😎 2. 데이터 ꡬ성

  •  ν•™μŠ΅ 데이터셋 (3개)
    • 파일λͺ… : 'train1.csv', 'train2.csv', 'train3.csv'
    • μ„€λͺ… : 정상적인 운영 μƒν™©μ—μ„œ μˆ˜μ§‘λœ 데이터(각 νŒŒμΌλ³„λ‘œ μ‹œκ°„ 연속성을 가짐)
    • Column1 ('time') : κ΄€μΈ‘ μ‹œκ°
    • Column2, 3, …, 80 ('C01', 'C02', …, 'C79') : μƒνƒœ κ΄€μΈ‘ 데이터
  • 검증 데이터셋 (1개)
    • 파일λͺ… : 'validation.csv'
    • 5가지 곡격 μƒν™©μ—μ„œ μˆ˜μ§‘λœ 데이터(μ‹œκ°„ 연속성을 가짐)
    • Column1 ('time') : κ΄€μΈ‘ μ‹œκ°
    • Column2, 3, …, 80 ('C01', 'C02', …, 'C79') : μƒνƒœ κ΄€μΈ‘ 데이
    • Column81 : 곡격 라벨 (정상:'0', 곡격:'1')
  • ν…ŒμŠ€νŠΈ 데이터셋 (4개)
    • 파일λͺ… : 'test1.csv', 'test2.csv', 'test3.csv', 'test4.csv'
    • Column1 ('time') : κ΄€μΈ‘ μ‹œκ°
    • Column2, 3, …, 80 ('C01', 'C02', …, 'C79') : μƒνƒœ κ΄€μΈ‘ 데이터
  • HAI 2.0
    • 파일λͺ… : eTaPR-1.12-py3-none-any.whl

 

 

 

😎 3. μ½”λ“œ κ΅¬ν˜„

 

1️⃣ 데이터 λ‘œλ“œ

  • !pip install /파일경둜/eTaPR-1.12-py3-none-any.whl
!pip install /home/ubuntu/coding/220906/eTaPR-1.12-py3-none-any.whl

 

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
import time

from keras.models import Model, Sequential, load_model
import keras
from keras import optimizers
from keras.layers import Input,Bidirectional ,LSTM, Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from keras.callbacks import EarlyStopping, ModelCheckpoint

from pathlib import Path
from datetime import timedelta
import dateutil
from tqdm.notebook import trange
from TaPR_pkg import etapr

 

  • Data set
TRAIN_DATASET = sorted([x for x in Path("/home/ubuntu/coding/220906/HAI 2.0/training").glob("*.csv")])
TRAIN_DATASET

 

TEST_DATASET = sorted([x for x in Path("/home/ubuntu/coding/220906/HAI 2.0/testing").glob("*.csv")])
TEST_DATASET

 

VALIDATION_DATASET = sorted([x for x in Path("/home/ubuntu/coding/220906/HAI 2.0/validation").glob("*.csv")])
VALIDATION_DATASET

 

def dataframe_from_csv(target):
    return pd.read_csv(target).rename(columns=lambda x: x.strip())

def dataframe_from_csvs(targets):
    return pd.concat([dataframe_from_csv(x) for x in targets])

 

  • Data load
    • Train dataset load
      • μ„€λͺ… : 정상적인 운영 μƒν™©μ—μ„œ μˆ˜μ§‘λœ 데이터(각 νŒŒμΌλ³„λ‘œ μ‹œκ°„ 연속성을 가짐)
      • Column1 ('time') : κ΄€μΈ‘ μ‹œκ°
      • Column2, 3, …, 80 ('C01', 'C02', …, 'C79') : μƒνƒœ κ΄€μΈ‘ 데이터
TRAIN_DF_RAW = dataframe_from_csvs(TRAIN_DATASET)
TRAIN_DF_RAW


# 좔후에 λ‚˜λŠ” 데이터가 λ„ˆλ¬΄ μ»€μ„œ 50000개둜 μ€„μ˜€λ‹€.
# TRAIN_DF_RAW = dataframe_from_csvs(TRAIN_DATASET)
# TRAIN_DF_RAW = TRAIN_DF_RAW[40000:90000]
# TRAIN_DF_RAW

  • ν•™μŠ΅ 데이터셋은 곡격을 받지 μ•Šμ€ ν‰μƒμ‹œ 데이터이고 μ‹œκ°„μ„ λ‚˜νƒ€λ‚΄λŠ” ν•„λ“œμΈ time
  • λ‚˜λ¨Έμ§€ ν•„λ“œλŠ” λͺ¨λ‘ λΉ„μ‹λ³„ν™”λœ μ„Όμ„œ/μ•‘μΆ”μ—μ΄ν„°μ˜ κ°’
  • 전체 데이터λ₯Ό λŒ€μƒμœΌλ‘œ 이상을 νƒμ§€ν•˜λ―€λ‘œ "attack" ν•„λ“œλ§Œ μ‚¬μš©

 

2️⃣ 데이터 μ „μ²˜λ¦¬

  •  ν•™μŠ΅ 데이터셋에 μžˆλŠ” λͺ¨λ“  μ„Όμ„œ/앑좔에이터 ν•„λ“œλ₯Ό λ‹΄κ³  있음
    • ν•™μŠ΅ 데이터셋에 μ‘΄μž¬ν•˜μ§€ μ•ŠλŠ” ν•„λ“œκ°€ ν…ŒμŠ€νŠΈ 데이터셋에 μ‘΄μž¬ν•˜λŠ” κ²½μš°κ°€ 있음
    • ν•™μŠ΅ μ‹œ 보지 λͺ»ν–ˆλ˜ ν•„λ“œμ— λŒ€ν•΄μ„œ ν…ŒμŠ€νŠΈλ₯Ό ν•  수 μ—†μœΌλ―€λ‘œ ν•™μŠ΅ 데이터셋을 κΈ°μ€€μœΌλ‘œ ν•„λ“œ 이름
  • ν•™μŠ΅ λ°μ΄ν„°μ…‹μ—μ„œ μ΅œμ†Ÿκ°’ μ΅œλŒ“κ°’μ„ 얻은 κ²°κ³Ό
TAG_MIN = TRAIN_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET].min()
TAG_MAX = TRAIN_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET].max()

 

  • Min-Max Normalization (μ΅œμ†Œ-μ΅œλŒ€ μ •κ·œν™”)
    • (X - MIN) / (MAX-MIN)
    • λͺ¨λ“  feature에 λŒ€ν•΄ 각각의 μ΅œμ†Œκ°’ 0, μ΅œλŒ€κ°’ 1둜, 그리고 λ‹€λ₯Έ 값듀은 0κ³Ό 1 μ‚¬μ΄μ˜ κ°’μœΌλ‘œ λ³€ν™˜
    • 값이 μ „ν˜€ λ³€ν•˜μ§€ μ•ŠλŠ” ν•„λ“œ 경우 μ΅œμ†Ÿκ°’κ³Ό μ΅œλŒ“κ°’μ΄ 같을 것 -> 이런 ν•„λ“œλ₯Ό λͺ¨λ‘ 0
def normalize(df):
    ndf = df.copy()
    for c in df.columns:
        if TAG_MIN[c] == TAG_MAX[c]:
            ndf[c] = df[c] - TAG_MIN[c]
        else:
            ndf[c] = (df[c] - TAG_MIN[c]) / (TAG_MAX[c] - TAG_MIN[c])
    return ndf

 

  • ewm (μ§€μˆ˜κ°€μ€‘ν•¨μˆ˜)
    • 였래된 데이터에 μ§€μˆ˜κ°μ‡ λ₯Ό μ μš©ν•˜μ—¬ 졜근 데이터가 더 큰 영ν–₯을 끼지도둝 κ°€μ€‘μΉ˜λ₯Ό μ£ΌλŠ” ν•¨μˆ˜
    • μΆ”κ°€ λ©”μ„œλ“œλ‘œ mean() 을 μ‚¬μš©ν•΄μ„œ μ§€μˆ˜κ°€μ€‘ν‰κ· μœΌλ‘œ μ‚¬μš© 
      • μ‚¬μš©λ²•
        • df.ewm(com=None, span=None, halflife=None, alpha=None, min_periods=0, adjust=True, ignore_na=False, axis=0, times=None, method='single')
        • 주둜 κ°€μ€‘μΉ˜λ₯Ό κ²°μ •ν•˜λŠ” μš”μ†ŒλŠ” alpha!! ν‰ν™œκ³„μˆ˜(κ°μ‡ κ³„μˆ˜)
          ====>  com / span / halflifeλ₯Ό 톡해 μžλ™ κ³„μ‚°ν•˜λ„λ‘ ν•˜κ±°λ‚˜, alphaλ₯Ό 톡해 직접 μ„€μ •
       
  • exponential weighted function을 ν†΅κ³Όμ‹œν‚¨ κ²°κ³Όμž…λ‹ˆλ‹€. μ„Όμ„œμ—μ„œ λ°œμƒν•˜λŠ” noiseλ₯Ό smoothing μ‹œμΌœμ£ΌκΈ° 

 

  • Pandas Dataframe에 μžˆλŠ” κ°’ 쀑 1 초과의 값이 μžˆλŠ”μ§€, 0 미만의 값이 μžˆλŠ”μ§€, NaN이 μžˆλŠ”μ§€ 점검
    • np.any( ) : λ°°μ—΄μ˜ 데이터 쀑 쑰건과 λ§žλŠ” 데이터가 있으면 True, μ „ν˜€ μ—†μœΌλ©΄ False
def boundary_check(df):
    x = np.array(df, dtype=np.float32)
    return np.any(x > 1.0), np.any(x < 0), np.any(np.isnan(x))
boundary_check(TRAIN_DF)

 

3️⃣ ν•™μŠ΅ λͺ¨λΈ μ„€μ • & 데이터 μž…μΆœλ ₯ μ •μ˜

  • 베이슀라인 λͺ¨λΈμ€ Stacked RNN(GRU cells)을 μ΄μš©ν•΄μ„œ 이상을 탐지
    • 정상 λ°μ΄ν„°λ§Œ ν•™μŠ΅ν•΄μ•Ό ν•˜κ³ , 정상 λ°μ΄ν„°μ—λŠ” μ–΄λ– ν•œ label도 μ—†μœΌλ―€λ‘œ unsupervised learning을 ν•΄μ•Ό 함
  • μŠ¬λΌμ΄λ”© μœˆλ„μš°λ₯Ό 톡해 μ‹œκ³„μ—΄ λ°μ΄ν„°μ˜ 일뢀λ₯Ό κ°€μ Έμ™€μ„œ ν•΄λ‹Ή μœˆλ„μš°μ˜ νŒ¨ν„΄μ„ κΈ°μ–΅
    • μŠ¬λΌμ΄λ”© μœˆλ„μš°λŠ” 90초(HAIλŠ” 1μ΄ˆλ§ˆλ‹€ μƒ˜ν”Œλ§λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€)둜 μ„€μ •
  • λͺ¨λΈμ˜ μž…μΆœλ ₯
    • μž…λ ₯ : μœˆλ„μš°μ˜ μ•žλΆ€λΆ„ 89μ΄ˆμ— ν•΄λ‹Ήν•˜λŠ” κ°’
    • 좜λ ₯ : μœˆλ„μš°μ˜ κ°€μž₯ λ§ˆμ§€λ§‰ 초(90번째 초)의 κ°’
  • 탐지 μ‹œμ—λŠ” λͺ¨λΈμ΄ 좜λ ₯ν•˜λŠ” κ°’(μ˜ˆμΈ‘κ°’)κ³Ό μ‹€μ œλ‘œ λ“€μ–΄μ˜¨ κ°’μ˜ μ°¨λ₯Ό 보고 차이가 크면 μ΄μƒμœΌλ‘œ κ°„μ£Ό
    • λ§Žμ€ μ˜€μ°¨κ°€ λ°œμƒν•œλ‹€λŠ” 것은 기쑴에 ν•™μŠ΅ λ°μ΄ν„°μ…‹μ—μ„œ λ³Έ 적이 μ—†λŠ” νŒ¨ν„΄μ΄κΈ° λ•Œλ¬Έ (κ°€μ •)

 

  • HaiDataset : PyTorch의 Dataset μΈν„°νŽ˜μ΄μŠ€λ₯Ό μ •μ˜ν•œ 것
    • 데이터셋을 읽을 λ•ŒλŠ” μŠ¬λΌμ΄λ”© μœˆλ„μš°κ°€ μœ νš¨ν•œ 지 점검
    • 정상적인 μœˆλ„μš°λΌλ©΄ μ›λ„μš°μ˜ 첫 μ‹œκ°κ³Ό λ§ˆμ§€λ§‰ μ‹œκ°μ˜ μ°¨κ°€ 89초
  • stride νŒŒλΌλ―Έν„° : μŠ¬λΌμ΄λ”©μ„ ν•  λ•Œ 크기
    • 전체 μœˆλ„μš°λ₯Ό λͺ¨λ‘ ν•™μŠ΅ν•  μˆ˜λ„ μžˆμ§€λ§Œ, μ‹œκ³„μ—΄ λ°μ΄ν„°μ—μ„œλŠ” μŠ¬λΌμ΄λ”© μœˆλ„μš°λ₯Ό 1μ΄ˆμ”© μ μš©ν•˜λ©΄ 이전 μœˆλ„μš°μ™€ λ‹€μŒ μœˆλ„μš°μ˜ 값이 거의 κ°™μŒ
    • ν•™μŠ΅μ„ λΉ λ₯΄κ²Œ 마치기 μœ„ν•΄ 10μ΄ˆμ”© κ±΄λ„ˆλ›°λ©΄μ„œ 데이터λ₯Ό μΆ”μΆœ
      • μŠ¬λΌμ΄λ”© 크기λ₯Ό 1둜 μ„€μ •ν•˜μ—¬ λͺ¨λ“  데이터셋을 보게 ν•˜λ©΄ 더 쒋을 것

 

WINDOW_GIVEN = 59
WINDOW_SIZE= 60

def HaiDataset(timestamps, df, stride=1, attacks=None):
    ts= np.array(timestamps)
    tag_values=np.array(df,dtype=np.float32)
    valid_idxs=[]
    for L in trange(len(ts)-WINDOW_SIZE+1):
        R = L + WINDOW_SIZE - 1
        if dateutil.parser.parse(ts[R]) - dateutil.parser.parse(ts[L]) == timedelta(seconds=WINDOW_SIZE - 1):
            valid_idxs.append(L)
    valid_idxs=np.array(valid_idxs, dtype=np.int32)[::stride]
    n_idxs = len(valid_idxs)
    print("# of valid windows:", n_idxs)
    if attacks is not None:
        attacks = np.array(attacks, dtype=np.float32)
        with_attack = True
    else:
        with_attack = False
        
    
    timestamp, X, y, att = list(), list(), list(), list()
    
    if with_attack:
        for i in valid_idxs:
            last=i+WINDOW_SIZE-1
            seq_time, seq_x, seq_y, seq_attack = ts[last], tag_values[i:i+WINDOW_GIVEN], tag_values[last], attacks[last]
            timestamp.append(seq_time)
            X.append(seq_x)
            y.append(seq_y)
            att.append(seq_attack)
            
        return np.array(timestamp), np.array(X), np.array(y), np.array(att)
    else:
        for i in valid_idxs:
            last=i+WINDOW_SIZE-1
            seq_time, seq_x, seq_y = ts[last], tag_values[i:i+WINDOW_GIVEN], tag_values[last]
            timestamp.append(seq_time)
            X.append(seq_x)
            y.append(seq_y)
        return np.array(timestamp), np.array(X), np.array(y)

 

  • ν›ˆλ ¨ 데이터 λ‚˜λˆ„κΈ°
ts, X_train, y_train = HaiDataset(TRAIN_DF_RAW[TIMESTAMP_FIELD], TRAIN_DF, stride=1)

 

  • 데이터셋이 잘 λ‘œλ“œλ¨
    • λͺ¨λΈμ€ 3μΈ΅ bidirectional GRUλ₯Ό μ‚¬μš©
    • Hidden cell의 ν¬κΈ°λŠ” 100
    • Dropout은 μ‚¬μš©ν•˜μ§€ μ•ŠμŒ
HAI_DATASET_TRAIN = HaiDataset(TRAIN_DF_RAW[TIMESTAMP_FIELD], TRAIN_DF, stride=10)
HAI_DATASET_TRAIN[0]

 

  • Validation dataset load
    • Validation도 μœ„μ™€ 같이 μ •κ·œν™”
    • μ κ²€ν•˜κΈ°
VALIDATION_DF_RAW = dataframe_from_csvs(VALIDATION_DATASET)
VALIDATION_DF_RAW

 

VALIDATION_DF = normalize(VALIDATION_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET])
boundary_check(VALIDATION_DF)

ts, X_valid, y_valid, attack = HaiDataset(VALIDATION_DF_RAW[TIMESTAMP_FIELD], VALIDATION_DF, attacks=VALIDATION_DF_RAW[ATTACK_FIELD])

 

  • EarlyStopping 지정
    • 과적합을 λ°©μ§€ν•˜κΈ° μœ„ν•΄μ„œλŠ” EarlyStopping μ΄λΌλŠ” μ½œλ°±ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ μ μ ˆν•œ μ‹œμ μ— ν•™μŠ΅μ„ μ‘°κΈ° μ’…λ£Œ
더보기

EarlyStopping 

EarlyStopping(monitor = 'val_loss', min_delta = 0, patience = 0, mode = 'auto')
  • monitor : ν•™μŠ΅ μ‘°κΈ°μ’…λ£Œλ₯Ό μœ„ν•΄ κ΄€μ°°
    • val_loss λ‚˜ val_accuracy κ°€ 주둜 μ‚¬μš© (default : val_loss)
  • min_delta : κ°œμ„ λ˜κ³  μžˆλ‹€κ³  νŒλ‹¨ν•˜κΈ° μœ„ν•œ μ΅œμ†Œ λ³€ν™”λŸ‰
    • λ³€ν™”λŸ‰μ΄ min_delta 보닀 적은 κ²½μš°μ—λŠ” κ°œμ„ μ΄ μ—†λ‹€κ³  νŒλ‹¨ (default = 0)
  • patience : κ°œμ„ μ΄ μ•ˆλœλ‹€κ³  λ°”λ‘œ μ’…λ£Œμ‹œν‚€μ§€ μ•Šκ³ , κ°œμ„ μ„ μœ„ν•΄ λͺ‡λ²ˆμ˜ 에포크λ₯Ό 기닀릴지 μ„€μ • (default = 0)
  • mode : κ΄€μ°°ν•­λͺ©μ— λŒ€ν•΄ κ°œμ„ μ΄ μ—†λ‹€κ³  νŒλ‹¨ν•˜κΈ° μœ„ν•œ 기쀀을 μ„€μ •
    • monitorμ—μ„œ μ„€μ •ν•œ ν•­λͺ©μ΄ val_loss 이면 값이 κ°μ†Œλ˜μ§€ μ•Šμ„ λ•Œ μ’…λ£Œν•˜μ—¬μ•Ό ν•˜λ―€λ‘œ min 을 μ„€μ •
    • val_accuracy 의 κ²½μš°μ—λŠ” maxλ₯Ό μ„€μ • (default = auto)
      • auto : monitor에 μ„€μ •λœ 이름에 따라 μžλ™μœΌλ‘œ 지정
      • min : 관찰값이 κ°μ†Œν•˜λŠ” 것을 멈좜 λ•Œ, ν•™μŠ΅μ„ μ’…λ£Œ
      • max: 관찰값이 μ¦κ°€ν•˜λŠ” 것을 멈좜 λ•Œ, ν•™μŠ΅μ„ μ’…λ£Œ

좜처 : https://sevillabk.github.io/1-early-stopping/ 

  • ModelCheckpoint 지정
더보기

ModelCheckpoint

tf.keras.callbacks.ModelCheckpoint(
    filepath, monitor='val_loss', verbose=0, save_best_only=False,
    save_weights_only=False, mode='auto', save_freq='epoch', options=None, **kwargs
)
  • λͺ¨λΈμ΄ ν•™μŠ΅ν•˜λ©΄μ„œ μ •μ˜ν•œ μ‘°κ±΄μ„ λ§Œμ‘±ν–ˆμ„ λ•Œ Model의 weight 값을 쀑간 μ €μž₯
  • ν•™μŠ΅μ‹œκ°„μ΄ κ½€ μ˜€λž˜κ±Έλ¦°λ‹€λ©΄, λͺ¨λΈμ΄ κ°œμ„ λœ validation scoreλ₯Ό λ„μΆœν•΄λ‚Ό λ•Œλ§ˆλ‹€ weightλ₯Ό 쀑간 μ €μž₯함
    • 쀑간에 memory overflowλ‚˜ crashκ°€ λ‚˜λ”λΌλ„ λ‹€μ‹œ weightλ₯Ό λΆˆλŸ¬μ™€μ„œ ν•™μŠ΅ μ΄μ–΄λ‚˜κ°ˆ 수 있음 ( μ‹œκ°„ save)
인자 μ„€λͺ…
filepath λͺ¨λΈμ„ μ €μž₯ν•  경둜λ₯Ό μž…λ ₯
monitor λͺ¨λΈμ„ μ €μž₯ν•  λ•Œ, 기쀀이 λ˜λŠ” 값을 지정
verbose 0, 1
1일 경우 λͺ¨λΈμ΄ μ €μž₯ 될 λ•Œ, 'μ €μž₯λ˜μ—ˆμŠ΅λ‹ˆλ‹€' 라고 화면에 ν‘œμ‹œ
0일 경우 화면에 ν‘œμ‹œλ˜λŠ” 것 없이 κ·Έλƒ₯ λ°”λ‘œ λͺ¨λΈμ΄ μ €μž₯

save_best_only True, False
True 인 경우, monitor 되고 μžˆλŠ” 값을 κΈ°μ€€μœΌλ‘œ κ°€μž₯ 쒋은 κ°’μœΌλ‘œ λͺ¨λΈμ΄ μ €μž₯
False인 경우, 맀 μ—ν­λ§ˆλ‹€ λͺ¨λΈμ΄ filepath{epoch}으둜 μ €μž₯ (model0, model1, model2....)
save_weights_only True, False
True인 경우, λͺ¨λΈμ˜ weights만 μ €μž₯
False인 경우, λͺ¨λΈ λ ˆμ΄μ–΄ 및 weights λͺ¨λ‘ μ €μž₯
mode 'auto', 'min', 'max'
val_acc 인 경우, 정확도이기 λ•Œλ¬Έμ— 클수둝 μ’‹μŒ (maxλ₯Ό μž…λ ₯)
val_loss 인 경우, loss 값이기 λ•Œλ¬Έμ— 값이 μž‘μ„μˆ˜λ‘ μ’‹μŒ (min을 μž…λ ₯)
auto둜 ν•  경우, λͺ¨λΈμ΄ μ•Œμ•„μ„œ min, maxλ₯Ό νŒλ‹¨ν•˜μ—¬ λͺ¨λΈμ„ μ €μž₯
save_freq 'epoch' λ˜λŠ” integer(μ •μˆ˜ν˜• 숫자)
'epoch'을 μ‚¬μš©ν•  경우, 맀 μ—ν­λ§ˆλ‹€ λͺ¨λΈμ΄ μ €μž₯
integer을 μ‚¬μš©ν•  경우, 숫자만큼의 배치λ₯Ό μ§„ν–‰λ˜λ©΄ λͺ¨λΈμ΄ μ €μž₯
options tf.train.CheckpointOptionsλ₯Ό μ˜΅μ…˜μœΌλ‘œ μ£ΌκΈ°
λΆ„μ‚°ν™˜κ²½μ—μ„œ λ‹€λ₯Έ 디렉토리에 λͺ¨λΈμ„ μ €μž₯ν•˜κ³  싢을 경우 μ‚¬μš©

좜처 : https://deep-deep-deep.tistory.com/53

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=4)
mc = ModelCheckpoint('best_LSTM_model_60seq_TEST.h5', monitor='val_loss', mode='min', verbose=1, save_best_only=True)

 

 

 

 

 

4️⃣ Training Model

  • LSTM (Long Short-Term Memory)
    • 단기 λ©”λͺ¨λ¦¬μ™€ μž₯κΈ° λ©”λͺ¨λ¦¬λ₯Ό λ‚˜λˆ  ν•™μŠ΅ ν›„, 두 λ©”λͺ¨λ¦¬λ₯Ό 병합해 이벀트 ν™•λ₯ μ„ 예츑 → 과거의 정보λ₯Ό 훨씬 잘 반영
    • LSTM은 κ²Œμ΄νŠΈλΌλŠ” κ°œλ…μœΌλ‘œ 선별 기얡을 확보
      • C(Cell state)λŠ” μž₯κΈ° κΈ°μ–΅ μƒνƒœ
      • H(Hidden State)λŠ” 단기 κΈ°μ–΅ μƒνƒœ
      • xλ‚˜ h, CλŠ” λͺ¨λ‘ λ°μ΄ν„°μ˜ 벑터

## LSTM Model define

n_features = TRAIN_DF.shape[1]

inputs=Input(shape=(WINDOW_GIVEN , n_features))
first=Bidirectional(LSTM(100, return_sequences=True))(inputs)
second=Bidirectional(LSTM(100, return_sequences=True))(first)
thrid=Bidirectional(LSTM(100))(second)

lstm_out=Dense(n_features)(thrid)

aux_input = Input(shape=(n_features,), name='aux_input')

outputs = keras.layers.add([lstm_out, aux_input])

model_60seq=Model(inputs=[inputs, aux_input], outputs=outputs)

model_60seq.compile(loss='mean_squared_error', optimizer='Adam')

model_60seq.summary()

 

 

  • aux_train으둜 리슀트 정리
    • train, valid λ‘˜λ‹€ γ„± γ„±
aux_train=[]

for i in range(len(X_train)):
    aux_train.append(X_train[i][0])
aux_train=np.array(aux_train)
aux_valid=[]
for i in range(len(X_valid)):
    aux_valid.append(X_valid[i][0])
aux_valid=np.array(aux_valid)

 

  • λͺ¨λΈ ν›ˆλ ¨
    • earlystoppoing!
hist=model_60seq.fit([X_train, aux_train], y_train, batch_size=512, epochs=32, shuffle=True, 
               callbacks=[es, mc], validation_data=([X_valid,aux_valid],y_valid))

 

  • λͺ¨λΈμ €μž₯
model_60seq.save('best_LSTM_model_60seq.h5') #keras h5

 

 

 

5️⃣ Skip learning

## model load
model_60seq = load_model('best_LSTM_model_60seq.h5')
  • Validation 예츑
## Validation set Prediction

y_pred=model_60seq.predict([X_valid,aux_valid])
tmp=[]
for i in range(len(y_pred)):
    tmp.append(abs(y_valid[i]-y_pred[i]))

  • anomaly score κ΅¬ν•˜κΈ°
ANOMALY_SCORE=np.mean(tmp,axis=1)
  • threshold μ΄μš©ν•΄ label κ°’ λ„£κΈ°
def put_labels(distance, threshold):
    xs = np.zeros_like(distance)
    xs[distance > threshold] = 1
    return xs
  • labels
THRESHOLD = 0.027

LABELS = put_labels(ANOMALY_SCORE, THRESHOLD)
LABELS, LABELS.shape

  • attack labels
ATTACK_LABELS = put_labels(np.array(VALIDATION_DF_RAW[ATTACK_FIELD]), threshold=0.5)
ATTACK_LABELS, ATTACK_LABELS.shape

  • what's this.?
def fill_blank(check_ts, labels, total_ts):
    def ts_generator():
        for t in total_ts:
            yield dateutil.parser.parse(t)

    def label_generator():
        for t, label in zip(check_ts, labels):
            yield dateutil.parser.parse(t), label

    g_ts = ts_generator()
    g_label = label_generator()
    final_labels = []

    try:
        current = next(g_ts)
        ts_label, label = next(g_label)
        while True:
            if current > ts_label:
                ts_label, label = next(g_label)
                continue
            elif current < ts_label:
                final_labels.append(0)
                current = next(g_ts)
                continue
            final_labels.append(label)
            current = next(g_ts)
            ts_label, label = next(g_label)
    except StopIteration:
        return np.array(final_labels, dtype=np.int8)
  • final labels
%%time
FINAL_LABELS = fill_blank(ts, LABELS, np.array(VALIDATION_DF_RAW[TIMESTAMP_FIELD]))
FINAL_LABELS.shape

ATTACK_LABELS.shape[0] == FINAL_LABELS.shape[0]

# μ„±λŠ₯ 평가도ꡬ eTaPR(enhanced Time-series aware Precision and Recall)
TaPR = etapr.evaluate(anomalies=ATTACK_LABELS, predictions=FINAL_LABELS)

print(f"F1: {TaPR['f1']:.3f} (TaP: {TaPR['TaP']:.3f}, TaR: {TaPR['TaR']:.3f})")
print(f"# of detected anomalies: {len(TaPR['Detected_Anomalies'])}")
print(f"Detected anomalies: {TaPR['Detected_Anomalies']}")

                     

6️⃣ Moving Average

## MOVING AVERAGE

seq60_10mean=[]
for idx in range(len(ANOMALY_SCORE)):
    if idx >= 10:
        seq60_10mean.append((ANOMALY_SCORE[idx-10:idx].mean()+ANOMALY_SCORE[idx])/2)
    else:
        seq60_10mean.append(ANOMALY_SCORE[idx])

seq60_10mean=np.array(seq60_10mean)
print(seq60_10mean.shape)

THRESHOLD = 0.019

LABELS = put_labels(seq60_10mean, THRESHOLD)
LABELS, LABELS.shape

%%time
FINAL_LABELS = fill_blank(ts, LABELS, np.array(VALIDATION_DF_RAW[TIMESTAMP_FIELD]))
FINAL_LABELS.shape

 

ATTACK_LABELS.shape[0] == FINAL_LABELS.shape[0]

 

TaPR = etapr.evaluate(anomalies=ATTACK_LABELS, predictions=FINAL_LABELS)
print(f"F1: {TaPR['f1']:.3f} (TaP: {TaPR['TaP']:.3f}, TaR: {TaPR['TaR']:.3f})")
print(f"# of detected anomalies: {len(TaPR['Detected_Anomalies'])}")
print(f"Detected anomalies: {TaPR['Detected_Anomalies']}")

무슨일일까..............μ™œ 0인거지..............

  • test data μœ„μ™€ 같이
############### Test data Start

TEST_DF_RAW = dataframe_from_csvs(TEST_DATASET)
TEST_DF_RAW = TEST_DF_RAW[:11960]
TEST_DF_RAW

TEST_DF = normalize(TEST_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET]).ewm(alpha=0.9).mean()
TEST_DF

boundary_check(TEST_DF)

CHECK_TS, X_test, y_test = HaiDataset(TEST_DF_RAW[TIMESTAMP_FIELD], TEST_DF, attacks=None)

aux_test=[]
for i in range(len(X_test)):
    aux_test.append(X_test[i][0])
aux_test=np.array(aux_test)
## Model Prediction

y_pred=model_60seq.predict([X_test,aux_test])
tmp=[]
for i in range(len(y_test)):
    tmp.append(abs(y_test[i]-y_pred[i]))
ANOMALY_SCORE=np.mean(tmp,axis=1)
# Moving Average

seq60_10mean=[]
for idx in range(len(ANOMALY_SCORE)):
    if idx >= 10:
        seq60_10mean.append((ANOMALY_SCORE[idx-10:idx].mean()+ANOMALY_SCORE[idx])/2)
    else:
        seq60_10mean.append(ANOMALY_SCORE[idx])

seq60_10mean=np.array(seq60_10mean)
print(seq60_10mean.shape)

### Threshold Setting

THRESHOLD=0.019

LABELS_60seq = put_labels(seq60_10mean, THRESHOLD)
LABELS_60seq, LABELS_60seq.shape

submission = pd.read_csv('/home/ubuntu/coding/220906/HAI 2.0/sample_submission.csv')
submission.index = submission['time']
submission.loc[CHECK_TS,'attack'] = LABELS_60seq

submission

submission.to_csv('SILab_Moving_Average.csv', index=False)

 

 

 

 

 

 

7️⃣ Gray Aare Smoothing

submission = pd.read_csv('/home/ubuntu/coding/220906/HAI 2.0/sample_submission.csv')
submission.index = submission['time']
submission['attack_1']=0
submission['anomaly_score']=0

submission.loc[CHECK_TS,'anomaly_score'] = seq60_10mean
submission.loc[CHECK_TS,'attack_1'] = LABELS_60seq

LABELS_60seq=submission['attack_1']
seq60_10mean=submission['anomaly_score']
def Gray_Area(attacks):
    start = []  # start point
    finish = []  # finish point
    c = []  # count
    com = 0
    count = 0
    for i in range(1, len(attacks)):
        if attacks[i - 1] != attacks[i]:
            if com == 0:
                start.append(i)
                count = count + 1
                com = 1
            elif com == 1:
                finish.append(i - 1)
                c.append(count)
                count = 0
                start.append(i)
                count = count + 1
        else:
            count = count + 1

    finish.append(len(attacks) - 1)
    c.append(finish[len(finish) - 1] - start[len(start) - 1] + 1)

    for i in range(0, len(start)):
        if c[i] < 10:
            s = start[i]
            f = finish[i] + 1
            g1 = [1 for i in range(c[i])] # Temp Attack list
            g0 = [0 for i in range(c[i])]  # Temp Normal List
            if attacks[start[i] - 1] == 1:
                attacks[s:f] = g1  # change to attack
            else:
                attacks[s:f] = g0  # change to normal

    return attacks
gray_LABELS_60seq=Gray_Area(LABELS_60seq)
submission = pd.read_csv('/home/ubuntu/coding/220906/HAI 2.0/sample_submission.csv')
submission.index = submission['time']

submission['attack']=gray_LABELS_60seq

submission

submission.to_csv('SILab_Gray_Area_Smoothing.csv', index=False)

 

 

8️⃣ Attack Point and Finish Point Policy

WINDOW_GIVEN = 9
WINDOW_SIZE= 10
ts, X_train, y_train = HaiDataset(TRAIN_DF_RAW[TIMESTAMP_FIELD], TRAIN_DF, stride=1)

ts, X_valid, y_valid, attack = HaiDataset(VALIDATION_DF_RAW[TIMESTAMP_FIELD], 
                                          VALIDATION_DF, attacks=VALIDATION_DF_RAW[ATTACK_FIELD])

## LSTM Model define

n_features = TRAIN_DF.shape[1]

inputs=Input(shape=(WINDOW_GIVEN , n_features))
first=Bidirectional(LSTM(100, return_sequences=True))(inputs)
second=Bidirectional(LSTM(100, return_sequences=True))(first)
thrid=Bidirectional(LSTM(100))(second)

lstm_out=Dense(n_features)(thrid)

aux_input = Input(shape=(n_features,), name='aux_input')

outputs = keras.layers.add([lstm_out, aux_input])

model_10seq=Model(inputs=[inputs, aux_input], outputs=outputs)

model_10seq.compile(loss='mean_squared_error', optimizer='Adam')

model_10seq.summary()

aux_train=[]
for i in range(len(X_train)):
    aux_train.append(X_train[i][0])
aux_train=np.array(aux_train)
aux_valid=[]
for i in range(len(X_valid)):
    aux_valid.append(X_valid[i][0])
aux_valid=np.array(aux_valid)
hist=model_10seq.fit([X_train, aux_train], y_train, batch_size=512, epochs=32, shuffle=True, 
               callbacks=[es, mc], validation_data=([X_valid,aux_valid],y_valid))

model_10seq.save('best_LSTM_model_10seq.h5') #keras h5
## model load
model_10seq = load_model('best_LSTM_model_10seq.h5')
CHECK_TS, X_test, y_test = HaiDataset(TEST_DF_RAW[TIMESTAMP_FIELD], TEST_DF, attacks=None)

aux_test=[]
for i in range(len(X_test)):
    aux_test.append(X_test[i][0])
aux_test=np.array(aux_test)
## Model Prediction

y_pred=model_10seq.predict([X_test,aux_test])
tmp=[]
for i in range(len(y_test)):
    tmp.append(abs(y_test[i]-y_pred[i]))
ANOMALY_SCORE=np.mean(tmp,axis=1)
# Moving Average

seq10_10mean=[]
for idx in range(len(ANOMALY_SCORE)):
    if idx >= 10:
        seq10_10mean.append((ANOMALY_SCORE[idx-10:idx].mean()+ANOMALY_SCORE[idx])/2)
    else:
        seq10_10mean.append(ANOMALY_SCORE[idx])

seq10_10mean=np.array(seq10_10mean)
print(seq10_10mean.shape)

### Threshold Setting

THRESHOLD=0.008

LABELS_10seq = put_labels(seq10_10mean, THRESHOLD)

LABELS_10seq, LABELS_10seq.shape

submission = pd.read_csv('./HAI 2.0/sample_submission.csv')
submission.index = submission['time']
submission['attack_1']=0

submission.loc[CHECK_TS,'attack_1'] = LABELS_10seq

LABELS_10seq=submission['attack_1']
gray_LABELS_10seq=Gray_Area(LABELS_10seq)
Queue=[0 for i in range(59)]

Label=[]

for i in range(len(gray_LABELS_60seq)):
    Queue.append(gray_LABELS_60seq[i])
    N=Queue.count(1)
    
    if N>=60:
        if seq60_10mean[i-100:i].max() > 0.1:
            Label.append(gray_LABELS_10seq[i])
        else:
            Label.append(gray_LABELS_60seq[i])
        
    elif N>=1:
        Label.append(gray_LABELS_60seq[i])
    
    else:
        Label.append(gray_LABELS_60seq[i])
    
    del Queue[0]

Label=np.array(Label)
submission = pd.read_csv('/home/ubuntu/coding/220906/HAI 2.0/sample_submission.csv')
submission.index = submission['time']

submission['attack']=Label

submission

 

 

 

 

 

728x90
λ°˜μ‘ν˜•
Comments