๐Ÿ˜Ž ๊ณต๋ถ€ํ•˜๋Š” ์ง•์ง•์•ŒํŒŒ์นด๋Š” ์ฒ˜์Œ์ด์ง€?

[Kaggle] Credit Card Anomaly Detection ๋ณธ๋ฌธ

๐Ÿ‘ฉ‍๐Ÿ’ป ์ปดํ“จํ„ฐ ๊ตฌ์กฐ/Kaggle

[Kaggle] Credit Card Anomaly Detection

์ง•์ง•์•ŒํŒŒ์นด 2022. 11. 2. 11:23
728x90
๋ฐ˜์‘ํ˜•

221102 ์ž‘์„ฑ

<๋ณธ ๋ธ”๋กœ๊ทธ๋Š” kaggle์˜ ysjang0926๋‹˜์˜ notebook ์„ ์ฐธ๊ณ ํ•ด์„œ ๊ณต๋ถ€ํ•˜๋ฉฐ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค>

https://www.kaggle.com/code/ysjang0926/kor-introduction-to-anomaly-detection-r01/notebook

 

[kor]Introduction to Anomaly Detection_r01

Explore and run machine learning code with Kaggle Notebooks | Using data from Credit Card Fraud Detection

www.kaggle.com

 

 

 

๐ŸŠ ์‹ ์šฉ ์นด๋“œ ์‚ฌ๊ธฐ ๊ฐ์ง€

: ์‹ ์šฉ ์นด๋“œ ํšŒ์‚ฌ๋Š” ์‚ฌ๊ธฐ์„ฑ ์‹ ์šฉ ์นด๋“œ ๊ฑฐ๋ž˜๋ฅผ ์ธ์‹ํ•˜์—ฌ ๊ณ ๊ฐ์ด ๊ตฌ๋งคํ•˜์ง€ ์•Š์€ ํ•ญ๋ชฉ์— ๋Œ€ํ•ด ์š”๊ธˆ์„ ์ฒญ๊ตฌํ•˜์ง€ ์•Š๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”

  • Dataset

: ์œ ๋Ÿฝ ์นด๋“œ ์†Œ์ง€์ž๊ฐ€ 2013๋…„ 9์›”์— ์‹ ์šฉ ์นด๋“œ๋กœ ๊ฑฐ๋ž˜ํ•œ ๋‚ด์šฉ์ด ํฌํ•จ๋จ

: 284,807๊ฑด์˜ ๊ฑฐ๋ž˜ ์ค‘ 492๊ฑด์˜ ์‚ฌ๊ธฐ๊ฐ€ ๋ฐœ์ƒํ•œ ์ดํ‹€ ๋™์•ˆ ๋ฐœ์ƒํ•œ ๊ฑฐ๋ž˜

: ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ๋งค์šฐ ๋ถˆ๊ท ํ˜•์ ์ด๋ฉฐ ๊ธ์ •์  ํด๋ž˜์Šค(์‚ฌ๊ธฐ)๋Š” ๋ชจ๋“  ๊ฑฐ๋ž˜์˜ 0.172%๋ฅผ ์ฐจ์ง€

: PCA ๋ณ€ํ™˜์˜ ๊ฒฐ๊ณผ์ธ ์ˆซ์ž ์ž…๋ ฅ ๋ณ€์ˆ˜๋งŒ ํฌํ•จ๋จ

: ๊ธฐ๋Šฅ V1, V2, … V28์€ PCA๋กœ ์–ป์€ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ์ด๋ฉฐ PCA๋กœ ๋ณ€ํ™˜๋˜์ง€ ์•Š์€ ์œ ์ผํ•œ ๊ธฐ๋Šฅ์€ '์‹œ๊ฐ„'๊ณผ '์–‘'

     : '์‹œ๊ฐ„'์—๋Š” ๊ฐ ํŠธ๋žœ์žญ์…˜๊ณผ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ฒซ ๋ฒˆ์งธ ํŠธ๋žœ์žญ์…˜ ์‚ฌ์ด์— ๊ฒฝ๊ณผ๋œ ์‹œ๊ฐ„(์ดˆ)์ด ํฌํ•จ

     : 'Amount' ๊ธฐ๋Šฅ์€ ๊ฑฐ๋ž˜ ๊ธˆ์•ก์ด๋ฉฐ, ์ด ๊ธฐ๋Šฅ์€ ์˜ˆ์ œ ์ข…์† ๋น„์šฉ์— ๋ฏผ๊ฐํ•œ ํ•™์Šต์— ์‚ฌ์šฉ

     : 'Class'๋Š” ์‘๋‹ต ๋ณ€์ˆ˜์ด๋ฉฐ ์‚ฌ๊ธฐ์˜ ๊ฒฝ์šฐ ๊ฐ’ 1์„ ์ทจํ•˜๊ณ  ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ 0์„ ์ทจํ•จ

: ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• ๋น„์œจ์ด ์ฃผ์–ด์ง€๋ฉด AUPRC(์ •๋ฐ€๋„ ์žฌํ˜„์œจ ๊ณก์„  ์•„๋ž˜ ์˜์—ญ)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ์ธก์ •

  • ์ด์ƒ์น˜

- Anomaly

: ๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ์™€ ๋ณธ์งˆ์ ์ธ ํŠน์„ฑ์ด ๋‹ค๋ฅด๋ฉฐ, ๊ธฐ์กด ๋ถ„ํฌ์—์„œ ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ์–ด ์ „ํ˜€ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์ƒ์„ฑ๋˜์—ˆ์„ ๊ฒƒ์œผ๋กœ ์ถ”์ •๋˜๋Š” ๊ด€์ธก์น˜

: ๋‹ค๋ฅธ ๋งค์ปค๋‹ˆ์ฆ˜์— ์˜ํ•ด ๋ฐœ์ƒ๋œ ๊ฒƒ์ฒ˜๋Ÿผ ๋‹ค๋ฅธ ๊ด€์ธก์น˜์™€๋Š” ๋งค์šฐ ๋‹ค๋ฅธ ๊ด€์ธก์น˜

- Novelty

: ๋ณธ์งˆ์ ์ธ ๋ฐ์ดํ„ฐ๋Š” ๊ฐ™์ง€๋งŒ ์œ ํ˜•์ด ๋‹ค๋ฅธ ๊ด€์ธก์น˜

: ๋ฐ์ดํ„ฐ์—์„œ ์ด์ „์— ๊ด€์ฐฐ๋˜์ง€ ์•Š์€ ์ƒˆ๋กœ์šด ํŒจํ„ด์„ ๊ฐ€์ง„ ๊ด€์ธก์น˜
: Novelty ๋ฐ์ดํ„ฐ๋Š” Anomaly๋กœ ๊ฐ„์ฃผ๋˜์ง€ ์•Š๊ณ  Normal ๋ฐ์ดํ„ฐ ๋ฒ”์ฃผ๋กœ ํฌํ•จ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ฐจ์ด๊ฐ€ ์žˆ์Œ

- Outlier

: ๋ฐ์ดํ„ฐ์˜ ์ „์ฒด์ ์ธ ํŒจํ„ด์—์„œ ๋ฒ—์–ด๋‚œ ๊ด€์ธก์น˜

: ๋ฐ์ดํ„ฐ๊ฐ€ ์‹œ๊ณ„์—ด์ด๋ƒ ์•„๋‹ˆ๋ƒ๋กœ ๊ตฌ๋ถ„๋ฉ๋‹ˆ๋‹ค. (์‹œ๊ฐ„๊ณผ์˜ ๊ด€๊ณ„ ์—ฌ๋ถ€)

  • Anomaly : ์‹œ๊ฐ„ ๋˜๋Š” ์ˆœ์„œ๊ฐ€ ์žˆ๋Š” ํ๋ฆ„์— ๋”ฐ๋ฅธ ํŒจํ„ด์ด ๋ณดํŽธ์ ์ธ ํŒจํ„ด๋“ค๊ณผ ๋‹ค๋ฅธ ๊ด€์ธก์น˜
  • Outlier : ์‹œ๊ฐ„๊ณผ ๊ด€๋ จ ์—†์ด ๋Œ€์ƒ์„ ํ‘œํ˜„ํ•˜๋Š” ๊ด€์ธก์น˜๋“ค์˜ ์œ„์น˜๋ฅผ ๋ณด๊ณ  ๋ณดํŽธ์ ์ธ ํŒจํ„ด๊ณผ ๋ฒ—์–ด๋‚œ ๊ด€์ธก์น˜

 

 

๐ŸŠ ์ฝ”๋“œ ๊ตฌํ˜„

1. Data load

from keras.layers import Input, Dense
from keras.models import Model, Sequential
from keras import regularizers
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.manifold import TSNE
from sklearn import preprocessing 
import matplotlib.pyplot as plt
import pandas as pd 
import numpy as np
import seaborn as sns
sns.set(style="whitegrid")
np.random.seed(203)

data = pd.read_csv("creditcard.csv")
data.head()

- Time : ์ด ํŠธ๋žœ์žญ์…˜๊ณผ ๋ฐ์ดํ„ฐ์„ธํŠธ์˜ ์ฒซ ๋ฒˆ์งธ ํŠธ๋žœ์žญ์…˜ ์‚ฌ์ด์— ๊ฒฝ๊ณผ๋œ ์‹œ๊ฐ„(์ดˆ)
- v1 ~ v28 : ์‚ฌ์šฉ์ž ID ๋ฐ ๋ฏผ๊ฐํ•œ ๊ธฐ๋Šฅ์„ ๋ณดํ˜ธํ•˜๊ธฐ ์œ„ํ•œ PCA ์ฐจ์› ์ถ•์†Œ(v1-v28)์˜ ๊ฒฐ๊ณผ
- Amount : ๊ฑฐ๋ž˜ ๊ธˆ์•ก- Class : Target ๋ณ€์ˆ˜(0: ์ •์ƒ, 1: ๋น„์ •์ƒ)   
    - ์ด 284,807๊ฑด์˜ ๊ฑฐ๋ž˜๋‚ด์—ญ์ด ์ œ๊ณต   
    - ์‚ฌ๊ธฐ ๊ฑฐ๋ž˜(Fraud Transaction)๋Š” 492๊ฑด

 

 

colors = ["#0101DF", "#DF0101"]
LABELS = ["Non-Fraud", "Fraud"]

sns.countplot('Class', data=data, palette=colors)
plt.title('Class Distribution \n (0 : Non-Fraud / 1 : Fraud)', fontsize=13)
plt.xticks(range(2), LABELS)
plt.xlabel("Class")
plt.ylabel("Frequency");

 

vc = data['Class'].value_counts().to_frame().reset_index()

vc['percent'] = vc["Class"].apply(lambda x : round(100*float(x) / len(data), 2))

vc = vc.rename(columns = {"index" : "Target", "Class" : "Count"})
vc

 

  • amount์™€ time ๋ถ„ํฌ
fig, ax = plt.subplots(1, 2, figsize=(18,4))

amount_val = data['Amount'].values
time_val = data['Time'].values

sns.distplot(amount_val, ax=ax[0], color='r')
ax[0].set_title('Distribution of Transaction Amount', fontsize=14)
ax[0].set_xlim([min(amount_val), max(amount_val)])

sns.distplot(time_val, ax=ax[1], color='b')
ax[1].set_title('Distribution of Transaction Time', fontsize=14)
ax[1].set_xlim([min(time_val), max(time_val)])

 

 

 

2. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

- ๊ฐ€์žฅ ํฐ ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜๋Š” ์‚ฌ๊ธฐ ๊ฑฐ๋ž˜๊ฐ€ 0.17%์— ๋ถˆ๊ณผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€์ƒ์˜ ๋ถˆ๊ท ํ˜•์ด ๋งค์šฐ ํฌ๋‹ค๋Š” ์ 
- non-fraud ๊ฑฐ๋ž˜๋‚ด์—ญ์˜ 1,000ํ–‰๋งŒ ์‚ฌ์šฉ
non_fraud = data[data['Class'] == 0].sample(1000)
fraud = data[data['Class'] == 1]

df = non_fraud.append(fraud).sample(frac=1).reset_index(drop=True)
X = df.drop(['Class'], axis = 1).values
Y = df["Class"].values

 

 

3. ์‹œ๊ฐํ™”

- t-SNE(t-Distributed Stochastic Neighbor Embedding)๋ž€?
    - ๋†’์€ ์ฐจ์›์˜ ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ๋ฅผ 2์ฐจ์›์— ์ฐจ์› ์ถ•์†Œํ•˜๋Š” ๋ฐฉ๋ฒ•
    - ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ์˜ ์‹œ๊ฐํ™”์— ์šฉ์ดํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋ฉฐ ์ฐจ์› ์ถ•์†Œํ•  ๋•Œ๋Š” ๋น„์Šทํ•œ ๊ตฌ์กฐ๋ผ๋ฆฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ฆฌํ•œ ์ƒํƒœ์ด๋ฏ€๋กœ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ์ดํ•ด
    - ๊ณ ์ฐจ์› ๊ณต๊ฐ„์—์„œ์˜ ๊ฐ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ์ฃผ๋ณ€์œผ๋กœ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ, ๊ฐ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ 2์ฐจ์›์—์„œ ์›๋ณธ ํŠน์„ฑ ๊ณต๊ฐ„์—์„œ ๋น„์Šทํ•œ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋Š” ๊ฐ€๊น๊ฒŒ ๋น„์Šทํ•˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋Š” ๋ฉ€๋ฆฌ ๋–จ์–ด์ง€๊ฒŒ ๋งŒ๋“œ๋ฉฐ, ์ฆ‰ ์ด์›ƒ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋ณด์กดํ•˜๋ ค๊ณ  ๋…ธ๋ ฅ

- PCA์™€ ์ฐจ์ด์ 
    - PCA๋Š” linearํ•œ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ, t-SNE๋Š” ๋น„์„ ํ˜•์ ์ธ ๋ฐฉ๋ฒ•์˜ ์ฐจ์›์ถ•์†Œ ๋ฐฉ๋ฒ•
    - PCA : ๊ณต๋ถ„์‚ฐ ํ–‰๋ ฌ์—์„œ ๊ณ ์œ ๋ฒกํ„ฐ๋ฅผ ๊ณ„์‚ฐ
    - t-SNE : ๊ณ ์ฐจ์› ๊ณต๊ฐ„์—์„œ์˜ ์ ๋“ค์˜ ์œ ์‚ฌ์„ฑ๊ณผ ๊ทธ์— ํ•ด๋‹นํ•˜๋Š” ์ €์ฐจ์› ๊ณต๊ฐ„์—์„œ์˜ ์ ๋“ค์˜ ์œ ์‚ฌ์„ฑ์„ ๊ณ„์‚ฐ
    - t-SNE๋Š” input feature๋ฅผ ํ™•์ธํ•˜๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— t-SNE ๊ฒฐ๊ณผ๋งŒ ๊ฐ€์ง€๊ณ  ๋ฌด์–ธ๊ฐ€๋ฅผ ์ถ”๋ก ํ•˜๊ธฐ ์–ด๋ ค์›€ (t-SNE๋Š” ์ฃผ๋กœ ์‹œ๊ฐํ™” ํˆด๋กœ ์‚ฌ์šฉ)

- t-SNE ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•
    - ๊ฐ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ์ฃผ๋ณ€์œผ๋กœ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•ด ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋งŒ๋“ฆ
    - ๊ด€์‹ฌ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ํ•˜๋‚˜ ์žก๊ณ , ๊ทธ ๊ฐ’์„ ํ‰๊ท ์œผ๋กœ ๊ฐ–๋Š” ์ •๊ทœ๋ถ„ํฌ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆผ
        - → ์ฃผ๋ณ€์˜ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ(๊ฐ™์€ ํด๋Ÿฌ์Šคํ„ฐ) ์œ ์‚ฌ๋„๊ฐ€ ๋†’๊ฒŒ ๋‚˜์˜ค๊ณ ,(์ •๊ทœ๋ถ„ํฌ์˜ ํ™•๋ฅ ๊ฐ’) ๋จผ ๋ฐ์ดํ„ฐ๋Š” ์œ ์‚ฌ๋„๊ฐ€ ๋‚ฎ์Œ
    - ๋ชจ๋“  ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์— ๋Œ€ํ•œ ์œ ์‚ฌ๋„๋ฅผ ๊ทธ๋ ค, ์ƒ๊ด€๊ณ„์ˆ˜๊ทธ๋ž˜ํ”„์ฒ˜๋Ÿผ ๋‚˜์—ดํ•จ
        - → ๊ฐ™์€ ํด๋Ÿฌ์Šคํ„ฐ๋Š” ๋†’์€ ์œ ์‚ฌ๋„๋ฅผ ์ง€๋‹˜
    - ์ž„์˜๋กœ ๋‚ฎ์€ ์ฐจ์›์œผ๋กœ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ์‚ฌ์˜์‹œํ‚ค๊ณ , ๋ฐ์ดํ„ฐ๋งˆ๋‹ค ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•จ.(์ •๊ทœ๋ถ„ํฌ๋กœ ํ–ˆ๋“ฏ์ด)
    - t๋ถ„ํฌ๋ฅผ ๋Œ€์‹  ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ๊ทธ๋ž˜์„œ t-sne์ž„
        - t-๋ถ„ํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š”, ํด๋Ÿฌ์Šคํ„ฐ๋“ค์„ ์ข€๋” ์„ฑ๊ธฐ๊ฒŒ ๋ถ„ํฌ์‹œํ‚ค๊ธฐ ์œ„ํ•จ
def tsne_plot(x1, y1, name="graph.png"):
    tsne = TSNE(n_components=2, random_state=0)
    X_t = tsne.fit_transform(x1)

    plt.figure(figsize=(12, 8))
    plt.scatter(X_t[np.where(y1 == 0), 0], X_t[np.where(y1 == 0), 1], marker='o', color='g', alpha=0.8, label='Non Fraud')
    plt.scatter(X_t[np.where(y1 == 1), 0], X_t[np.where(y1 == 1), 1], marker='o', color='r', alpha=0.8, label='Fraud')

    plt.legend(loc='best');
    plt.show();
  • ๋ชจ๋“  ์ ์€ ๊ฑฐ๋ž˜(transaction)๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, Non-Fraud(๋น„์‚ฌ๊ธฐ) ๊ฑฐ๋ž˜๋Š” ๋…น์ƒ‰์œผ๋กœ ํ‘œ์‹œ๋˜๊ณ  Fraud(์‚ฌ๊ธฐ) ๊ฑฐ๋ž˜๋Š” ๋นจ๊ฐ„์ƒ‰์œผ๋กœ ํ‘œ์‹œ
  • ๋‘ ์ถ•์€ tsne์— ์˜ํ•ด ์ถ”์ถœ๋œ ์„ฑ๋ถ„
tsne_plot(X, Y, "original.png")

๋ถ„๋ฅ˜๊ฐ€ ์ž˜ ์•ˆ๋จ

 

 

4. AutoEncoder

- Autoencoder๋Š” ์ถœ๋ ฅ์ด ์ž…๋ ฅ๊ณผ ๋™์ผํ•œ ํŠน์ˆ˜ํ•œ ์œ ํ˜•์˜ ์‹ ๊ฒฝ๋ง(neural network) ์•„ํ‚คํ…์ฒ˜
- ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๊ทนํžˆ ๋‚ฎ์€ level์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด unsupervised ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จ๋˜๋ฉฐ, ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ด๋Ÿฌํ•œ ๋‚ฎ์€ level์˜ ๊ธฐ๋Šฅ์€ ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ํˆฌ์˜ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์‹œ ๋ณ€ํ˜•
- ์ž…๋ ฅ์„ ๊ทธ๋Œ€๋กœ ์ถœ๋ ฅ(๋ณต์›)ํ•ด๋‚ด๋„๋ก ํ•˜๋Š” ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ๊ฐ–๋Š” ํ•ญ๋“ฑ ํ•จ์ˆ˜(identity function) ๋ชจ๋ธ๋ง
    - ์ž…๋ ฅ๋œ ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ์š”์•ฝํ•˜๋Š” Encoder(์ธ์ฝ”๋”)
    - ์š”์•ฝ๋œ ์ •๋ณด๋ฅผ ๋ณต์›ํ•˜๋Š” Decoder(๋””์ฝ”๋”)
    => ์ž…๋ ฅ๊ฐ’์„ ๋‹ต์œผ๋กœ ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ์Šค์Šค๋กœ ํ•™์Šต์„ ํ•œ๋‹ค๊ณ  ํ•ด์„œ ์ž๊ธฐ์ง€๋„ํ•™์Šต

- ๊ฐ€์ • : ์ •์ƒ ๊ด€์ธก์น˜๋“ค์€ ๋ถˆ๋Ÿ‰ ๊ด€์ธก์น˜๋ณด๋‹ค ๋” ์ž˜ ๋ณต์›๋  ๊ฒƒ
- ์ฐจ์›์ด ์ถ•์†Œ๋˜๋Š” ๋ถ€๋ถ„์„ Bottlenect(์ž ์žฌ๋ฒกํ„ฐ)์ด๋ผ๊ณ  ํ•˜๋ฉฐ, Autoencoder๋Š” z ๋ถ€๋ถ„์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ 
- MSE ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ค‘๊ฐ„์— ์œ„์น˜ํ•œ ๋ช‡ ๊ฐœ์˜ ๋‰ด๋Ÿฐ์˜ ๋ณ‘๋ชฉ(bottle-neck) ํ˜„์ƒ์ด ์‹ฌํ•˜์—ฌ Decoder์—์„œ ์›๋ž˜ ์ž…๋ ฅ์„ ์žฌ์ƒ์‚ฐํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ €์ฐจ์› ์ฝ”๋“œ๋กœ ์ž…๋ ฅ์„ ์••์ถ•ํ•˜๋Š” ํšจ๊ณผ์ ์ธ ํ‘œํ˜„์„ ์ƒ์„ฑํ•ด์•ผํ•จ
- ๊ณ ์ฐจ์› ๊ณต๊ฐ„ ์ƒ์˜ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์ฐจ์›์˜ ๊ณต๊ฐ„์œผ๋กœ ๋งตํ•‘(mapping)ํ•˜์—ฌ ์ž ์žฌ์ ์ธ ๋ณ€์ˆ˜๋กœ ํ‘œํ˜„(latent representation)ํ•˜์˜€๋‹ค๊ฐ€, ๋‹ค์‹œ ์ž…๋ ฅ๊ณผ ๊ฐ™์€ ๊ณ ์ฐจ์›์˜ ๊ณต๊ฐ„์œผ๋กœ ๋ณต์›ํ•ด์•ผ ํ•จ
- ๋ฐ์ดํ„ฐ๊ฐ€ ์ž˜ ๋ณต์›๋œ ๊ฒฝ์šฐ ์ €์ฐจ์›์˜ ๋ฐ์ดํ„ฐ ํŠน์„ฑ ๊ณต๊ฐ„์„ ์ž˜ ํŒŒ์•…ํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Œ

===> Autoencoder๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์™€ ๋ณต์›๋œ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์˜ ์ฐจ์ด(์žฌ๊ตฌ์ถ• ์˜ค์ฐจ, Reconstruction Error)๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ด์ƒ์น˜๋ฅผ ํƒ์ง€
  • 1) Autoencoder๋Š” ์–ด๋–ค ๋ฐ์ดํ„ฐ์˜ ์ž…๋ ฅ์„ ๋ฐ›์€ ํ›„, ์ •๋ณด๋ฅผ ์••์ถ•์‹œ์ผฐ๋‹ค๊ฐ€ ๋‹ค์‹œ ํ™•์žฅํ•˜๋ฉฐ ์ž…๋ ฅ๋ฐ›์€ ๊ฒƒ๊ณผ ์ตœ๋Œ€ํ•œ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•˜๋Š” '์žฌ๊ตฌ์„ฑ'์˜ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ํŠน์ • ๋ฐ์ดํ„ฐ์˜ ํŒจํ„ด์„ ์ธ์ฝ”๋”ฉํ•จ
  • 2) ์‹ ์šฉ์นด๋“œ ๊ฑฐ๋ž˜ ๋ฐ์ดํ„ฐ์…‹์€ ๋Œ€๋ถ€๋ถ„ ์ •์ƒ ๊ฑฐ๋ž˜ ๋ฐ์ดํ„ฐ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์œผ๋ฉฐ, ์ด์ƒ ๊ฑฐ๋ž˜ ๋ฐ์ดํ„ฐ๋Š” ์ ๊ฒŒ ํฌํ•จ๋˜์–ด ์žˆ๋Š” ๋ถˆ๊ท ํ˜•ํ•œ ๋ฐ์ดํ„ฐ์…‹์ž„
  • 3) Autoencoder ๋ชจ๋ธ์„ ์ •์ƒ ๊ฑฐ๋ž˜ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๋ฉด ์ด์ƒ ๊ฑฐ๋ž˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์ง€๋Š” ๊ฒฝ์šฐ์—๋Š” ์žฌ๊ตฌ์„ฑ์ด ์ž˜ ์•ˆ ๋˜๋Š”๋ฐ, ์ด ์›๋ฆฌ๋ฅผ ํ™œ์šฉํ•ด ์‹ ์šฉ์นด๋“œ ์ด์ƒ ๊ฑฐ๋ž˜๋ฅผ ํƒ์ง€ํ•จ
## input layer 
input_layer = Input(shape=(X.shape[1],))

## encoding part
encoded = Dense(100, activation='tanh', activity_regularizer=regularizers.l1(10e-5))(input_layer)
encoded = Dense(50, activation='relu')(encoded)

## decoding part
decoded = Dense(50, activation='tanh')(encoded)
decoded = Dense(100, activation='tanh')(decoded)

## output layer
output_layer = Dense(X.shape[1], activation='relu')(decoded)
  • Network์˜ ๊ตฌ์กฐ(๋‰ด๋Ÿฐ์ˆ˜)๋Š” input → 100 → 50 → 50 → 100 → input์œผ๋กœ, ๊ฐ€์šด๋ฐ(Bottle-neck Layer) Dimension์ด 50์ž„
  • ์‹œ๊ฐํ™”ํ•˜๋Š” ๊ณผ์ •์—์„œ ์ด๋ฅผ ๋‹ค์‹œ 2์ฐจ์›์œผ๋กœ t-SNE๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ถ•์†Œํ•˜๊ณ  ์‹œ๊ฐํ™”ํ•จ

 

autoencoder = Model(input_layer, output_layer)

# adadelta : Adagrad, RMSprop, Momentum ๋ชจ๋‘๋ฅผ ํ•ฉ์นœ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•
# Adadelta๋Š” ๊ณผ๊ฑฐ์˜ ๋ชจ๋“  ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๋ˆ„์ ํ•˜๋Š” ๋Œ€์‹  ๊ทธ๋ž˜๋””์–ธํŠธ ์—…๋ฐ์ดํŠธ์˜ ์ด๋™ ์ฐฝ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋ฅ ์„ ์กฐ์ •
autoencoder.compile(optimizer="adadelta", loss="mse")

 

  • ๊ฐ ๋ณ€์ˆ˜์˜ ๋ฒ”์œ„๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— Min-Max scaling์„ ํ†ตํ•ด ์ˆ˜์ค€์„ ๋งž์ถฐ์ฃผ๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰
x = data.drop(["Class"], axis=1)
y = data["Class"].values

x_scale = preprocessing.MinMaxScaler().fit_transform(x.values)
x_norm, x_fraud = x_scale[y == 0], x_scale[y == 1]
  • ๋ชจ๋ธ ํ›ˆ๋ จ
autoencoder.fit(x_norm[0:2000], x_norm[0:2000], 
                batch_size = 256, epochs = 10, 
                shuffle = True, validation_split = 0.20);

 

  • sequential layers์„ ํฌํ•จํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ๋„คํŠธ์›Œํฌ๋ฅผ ๋งŒ๋“ค๊ณ , ์ž ์žฌ์ ์ธ ๋ณ€์ˆ˜๋กœ์˜ ํ‘œํ˜„(latent representation)์ด ์กด์žฌํ•˜๋Š” ์„ธ ๋ฒˆ์งธ layer๊นŒ์ง€๋งŒ ํ›ˆ๋ จ๋œ ๊ฐ€์ค‘์น˜๋ฅผ ์ถ”๊ฐ€
hidden_representation = Sequential()
hidden_representation.add(autoencoder.layers[0])
hidden_representation.add(autoencoder.layers[1])
hidden_representation.add(autoencoder.layers[2])
 
  • raw inputs์„ ์˜ˆ์ธกํ•˜์—ฌ ์‚ฌ๊ธฐ(Fraud)์™€ ๋น„์‚ฌ๊ธฐ(Non-Fraud)๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ์ˆจ๊ฒจ์ง„ ํ‘œํ˜„์„ ์ƒ์„ฑ
norm_hid_rep = hidden_representation.predict(x_norm[:3000])
fraud_hid_rep = hidden_representation.predict(x_fraud)

 

  • ์‚ฌ๊ธฐ(Fraud)์™€ ๋น„์‚ฌ๊ธฐ(Non-Fraud) ๊ฑฐ๋ž˜๊ฐ€ ๊ฐ€์‹œ์ (visible)์ด๊ณ  ์„ ํ˜•์ ์œผ๋กœ(linearly) ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
rep_x = np.append(norm_hid_rep, fraud_hid_rep, axis = 0)

y_n = np.zeros(norm_hid_rep.shape[0])
y_f = np.ones(fraud_hid_rep.shape[0])

rep_y = np.append(y_n, y_f)
tsne_plot(rep_x, rep_y, "latent_representation.png")

์ž˜ ๋ถ„๋ฅ˜

- ์‚ฌ๊ธฐ(Fraud)์™€ ๋น„์‚ฌ๊ธฐ(Non-Fraud) ๊ฑฐ๋ž˜๊ฐ€ ๊ฐ€์‹œ์ (visible)์ด๊ณ  ์„ ํ˜•์ ์œผ๋กœ(linearly) ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
 
 
 

5. Simple Linear Classifier

train_x, val_x, train_y, val_y = train_test_split(rep_x, rep_y, test_size=0.25)

clf = LogisticRegression(solver="lbfgs").fit(train_x, train_y)

pred_y = clf.predict(val_x)
print ("Classification Report: ")
print (classification_report(val_y, pred_y))

print ("")
print('Logistic Regression Accuracy Score: ', round(accuracy_score(val_y, pred_y) * 100, 3).astype(str) + '%')

  • ์ด 873๊ฑด์˜ Test Set ์ค‘ Non-Fraud๋Š” 742๊ฑด & Fraud๋Š” 114๊ฑด
  • Fraud๋Š” 129๊ฑด ์ค‘ 114๊ฑด์€ ๋งž์ถ”๊ณ  16๊ฑด์€ ํ‹€๋ฆฐ ๊ฒƒ์„ ํ™•์ธ
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_pred=pred_y, y_true=val_y)
cmp = ConfusionMatrixDisplay(cm)

cmp.plot(cmap=plt.cm.Blues)

728x90
๋ฐ˜์‘ํ˜•
Comments