๐Ÿ˜Ž ๊ณต๋ถ€ํ•˜๋Š” ์ง•์ง•์•ŒํŒŒ์นด๋Š” ์ฒ˜์Œ์ด์ง€?

ADTK (Anomaly detection toolkit) ์‹œ๊ณ„์—ด ์ด์ƒํƒ์ง€ ์˜คํ”ˆ์†Œ์Šค ๋ณธ๋ฌธ

๐Ÿ‘ฉ‍๐Ÿ’ป ์ธ๊ณต์ง€๋Šฅ (ML & DL)/Serial Data

ADTK (Anomaly detection toolkit) ์‹œ๊ณ„์—ด ์ด์ƒํƒ์ง€ ์˜คํ”ˆ์†Œ์Šค

์ง•์ง•์•ŒํŒŒ์นด 2022. 9. 6. 09:52
728x90
๋ฐ˜์‘ํ˜•

220906 ์ž‘์„ฑ

<๋ณธ ๋ธ”๋กœ๊ทธ๋Š” analyticsindiamag ๋‹˜์˜ ๋ธ”๋กœ๊ทธ๋ฅผ ์ฐธ๊ณ ํ•ด์„œ ๊ณต๋ถ€ํ•˜๋ฉฐ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค :-) >

https://analyticsindiamag.com/a-hands-on-guide-to-anomaly-detection-in-time-series-using-adtk/

 

A hands-on guide to anomaly detection in time series using ADTK

ADTK is an open-source python package for time series anomaly detection. The name ADTK stands for Anomaly detection toolkit.

analyticsindiamag.com

 

 

๐Ÿ˜Ž1. ADTK๋ž€ ?

  • ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์—์„œ ๋น„์ •์ƒ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฑฐ์˜ ์ฐพ์ง€ ๋ชปํ•˜๊ณ  ์ง€๋„ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ ๋ฌธ์ œ ์žˆ์Œ
  • ๊ด€์‹ฌ ์žˆ๋Š” ์ด๋ฒคํŠธ ์œ ํ˜•์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ํ•ญ์ƒ ๋น„์ •์ƒ์„ ๊ฐ์ง€ํ•ด์•ผ ํ•˜๋Š” ์š”๊ตฌ ์‚ฌํ•ญ ์žˆ์Œ
    •    ์ด์ƒ๊ฐ’ ๊ฐ์ง€๊ธฐ
      • ๋ฐ์ดํ„ฐ์—์„œ ์ด์ƒ๊ฐ’์„ ๊ฐ์ง€
      • ์ž„๊ณ„๊ฐ’ ๊ฐ์ง€, ๊ณ„์ ˆ ๊ฐ์ง€, ํšŒ๊ท€ ๊ฐ์ง€ ๋“ฑ๊ณผ ๊ฐ™์€ ์‹œ๊ณ„์—ด์˜ ๋ชจ๋“  ์œ ํ˜•์˜ ๊ฐ์ง€์— ๋„์›€์ด ๋˜๋Š” ๋‹ค์–‘ํ•œ ๋ชจ๋“ˆ
    • ๋ณ€ํ™˜๊ธฐ
      • ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜
      • ์ด ํด๋ž˜์Šค์—๋Š” ๋กค๋ง ์ง‘๊ณ„, ๊ณ„์ ˆ ๋ถ„ํ•ด ๋“ฑ์˜ ๋ชจ๋“ˆ๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๋ชจ๋“ˆ
    • ๊ทœ์น™ ์ฒด์ธ ์ง‘๊ณ„๊ธฐ
      • ๋ชจ๋“  ๊ทœ์น™์— ๋”ฐ๋ผ ์„ค์ •๋œ ์‹œ์ ์„ ์‹๋ณ„ํ•˜๊ณ  ์ง‘๊ณ„

 

 

 

 

๐Ÿ˜Ž2. ์ฝ”๋“œ๊ตฌํ˜„

 

1๏ธโƒฃ Data Load

  • yfinance : ์•ผํ›„ ํŒŒ์ด๋‚ธ์Šค์—์„œ ํฌ๋กค๋งํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
pip install yfinance
  • ์ธ๋„ ๊ตญ์˜ ์€ํ–‰์—์„œ ์ฃผ๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœ
import datetime as dt
from datetime import datetime as dt
from dateutil.relativedelta import relativedelta
import yfinance as yf

end = dt.today()
start = dt.today() - relativedelta(years=1)
data = yf.download('SBIN.NS', start, end)
data.tail()
  • ์ฃผ๊ฐ€ ์‹œ๊ฐ€, ๊ณ ๊ฐ€, ์ €๊ฐ€ ๋ฐ ์ข…๊ฐ€์™€ ์ฃผ์‹์˜ ๊ฑฐ๋ž˜๋Ÿ‰์ด ๋ณ€์ˆ˜

2๏ธโƒฃ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

  • ์ฃผ์‹ ์‹œ์žฅ์ด ์—ด๋ฆฌ์ง€ ์•Š๋Š” ์ฃผ๋ง์ด ์žˆ์Œ! -> ๊ฒฐ์ธก์น˜ ๋ฐœ์ƒ
    • ์ด ๊ฒฉ์ฐจ๋ฅผ ๋ฉ”์›Œ์•ผ ํ•˜์ง€๋งŒ ๊ทธ ์ „์— ๋ฐ์ดํ„ฐ์— ๋ˆ„๋ฝ๋œ ๋‚ ์งœ๊ฐ€ ๋ช‡ ๊ฐœ์ธ์ง€ ํ™•์ธ
import pandas as pd

# pandas.date_range() : ๊ณ ์ • ๋นˆ๋„ DatetimeIndex๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” Pandas์˜ ์ผ๋ฐ˜ ํ•จ์ˆ˜
pd.date_range(start = start, end = end ).difference(data.index)
# difference : data์˜ columns์—์„œ ๋ถ„๋ฆฌํ•ด๋‚ด๋Š” method

  • ๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๊ธฐ
    • reindex : ํŠน์ • ์ถ•์„ ๋‹ค๋ผ ์ž…๋ ฅ๋œ ๋ฐฐ์—ด์— ๋‹ค๋ผ ๋ฐ์ดํ„ฐ ์ˆœ์„œ๋ฅผ ์กฐ์ •
idx = pd.date_range(start, end)

# reindex : ํŠน์ • ์ถ•์„ ๋‹ค๋ผ ์ž…๋ ฅ๋œ ๋ฐฐ์—ด์— ๋‹ค๋ผ ๋ฐ์ดํ„ฐ ์ˆœ์„œ๋ฅผ ์กฐ์ •
data = data.reindex(idx)
# fillna : DataFrame์—์„œ ๊ฒฐ์ธก๊ฐ’์„ ์›ํ•˜๋Š” ๊ฐ’์œผ๋กœ ๋ณ€๊ฒฝ
# ffill : ๊ฒฐ์ธก๊ฐ’์ด ๋ฐ”๋กœ ์œ„๊ฐ’๊ณผ ๋™์ผํ•˜๊ฒŒ ์„ค์ •
data.fillna(method="ffill", inplace = True)

 

  • ๊ฒฐ์ธก์น˜ ์žˆ๋Š”์ง€ ๋‹ค์‹œ ํ™•์ธ
pd.date_range(start = start, end = end ).difference(data.index)

 

 

3๏ธโƒฃ ๋ฐ์ดํ„ฐ ๊ฒ€์ฆ ๋ฐ ์‹œ๊ฐํ™”

  • ADTK์˜ ๋ชจ๋“ˆ๋กœ ์ž‘์—…ํ•˜๋ ค๋ฉด ํŒจํ‚ค์ง€๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์ฆ
# ๋น„์ง€๋„/๊ทœ์น™ ๊ธฐ๋ฐ˜ ์‹œ๊ณ„์—ด ์ด์ƒ ํƒ์ง€๋ฅผ ์œ„ํ•œ Python ํŒจํ‚ค์ง€
from adtk.data import validate_series

data = validate_series(data)
print(data)

fillna ํ•˜๋‹ˆ๊นŒ NaN ๊ฐ’์ด ์ง€์†๋ผ์„œ, ๊ทธ๋ƒฅ ์ „์ฒ˜๋ฆฌ ํ•˜์ง€ ์•Š๊ณ  ๊ฐ€๊ฒ ๋‹ค..
์ „์ฒ˜๋ฆฌ ์•ˆํ•œ data

  • ์‹œ๊ฐํ™”
from adtk.visualization import plot

plot(data)

 

4๏ธโƒฃ ๊ณ„์ ˆ์  ์ด์ƒ ํƒ์ง€

  • ์ด์ƒ ํ˜„์ƒ ๋ฐ ๋ณผ๋ฅจ ๋ณ€์ˆ˜์— ์˜ํ•œ ๊ณ„์ ˆ ํŒจํ„ด ์œ„๋ฐ˜์„ ๊ฐ์ง€
from adtk.detector import SeasonalAD

seasonal_vol = SeasonalAD()
anomalies = seasonal_vol.fit_detect(data['Volume'])
anomalies.value_counts()
plot(data, anomaly=anomalies, anomaly_color="orange", anomaly_tag="marker")

 

5๏ธโƒฃ ์ž„๊ณ„๊ฐ’ ๊ฐ์ง€

  • ์‹œ๊ณ„์—ด์ด ์ด ์ž„๊ณ„๊ฐ’์„ ๋ฒ—์–ด๋‚˜๋Š” ์ง€์ ์„ ๊ฐ์ง€
print('Average closing price', data['Close'].mean())
print('Minimum closing price', data['Close'].min())
print('Maximum closing price',data['Close'].max())
from adtk.detector import ThresholdAD
threshold_val = ThresholdAD(high=530, low=180)
anomalies_thresh = threshold_val.detect(data['Close'])
  • ์ด์ƒ์ง•ํ›„๊ฐ€ ๋ฌด์—‡์ธ๊ฐ€?
    • True ๊ฐœ์ˆ˜
anomalies_thresh.value_counts()
  • ์‹œ๊ฐํ™”
from adtk.visualization import plot
plot(data, anomaly=anomalies_thresh, ts_linewidth=1, ts_markersize=3, anomaly_markersize=5, anomaly_color='black');

 

 

 

ํ•˜๋‹ค๊ฐ€ ๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๋Š” ๋ถ€๋ถ„์—์„œ data ๊ฐ’์ด NaN ๊ฐ’์œผ๋กœ ๋‚˜์™“๋‹ค..

์ด์œ ๊ฐ€ ๊ถ๊ธˆํ•˜๋‹ค reindex๋Š” ์ฒจ์ด๋ผ..

728x90
๋ฐ˜์‘ํ˜•
Comments