๐Ÿ˜Ž ๊ณต๋ถ€ํ•˜๋Š” ์ง•์ง•์•ŒํŒŒ์นด๋Š” ์ฒ˜์Œ์ด์ง€?

[Kaggle]Breast Cancer Wisconsin (Diagnostic) Data Set_์œ ๋ฐฉ์•” ๋ถ„๋ฅ˜ ๋ณธ๋ฌธ

๐Ÿ‘ฉ‍๐Ÿ’ป ์ปดํ“จํ„ฐ ๊ตฌ์กฐ/Kaggle

[Kaggle]Breast Cancer Wisconsin (Diagnostic) Data Set_์œ ๋ฐฉ์•” ๋ถ„๋ฅ˜

์ง•์ง•์•ŒํŒŒ์นด 2022. 1. 28. 01:27
728x90
๋ฐ˜์‘ํ˜•

220128 ์ž‘์„ฑ

<๋ณธ ๋ธ”๋กœ๊ทธ๋Š” Kaggle ์„ ์ฐธ๊ณ ํ•ด์„œ ๊ณต๋ถ€ํ•˜๋ฉฐ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค>

https://bigdaheta.tistory.com/33

 

[๋จธ์‹ ๋Ÿฌ๋‹] ์บ๊ธ€(kaggle)์˜ˆ์ œ - ์œ„์Šค์ฝ˜์‹  ์œ ๋ฐฉ์•” ์˜ˆ์ธก ๋ฐ์ดํ„ฐ ๋ถ„์„ (Wisconsin Diagnostic breast cancer datase

์œ„์Šค์ฝ˜์‹  ์œ ๋ฐฉ์•” ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ์ข…์–‘์˜ ํฌ๊ธฐ, ๋ชจ์–‘ ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ์†์„ฑ ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ด๋‹น ์ข…์–‘์ด ์•…์„ฑ(malignmant)์ธ์ง€ ์–‘์„ฑ (benign)์ธ์ง€๋ฅผ ๋ถ„๋ฅ˜ํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์ด๋‹ค. ์ด ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์•™์ƒ๋ธ”(ํˆฌํ‘œ,

bigdaheta.tistory.com

 

 

 

 

์œ ๋ฐฉ์•” ์•…์„ฑ ์ข…์–‘์ธ์ง€
์–‘์„ฑ ์ข…์–‘์ธ์ง€
์ด์ง„ ๋ถ„๋ฅ˜ ํ•ด๋ณด์ž!

 

 

 

1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import

import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import roc_curve, auc

from sklearn.model_selection import learning_curve, validation_curve
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score

 

 

 

 

 

2. ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ

data = pd.read_csv("data.csv")
data

data.info()

  • ์ผ๋‹จ float ์ด๋‚˜ int ๋งŒ ๋ณด์ธ๋‹ค... object ๋Š” ์•ˆ๋ณด์ธ๋‹ค
data.isnull().sum()

  • Unnamed:32 null ์กด์žฌ!

 

 

 

 

3. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

1) diagnosis ์ˆซ์ž๋กœ

data["diagnosis"].unique()

  • '์–‘์„ฑ์ข…์–‘(Benign tumor)'๊ณผ '์•…์„ฑ์ข…์–‘(Malignant tumor)
  • ์•…์„ฑ์ข…์–‘์„ 1, ์–‘์„ฑ์–‘์„ฑ์„ 0
def cancle(a):
    if 'M' in a :
        return 1
    else :
        return 0

data["ํŒ์ •"] = data["diagnosis"].apply(cancle)
data.head(10)

 

 

2) ํ•„์š”์—†๋Š”๊ฑฐ drop

data.columns

  •  id ํ•„์š”์—†๊ณ , diagnosis ๋Š” ์ˆซ์ž๋กœ ํ–ˆ๊ณ , null ์žˆ๋Š” Unnamed: 32 ์—†์• ๋ฒ„๋ ค~!
total = data.drop(columns= ["id", "diagnosis", "Unnamed: 32"])
total.head()

 

 

 

 

 

4. train/test ๋ถ„ํ• 

X = total.drop(['ํŒ์ •'], axis = 1) # ์นผ๋Ÿผ์œผ๋กœ
y = total["ํŒ์ •"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
  • test_size์—๋Š” test set์˜ ๋น„์œจ์„ ์ž…๋ ฅ
  • stratify์—๋Š” ์ธต ๊ตฌ๋ถ„ ๋ณ€์ˆ˜์ด๋ฆ„์„ ์ž…๋ ฅ -> ๊ฐ ์ธต๋ณ„๋กœ ๋‚˜๋ˆ„์–ด์„œ test_size๋น„์œจ ์ ์šฉ ์ถ”์ถœ

 

  • shuffle=True ๋ฅผ ์ง€์ •ํ•ด์ฃผ๋ฉด ๋ฌด์ž‘์œ„ ์ถ”์ถœ(random sampling)
  • ์ฒด๊ณ„์  ์ถ”์ถœ(systematic sampling)์„ ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด shuffle=False๋ฅผ ์ง€์ •

 

  • random_state ๋Š” ์žฌํ˜„๊ฐ€๋Šฅ์„ฑ์„ ์œ„ํ•ด์„œ ๋‚œ์ˆ˜ ์ดˆ๊ธฐ๊ฐ’์œผ๋กœ ์•„๋ฌด ์ˆซ์ž๋‚˜ ์ง€์ •

 

 

 

 

5. ๋ชจ๋ธ ๊ตฌ์ถ•

1) voting : ๋ญ๊ฐ€ ์ œ์ผ ๋‚ซ๋‚˜~

  • ๋‹ค์ˆ˜์˜ ๋ถ„๋ฅ˜๊ธฐ์˜ ๋ ˆ์ด๋ธ” ๊ฐ’ ๊ฒฐ์ • ํ™•๋ฅ ์„ ๋ชจ๋‘ ๋”ํ•˜๊ณ  ํ‰๊ท ๋‚ธ ํ™•๋ฅ  ๊ฐ’์ด ๊ฐ€์žฅ ๋†’์€ ๋ ˆ์ด๋ธ”์„ ์ตœ์ข… ๋ณดํŒ… ๊ฒฐ๊ณผ๊ฐ’์œผ๋กœ ์„ ์ • => ์†Œํ”„ํŠธ ๋ณดํŒ…
  • ๋‹ค์ˆ˜์˜ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์˜ˆ์ธกํ•œ ์˜ˆ์ธก๊ฐ’์„ ์ตœ์ข… ๋ณดํŒ… ๊ฒฐ๊ณผ๊ฐ’์œผ๋กœ ์„ ์ • => ํ•˜๋“œ ๋ณดํŒ…
logistic = LogisticRegression( solver = "liblinear",
                        penalty = "l2",
                        C = 0.001,
                        random_state = 1)
tree = DecisionTreeClassifier(max_depth = None,
                        criterion="entropy",
                        random_state=1)
knn = KNeighborsClassifier(n_neighbors=1,
                        p = 2,
                        metric = "minkowski")

voting_estimators = [("logistic", logistic), ("tree", tree), ("knn", knn)]

voting = VotingClassifier(estimators=voting_estimators,
                        voting = "soft")

clf_labels1 = ["Logistic regression", "Decision Tree", "KNN", "Majority voting"]

all_clf1 = [logistic, tree, knn, voting]

2) ๋ฐฐ๊น… : ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ์ค‘ ํ•œ๊ฐ€์ง€ ๋ชจ๋ธ์—๋งŒ ์ง‘์ค‘ ๋ชจ๋ธ ๊ตฌ์ถ•

tree = DecisionTreeClassifier (max_depth= None,
                        criterion="entropy",
                        random_state=1)
forest = RandomForestClassifier(criterion="gini",
                        n_estimators=500,   # ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ ๋ช‡๊ฐœ
                        random_state=1)     
clf_labels2 = ["Decision Tree", "Random Forest"]
all_clf2 = [tree, forest]

3) ๋ถ€์ŠคํŒ…

tree = DecisionTreeClassifier(max_depth=1,
                    criterion="entropy",
                    random_state=1)
adaboost = AdaBoostClassifier (base_estimator=tree,
                    n_estimators=500,
                    learning_rate = 0.1,
                    random_state=1)

clf_labels3 = ["Decision Tree", "Ada Boost"]
all_clf3 = [tree, adaboost]

4) AUC

for clf, label in zip(all_clf1, clf_labels1) :
    scores = cross_val_score(estimator=clf,
                    X = X_train,
                    y = y_train,
                    cv = 10,
                    scoring ="roc_auc")
    print("ROC AUC : %.3f ( +/- %.3f) [%s]"
            % (scores.mean(), scores.std(), label))

  • ์Œ Voting ์ด ๋†’๋‹ค
for clf, label in zip(all_clf2, clf_labels2) :
    scores = cross_val_score(estimator=clf,
                    X = X_train,
                    y = y_train,
                    cv = 10,
                    scoring ="roc_auc")
    print("ROC AUC : %.3f ( +/- %.3f) [%s]"
            % (scores.mean(), scores.std(), label))

  • ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๊ฐ€ ์ข‹๋‹ค
for clf, label in zip(all_clf3, clf_labels3) :
    scores = cross_val_score(estimator=clf,
                    X = X_train,
                    y = y_train,
                    cv = 10,
                    scoring ="roc_auc")
    print("ROC AUC : %.3f ( +/- %.3f) [%s]"
            % (scores.mean(), scores.std(), label))

  • AdaBoost ๊ฐ€ ์ข‹๋‹ค

 

 

 

 

6. ROC ๊ณก์„  ๊ทธ๋ฆฌ๊ธฐ

colors = ["orange", "pink","blue", "green"]
linestyles = [':', "--", "-.", "-"]

for clf, label, clr, ls in zip(all_clf1, clf_labels1, colors, linestyles) :
    clf.fit(X_train, y_train)
    y_pred = clf.predict_proba(X_test)[:,1]
    fpr, tpr, threshold = roc_curve(y_true = y_test,
                                    y_score=y_pred)
    roc_auc = auc(x=fpr, y=tpr)

    plt.plot(fpr, tpr, color = clr, linestyle = ls,
                label ="%s (auc = %.3f) " %(label, roc_auc))

plt.legend(loc = "lower right")
plt.plot([0,1], [0,1], linestyle = "--", color = "gray", linewidth = 2)

plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])

plt.grid(alpha = 0.5)

plt.xlabel("False positive rate (FPR)")
plt.ylabel("True positive rate (TPR)")

plt.title("Voting")
plt.show()

colors = ["orange", "pink","blue", "green"]
linestyles = [':', "--", "-.", "-"]

for clf, label, clr, ls in zip(all_clf2, clf_labels2, colors, linestyles) :
    clf.fit(X_train, y_train)
    y_pred = clf.predict_proba(X_test)[:,1]
    fpr, tpr, threshold = roc_curve(y_true = y_test,
                                    y_score=y_pred)
    roc_auc = auc(x=fpr, y=tpr)

    plt.plot(fpr, tpr, color = clr, linestyle = ls,
                label ="%s (auc = %.3f) " %(label, roc_auc))

plt.legend(loc = "lower right")
plt.plot([0,1], [0,1], linestyle = "--", color = "gray", linewidth = 2)

plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])

plt.grid(alpha = 0.5)

plt.xlabel("False positive rate (FPR)")
plt.ylabel("True positive rate (TPR)")

plt.title("RandomForest")
plt.show()

colors = ["orange", "pink","blue", "green"]
linestyles = [':', "--", "-.", "-"]

for clf, label, clr, ls in zip(all_clf3, clf_labels3, colors, linestyles) :
    clf.fit(X_train, y_train)
    y_pred = clf.predict_proba(X_test)[:,1]
    fpr, tpr, threshold = roc_curve(y_true = y_test,
                                    y_score=y_pred)
    roc_auc = auc(x=fpr, y=tpr)

    plt.plot(fpr, tpr, color = clr, linestyle = ls,
                label ="%s (auc = %.3f) " %(label, roc_auc))

plt.legend(loc = "lower right")
plt.plot([0,1], [0,1], linestyle = "--", color = "gray", linewidth = 2)

plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])

plt.grid(alpha = 0.5)

plt.xlabel("False positive rate (FPR)")
plt.ylabel("True positive rate (TPR)")

plt.title("AdaBoost")
plt.show()

 

 

 

 

 

7. ์ •์˜ค ๋ถ„๋ฅ˜ํ‘œ

forest.fit(X_train, y_train)

y_pred = forest.predict(X_test)

print("RandomForest")
print("์ž˜๋ชป ๋ถ„๋ฅ˜๋œ ์ƒ˜ํ”Œ ๊ฐœ์ˆ˜ : %d" %(y_test != y_pred).sum())
print("์ •ํ™•๋„ : %.3f" % accuracy_score(y_test, y_pred))
print("์ •๋ฐ€๋„ : %.3f" % precision_score(y_true = y_test, y_pred = y_pred))
print("์žฌํ˜„์œจ : %.3f" % recall_score(y_true=y_test, y_pred=y_pred))
print("F1 : %.3f" % f1_score(y_true=y_test, y_pred=y_pred))

adaboost.fit(X_train, y_train)

y_pred = forest.predict(X_test)

print("AdaBoost")
print("์ž˜๋ชป ๋ถ„๋ฅ˜๋œ ์ƒ˜ํ”Œ ๊ฐœ์ˆ˜ : %d" %(y_test != y_pred).sum())
print("์ •ํ™•๋„ : %.3f" % accuracy_score(y_test, y_pred))
print("์ •๋ฐ€๋„ : %.3f" % precision_score(y_true = y_test, y_pred = y_pred))
print("์žฌํ˜„์œจ : %.3f" % recall_score(y_true=y_test, y_pred=y_pred))
print("F1 : %.3f" % f1_score(y_true=y_test, y_pred=y_pred))

 

 

 

 

8. ์ตœ์ ํ™”

voting.get_params()

 

 

params = {"logistic__C" : [0.001, 0.1, 100.0],
        "tree__max_depth" : [1, 2, 3, 4, 5],
        "knn__n_neighbors" : [1, 2, 3, 4, 5]}

grid = GridSearchCV(estimator=voting,
                    param_grid=params,
                    cv = 10,
                    scoring = "roc_auc",
                    )

grid.fit(X_train, y_train)

for i, _ in enumerate(grid.cv_results_["mean_test_score"]) :
    print("%.3f +/- %.3f %r"
            %(grid.cv_results_["mean_test_score"][i],
            grid.cv_results_["std_test_score"][i] / 2.0,
            grid.cv_results_["params"][i]))

print("์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ : %s" %grid.best_params_)
print("ACU : %.3f" % grid.best_score_)

 

 

 

 

 

9. ํŠน์„ฑ ์ค‘์š”๋„

  • ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ๋ณ„๋„์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ํ•„์š” ์—†์Œ
feat_labels = X.columns
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]

for i in range(X_train.shape[1]) :
    print("%2d) %-*s %f" % (i + 1, 30, feat_labels[indices[i]],
                                        importances[indices[i]]))

plt.title("Feature Importances")
plt.bar(range(X_train.shape[1]), importances[indices], align="center")

plt.xticks(range(X_train.shape[1]), feat_labels[indices], rotation = 90)

plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
plt.show()

 

 

 

 

 

 

728x90
๋ฐ˜์‘ํ˜•
Comments