[Python] 48. picher_stats_2017.csv 분석하기, VIF, r2-score, 시계열 분석. ARIMA

728x90

1. picher_stats_2017.csv 분석하기

2. VIF

3. r2-score

4. 시계열 분석. ARIMA

# -*- coding: utf-8 -*-
"""
Created on Mon Jul 11 09:09:30 2022

@author: p011v
"""
'''
    머신러닝 모델평가 기준
        회귀분석 평가점수 : 0에 가까울수록 정확도가 높음
            mse : mean_squared_error(평균제곱오차)
                (실제데이터 - 예측데이터)**2 의 평균
                작은 값일수록 정확도가 높음
            rmse : mse수치의 제곱근
                 작은 값일수록 정확도가 높음
        분류 평가 점수 : 1에 가까울수록 정확도가 높음
              accuracy  : 정확도     (TP + TN) / (TP+ TN + FP + FN)
              precision : 정밀도     TP / (TP + FP)
              recall    : 재현율     TP / (TP + FN)
              f1-score  : 조화평균   2 * (정밀도 * 재현율) / (정밀도 + 재현율) 
              
              TP : 실제:T, 예측:T
              TN : 실제:F, 예측:F
              FP : 실제:F, 예측:T
              FN : 실제:T, 예측:F 
'''
########## 1. picher_stats_2017.csv 분석하기.
# 1. 투수들의 연봉 예측하기
import pandas as pd
picher = pd.read_csv("data/picher_stats_2017.csv")
picher.info()
picher.팀명.unique()
# 2. 팀명을 one-hot 인코딩하기. picher데이터셋에 추가하기
# pd.get_dummies() 함수 이용.
onehot_team = pd.get_dummies(picher["팀명"])
onehot_team.head()
# 3. 팀명 컬럼 제거하기
del picher["팀명"]
# 4. onehot_team 데이터를 picher 데이터 추가하기
picher = pd.concat([picher, onehot_team], axis=1)
picher.info()
# 5. 연봉(2018)컬럼을 y컬럼으로 변경하기
picher = picher.rename(columns={"연봉(2018)":"y"})
picher.info()
# 6. 독립변수, 종속변수로 분리하기
#   독립변수: 선수명, y컬럼을 제외한 모든 컬럼
#   종속변수: y컬럼
X = picher[picher.columns.difference(['선수명','y'])]
Y = picher["y"]
X.info()

# 7. 정의된 정규화 함수로 데이터 정규화하기
def standard_scaling(df, scale_columns) :
    for col in scale_columns :
        s_mean = df[col].mean() # col 컬럼의 평균
        s_std = df[col].std()   # col 컬럼의 표준편차
        # apply : 각각의 요소에 적용되는 함수 설정
        df[col] = df[col].apply(lambda x : (x-s_mean) / s_std) # 표준점수
    return df

# onehot_team.columns : 컬럼은 정규화 하지 않음
scale_columns = X.columns.difference(onehot_team.columns)
picher_df = standard_scaling(X, scale_columns)
picher_df.head()
# 8 . 훈련데이터와 테스트데이터 분리하기
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(X, Y, test_size=0.2, random_state=10)
# 9. 회귀분석
# pip install statsmodels : 회귀분석모듈
import statsmodels.api as sm # pip install statsmodels
# 상수항결합. 회귀분석모형의 수식을 단순화하기 위한 상수값을 추가하는 과정
X_train = sm.add_constant(X_train)
X_train.head()
'''
    OLS : 선형회귀분석을 위한 모델.
        - 독립변수와 종속변수의 영향을 수치로 표시
'''
model = sm.OLS(y_train, X_train).fit() # 모델생성. 학습
model.summary()
'''
    R-Squared(결정계수) : 0 ~ 1 사이의 값. 독립변수의 갯수가 많아지면 값이 커짐.
    Ads. R-Squared(수정결정계수) : 표본의 크기와 독립변수의 수를 고려하여 다시 계산.
        => 독립변수의 변동량에 따른 종속변수의 변동량
    P > |t| : 각 피처(컬럼, 변수)의 t-statistics값이 유의미 여부를 표시하는 p-value 값.
              0.05미만의 컬럼이 회귀분석에서는 유의미한 피처들임.
              WAR, 연봉(2017), 한화 3개의 피처들이 유의미한 피처들임.
    coef : 회귀계수, 독립변수별 종속변수에 미치는 영향을 수치로 표현함.
        => 독립변수 선택 기준값
'''
# 10. 시각화
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [20, 16]
plt.rc('font', family='Malgun Gothic')
plt.rcParams['axes.unicode_minus'] = False
coefs = model.params.tolist()
coefs_series = pd.Series(coefs)
x_labels = model.params.index.tolist()
ax = coefs_series.plot(kind='bar')
ax.set_title('feature_coef_graph')
ax.set_xlabel('x_features')
ax.set_ylabel('coef')
ax.set_xticklabels(x_labels)

########## 2. VIF
'''
    VIF(variance_inflation_factor) : 분산팽창요인.
        독립변수들은 서로 독립적이어야 함. 독립변수간의 연관성이 없는게 좋음.
    다중공선성 : 독립변수들은 서로 독립적이어야 하는데, 독립변수들 사이의 연관성이
                높은 경우 가중치가 발생이 됨. 
                연관성이 높은 독립변수는 하나만 선택하는 것이 좋음
    예제에서 FIP, KFIP 변수는 한개만 선택해야함.
'''
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) \
                     for i in range(X.shape[1])]
vif["features"] = X.columns
print(vif.round(1))

# 회귀분석
from sklearn import linear_model
from sklearn import preprocessing

lr = linear_model.LinearRegression() # 알고리즘 선택
# 독립변수 선택
X = picher_df[["FIP", "WAR", "볼넷/9", "삼진/9", "연봉(2017)"]]
Y
# 훈련데이터, 테스트데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=19)

lr = lr.fit(X_train, y_train) #학습
predict_2018_salary = lr.predict(X) #예측
picher_df["예측연봉"] = pd.Series(predict_2018_salary) #picher_df 예측연봉 컬럼에 추가
picher_df["예측연봉"].head()
picher_df.info()
picher_df["연봉(2017)"].head() #정규화된 데이터
picher["연봉(2017)"].head()
# picher_df에서 "연봉(2017)" 컬럼 삭제
del picher_df["연봉(2017)"]
picher_df["연봉(2017)"] = picher["연봉(2017)"]
picher_df["y"] = picher["y"]
picher_df["선수명"] = picher["선수명"]
picher_df.info()
picher_df[["선수명", "연봉(2017)", "y", "예측연봉"]].head()
picher[["선수명", "연봉(2017)", "y"]].head()
# 2018년 연봉의 내림차순으로 정렬하기
result_df = picher_df.sort_values(by=["y"], ascending=False)
result_df.head()
result_df = result_df[["선수명", "연봉(2017)", "y", "예측연봉"]]
result_df.head()

# 2017년 연봉과 2018년 연봉이 다른 선수들 10명을 작년연봉, 예측연봉, 실제연봉 그래프 출력
# 2018년 연봉이 가장 많은 선수 10명만 그래프 출력
result_df.info()
result_df.columns = \
    ["선수명", "작년연봉(2017)", "실제연봉(2018)", "예측연봉(2018)"]
# 2017년 연봉과 2018년 연봉이 다른 선수만 조회
result_df = \
    result_df[result_df["작년연봉(2017)"] != result_df["실제연봉(2018)"]]
result_df = result_df.sort_values(by=["실제연봉(2018)"], ascending=False)
result_df = result_df.iloc[:10, :]
result_df.plot(x = "선수명", \
               y = ["작년연봉(2017)", "예측연봉(2018)", "실제연봉(2018)"], kind="bar")

########## 3. r2-score
# 회귀분석 평가
# 1. r2-score 값 출력하기
# r2-score : 결정계수. 독립변수가 종속변수에 미치는 영향을 수치로 표시
#            0 ~ 1사이의 값을 가짐. 
lr.score(X_train, y_train) #0.9150591192570362
lr.score(X_test, y_test)   #0.9038759653889866

# 2. rmse 값 출력하기
# mse(mean_squared_error) : 평균제곱오차. 작은 숫자인경우 성능이 좋음.
# rmse : mse의 제곱근. 작은 숫자인경우 성능이 좋음
from math import sqrt # 제곱근함수
from sklearn.metrics import mean_squared_error #mse 함수
y_predictions = lr.predict(X_train) # 훈련데이터로 예측
# y_train : 실제데이터
# y_predictions : 예측데이터
print(sqrt(mean_squared_error(y_train, y_predictions))) # 훈련데이터 rmse : 7893.462873347693
y_predictions = lr.predict(X_test)
print(sqrt(mean_squared_error(y_test, y_predictions))) # 테스트데이터 rmse : 13141.866063591076
# 과대적합 :  훈련데이터 rmse 성능 > 테스트데이터 rmse 성능

########## 4. 시계열 분석
# 시계열 데이터 : 연속적인 시간에 따라 다르게 측정되는 데이터
'''
    1. https://www.blockchain.com/ko/charts/market-price
    2. csv format으로 선택. 다운받기
    3. market-price => data폴더에 저장하기
    4. 2022-7-7부터 7-11까지 5개의 데이터만 market-price-test.csv 파일로 생성
        market-price.csv : 훈련데이터
        market-price-test.csv : 테스트데이터. 5개의 레코드만 저장
'''
file_path = "data/market-price.csv"
bitcoin_df = pd.read_csv(file_path, names=['day', 'price'], header=0)
bitcoin_df.info()
# day 컬럼을 날짜형 변환
bitcoin_df["day"] = pd.to_datetime(bitcoin_df["day"])
bitcoin_df.info()
# day 컬럼을 index로 변환
bitcoin_df.set_index("day", inplace=True)
bitcoin_df.info()
# 시각화
bitcoin_df.plot()

# 알고리즘 선택
#
'''
    ARIMA : 시계열분석을 위한 알고리즘
        AR(Auto Regression) : 과거 정보를 이용하여 현재 정보를 계산
        I(Integrated) : 차분. 현재상태 - 이전상태의 차이
        MA(Moving Average) : 이전항의 오차를 이용하여 현재항을 추론
'''
from statsmodels.tsa.arima_model import ARIMA
'''
     bitcoin_df.price.values : 분석데이터
     order=(2, 1, 2)
       2 : AR관련 데이터. 2번째 과거데이터로 현재 추론
       1 : I관련 데이터. 차분. 현재상태 - 이전
       2 : MA관련 데이터. 2번째 과정 정보의 오차를 이용. 현재 추론
'''
model = ARIMA(bitcoin_df.price.values, order=(2, 1, 2))
import statsmodels
statsmodels.__version__
model

model_fit = model.fit(trend='c', full_output=True, disp=True) #학습하기
fig = model_fit.plot_predict()
# resid : 잔차 정보. 변동정보
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()

forecast_data = model_fit.forecast(steps=5) # 5일 정보를 예측
forecast_data # 예측 데이터
'''
    1번째 배열: 예측값, 5일치 예측값
    2번째 배열: 표준오차
    3번째 배열: [예측하한값, 예측상한값]
'''

pred_y = forecast_data[0].tolist()
pred_y_lower = []
pred_y_upper = []
for  low_up in forecast_data[2] :
        pred_y_lower.append(low_up[0])
        pred_y_upper.append(low_up[1])
        
import matplotlib.pyplot as plt
plt.plot(pred_y, color="gold") # 5일치 예측값
plt.plot(pred_y_lower, color="red") # 5일치 예측하한값
plt.plot(pred_y_upper, color="blue") # 5일치 예측상한값

# 실제데이터 읽기
bitcoin_test_df = pd.read_csv\
    ("data/market-price-test.csv", names=["ds", "y"], header=0)
test_y = bitcoin_test_df.y.values
test_y
import matplotlib.pyplot as plt
plt.plot(pred_y, color="gold") # 5일치 예측값
plt.plot(test_y, color="green") # 5일치 실제값
plt.plot(pred_y_lower, color="red") # 5일치 예측하한값
plt.plot(pred_y_upper, color="blue") # 5일치 예측상한값

4. 시계열 분석

728x90

저작자표시 비영리 변경금지 (새창열림)

'study > Python' 카테고리의 다른 글

[Python] 50. 카카오맵을 크롤링하여 맛집리뷰에 사용되는 용어 분석하기 (0)	2022.07.13
[Python] 49. titanic_train.csv 분석하기, 분류 모델 평가, 카카오맵을 크롤링하여 맛집리뷰에 사용되는 용어 분석하기 (0)	2022.07.13
[Python] 47. 비지도학습. 군집 (0)	2022.07.08
[Python] 46. 분류 : 지도학습, 원핫인코딩, 정규화, 모형 성능 평가데이터, 혼동행렬 : 분류 결과, SVM(Support Vector Machine), Decision Tree (0)	2022.07.07
[Python] 45. CCTV_in_Seoul.csv, crime_in_Seoul.csv, 경찰관서 위치.csv 파일 분석하기, 머신러닝, 알고리즘 선택 : LinearRegression, PolynomialFeatures., 다중회귀분석, seoul_5.csv 파일을 읽어 2022년 7월 6일 평균기온 .. (0)	2022.07.06

Daily develope

[Python] 48. picher_stats_2017.csv 분석하기, VIF, r2-score, 시계열 분석. ARIMA

'study > Python' 카테고리의 다른 글

티스토리툴바

[Python] 48. picher_stats_2017.csv 분석하기, VIF, r2-score, 시계열 분석. ARIMA

'study > Python' 카테고리의 다른 글

관련글

티스토리툴바