728x90
영화 관객 수 데이터를 활용한 데이터 분석¶
In [3]:
# 패키지 불러오기
import pandas as pd
import lightgbm as lgb
In [11]:
# train, test, submission dara 불러오기
train = pd.read_csv('data/movies_train.csv')
test = pd.read_csv('data/movies_test.csv')
submission = pd.read_csv('data/submission.csv')
In [12]:
# train data 상위 5개의 행 출력
train.head()
Out[12]:
title | distributor | genre | release_time | time | screening_rat | director | dir_prev_bfnum | dir_prev_num | num_staff | num_actor | box_off_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 개들의 전쟁 | 롯데엔터테인먼트 | 액션 | 2012-11-22 | 96 | 청소년 관람불가 | 조병옥 | NaN | 0 | 91 | 2 | 23398 |
1 | 내부자들 | (주)쇼박스 | 느와르 | 2015-11-19 | 130 | 청소년 관람불가 | 우민호 | 1161602.50 | 2 | 387 | 3 | 7072501 |
2 | 은밀하게 위대하게 | (주)쇼박스 | 액션 | 2013-06-05 | 123 | 15세 관람가 | 장철수 | 220775.25 | 4 | 343 | 4 | 6959083 |
3 | 나는 공무원이다 | (주)NEW | 코미디 | 2012-07-12 | 101 | 전체 관람가 | 구자홍 | 23894.00 | 2 | 20 | 6 | 217866 |
4 | 불량남녀 | 쇼박스(주)미디어플렉스 | 코미디 | 2010-11-04 | 108 | 15세 관람가 | 신근호 | 1.00 | 1 | 251 | 2 | 483387 |
In [13]:
# test data 상위 5개의 행 출력
test.head()
Out[13]:
title | distributor | genre | release_time | time | screening_rat | director | dir_prev_bfnum | dir_prev_num | num_staff | num_actor | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 용서는 없다 | 시네마서비스 | 느와르 | 2010-01-07 | 125 | 청소년 관람불가 | 김형준 | 3.005290e+05 | 2 | 304 | 3 |
1 | 아빠가 여자를 좋아해 | (주)쇼박스 | 멜로/로맨스 | 2010-01-14 | 113 | 12세 관람가 | 이광재 | 3.427002e+05 | 4 | 275 | 3 |
2 | 하모니 | CJ 엔터테인먼트 | 드라마 | 2010-01-28 | 115 | 12세 관람가 | 강대규 | 4.206611e+06 | 3 | 419 | 7 |
3 | 의형제 | (주)쇼박스 | 액션 | 2010-02-04 | 116 | 15세 관람가 | 장훈 | 6.913420e+05 | 2 | 408 | 2 |
4 | 평행 이론 | CJ 엔터테인먼트 | 공포 | 2010-02-18 | 110 | 15세 관람가 | 권호영 | 3.173800e+04 | 1 | 380 | 1 |
In [14]:
# submission data 상위 5개의 행 출력
submission.head()
Out[14]:
title | box_off_num | |
---|---|---|
0 | 용서는 없다 | 0 |
1 | 아빠가 여자를 좋아해 | 0 |
2 | 하모니 | 0 |
3 | 의형제 | 0 |
4 | 평행 이론 | 0 |
In [15]:
# train data 하위 5개의 행 출력
train.tail()
Out[15]:
title | distributor | genre | release_time | time | screening_rat | director | dir_prev_bfnum | dir_prev_num | num_staff | num_actor | box_off_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
595 | 해무 | (주)NEW | 드라마 | 2014-08-13 | 111 | 청소년 관람불가 | 심성보 | 3833.0 | 1 | 510 | 7 | 1475091 |
596 | 파파로티 | (주)쇼박스 | 드라마 | 2013-03-14 | 127 | 15세 관람가 | 윤종찬 | 496061.0 | 1 | 286 | 6 | 1716438 |
597 | 살인의 강 | (주)마운틴픽쳐스 | 공포 | 2010-09-30 | 99 | 청소년 관람불가 | 김대현 | NaN | 0 | 123 | 4 | 2475 |
598 | 악의 연대기 | CJ 엔터테인먼트 | 느와르 | 2015-05-14 | 102 | 15세 관람가 | 백운학 | NaN | 0 | 431 | 4 | 2192525 |
599 | 베를린 | CJ 엔터테인먼트 | 액션 | 2013-01-30 | 120 | 15세 관람가 | 류승완 | NaN | 0 | 363 | 5 | 7166532 |
In [16]:
# 데이터 프레임의 행의 개수와 열의 개수 출력
print(train.shape)
print(test.shape)
print(submission.shape)
(600, 12)
(243, 11)
(243, 2)
In [18]:
# train data 정보 요약
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 600 non-null object
1 distributor 600 non-null object
2 genre 600 non-null object
3 release_time 600 non-null object
4 time 600 non-null int64
5 screening_rat 600 non-null object
6 director 600 non-null object
7 dir_prev_bfnum 270 non-null float64
8 dir_prev_num 600 non-null int64
9 num_staff 600 non-null int64
10 num_actor 600 non-null int64
11 box_off_num 600 non-null int64
dtypes: float64(1), int64(5), object(6)
memory usage: 56.4+ KB
In [19]:
# test data 정보 요약
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243 entries, 0 to 242
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 243 non-null object
1 distributor 243 non-null object
2 genre 243 non-null object
3 release_time 243 non-null object
4 time 243 non-null int64
5 screening_rat 243 non-null object
6 director 243 non-null object
7 dir_prev_bfnum 107 non-null float64
8 dir_prev_num 243 non-null int64
9 num_staff 243 non-null int64
10 num_actor 243 non-null int64
dtypes: float64(1), int64(4), object(6)
memory usage: 21.0+ KB
In [22]:
# train data 의 퉁계량 출력
# count: 비어있지 않은 값의 개수
# mean: 평균
# std: 표준편차
# min: 최솟값 (이상치(울타리 밖에 있는 부분) 포함)
# 25%: 전체 데이터 정렬했을 때, 하위로부터 1/4번째 지점
# 50%: 전체 데이터 정렬했을 때, 하위로부터 2/4번째 지점
# 75%: 전체 데이터 정렬했을 때, 하위로부터 3/4번째 지점
# max: 최댓값 (이상치 포함)
train.describe()
Out[22]:
time | dir_prev_bfnum | dir_prev_num | num_staff | num_actor | box_off_num | |
---|---|---|---|---|---|---|
count | 600.000000 | 2.700000e+02 | 600.000000 | 600.000000 | 600.000000 | 6.000000e+02 |
mean | 100.863333 | 1.050443e+06 | 0.876667 | 151.118333 | 3.706667 | 7.081818e+05 |
std | 18.097528 | 1.791408e+06 | 1.183409 | 165.654671 | 2.446889 | 1.828006e+06 |
min | 45.000000 | 1.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 1.000000e+00 |
25% | 89.000000 | 2.038000e+04 | 0.000000 | 17.000000 | 2.000000 | 1.297250e+03 |
50% | 100.000000 | 4.784236e+05 | 0.000000 | 82.500000 | 3.000000 | 1.259100e+04 |
75% | 114.000000 | 1.286569e+06 | 2.000000 | 264.000000 | 4.000000 | 4.798868e+05 |
max | 180.000000 | 1.761531e+07 | 5.000000 | 869.000000 | 25.000000 | 1.426277e+07 |
In [23]:
# data 값이 e로 표시되어 읽기 힘드니까
# 소수점 자리수 지정하여 가독성 좋게하는 코드
pd.options.display.float_format = '{:.1f}'.format
In [25]:
# 소수점 지정 포맷 확인
train.describe()
Out[25]:
time | dir_prev_bfnum | dir_prev_num | num_staff | num_actor | box_off_num | |
---|---|---|---|---|---|---|
count | 600.0 | 270.0 | 600.0 | 600.0 | 600.0 | 600.0 |
mean | 100.9 | 1050442.9 | 0.9 | 151.1 | 3.7 | 708181.8 |
std | 18.1 | 1791408.3 | 1.2 | 165.7 | 2.4 | 1828005.9 |
min | 45.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
25% | 89.0 | 20380.0 | 0.0 | 17.0 | 2.0 | 1297.2 |
50% | 100.0 | 478423.6 | 0.0 | 82.5 | 3.0 | 12591.0 |
75% | 114.0 | 1286568.6 | 2.0 | 264.0 | 4.0 | 479886.8 |
max | 180.0 | 17615314.0 | 5.0 | 869.0 | 25.0 | 14262766.0 |
In [26]:
# 집단에 대한 통계량 확인
# train data 에서 column = genre, box_off_num 가져오고
# genre를 기준으로 그룹화하여 box_off_num 값 평균내고
# box_off_num 값 기준 오른차순 정렬
train[['genre', 'box_off_num']].groupby('genre').mean().sort_values('box_off_num')
Out[26]:
box_off_num | |
---|---|
genre | |
뮤지컬 | 6627.0 |
다큐멘터리 | 67172.3 |
서스펜스 | 82611.0 |
애니메이션 | 181926.7 |
멜로/로맨스 | 425968.0 |
미스터리 | 527548.2 |
공포 | 590832.5 |
드라마 | 625689.8 |
코미디 | 1193914.0 |
SF | 1788345.7 |
액션 | 2203974.1 |
느와르 | 2263695.1 |
In [27]:
# 숫자 소수점 지정하는 포맷팅 리셋
pd.reset_option('display.float_format')
In [28]:
# 리셋된 것 확인
train.describe()
Out[28]:
time | dir_prev_bfnum | dir_prev_num | num_staff | num_actor | box_off_num | |
---|---|---|---|---|---|---|
count | 600.000000 | 2.700000e+02 | 600.000000 | 600.000000 | 600.000000 | 6.000000e+02 |
mean | 100.863333 | 1.050443e+06 | 0.876667 | 151.118333 | 3.706667 | 7.081818e+05 |
std | 18.097528 | 1.791408e+06 | 1.183409 | 165.654671 | 2.446889 | 1.828006e+06 |
min | 45.000000 | 1.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 1.000000e+00 |
25% | 89.000000 | 2.038000e+04 | 0.000000 | 17.000000 | 2.000000 | 1.297250e+03 |
50% | 100.000000 | 4.784236e+05 | 0.000000 | 82.500000 | 3.000000 | 1.259100e+04 |
75% | 114.000000 | 1.286569e+06 | 2.000000 | 264.000000 | 4.000000 | 4.798868e+05 |
max | 180.000000 | 1.761531e+07 | 5.000000 | 869.000000 | 25.000000 | 1.426277e+07 |
In [29]:
# 상관계수(두 개의 변수가 같이 일어나는 강도를 나타내는 수치) 확인
# 보통 0.4 이상이면 두 개의 변수간에 상관성이 있다고 함.
train.corr()
Out[29]:
time | dir_prev_bfnum | dir_prev_num | num_staff | num_actor | box_off_num | |
---|---|---|---|---|---|---|
time | 1.000000 | 0.264675 | 0.306727 | 0.623205 | 0.114153 | 0.441452 |
dir_prev_bfnum | 0.264675 | 1.000000 | 0.131822 | 0.323521 | 0.083818 | 0.283184 |
dir_prev_num | 0.306727 | 0.131822 | 1.000000 | 0.450706 | 0.014006 | 0.259674 |
num_staff | 0.623205 | 0.323521 | 0.450706 | 1.000000 | 0.077871 | 0.544265 |
num_actor | 0.114153 | 0.083818 | 0.014006 | 0.077871 | 1.000000 | 0.111179 |
box_off_num | 0.441452 | 0.283184 | 0.259674 | 0.544265 | 0.111179 | 1.000000 |
In [30]:
# 상관성을 heatmap으로 시각화
import seaborn as sns
sns.heatmap(train.corr(), annot = True)
Out[30]:
<AxesSubplot:>
In [37]:
# 결측치 확인
# 결측치를 구하고, 행의 수 인 600으로 나누어 줌으로써
# 백분율로 확인
train.isna().sum() / 600
Out[37]:
title 0.00
distributor 0.00
genre 0.00
release_time 0.00
time 0.00
screening_rat 0.00
director 0.00
dir_prev_bfnum 0.55
dir_prev_num 0.00
num_staff 0.00
num_actor 0.00
box_off_num 0.00
dtype: float64
In [39]:
# train data의 dir_prev_bfnum 에서 결측치 있는 행들만 가져옴
# dir_prev_num 값 총합 출력
train[train['dir_prev_bfnum'].isna()]['dir_prev_num'].sum()
Out[39]:
0
In [41]:
# 기존 결측치 값 Nan 으로 표기 되어 있음
# Nan 을 0 으로 바꿔주는 코드
train['dir_prev_bfnum'].fillna(0, inplace = True)
In [46]:
# 결측치 값 변환 된것 확인
train.isna().sum()
Out[46]:
title 0
distributor 0
genre 0
release_time 0
time 0
screening_rat 0
director 0
dir_prev_bfnum 0
dir_prev_num 0
num_staff 0
num_actor 0
box_off_num 0
dtype: int64
In [47]:
# test 결측치 값도 동일하게 위의 과정으로 변환
test[test['dir_prev_bfnum'].isna()]['dir_prev_num'].sum()
Out[47]:
0
In [48]:
# test 결측치 값도 동일하게 위의 과정으로 변환
test['dir_prev_bfnum'].fillna(0, inplace = True)
In [49]:
# model 설계
model = lgb.LGBMRegressor(random_state = 777, n_estimators = 1000)
In [50]:
# features 는 x, target 은 y
features = ['time', 'dir_prev_num', 'num_staff', 'num_actor']
target = ['box_off_num']
In [51]:
# train data와 test data 변수에 입력
X_train, X_test, y_train = train[features], test[features], train[target]
LightGBM¶
In [52]:
# train data 학습
model.fit(X_train, y_train)
Out[52]:
LGBMRegressor(n_estimators=1000, random_state=777)
In [53]:
# submission data 사본 변수에 입력
singleLGBM = submission.copy()
In [55]:
# singleLGBM data 상위 5개의 행 출력
singleLGBM.head()
Out[55]:
title | box_off_num | |
---|---|---|
0 | 용서는 없다 | 0 |
1 | 아빠가 여자를 좋아해 | 0 |
2 | 하모니 | 0 |
3 | 의형제 | 0 |
4 | 평행 이론 | 0 |
In [61]:
# 예측한 결과값을 singleLGBM['box_off_num'] 에 입력
singleLGBM['box_off_num'] = model.predict(X_test)
# 가독성을 위해 숫자 소수점 지정
pd.options.display.float_format = '{:.1f}'.format
# 확인
singleLGBM
Out[61]:
title | box_off_num | |
---|---|---|
0 | 용서는 없다 | 2817995.2 |
1 | 아빠가 여자를 좋아해 | 375377.2 |
2 | 하모니 | -569324.3 |
3 | 의형제 | 1581189.0 |
4 | 평행 이론 | -527780.6 |
... | ... | ... |
238 | 해에게서 소년에게 | 500784.4 |
239 | 울보 권투부 | 1013858.4 |
240 | 어떤살인 | 1682067.7 |
241 | 말하지 못한 비밀 | 300216.3 |
242 | 조선안방 스캔들-칠거지악 2 | 11390.0 |
243 rows × 2 columns
In [62]:
# csv 파일 만들어서, 결과물 저장
singleLGBM.to_csv('data/singleLGBM.csv', index = False)
k-fold 교차 검증 (k-fold cross validation)¶
In [66]:
# 패키지 불러오기
from sklearn.model_selection import KFold
# 데이터 분할 수 = 5, 데이터 shuffle 하기
k_fold = KFold(n_splits = 5, shuffle = True, random_state = 777)
In [67]:
# X_train data 에서 train 480개, val 120개로 분할
for train_idx, val_idx in k_fold.split(X_train):
print(len(train_idx), len(val_idx))
break
480 120
In [68]:
# 교차 검증하여, LGBMRegressor 모델 학습시키기
model = lgb.LGBMRegressor(random_state = 777, n_estimators = 1000)
models = []
for train_idx, val_idx in k_fold.split(X_train):
x_t = X_train.iloc[train_idx]
y_t = y_train.iloc[train_idx]
x_val = X_train.iloc[val_idx]
y_val = y_train.iloc[val_idx]
models.append(model.fit(x_t, y_t, eval_set=(x_val, y_val), early_stopping_rounds = 100, verbose = 100))
C:\Users\Jung_dayoung\anaconda3\lib\site-packages\lightgbm\sklearn.py:726: UserWarning: 'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. Pass 'early_stopping()' callback via 'callbacks' argument instead.
_log_warning("'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. "
C:\Users\Jung_dayoung\anaconda3\lib\site-packages\lightgbm\sklearn.py:736: UserWarning: 'verbose' argument is deprecated and will be removed in a future release of LightGBM. Pass 'log_evaluation()' callback via 'callbacks' argument instead.
_log_warning("'verbose' argument is deprecated and will be removed in a future release of LightGBM. "
[100] valid_0's l2: 2.70572e+12
[100] valid_0's l2: 3.90847e+12
[100] valid_0's l2: 3.50344e+12
[100] valid_0's l2: 1.45977e+12
[100] valid_0's l2: 1.77214e+12
In [69]:
models
Out[69]:
[LGBMRegressor(n_estimators=1000, random_state=777),
LGBMRegressor(n_estimators=1000, random_state=777),
LGBMRegressor(n_estimators=1000, random_state=777),
LGBMRegressor(n_estimators=1000, random_state=777),
LGBMRegressor(n_estimators=1000, random_state=777)]
In [71]:
# 예측 결과값 정리
preds = []
for model in models:
preds.append(model.predict(X_test))
len(preds)
Out[71]:
5
In [72]:
# 예측 결과값 확인
preds
Out[72]:
[array([3367422.08211024, 961138.88337016, 1097929.67851313,
2097270.81999921, 781476.31131047, 123133.23396977,
84085.01896248, 199222.92670303, 124854.97973097,
1072684.78820647, 657040.55769984, 1644701.47160779,
924735.72080619, 191699.58415836, 723218.39948755,
2395613.42888462, 113180.54906592, 1882765.92812296,
97286.47282983, 417270.42113431, 161482.3367577 ,
63495.73363115, 598057.42337284, 249411.25868881,
308424.71884273, 1319518.60852241, 202273.52740684,
1076683.13768137, 491636.41945325, 183406.6380314 ,
1740233.45816734, 86696.17528125, 440430.29773088,
2040944.13061099, 113375.45695331, 531746.29484356,
106207.35971699, 169560.79581187, 158101.0895316 ,
95037.29701059, 466355.50624629, 197548.63265325,
198810.0901353 , 682743.4340129 , 679265.22397749,
617812.91125278, 219182.85833837, 83712.10602149,
1206257.86593038, 213886.45118276, 225829.13565632,
573638.25784044, 213008.72404464, 1256363.95636108,
1114803.4971861 , 743531.85320002, 491360.56067236,
844903.11979141, 1294842.58551457, 3224861.58056056,
620636.85839868, 1997931.01104478, 425408.70724227,
187911.96484382, 146932.14876274, 60290.34778046,
1553051.98103674, 61273.32234969, 434403.58150991,
295641.77082639, 3270724.75199692, 956695.86295249,
52421.39380137, 1193829.08957371, 109115.94774032,
114532.36574566, 1889840.31759081, 263117.7221664 ,
110722.44800397, 129405.69463079, 295288.27657346,
2208289.14524844, 127156.40131508, 2356552.01519359,
52421.39380137, 1582094.56763018, 450263.02347743,
180314.7209139 , 207866.80825891, 237980.96274482,
92809.10191796, 217726.13148671, 588167.04361702,
2269569.15414991, 413549.08320869, 504238.61127942,
121235.38111888, 294348.6064813 , 365600.03475309,
1021310.81474734, 1255348.78654001, 1121695.37239584,
236858.1355381 , 180837.57476421, 390892.54870688,
112045.06953824, 3091294.61560971, 71489.84756224,
3949551.73068699, 117269.53815136, 76582.3994148 ,
1933601.27793109, 1675206.65522009, 294348.6064813 ,
97940.27258456, 945839.01956416, 610770.20862757,
1804885.34137216, 199766.90646407, 70766.27122776,
81754.00329429, 1021310.81474734, 846364.16218315,
77206.4852859 , 598474.09814662, 169200.18001164,
73757.25966134, 68560.19329018, 363573.18970532,
439100.02360785, 53273.19743239, 883179.93430519,
96841.00528099, 69705.57216142, 720189.94864194,
4682200.86959811, 157547.86592702, 598577.33917498,
103530.67260964, 219913.31885361, 69705.57216142,
3393838.48355355, 3747750.51209052, 137237.96352743,
205694.9146793 , 1343063.74316061, 105778.55398733,
113406.67661411, 2090467.44678197, 676486.18009377,
1653182.84275506, 151567.24115244, 82132.69602781,
450200.37816199, 123824.9061514 , 841869.56483077,
95800.41312684, 118291.73959934, 2085585.62459758,
2372839.45720666, 247688.59858277, 1334681.62533764,
101692.7552419 , 503472.70498147, 64636.19839218,
320805.50593895, 2623266.33617439, 174088.14927962,
72116.10600518, 80293.56098003, 89828.86338213,
219193.02374581, 328023.58124177, 60290.34778046,
516294.37658866, 236774.09865655, 163101.99672652,
95037.29701059, 4401076.73195112, 3540272.0939466 ,
80652.10214941, 64007.05829165, 91088.74649744,
451288.50611234, 2108891.7577948 , 222324.39435632,
230969.76551419, 119869.73646973, 202629.14049356,
3372525.98485469, 194055.24146137, 82570.78055486,
107755.73232124, 98100.67856341, 64007.05829165,
1960017.88124655, 439387.54981251, 1311678.25959343,
52266.56751161, 612357.4359533 , 277611.14005727,
100834.87848026, 52421.39380137, 320518.12600116,
622991.23928914, 905631.603816 , 2927834.24849227,
562358.56357488, 111381.83047422, 781476.31131047,
798596.64645281, 2310966.59797285, 5133725.87086224,
1281440.30372974, 171353.72979582, 89630.73883889,
1555941.58912764, 124073.12498191, 66108.35792552,
201459.85880026, 108993.88378627, 3134539.90969621,
148239.45542642, 53451.2389687 , 1307718.56304709,
214406.20173983, 194835.86837801, 3367422.08211024,
199766.90646407, 3823577.41600747, 68447.35094925,
739960.30606392, 506712.38290343, 150255.49362263,
62093.65018694, 64707.81660264, 151784.52074685,
80757.16568265, 177095.84545845, 210255.47503947,
360547.62990175, 181508.22558258, 55171.21699634]),
array([3367422.08211024, 961138.88337016, 1097929.67851313,
2097270.81999921, 781476.31131047, 123133.23396977,
84085.01896248, 199222.92670303, 124854.97973097,
1072684.78820647, 657040.55769984, 1644701.47160779,
924735.72080619, 191699.58415836, 723218.39948755,
2395613.42888462, 113180.54906592, 1882765.92812296,
97286.47282983, 417270.42113431, 161482.3367577 ,
63495.73363115, 598057.42337284, 249411.25868881,
308424.71884273, 1319518.60852241, 202273.52740684,
1076683.13768137, 491636.41945325, 183406.6380314 ,
1740233.45816734, 86696.17528125, 440430.29773088,
2040944.13061099, 113375.45695331, 531746.29484356,
106207.35971699, 169560.79581187, 158101.0895316 ,
95037.29701059, 466355.50624629, 197548.63265325,
198810.0901353 , 682743.4340129 , 679265.22397749,
617812.91125278, 219182.85833837, 83712.10602149,
1206257.86593038, 213886.45118276, 225829.13565632,
573638.25784044, 213008.72404464, 1256363.95636108,
1114803.4971861 , 743531.85320002, 491360.56067236,
844903.11979141, 1294842.58551457, 3224861.58056056,
620636.85839868, 1997931.01104478, 425408.70724227,
187911.96484382, 146932.14876274, 60290.34778046,
1553051.98103674, 61273.32234969, 434403.58150991,
295641.77082639, 3270724.75199692, 956695.86295249,
52421.39380137, 1193829.08957371, 109115.94774032,
114532.36574566, 1889840.31759081, 263117.7221664 ,
110722.44800397, 129405.69463079, 295288.27657346,
2208289.14524844, 127156.40131508, 2356552.01519359,
52421.39380137, 1582094.56763018, 450263.02347743,
180314.7209139 , 207866.80825891, 237980.96274482,
92809.10191796, 217726.13148671, 588167.04361702,
2269569.15414991, 413549.08320869, 504238.61127942,
121235.38111888, 294348.6064813 , 365600.03475309,
1021310.81474734, 1255348.78654001, 1121695.37239584,
236858.1355381 , 180837.57476421, 390892.54870688,
112045.06953824, 3091294.61560971, 71489.84756224,
3949551.73068699, 117269.53815136, 76582.3994148 ,
1933601.27793109, 1675206.65522009, 294348.6064813 ,
97940.27258456, 945839.01956416, 610770.20862757,
1804885.34137216, 199766.90646407, 70766.27122776,
81754.00329429, 1021310.81474734, 846364.16218315,
77206.4852859 , 598474.09814662, 169200.18001164,
73757.25966134, 68560.19329018, 363573.18970532,
439100.02360785, 53273.19743239, 883179.93430519,
96841.00528099, 69705.57216142, 720189.94864194,
4682200.86959811, 157547.86592702, 598577.33917498,
103530.67260964, 219913.31885361, 69705.57216142,
3393838.48355355, 3747750.51209052, 137237.96352743,
205694.9146793 , 1343063.74316061, 105778.55398733,
113406.67661411, 2090467.44678197, 676486.18009377,
1653182.84275506, 151567.24115244, 82132.69602781,
450200.37816199, 123824.9061514 , 841869.56483077,
95800.41312684, 118291.73959934, 2085585.62459758,
2372839.45720666, 247688.59858277, 1334681.62533764,
101692.7552419 , 503472.70498147, 64636.19839218,
320805.50593895, 2623266.33617439, 174088.14927962,
72116.10600518, 80293.56098003, 89828.86338213,
219193.02374581, 328023.58124177, 60290.34778046,
516294.37658866, 236774.09865655, 163101.99672652,
95037.29701059, 4401076.73195112, 3540272.0939466 ,
80652.10214941, 64007.05829165, 91088.74649744,
451288.50611234, 2108891.7577948 , 222324.39435632,
230969.76551419, 119869.73646973, 202629.14049356,
3372525.98485469, 194055.24146137, 82570.78055486,
107755.73232124, 98100.67856341, 64007.05829165,
1960017.88124655, 439387.54981251, 1311678.25959343,
52266.56751161, 612357.4359533 , 277611.14005727,
100834.87848026, 52421.39380137, 320518.12600116,
622991.23928914, 905631.603816 , 2927834.24849227,
562358.56357488, 111381.83047422, 781476.31131047,
798596.64645281, 2310966.59797285, 5133725.87086224,
1281440.30372974, 171353.72979582, 89630.73883889,
1555941.58912764, 124073.12498191, 66108.35792552,
201459.85880026, 108993.88378627, 3134539.90969621,
148239.45542642, 53451.2389687 , 1307718.56304709,
214406.20173983, 194835.86837801, 3367422.08211024,
199766.90646407, 3823577.41600747, 68447.35094925,
739960.30606392, 506712.38290343, 150255.49362263,
62093.65018694, 64707.81660264, 151784.52074685,
80757.16568265, 177095.84545845, 210255.47503947,
360547.62990175, 181508.22558258, 55171.21699634]),
array([3367422.08211024, 961138.88337016, 1097929.67851313,
2097270.81999921, 781476.31131047, 123133.23396977,
84085.01896248, 199222.92670303, 124854.97973097,
1072684.78820647, 657040.55769984, 1644701.47160779,
924735.72080619, 191699.58415836, 723218.39948755,
2395613.42888462, 113180.54906592, 1882765.92812296,
97286.47282983, 417270.42113431, 161482.3367577 ,
63495.73363115, 598057.42337284, 249411.25868881,
308424.71884273, 1319518.60852241, 202273.52740684,
1076683.13768137, 491636.41945325, 183406.6380314 ,
1740233.45816734, 86696.17528125, 440430.29773088,
2040944.13061099, 113375.45695331, 531746.29484356,
106207.35971699, 169560.79581187, 158101.0895316 ,
95037.29701059, 466355.50624629, 197548.63265325,
198810.0901353 , 682743.4340129 , 679265.22397749,
617812.91125278, 219182.85833837, 83712.10602149,
1206257.86593038, 213886.45118276, 225829.13565632,
573638.25784044, 213008.72404464, 1256363.95636108,
1114803.4971861 , 743531.85320002, 491360.56067236,
844903.11979141, 1294842.58551457, 3224861.58056056,
620636.85839868, 1997931.01104478, 425408.70724227,
187911.96484382, 146932.14876274, 60290.34778046,
1553051.98103674, 61273.32234969, 434403.58150991,
295641.77082639, 3270724.75199692, 956695.86295249,
52421.39380137, 1193829.08957371, 109115.94774032,
114532.36574566, 1889840.31759081, 263117.7221664 ,
110722.44800397, 129405.69463079, 295288.27657346,
2208289.14524844, 127156.40131508, 2356552.01519359,
52421.39380137, 1582094.56763018, 450263.02347743,
180314.7209139 , 207866.80825891, 237980.96274482,
92809.10191796, 217726.13148671, 588167.04361702,
2269569.15414991, 413549.08320869, 504238.61127942,
121235.38111888, 294348.6064813 , 365600.03475309,
1021310.81474734, 1255348.78654001, 1121695.37239584,
236858.1355381 , 180837.57476421, 390892.54870688,
112045.06953824, 3091294.61560971, 71489.84756224,
3949551.73068699, 117269.53815136, 76582.3994148 ,
1933601.27793109, 1675206.65522009, 294348.6064813 ,
97940.27258456, 945839.01956416, 610770.20862757,
1804885.34137216, 199766.90646407, 70766.27122776,
81754.00329429, 1021310.81474734, 846364.16218315,
77206.4852859 , 598474.09814662, 169200.18001164,
73757.25966134, 68560.19329018, 363573.18970532,
439100.02360785, 53273.19743239, 883179.93430519,
96841.00528099, 69705.57216142, 720189.94864194,
4682200.86959811, 157547.86592702, 598577.33917498,
103530.67260964, 219913.31885361, 69705.57216142,
3393838.48355355, 3747750.51209052, 137237.96352743,
205694.9146793 , 1343063.74316061, 105778.55398733,
113406.67661411, 2090467.44678197, 676486.18009377,
1653182.84275506, 151567.24115244, 82132.69602781,
450200.37816199, 123824.9061514 , 841869.56483077,
95800.41312684, 118291.73959934, 2085585.62459758,
2372839.45720666, 247688.59858277, 1334681.62533764,
101692.7552419 , 503472.70498147, 64636.19839218,
320805.50593895, 2623266.33617439, 174088.14927962,
72116.10600518, 80293.56098003, 89828.86338213,
219193.02374581, 328023.58124177, 60290.34778046,
516294.37658866, 236774.09865655, 163101.99672652,
95037.29701059, 4401076.73195112, 3540272.0939466 ,
80652.10214941, 64007.05829165, 91088.74649744,
451288.50611234, 2108891.7577948 , 222324.39435632,
230969.76551419, 119869.73646973, 202629.14049356,
3372525.98485469, 194055.24146137, 82570.78055486,
107755.73232124, 98100.67856341, 64007.05829165,
1960017.88124655, 439387.54981251, 1311678.25959343,
52266.56751161, 612357.4359533 , 277611.14005727,
100834.87848026, 52421.39380137, 320518.12600116,
622991.23928914, 905631.603816 , 2927834.24849227,
562358.56357488, 111381.83047422, 781476.31131047,
798596.64645281, 2310966.59797285, 5133725.87086224,
1281440.30372974, 171353.72979582, 89630.73883889,
1555941.58912764, 124073.12498191, 66108.35792552,
201459.85880026, 108993.88378627, 3134539.90969621,
148239.45542642, 53451.2389687 , 1307718.56304709,
214406.20173983, 194835.86837801, 3367422.08211024,
199766.90646407, 3823577.41600747, 68447.35094925,
739960.30606392, 506712.38290343, 150255.49362263,
62093.65018694, 64707.81660264, 151784.52074685,
80757.16568265, 177095.84545845, 210255.47503947,
360547.62990175, 181508.22558258, 55171.21699634]),
array([3367422.08211024, 961138.88337016, 1097929.67851313,
2097270.81999921, 781476.31131047, 123133.23396977,
84085.01896248, 199222.92670303, 124854.97973097,
1072684.78820647, 657040.55769984, 1644701.47160779,
924735.72080619, 191699.58415836, 723218.39948755,
2395613.42888462, 113180.54906592, 1882765.92812296,
97286.47282983, 417270.42113431, 161482.3367577 ,
63495.73363115, 598057.42337284, 249411.25868881,
308424.71884273, 1319518.60852241, 202273.52740684,
1076683.13768137, 491636.41945325, 183406.6380314 ,
1740233.45816734, 86696.17528125, 440430.29773088,
2040944.13061099, 113375.45695331, 531746.29484356,
106207.35971699, 169560.79581187, 158101.0895316 ,
95037.29701059, 466355.50624629, 197548.63265325,
198810.0901353 , 682743.4340129 , 679265.22397749,
617812.91125278, 219182.85833837, 83712.10602149,
1206257.86593038, 213886.45118276, 225829.13565632,
573638.25784044, 213008.72404464, 1256363.95636108,
1114803.4971861 , 743531.85320002, 491360.56067236,
844903.11979141, 1294842.58551457, 3224861.58056056,
620636.85839868, 1997931.01104478, 425408.70724227,
187911.96484382, 146932.14876274, 60290.34778046,
1553051.98103674, 61273.32234969, 434403.58150991,
295641.77082639, 3270724.75199692, 956695.86295249,
52421.39380137, 1193829.08957371, 109115.94774032,
114532.36574566, 1889840.31759081, 263117.7221664 ,
110722.44800397, 129405.69463079, 295288.27657346,
2208289.14524844, 127156.40131508, 2356552.01519359,
52421.39380137, 1582094.56763018, 450263.02347743,
180314.7209139 , 207866.80825891, 237980.96274482,
92809.10191796, 217726.13148671, 588167.04361702,
2269569.15414991, 413549.08320869, 504238.61127942,
121235.38111888, 294348.6064813 , 365600.03475309,
1021310.81474734, 1255348.78654001, 1121695.37239584,
236858.1355381 , 180837.57476421, 390892.54870688,
112045.06953824, 3091294.61560971, 71489.84756224,
3949551.73068699, 117269.53815136, 76582.3994148 ,
1933601.27793109, 1675206.65522009, 294348.6064813 ,
97940.27258456, 945839.01956416, 610770.20862757,
1804885.34137216, 199766.90646407, 70766.27122776,
81754.00329429, 1021310.81474734, 846364.16218315,
77206.4852859 , 598474.09814662, 169200.18001164,
73757.25966134, 68560.19329018, 363573.18970532,
439100.02360785, 53273.19743239, 883179.93430519,
96841.00528099, 69705.57216142, 720189.94864194,
4682200.86959811, 157547.86592702, 598577.33917498,
103530.67260964, 219913.31885361, 69705.57216142,
3393838.48355355, 3747750.51209052, 137237.96352743,
205694.9146793 , 1343063.74316061, 105778.55398733,
113406.67661411, 2090467.44678197, 676486.18009377,
1653182.84275506, 151567.24115244, 82132.69602781,
450200.37816199, 123824.9061514 , 841869.56483077,
95800.41312684, 118291.73959934, 2085585.62459758,
2372839.45720666, 247688.59858277, 1334681.62533764,
101692.7552419 , 503472.70498147, 64636.19839218,
320805.50593895, 2623266.33617439, 174088.14927962,
72116.10600518, 80293.56098003, 89828.86338213,
219193.02374581, 328023.58124177, 60290.34778046,
516294.37658866, 236774.09865655, 163101.99672652,
95037.29701059, 4401076.73195112, 3540272.0939466 ,
80652.10214941, 64007.05829165, 91088.74649744,
451288.50611234, 2108891.7577948 , 222324.39435632,
230969.76551419, 119869.73646973, 202629.14049356,
3372525.98485469, 194055.24146137, 82570.78055486,
107755.73232124, 98100.67856341, 64007.05829165,
1960017.88124655, 439387.54981251, 1311678.25959343,
52266.56751161, 612357.4359533 , 277611.14005727,
100834.87848026, 52421.39380137, 320518.12600116,
622991.23928914, 905631.603816 , 2927834.24849227,
562358.56357488, 111381.83047422, 781476.31131047,
798596.64645281, 2310966.59797285, 5133725.87086224,
1281440.30372974, 171353.72979582, 89630.73883889,
1555941.58912764, 124073.12498191, 66108.35792552,
201459.85880026, 108993.88378627, 3134539.90969621,
148239.45542642, 53451.2389687 , 1307718.56304709,
214406.20173983, 194835.86837801, 3367422.08211024,
199766.90646407, 3823577.41600747, 68447.35094925,
739960.30606392, 506712.38290343, 150255.49362263,
62093.65018694, 64707.81660264, 151784.52074685,
80757.16568265, 177095.84545845, 210255.47503947,
360547.62990175, 181508.22558258, 55171.21699634]),
array([3367422.08211024, 961138.88337016, 1097929.67851313,
2097270.81999921, 781476.31131047, 123133.23396977,
84085.01896248, 199222.92670303, 124854.97973097,
1072684.78820647, 657040.55769984, 1644701.47160779,
924735.72080619, 191699.58415836, 723218.39948755,
2395613.42888462, 113180.54906592, 1882765.92812296,
97286.47282983, 417270.42113431, 161482.3367577 ,
63495.73363115, 598057.42337284, 249411.25868881,
308424.71884273, 1319518.60852241, 202273.52740684,
1076683.13768137, 491636.41945325, 183406.6380314 ,
1740233.45816734, 86696.17528125, 440430.29773088,
2040944.13061099, 113375.45695331, 531746.29484356,
106207.35971699, 169560.79581187, 158101.0895316 ,
95037.29701059, 466355.50624629, 197548.63265325,
198810.0901353 , 682743.4340129 , 679265.22397749,
617812.91125278, 219182.85833837, 83712.10602149,
1206257.86593038, 213886.45118276, 225829.13565632,
573638.25784044, 213008.72404464, 1256363.95636108,
1114803.4971861 , 743531.85320002, 491360.56067236,
844903.11979141, 1294842.58551457, 3224861.58056056,
620636.85839868, 1997931.01104478, 425408.70724227,
187911.96484382, 146932.14876274, 60290.34778046,
1553051.98103674, 61273.32234969, 434403.58150991,
295641.77082639, 3270724.75199692, 956695.86295249,
52421.39380137, 1193829.08957371, 109115.94774032,
114532.36574566, 1889840.31759081, 263117.7221664 ,
110722.44800397, 129405.69463079, 295288.27657346,
2208289.14524844, 127156.40131508, 2356552.01519359,
52421.39380137, 1582094.56763018, 450263.02347743,
180314.7209139 , 207866.80825891, 237980.96274482,
92809.10191796, 217726.13148671, 588167.04361702,
2269569.15414991, 413549.08320869, 504238.61127942,
121235.38111888, 294348.6064813 , 365600.03475309,
1021310.81474734, 1255348.78654001, 1121695.37239584,
236858.1355381 , 180837.57476421, 390892.54870688,
112045.06953824, 3091294.61560971, 71489.84756224,
3949551.73068699, 117269.53815136, 76582.3994148 ,
1933601.27793109, 1675206.65522009, 294348.6064813 ,
97940.27258456, 945839.01956416, 610770.20862757,
1804885.34137216, 199766.90646407, 70766.27122776,
81754.00329429, 1021310.81474734, 846364.16218315,
77206.4852859 , 598474.09814662, 169200.18001164,
73757.25966134, 68560.19329018, 363573.18970532,
439100.02360785, 53273.19743239, 883179.93430519,
96841.00528099, 69705.57216142, 720189.94864194,
4682200.86959811, 157547.86592702, 598577.33917498,
103530.67260964, 219913.31885361, 69705.57216142,
3393838.48355355, 3747750.51209052, 137237.96352743,
205694.9146793 , 1343063.74316061, 105778.55398733,
113406.67661411, 2090467.44678197, 676486.18009377,
1653182.84275506, 151567.24115244, 82132.69602781,
450200.37816199, 123824.9061514 , 841869.56483077,
95800.41312684, 118291.73959934, 2085585.62459758,
2372839.45720666, 247688.59858277, 1334681.62533764,
101692.7552419 , 503472.70498147, 64636.19839218,
320805.50593895, 2623266.33617439, 174088.14927962,
72116.10600518, 80293.56098003, 89828.86338213,
219193.02374581, 328023.58124177, 60290.34778046,
516294.37658866, 236774.09865655, 163101.99672652,
95037.29701059, 4401076.73195112, 3540272.0939466 ,
80652.10214941, 64007.05829165, 91088.74649744,
451288.50611234, 2108891.7577948 , 222324.39435632,
230969.76551419, 119869.73646973, 202629.14049356,
3372525.98485469, 194055.24146137, 82570.78055486,
107755.73232124, 98100.67856341, 64007.05829165,
1960017.88124655, 439387.54981251, 1311678.25959343,
52266.56751161, 612357.4359533 , 277611.14005727,
100834.87848026, 52421.39380137, 320518.12600116,
622991.23928914, 905631.603816 , 2927834.24849227,
562358.56357488, 111381.83047422, 781476.31131047,
798596.64645281, 2310966.59797285, 5133725.87086224,
1281440.30372974, 171353.72979582, 89630.73883889,
1555941.58912764, 124073.12498191, 66108.35792552,
201459.85880026, 108993.88378627, 3134539.90969621,
148239.45542642, 53451.2389687 , 1307718.56304709,
214406.20173983, 194835.86837801, 3367422.08211024,
199766.90646407, 3823577.41600747, 68447.35094925,
739960.30606392, 506712.38290343, 150255.49362263,
62093.65018694, 64707.81660264, 151784.52074685,
80757.16568265, 177095.84545845, 210255.47503947,
360547.62990175, 181508.22558258, 55171.21699634])]
In [73]:
# submission 사본을 kfoldLightGBM 입력
kfoldLightGBM = submission.copy()
# 예측 결과값 kfoldLightGBM['box_off_num'] 에 추가
import numpy as np
kfoldLightGBM['box_off_num'] = np.mean(preds, axis = 0)
# kfoldLightGBM 상위 5개의 행 출력
kfoldLightGBM.head()
Out[73]:
title | box_off_num | |
---|---|---|
0 | 용서는 없다 | 3367422.1 |
1 | 아빠가 여자를 좋아해 | 961138.9 |
2 | 하모니 | 1097929.7 |
3 | 의형제 | 2097270.8 |
4 | 평행 이론 | 781476.3 |
In [76]:
# kfoldLightGBM.csv 파일 만든 후 결과값 저장
kfoldLightGBM.to_csv('data/kfoldLightGBM.csv', index = False)
feature engineering¶
In [77]:
features
Out[77]:
['time', 'dir_prev_num', 'num_staff', 'num_actor']
In [80]:
train.columns
Out[80]:
Index(['title', 'distributor', 'genre', 'release_time', 'time',
'screening_rat', 'director', 'dir_prev_bfnum', 'dir_prev_num',
'num_staff', 'num_actor', 'box_off_num'],
dtype='object')
In [81]:
train.genre
Out[81]:
0 액션
1 느와르
2 액션
3 코미디
4 코미디
...
595 드라마
596 드라마
597 공포
598 느와르
599 액션
Name: genre, Length: 600, dtype: object
In [82]:
# train data의 genre 값을 정수로 변환
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
train['genre'] = le.fit_transform(train['genre'])
train['genre']
Out[82]:
0 10
1 2
2 10
3 11
4 11
..
595 4
596 4
597 1
598 2
599 10
Name: genre, Length: 600, dtype: int32
In [90]:
# test data의 genre 값을 정수로 변환
test['genre'] = le.transform(test['genre'])
test['genre']
Out[90]:
0 2
1 5
2 4
3 10
4 1
..
238 4
239 3
240 2
241 4
242 5
Name: genre, Length: 243, dtype: int32
In [94]:
# features 값 재정의
features = ['time', 'dir_prev_num', 'num_staff', 'num_actor', 'dir_prev_bfnum', 'genre']
In [95]:
# 학습을 위해 필요한 변수에 값 할당
X_train, X_test, y_train = train[features], test[features], train[target]
In [96]:
# k-fold 와 LGBMRegressor 로 학습 시키기
model = lgb.LGBMRegressor(random_state = 777, n_estimators = 1000)
models = []
for train_idx, val_idx in k_fold.split(X_train):
x_t = X_train.iloc[train_idx]
y_t = y_train.iloc[train_idx]
x_val = X_train.iloc[val_idx]
y_val = y_train.iloc[val_idx]
models.append(model.fit(x_t, y_t, eval_set = (x_val, y_val), early_stopping_rounds = 100, verbose = 100))
C:\Users\Jung_dayoung\anaconda3\lib\site-packages\lightgbm\sklearn.py:726: UserWarning: 'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. Pass 'early_stopping()' callback via 'callbacks' argument instead.
_log_warning("'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. "
C:\Users\Jung_dayoung\anaconda3\lib\site-packages\lightgbm\sklearn.py:736: UserWarning: 'verbose' argument is deprecated and will be removed in a future release of LightGBM. Pass 'log_evaluation()' callback via 'callbacks' argument instead.
_log_warning("'verbose' argument is deprecated and will be removed in a future release of LightGBM. "
[100] valid_0's l2: 2.62067e+12
[100] valid_0's l2: 4.39227e+12
[100] valid_0's l2: 3.29841e+12
[100] valid_0's l2: 1.56499e+12
[100] valid_0's l2: 1.60118e+12
In [97]:
X_test.head()
Out[97]:
time | dir_prev_num | num_staff | num_actor | dir_prev_bfnum | genre | |
---|---|---|---|---|---|---|
0 | 125 | 2 | 304 | 3 | 300529.0 | 2 |
1 | 113 | 4 | 275 | 3 | 342700.2 | 5 |
2 | 115 | 3 | 419 | 7 | 4206610.7 | 4 |
3 | 116 | 2 | 408 | 2 | 691342.0 | 10 |
4 | 110 | 1 | 380 | 1 | 31738.0 | 1 |
In [98]:
# X_test로 결과값 예측
preds = []
for model in models:
preds.append(model.predict(X_test))
len(preds)
Out[98]:
5
In [101]:
# submission 사본을 feLightGBM 에 할당
feLightGBM = submission.copy()
# 예측값의 평균을 feLightGBM['box_off_num'] 에 할당
feLightGBM['box_off_num'] = np.mean(preds, axis = 0)
# feLightGBM.csv 파일 만든 후 결과값 저장
feLightGBM.to_csv('data/feLightGBM.csv', index = False)
In [102]:
# 필요한 패키지 불러오기
from sklearn.model_selection import GridSearchCV
In [105]:
# 최적화된 파라미터 찾기
model = lgb.LGBMRegressor(random_state = 777, n_estimators = 1000)
params = {
'learning_rate': [0.1, 0.01, 0.003],
'min_child_samples': [20, 30]
}
gs = GridSearchCV(estimator = model,
param_grid = params,
scoring = 'neg_mean_squared_error',
cv = k_fold)
In [106]:
# 학습시키기
gs.fit(X_train, y_train)
Out[106]:
GridSearchCV(cv=KFold(n_splits=5, random_state=777, shuffle=True),
estimator=LGBMRegressor(n_estimators=1000, random_state=777),
param_grid={'learning_rate': [0.1, 0.01, 0.003],
'min_child_samples': [20, 30]},
scoring='neg_mean_squared_error')
In [107]:
# 최적화된 파라미터 출력
gs.best_params_
Out[107]:
{'learning_rate': 0.003, 'min_child_samples': 30}
In [108]:
# 학습 시키기
model = lgb.LGBMRegressor(random_state = 777, n_estimators = 1000,
learning_rate = 0.003, min_child_samples = 30)
models = []
for train_idx, val_idx in k_fold.split(X_train):
x_t = X_train.iloc[train_idx]
y_t = y_train.iloc[train_idx]
x_val = X_train.iloc[val_idx]
y_val = y_train.iloc[val_idx]
models.append(model.fit(x_t, y_t, eval_set = (x_val, y_val), early_stopping_rounds = 100, verbose = 100))
C:\Users\Jung_dayoung\anaconda3\lib\site-packages\lightgbm\sklearn.py:726: UserWarning: 'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. Pass 'early_stopping()' callback via 'callbacks' argument instead.
_log_warning("'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. "
C:\Users\Jung_dayoung\anaconda3\lib\site-packages\lightgbm\sklearn.py:736: UserWarning: 'verbose' argument is deprecated and will be removed in a future release of LightGBM. Pass 'log_evaluation()' callback via 'callbacks' argument instead.
_log_warning("'verbose' argument is deprecated and will be removed in a future release of LightGBM. "
[100] valid_0's l2: 2.56673e+12
[200] valid_0's l2: 2.45583e+12
[300] valid_0's l2: 2.42575e+12
[400] valid_0's l2: 2.43392e+12
[100] valid_0's l2: 4.89194e+12
[200] valid_0's l2: 4.40922e+12
[300] valid_0's l2: 4.19146e+12
[400] valid_0's l2: 4.05951e+12
[500] valid_0's l2: 3.96931e+12
[600] valid_0's l2: 3.91727e+12
[700] valid_0's l2: 3.88462e+12
[800] valid_0's l2: 3.87695e+12
[900] valid_0's l2: 3.87088e+12
[100] valid_0's l2: 3.14361e+12
[200] valid_0's l2: 2.79286e+12
[300] valid_0's l2: 2.59302e+12
[400] valid_0's l2: 2.47608e+12
[500] valid_0's l2: 2.40386e+12
[600] valid_0's l2: 2.36407e+12
[700] valid_0's l2: 2.38505e+12
[100] valid_0's l2: 1.60592e+12
[200] valid_0's l2: 1.40227e+12
[300] valid_0's l2: 1.30053e+12
[400] valid_0's l2: 1.25184e+12
[500] valid_0's l2: 1.23543e+12
[600] valid_0's l2: 1.23595e+12
[100] valid_0's l2: 1.96107e+12
[200] valid_0's l2: 1.75478e+12
[300] valid_0's l2: 1.64513e+12
[400] valid_0's l2: 1.58132e+12
[500] valid_0's l2: 1.54801e+12
[600] valid_0's l2: 1.52159e+12
[700] valid_0's l2: 1.50655e+12
[800] valid_0's l2: 1.49834e+12
[900] valid_0's l2: 1.50018e+12
In [109]:
# 예측 결과값 정리
preds = []
for model in models:
preds.append(model.predict(X_test))
In [110]:
# 최고 점수 출력
gs.best_score_
Out[110]:
-2334525343085.6494
In [112]:
# 예측 결과값 파일에 저장 하는 과정
gslgbm = submission.copy()
gslgbm['box_off_num'] = np.mean(preds, axis = 0)
gslgbm.to_csv('data/gslgbm.csv', index = False)
a. lightGBM (base model)¶
In [113]:
singleLGBM
Out[113]:
title | box_off_num | |
---|---|---|
0 | 용서는 없다 | 2817995.2 |
1 | 아빠가 여자를 좋아해 | 375377.2 |
2 | 하모니 | -569324.3 |
3 | 의형제 | 1581189.0 |
4 | 평행 이론 | -527780.6 |
... | ... | ... |
238 | 해에게서 소년에게 | 500784.4 |
239 | 울보 권투부 | 1013858.4 |
240 | 어떤살인 | 1682067.7 |
241 | 말하지 못한 비밀 | 300216.3 |
242 | 조선안방 스캔들-칠거지악 2 | 11390.0 |
243 rows × 2 columns
b. k-fold lightGBM (k-fold model)¶
In [114]:
kfoldLightGBM
Out[114]:
title | box_off_num | |
---|---|---|
0 | 용서는 없다 | 3367422.1 |
1 | 아빠가 여자를 좋아해 | 961138.9 |
2 | 하모니 | 1097929.7 |
3 | 의형제 | 2097270.8 |
4 | 평행 이론 | 781476.3 |
... | ... | ... |
238 | 해에게서 소년에게 | 177095.8 |
239 | 울보 권투부 | 210255.5 |
240 | 어떤살인 | 360547.6 |
241 | 말하지 못한 비밀 | 181508.2 |
242 | 조선안방 스캔들-칠거지악 2 | 55171.2 |
243 rows × 2 columns
c. feature engineering (fe)¶
In [115]:
feLightGBM
Out[115]:
title | box_off_num | |
---|---|---|
0 | 용서는 없다 | 3395492.7 |
1 | 아빠가 여자를 좋아해 | 823543.9 |
2 | 하모니 | 1162055.4 |
3 | 의형제 | 2184689.1 |
4 | 평행 이론 | 809328.8 |
... | ... | ... |
238 | 해에게서 소년에게 | 81854.0 |
239 | 울보 권투부 | 54816.4 |
240 | 어떤살인 | 410490.0 |
241 | 말하지 못한 비밀 | 139172.4 |
242 | 조선안방 스캔들-칠거지악 2 | 28897.8 |
243 rows × 2 columns
d. grid search (hyperparameter tuning)¶
In [116]:
gslgbm
Out[116]:
title | box_off_num | |
---|---|---|
0 | 용서는 없다 | 2974959.7 |
1 | 아빠가 여자를 좋아해 | 982313.1 |
2 | 하모니 | 1283210.4 |
3 | 의형제 | 1681758.5 |
4 | 평행 이론 | 909584.5 |
... | ... | ... |
238 | 해에게서 소년에게 | 78861.3 |
239 | 울보 권투부 | 127602.2 |
240 | 어떤살인 | 447047.3 |
241 | 말하지 못한 비밀 | 276243.3 |
242 | 조선안방 스캔들-칠거지악 2 | 58072.2 |
243 rows × 2 columns
블로그 업로드 용 코드¶
In [117]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
제출 결과
'AI > Machine Learning' 카테고리의 다른 글
[Practical Time Series Analysis (실전 시계열 분석)] Chapter 06 시계열의 통계 모델 (0) | 2022.03.23 |
---|---|
[Practical Time Series Analysis (실전 시계열 분석)] Chapter 05 시간 데이터 저장 (0) | 2022.03.23 |
[Practical Time Series Analysis (실전 시계열 분석)] Chapter 04 시계열 데이터의 시뮬레이션 (0) | 2022.03.22 |
[Practical Time Series Analysis (실전 시계열 분석)] Chapter 03 시계열의 탐색적 자료 분석 (0) | 2022.03.20 |
[DACON] 영화 데이터를 활용한 데이터 분석1 (EDA) (0) | 2022.01.30 |