[ML] train_test_split을 이용한 데이터 분할

DataScience

[ML] train_test_split을 이용한 데이터 분할

koosco! ・ 2022. 8. 22. 09:57

모델은 데이터를 이용해 학습을 합니다. 학습용 데이터만 가지고 모델이 학습한다면, 학습용 데이터에만 너무 과대적합(overfitting)되어 새로운 데이터가 들어왔을 때 정확한 예측이 어려울 수 있습니다.

그렇기 때문에 데이터셋을 먼저 학습용 데이터와 테스트용 데이터로 나누는 작업이 필요합니다. 학습용 데이터로 모델을 학습한 후에, 테스트용 데이터를 사용했을 때의 점수와 비교하여 모델이 잘 학습되었는지 평가합니다.

sklearn.model_selection의 train_test_split은 데이터가 입력되면 학습용 데이터와 테스트용 데이터로 나누어집니다.

import seaborn as sns
from sklearn.model_selection import train_test_split

iris = sns.load_dataset('iris')

data = iris[iris.columns[:4]].to_numpy()
target = iris['species'].to_numpy()
train_X, test_X, train_y, test_y = train_test_split(data, target)

train_test_split을 사용하면 차례대로 학습용 입력 데이터, 테스트용 입력 데이터, 학습용 타깃 데이터, 테스트용 타깃 데이터를 반환합니다.

파라미터	설명
*arrays	리스트, ndarray, scipy-sparse amtrices, dataframe을 입력받음
test_size	default: None, 0.25 0~1: 데이터셋의 비율로 나눔 정수: 테스트 샘플의 개수를 나타냄
train_size	default: None, 1 - test_size 0~1: 데이터셋의 비율로 나눔 정수: 훈련 샘플의 개수를 나타냄
random_state	seed값 설정
shuffle	train 데이터와 test 데이터를 섞음 shuffle=False라면 stratify=None이어야 한다
stratify	층화 옵션 class 비율에 맞게 train 데이터와 test 데이터를 나눔

* stratify

- stratify는 데이터를 추출할 때, 층화를 사용하여 추출하도록 합니다. target이 되는 데이터 array를 넘겨주면 target의 class 비율에 맞게 train 데이터와 test 데이터를 나눕니다.

import seaborn as sns

iris = sns.load_dataset('iris')
iris['species'].value_counts()

from sklearn.model_selection import train_test_split

data = iris[iris.columns[:4]].to_numpy()
target = iris['species'].to_numpy()
train_X, test_X, train_y, test_y = train_test_split(data, target)
np.unique(train_y, return_counts=True)

import seaborn as sns
from sklearn.model_selection import train_test_split

iris = sns.load_dataset('iris')
data = iris[iris.columns[:4]].to_numpy()
target = iris['species'].to_numpy()

train_X, test_X, train_y, test_y = train_test_split(data, target, stratify=target)
np.unique(train_y, return_counts=True)

'DataScience' 카테고리의 다른 글

[ML] 회귀모형의 평가지표, MAE, MSE, RMSE, RSE, R^2 (0)	2022.08.23
[ML] 교차 검증(cross validate) (0)	2022.08.22
[ML] K-NeighborsRegressor (K-최근접 이웃 회귀) (0)	2022.08.04
[ML] SimpleImputer (누락값 처리) (0)	2022.08.04
[ML] 모델 파라미터와 모델 하이퍼 파라미터 차이 (Difference between a model parameter and a model hyper parameter) (0)	2022.08.02

'DataScience'의 다른글

현재글 [ML] train_test_split을 이용한 데이터 분할

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Koo's.Co

[ML] train_test_split을 이용한 데이터 분할

* stratify

'DataScience' 카테고리의 다른 글

'DataScience'의 다른글

관련글

티스토리툴바