[빅데이터분석기사] 실기 대비 4일차

글 작성자: pental

본 글은 개인적으로 공부하려고 작성한 내용입니다.
만일 저작권 및 다른 이유로 인해 문제가 발생할 소지가 있다면
pental@kakao.com 으로 메일 또는 해당 게시물에 댓글을 달아주시면 처리하도록 하겠습니다.

본 내용은 빅데이터분석기사-2차실기 / 장은진 교수님의 강의를 듣고 개인적으로 추가 공부가 필요한 내용에 대해서 정리한 글입니다.

작업형 1유형 문제 1

- 주어진 데이터의 age 컬럼의 3사분위수와 1사분위수의 차를 절대값으로 구하고 정수형태로 출력하시오

- data : basic1.csv

먼저 import pandas 와 같이 기본적인 작업을 진행하고 데이터의 정보를 확인한다.

import pandas as pd
df = pd.read_csv("./bigdata_csvfile/basic1.csv")
df

id	age	city	f1	f2	f3	f4	f5
0	id01	2.0	서울	NaN	0	NaN	ENFJ
1	id02	9.0	서울	70.0	1	NaN	ENFJ
2	id03	27.0	서울	61.0	1	NaN	ISTJ
3	id04	75.0	서울	NaN	2	NaN	INFP
4	id05	24.0	서울	85.0	2	NaN	ISFJ
...	...	...	...	...	...	...	...
95	id96	92.0	경기	53.0	1	NaN	ENTJ
96	id97	100.0	경기	NaN	0	NaN	INFP
97	id98	39.0	경기	58.0	2	NaN	INFP
98	id99	1.0	경기	47.0	0	NaN	ESFJ
99	id100	47.0	경기	53.0	0	vip	ESFP

총 100개의 행이 있으며 8개의 컬럼으로 이루어져 있다.

문제에서는 age컬럼의 3사분위수와 1사분위수의 차의 절대값으로 구하라 했기에, quantile 함수를 통해서 3사분위수와 1사분위의를 구한다.

Q3 = df['age'].quantile(0.75)
Q1 = df['age'].quantile(0.25)
# print(Q3) # 77.0
# print(Q1) # 26.875

절대값 및 정수로 출력하기 위해서 다음과 같이 진행하였다.

result = int(abs(Q3 - Q1)) # 50
print(result)

답 : 50

# 다른 답안
Q3 = df.age.quantile(.75)
Q1 = df.age.quantile(.25)
result = abs(Q3 - Q1)
print(int(result))

작업형 1유형 문제 2

주어진 소셜 데이터에서 (loves 컬럼 + wows 컬럼) / (reactions 컬럼) 비율이 0.4보다 크고 0.5보다 작으면서, type = 'video'인 데이터의 개수를 구하시오
data : fb.csv

import pandas as pd
df = pd.read_csv("./bigdata_csvfile/fb.csv")
display(df)

id	type	reactions	comments	shares	likes	loves	wows	hahas	sads	angrys
0	1	video	529	512	262	432	92	3	1	1
1	2	photo	150	0	0	150	0	0	0	0
2	3	video	227	236	57	204	21	1	1	0
3	4	photo	111	0	0	111	0	0	0	0
4	5	photo	213	0	0	204	9	0	0	0
...	...	...	...	...	...	...	...	...	...	...
7045	7046	photo	89	0	0	89	0	0	0	0
7046	7047	photo	16	0	0	14	1	0	1	0
7047	7048	photo	2	0	0	1	1	0	0	0
7048	7049	photo	351	12	22	349	2	0	0	0
7049	7050	photo	17	0	0	17	0	0	0	0

이 데이터는 7050행, 11개의 컬럼으로 이루어져 있다.

먼저 문제에서 주어진 것처럼 데이터를 계산하기 위해서 `calc` 컬럼을 새로 만들어 (loves + wows) / reactions 의 값을 계산해주었다.

df['calc'] = (df['loves'] + df['wows']) / df['reactions']
display(df['calc'])

0       0.179584
1       0.000000
2       0.096916
3       0.000000
4       0.042254
          ...   
7045    0.000000
7046    0.062500
7047    0.500000
7048    0.005698
7049    0.000000
Name: calc, Length: 7050, dtype: float64

다음으로는 문제에서 주어진 조건식을 세워주었다.

cond1 = df['calc'] > 0.4
cond2 = df['calc'] < 0.5
cond3 = df['type'] == 'video'
print(len(df[cond1 & cond2 & cond3])) # 50

나는 cond를 따로 나누어 조건식을 세웠지만 loc로도 작업이 가능하다.

df.loc[(df['calc'] > 0.4) & (df['calc'] < 0.5) & (df['type'] == 'video')]

답 : 50

# 다른 풀이
df['ratio'] = (df.loves + df.wows) / df.reactions
result = (df.type == 'video') & (df.ratio > 0.4) & (df.ratio < 0.5) 
print(result.sum())

작업형 1유형 문제 3

주어진 미디어 데이터에서 data_added가 2018년 1월 이면서 country가 United Kingdom 단독 제작인 데이터의 개수를 구하시오
data : nf.csv

import pandas as pd
df = pd.read_csv("./bigdata_csvfile/nf.csv")

display(df)

show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in
0	s1	Movie	Dick Johnson Is Dead	Kirsten Johnson	NaN	United States	September 25, 2021	2020	PG-13	90 min
1	s2	TV Show	Blood & Water	NaN	Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...	South Africa	September 24, 2021	2021	TV-MA	2 Seasons
2	s3	TV Show	Ganglands	Julien Leclercq	Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...	NaN	September 24, 2021	2021	TV-MA	1 Season
3	s4	TV Show	Jailbirds New Orleans	NaN	NaN	NaN	September 24, 2021	2021	TV-MA	1 Season
4	s5	TV Show	Kota Factory	NaN	Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...	India	September 24, 2021	2021	TV-MA	2 Seasons
...	...	...	...	...	...	...	...	...	...	...
8802	s8803	Movie	Zodiac	David Fincher	Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...	United States	November 20, 2019	2007	R	158 min
8803	s8804	TV Show	Zombie Dumb	NaN	NaN	NaN	July 1, 2019	2018	TV-Y7	2 Seasons

먼저 date_added 컬럼을 pd.to_datetime 함수를 통해서 pandas에서 자동으로 날짜 형식으로 변환시키도록 한다,

df['date_added'] = pd.to_datetime(df['date_added'])
display(df['date_added'])

0      2021-09-25
1      2021-09-24
2      2021-09-24
3      2021-09-24
4      2021-09-24
          ...    
8802   2019-11-20
8803   2019-07-01
8804   2019-11-01
8805   2020-01-11
8806   2019-03-02
Name: date_added, Length: 8807, dtype: datetime64[ns]

이렇게 되면 `df[’date_added’]` 컬럼은 데이터타입이 `datetime64`로 변하게 된다. ← 이건 중요

또한 연과 월을 구해야하기 때문에 `df[’date_added_month’]`, `df[’date_added_year’]` 컬럼을 새로 생성해주었다. 이때, `.dt.month`, `.dt.year`을 사용하여 연과 월을 계산했다.

df['date_added_month'] = df['date_added'].dt.month
df['date_added_year'] = df['date_added'].dt.year

또한 조건식을 3개 세워서 2018년이면서 1월, 그리고 국가가 United Kingdom 인 경우의 조건 3개를 세워줬다.

cond1 = df['date_added_month'] == 1
cond2 = df['date_added_year'] == 2018
cond3 = df['country'] == "United Kingdom"

조건식에 & 필터링을 걸어서 계산한 결과를 출력한다.

result = df[cond1 & cond2 & cond3]
print(len(result)) # 6

답 : 6

# 다른 풀이
df.date_added = pd.to_datetime(df.date_added)
con1 = df['date_added'].dt.year == 2018
con2 = df['date_added'].dt.month == 1
con3 = df['country'] == "United Kingdom"

result = con1 & con2 & con3
print(result.sum())

작업형 2유형

주어진 데이터를 활용하여 목표 변수 Segmentation에 대해 ID별로 Segmentation의 클래스를 예측해서 저장 후 제출하시오.
단, 제출 데이터 컬럼은 ID와 Segmentation 두 개만 존재해야하고, index는 포함하지 않는다.
평가지표는 macro f1 score로 하고, 제출 파일 이름은 4th_test_type2.csv로 한다.
train data : test4_type2_train.csv
test data : test4_type2_test.csv

import pandas as pd

train = pd.read_csv("./bigdata_csvfile/test4_type2_train.csv")
test = pd.read_csv("./bigdata_csvfile/test4_type2_test.csv")

# EDA
display(train.head())
display(test.head())

ID	Gender	Ever_Married	Age	Graduated	Profession	Work_Experience	Spending_Score	Family_Size	Var_1	Segmentation
0	464357	Male	No	40	Yes	Artist	7.0	Low	1.0	Cat_6
1	459624	Male	No	18	No	Healthcare	NaN	Low	5.0	Cat_4
2	462672	Male	No	25	No	Healthcare	7.0	Low	4.0	Cat_4
3	463360	Female	Yes	46	Yes	Artist	2.0	Low	1.0	Cat_6
4	462420	Male	No	27	Yes	Healthcare	1.0	Low	3.0	Cat_6

	ID	Gender	Ever_Married	Age	Graduated	Profession	Work_Experience	Spending_Score	Family_Size	Var_1
0	460406	Male	Yes	36	Yes	Healthcare	3.0	Low	2.0	Cat_6
1	466890	Male	Yes	47	Yes	Artist	0.0	Average	6.0	Cat_7
2	466145	Male	Yes	56	No	Entertainment	1.0	Average	3.0	Cat_6
3	465805	Male	Yes	50	Yes	Engineer	NaN	Low	1.0	Cat_3
4	466137	Female	NaN	31	No	Healthcare	0.0	Low	4.0	Cat_6

각각 `train`, `test`의 `head`를 보여주고 있다. `탐색적 EDA`를 진행한다. (기본 정보 및 결측값 확인)

먼저 결측값을 확인했다.

print(train.isnull().sum())
print()
print(test.isnull().sum())

ID                   0
Gender               0
Ever_Married        96
Age                  0
Graduated           51
Profession          64
Work_Experience    513
Spending_Score       0
Family_Size        192
Var_1               50
Segmentation         0
dtype: int64

ID                   0
Gender               0
Ever_Married        22
Age                  0
Graduated           16
Profession          23
Work_Experience    142
Spending_Score       0
Family_Size         68
Var_1               12
dtype: int64

현재 train과 test 모두 결측값이 다량 존재한다. 이 값들을 모두 제거하기로 한다.

# 결측값은 제거

train.dropna(axis = 0, inplace = True)
test.dropna(axis = 0, inplace = True)

print(train.isnull().sum())
print(test.isnull().sum())


ID                 0
Gender             0
Ever_Married       0
Age                0
Graduated          0
Profession         0
Work_Experience    0
Spending_Score     0
Family_Size        0
Var_1              0
Segmentation       0
dtype: int64
ID                 0
Gender             0
Ever_Married       0
Age                0
Graduated          0
Profession         0
Work_Experience    0
Spending_Score     0
Family_Size        0
Var_1              0
dtype: int64

`dropna` 및 `axis = 0`을 통해서 결측값 행을 모두 삭제해주었다. 이로 인해서 행 갯수가 변하게 되었다.

`train`의 경우 결측값 제거전 4881개의 행, 결측값 제거 후 4018행을 나타내고 있다. 대략 800개의 행이 제거되었다.

`info` 함수를 통해서 라벨인코딩 전 어떤 값이 `object` 타입인지 알기 위해서 진행했다.

print(train.info())

<class 'pandas.core.frame.DataFrame'>
Index: 4018 entries, 0 to 4879
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               4018 non-null   int64  
 1   Gender           4018 non-null   object 
 2   Ever_Married     4018 non-null   object 
 3   Age              4018 non-null   int64  
 4   Graduated        4018 non-null   object 
 5   Profession       4018 non-null   object 
 6   Work_Experience  4018 non-null   float64
 7   Spending_Score   4018 non-null   object 
 8   Family_Size      4018 non-null   float64
 9   Var_1            4018 non-null   object 
 10  Segmentation     4018 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 376.7+ KB
None

object 타입인 친구들을 미리 알아두고 라벨인코딩을 진행한다. 라벨인코딩의 경우 sklearn의 LabelEncoder을 사용했다. 아래는 사용한 코드이다.

from sklearn.preprocessing import LabelEncoder
cols = ['Gender', 'Ever_Married', "Graduated", "Profession", "Spending_Score", "Var_1"]

le = LabelEncoder()

for col in cols :
    le.fit(train[col])
    train[col] = le.transform(train[col])
    test[col] = le.transform(test[col])

print(train.info())
print(test.info())

<class 'pandas.core.frame.DataFrame'>
Index: 4018 entries, 0 to 4879
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               4018 non-null   int64  
 1   Gender           4018 non-null   int64  
 2   Ever_Married     4018 non-null   int64  
 3   Age              4018 non-null   int64  
 4   Graduated        4018 non-null   int64  
 5   Profession       4018 non-null   int64  
 6   Work_Experience  4018 non-null   float64
 7   Spending_Score   4018 non-null   int64  
 8   Family_Size      4018 non-null   float64
 9   Var_1            4018 non-null   int64  
 10  Segmentation     4018 non-null   object 
dtypes: float64(2), int64(8), object(1)
memory usage: 376.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
Index: 1288 entries, 0 to 1541
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               1288 non-null   int64  
...
 9   Var_1            1288 non-null   int64  
dtypes: float64(2), int64(8)
memory usage: 110.7 KB

여기서 놓치기 쉬운점은 `train`에 대해서만 라벨인코딩을 진행하는것이 아닌 `test` 데이터에도 라벨 인코딩을 진행해야한다.

모든 `object` 타입들이 `int64`로 변경된것을 확인할 수 있다. 이제 본격적으로 학습을 진행한다.

from sklearn.model_selection import train_test_split
X = train.drop(columns = ["Segmentation", "ID"])
y = train["Segmentation"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(3214, 9) (804, 9) (3214,) (804,)

X는 독립변수 y는 종속변수이다. X에 `Segmentation`은 삭제하는게 당연하다. 종속변수이기 때문이다. 그런데 ID는 삭제한 이유는 학습에 도움이 되지 않는 값이기 때문에 `Segmentation`과 `ID` 모두 삭제했다.

y는 종속변수이기에 `train[”Segmentation”]`의 단일 열을 가지도록 한다.

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators = 95, max_depth = 6, random_state = 123)
rfc.fit(X_train, y_train)
pred = rfc.predict(X_test)
print(pred[:5])

['A' 'C' 'C' 'D' 'A']

랜덤포레스트를 통해서 학습을 진행한다. 문제에서는 f1_score의 결과를 평가지표로 사용한다 했기에 f1 값을 계산한다.

from sklearn.metrics import f1_score, accuracy_score

f1 = f1_score(pred, y_test, average = "macro")
print("f1: ", f1) # f1:  0.505136752530129

acc = accuracy_score(pred, y_test)
print("acc : ", acc) # acc :  0.5211442786069652

accuracy_score는 그냥 정확도 확인을 위해서 사용했다.

여기서 또 놓치기 쉬운점은 지금은 학습데이터로 예측을 한것이고, 실제 제출에서는 테스트 데이터를 사용해서 제출해야한다는 점이다. 따라서 test 데이터에서도 동일하게 ID를 삭제한다.

data = test.drop(columns = "ID")
result = rfc.predict(data)
df = pd.DataFrame(
    {
        "ID" : test["ID"],
        "Segentation" : result
    }
).to_csv("4th_test_type2.csv", index = False)

display(pd.read_csv("4th_test_type2.csv"))

ID	Segentation
0	460406	D
1	466890	C
2	466145	B
3	467572	C
4	460767	B
...	...	...
1283	463734	A
1284	466055	B
1285	465689	D
1286	460401	A
1287	465215	B
1288 rows × 2 columns

또한 `ID` 및 `Segmentation`을 제출하라 했기에, `test[”ID”]`를 통해서 ID값을 추가했다.

근데 모델의 평가지표가 너무 낮아서 모델 향상을 진행해보려고 한다.

import pandas as pd

train = pd.read_csv("./bigdata_csvfile/test4_type2_train.csv")
test = pd.read_csv("./bigdata_csvfile/test4_type2_test.csv")

cols = ['Ever_Married', "Graduated", "Profession", "Work_Experience", "Family_Size", "Var_1"]

for col in cols :
    train[col] = train[col].fillna(train[col].mode()[0])
    test[col] = test[col].fillna(test[col].mode()[0])

결측값을 제거하는 것이 아닌 최빈값으로 대체하는 방법을 사용해보았다.

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4881 entries, 0 to 4880
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               4881 non-null   int64  
 1   Gender           4881 non-null   object 
 2   Ever_Married     4881 non-null   object 
 3   Age              4881 non-null   int64  
 4   Graduated        4881 non-null   object 
 5   Profession       4881 non-null   object 
 6   Work_Experience  4881 non-null   float64
 7   Spending_Score   4881 non-null   object 
 8   Family_Size      4881 non-null   float64
 9   Var_1            4881 non-null   object 
 10  Segmentation     4881 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 419.6+ KB

위와 동일하게 라벨인코딩을 진행한다.

from sklearn.preprocessing import LabelEncoder
cols = ["Gender", "Ever_Married", "Graduated", "Profession" ,"Spending_Score", "Var_1"]

le = LabelEncoder()
for col in cols :
    le.fit(train[col])
    train[col] = le.transform(train[col])
    test[col] = le.transform(test[col])

from sklearn.model_selection import train_test_split
X = train.drop(columns = ["ID", "Segmentation"])
y = train["Segmentation"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 95, max_depth = 6, random_state = 42)
rfc.fit(X_train, y_train)
pred = rfc.predict(X_test)

from sklearn.metrics import f1_score, accuracy_score

f1 = f1_score(pred, y_test, average = 'macro')
acc = accuracy_score(pred, y_test)
print("F1 : ", f1) # F1 :  0.4509502343284842
print("ACC : ", acc) # ACC :  0.46980552712384854

결측값을 제거했을때와 최빈값으로 대체했을때의 평가지표의 결과를 확인하면 오히려 결측값을 대체했을때 평가점수가 더 낮은 것을 확인 할 수 있다. ~~이러면 나가리인데~~

다음은 `LabelEncoder`을 사용하지 않고, `pd.get_dummies()`함수를 사용해보고자 한다.

import pandas as pd

train = pd.read_csv("./bigdata_csvfile/test4_type2_train.csv")
test = pd.read_csv("./bigdata_csvfile/test4_type2_test.csv")

cols = ['Ever_Married', "Graduated", "Profession", "Work_Experience", "Family_Size", "Var_1"]

for col in cols :
    train[col] = train[col].fillna(train[col].mode()[0])
    test[col] = test[col].fillna(test[col].mode()[0])
    
cols = ["Gender", 'Ever_Married', "Graduated", "Profession", "Spending_Score", "Var_1"]
train = pd.get_dummies(train, columns = cols)
test = pd.get_dummies(test, columns = cols)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
from sklearn.metrics import f1_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 95, max_depth = 6, random_state=42)
rf.fit(X_train, y_train)
pred = rf.predict(X_test)

f1 = f1_score(pred, y_test, average = 'macro')
acc = accuracy_score(pred, y_test)
print(f1, acc) # 0.4509502343284842 0.46980552712384854

똑같다.. 이러면 LabelEncoder와 pd.get_dummies()가 같은건가,,?

저작자표시 비영리 (새창열림)

'Programming > 빅데이터분석기사 실기' 카테고리의 다른 글

[빅데이터분석기사] 실기 대비 3일차 - 기출문제 3회 (0)	2025.11.05
[빅데이터분석기사] 실기 대비 2일차 - 기출문제 2회 (0)	2025.11.03
[빅데이터분석기사] 실기 대비 1일차 - 모의고사 (0)	2025.11.03

[빅데이터분석기사] 실기 대비 4일차 - 기출문제 4회

작업형 1유형 문제 1

작업형 1유형 문제 2

작업형 1유형 문제 3

작업형 2유형

'Programming > 빅데이터분석기사 실기' 카테고리의 다른 글

댓글

이 글 공유하기

티스토리툴바

작업형 1유형 문제 1

작업형 1유형 문제 2

작업형 1유형 문제 3

작업형 2유형

'Programming > 빅데이터분석기사 실기' 카테고리의 다른 글

댓글

이 글 공유하기

다른 글

[빅데이터분석기사] 실기 대비 3일차 - 기출문제 3회

[빅데이터분석기사] 실기 대비 2일차 - 기출문제 2회

[빅데이터분석기사] 실기 대비 1일차 - 모의고사

티스토리툴바