[빅데이터분석기사] 실기 대비 1일차

[빅데이터분석기사] 실기 대비 1일차 - 모의고사

pental 2025. 11. 3. 10:58

본 글은 개인적으로 공부하려고 작성한 내용입니다.
만일 저작권 및 다른 이유로 인해 문제가 발생할 소지가 있다면
pental@kakao.com 으로 메일 또는 해당 게시물에 댓글을 달아주시면 처리하도록 하겠습니다.

본 내용은 빅데이터분석기사-2차실기 / 장은진 교수님의 강의를 듣고 개인적으로 추가 공부가 필요한 내용에 대해서 정리한 글입니다.

작업형 1유형 1번 문제

문제에 대한 정의는 따로 되어 있지 않았음, 단 코드상으로 예측한 문제는 다음과 같다.해당 정보를 바탕으로 사망자가 제일 많은 도시의 사망자와, 사망자가 뒤에서 5번째로 많은 도시의 사망자의 차이를 구하시오

import pandas as pd
import numpy as np
df = pd.read_csv("./bigdata_csvfile/covid_death_bycountry.csv")
display(df.head())

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217 entries, 0 to 216
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Country  217 non-null    object
 1   Deaths   217 non-null    object
 2   Cases    217 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 5.2+ KB
None

사실 데이터의 마지막 부분에는 아래와 같이 “-” 으로 되어 있는 부분이 있다.
해당 부분 때문에 info의 Deaths의 타입이 object로 나타나고 있다.따라서 해당 부분을 교체하고, 정수형으로 변경하기 위해서 다음과 같이 작업한다.

import pandas as pd
import numpy as np
df = pd.read_csv("./bigdata_csvfile/covid_death_bycountry.csv")
display(df.head())

df['Deaths'] = df['Deaths'].str.replace('—', '0')
df['Deaths'] = df['Deaths'].astype(int)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217 entries, 0 to 216
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Country  217 non-null    object
 1   Deaths   217 non-null    int64 
 2   Cases    217 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 5.2+ KB
None

그 후 사망자가 제일 많은 순으로 정렬하기 위해서 df의 기본 함수인 sort_values를 사용하였다.

df = df.sort_values('Deaths', ascending = False)

또한 최소 최대의 경우 sort_values를 통해서 이미 최소, 최대가 구해졌기 때문에 iloc을 통해서 첫번째 행과, -5번째 행의 결과를 가져온다. 이때 df[’Deaths’]에서 바로 iloc를 사용하면 값만 가져와 진다.

max = df['Deaths'].iloc[0]
min = df['Deaths'].iloc[-5]

result = max - min
print(result)

답 : 1110232

작업형 1유형 2번 문제

다음은 2000년부터 2022년까지 국내의 지역별 출생률, 사망률, 이혼률, 혼인률, 자연 성장률에 대한 데이터이다.
2008년 7월 1일 데이터 중 출생률이 가장 높은 지역의 이름을 출력하시오.
(단, 결측값이 포함된 경우 결측값은 해당 컬럼의 중앙값으로 대체한다.)

df = pd.read_csv("./bigdata_csvfile/Korean_demographics_2000-2022.csv")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4860 entries, 0 to 4859
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Date                 4860 non-null   object 
 1   Region               4860 non-null   object 
 2   Birth                4716 non-null   float64
 3   Birth_rate           4709 non-null   float64
 4   Death                4716 non-null   float64
 5   Death_rate           4709 non-null   float64
 6   Divorce              4716 non-null   float64
 7   Divorce_rate         4709 non-null   float64
 8   Marriage             4716 non-null   float64
 9   Marriage_rate        4709 non-null   float64
 10  Natural_growth       4716 non-null   float64
 11  Natural_growth_rate  4709 non-null   float64
dtypes: float64(10), object(2)
memory usage: 455.8+ KB

print(df.isnull().sum())

Date                     0
Region                   0
Birth                  144
Birth_rate             151
Death                  144
Death_rate             151
Divorce                144
Divorce_rate           151
Marriage               144
Marriage_rate          151
Natural_growth         144
Natural_growth_rate    151
dtype: int64

결측값이 상당수 존재하는 것을 확인 할 수 있으며, 먼저 Date의 데이터 타입이 object로 되어 있기 때문에 해당 값을 pandas의 기본 함수인 to_datetime 을 통해서 Date 의 데이터 타입을 변경한다.

df['Date'] = pd.to_datetime(df['Date'])
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4860 entries, 0 to 4859
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Date                 4860 non-null   datetime64[ns]
 1   Region               4860 non-null   object        
 2   Birth                4716 non-null   float64       
 3   Birth_rate           4709 non-null   float64       
 4   Death                4716 non-null   float64       
 5   Death_rate           4709 non-null   float64       
 6   Divorce              4716 non-null   float64       
 7   Divorce_rate         4709 non-null   float64       
 8   Marriage             4716 non-null   float64       
 9   Marriage_rate        4709 non-null   float64       
 10  Natural_growth       4716 non-null   float64       
 11  Natural_growth_rate  4709 non-null   float64       
dtypes: datetime64[ns](1), float64(10), object(1)
memory usage: 455.8+ KB

birth_rate부분에 결측값이 있기 때문에, 해당 값은 그 컬럼의 중앙값으로 채우라고 했기 때문에 median()을 통해서 중앙값으로 채워준다.

df['Birth_rate'].fillna(df['Birth_rate'].median(), inplace = True)

이후 cond라는 변수를 생성해 df[’Date’] == ‘2008-07-01’ 의 조건을 추가하고 df2에 저장했다.

cond1 = df['Date'] == '2008-07-01'
print(len(df[cond1]))
display(df[cond1].head(5))
df2 = df[cond1]

df2를 sort_values 기능을 통해서 Birth_rate를 기준으로 내림차순 정렬 후 iloc을 통해서 도시 이름을 출력한다.

df2 = df2.sort_values('Birth_rate', ascending = False)
display(df2.head())
print(df2['Region'].iloc[0])

답 : Gyeonggi-do

작업형 1유형 3번 문제

주어진 수면 데이터에서 수면시간(Sleep Duration) 컬럼과 가장 높은 상관계수를 갖는 변수의 최빈값을 출력하시오.
(단, Blood pressure 컬럼과 Sleep Disorder 컬럼은 분석에서 제외한다.)

df = pd.read_csv("./bigdata_csvfile/Sleep_health_and_lifestyle_dataset.csv")
df = df.drop(columns = ['Blood Pressure', 'Sleep Disorder'])

one_hot_data = pd.get_dummies(df[['Gender', 'Occupation', 'BMI Category']])
df = pd.concat([df, one_hot_data], axis = 1)
df = df.drop(columns=['Gender', 'Occupation', 'BMI Category'])
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 25 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Person ID                        374 non-null    int64  
 1   Age                              374 non-null    int64  
 2   Sleep Duration                   374 non-null    float64
 3   Quality of Sleep                 374 non-null    int64  
 4   Physical Activity Level          374 non-null    int64  
 5   Stress Level                     374 non-null    int64  
 6   Heart Rate                       374 non-null    int64  
 7   Daily Steps                      374 non-null    int64  
 8   Gender_Female                    374 non-null    bool   
 9   Gender_Male                      374 non-null    bool   
 10  Occupation_Accountant            374 non-null    bool   
 11  Occupation_Doctor                374 non-null    bool   
 12  Occupation_Engineer              374 non-null    bool   
 13  Occupation_Lawyer                374 non-null    bool   
 14  Occupation_Manager               374 non-null    bool   
 15  Occupation_Nurse                 374 non-null    bool   
 16  Occupation_Sales Representative  374 non-null    bool   
 17  Occupation_Salesperson           374 non-null    bool   
 18  Occupation_Scientist             374 non-null    bool   
 19  Occupation_Software Engineer     374 non-null    bool   
...
 24  BMI Category_Overweight          374 non-null    bool   
dtypes: bool(17), float64(1), int64(7)
memory usage: 29.7 KB

먼저 불필요한 열 Blood_pressure, Sleep Disorder 열을 삭제해주고, one_hot 인코딩을 진행했다.
그 후 concat으로 원핫인코딩된 데이터와 기존 데이터프레임을 합쳐주고, 원핫인코딩 전 데이터의 열을 삭제한다.
그 후 상관 계수를 구하기 위해서 다음과 같이 작성한다.

corr = df.corr()
print(corr)

Sleep Duration 컬럼과 가장 큰 상관 관계를 가지는 컬럼은 Quality of Sleep 이기 때문에 해당 값의 최빈 값을 가음과 같이 출력한다.

result = df['Quality of Sleep'].mode()[0]
print(result)

답 : 8

작업형 2유형

다음은 중고차 가격 관련 데이터셋이다. 다음과 같은 주의사항을 확인하고, 결과를 제출하도록 한다. used_cars_price_data.csv 전체 데이터 목록 4009행 중 3800행을 학습용 데이터로 사용하고, 나머지를 테스트 데이터로 사용할 수 있도록 데이터를 슬라이싱한다. 학습용 데이터를 활용하여 예측 모델을 모델링 하고, 테스트 데이터를 적용하여 목표변수(price)를 예측하고, 예측 결과를 제출한다. 모델 평가지표는 RMSE로 한다. 모델 예측 결과는 price 컬럼을 갖고, 예측 결과를 나타내며, index는 표시하지 않는다. 모델 예측 결과 파일명은 다음과 같이 하여 제출한다. (파일명 : result.csv)

df = pd.read_csv("./bigdata_csvfile/used_cars_price_data.csv")

전처리가 필요함. df.isnull().sum()에서 이상치가 있는 컬럼이 3개 있었고, 그 3개의 컬럼에 대해서 결측값을 최빈 값으로 fillna 함

df['fuel_type'].fillna(df['fuel_type'].mode()[0], inplace = True)
df['accident'].fillna(df['accident'].mode()[0], inplace = True)
df['clean_title'].fillna(df['clean_title'].mode()[0], inplace = True)

price 부분에서는 $ / , 와 같은 특수문자가 있기 때문에 str.replace를 이용해서 모두 변경한다.

df['price'] = df['price'].str.replace("$", "")
df['price'] = df['price'].str.replace(",", "")
df['price'] = df['price'].astype(int)

그리고 info를 통해서 확인 한 경우 모두 잘 나온것을 확인 할 수 있다.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4009 entries, 0 to 4008
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   brand         4009 non-null   object
 1   model         4009 non-null   object
 2   model_year    4009 non-null   int64 
 3   milage        4009 non-null   object
 4   fuel_type     4009 non-null   object
 5   engine        4009 non-null   object
 6   transmission  4009 non-null   object
 7   ext_col       4009 non-null   object
 8   int_col       4009 non-null   object
 9   accident      4009 non-null   object
 10  clean_title   4009 non-null   object
 11  price         4009 non-null   int64 
dtypes: int64(2), object(10)
memory usage: 376.0+ KB

나머지 열에 대해서는 LabelEncoder를 돌면서 라벨인코딩을 진행한다.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

columns = ['brand', 'model', 'milage', 'fuel_type', 'engine', 'transmission', 'ext_col', 'int_col', 'accident', 'clean_title']

for col in columns :
    le.fit(df[col])
    df[col] = le.transform(df[col])

라벨인코딩을 하고, train과 test를 나누었다.

train = df.iloc[:3800, :]
test = df.iloc[-209:, :]

sklearn.model_selection의 train_test_split을 통해서 X_train, X_test, y_train. y_test를 나눈다.

from sklearn.model_selection import train_test_split
X = train.drop(columns = 'price') # 독립변수
y = train['price'] # 종속변수

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3040, 11), (760, 11), (3040,), (760,))

마법의 랜덤포렌스트회귀를 통해서 차량 가격에 대해서 예측한다.

from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 150, max_depth = 20, random_state = 123)
rfr.fit(X_train, y_train)
pred = rfr.predict(X_test)

평가지표의 경우 rmse를 통해서 구했기 때문에 root_mean_squared_error을 통해서 rmse를 계산한다.

from sklearn.metrics import root_mean_squared_error
rmse = root_mean_squared_error(y_test, pred)
print('rmse', rmse)

rmse 38204.54661207414

마지막 제출읠 경우는 test에서 price 열을 날리고, 테스트 데이터에 대해서 예측한다.

test_X_data = test.drop(columns = 'price')
pred2 = rfr.predict(test_X_data)

pd.DataFrame({'pred' : pred2}).to_csv("result.csv", index = False)
print(pd.read_csv("result.csv").head())

작업형 3유형 1번 문제

제공되는 심혈관 질환 데이터를 활용하여 심혈관 질환 발생 여부(cardio)를 예측하고자한다.
각문항의 답을 제출형식에 맞게 제출하시오
data = cardiovascular_heart_disease_data.csv

1. alco, cardio 변수 간의 독립성 검정을 실시할 때, 카이제곱 통계량을 구하시오.
2. gender, weight, smoke, cholesterol을 독립변수로 사용하여 로지스틱 회귀모형으로 분석할 때, smoke 변수의 계수 값을 구하시오
3. 2번 문제에서 생성한 로지스틱 회귀모형에서 gender 변수가 한 단위 증가할 때 심혈관 질환이 발생할 오즈비의 값을 구하시오,
(단, 모든 문제의 정답은 반올림하여 소수점 네 자리까지 출력한다)

df = pd.read_csv("./bigdata_csvfile/cardiovascular_heart_disease_data.csv")
display(df)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   index        70000 non-null  int64  
 1   id           70000 non-null  int64  
 2   age          70000 non-null  int64  
 3   gender       70000 non-null  int64  
 4   height       70000 non-null  int64  
 5   weight       70000 non-null  float64
 6   ap_hi        70000 non-null  int64  
 7   ap_lo        70000 non-null  int64  
 8   cholesterol  70000 non-null  int64  
 9   gluc         70000 non-null  int64  
 10  smoke        70000 non-null  int64  
 11  alco         70000 non-null  int64  
 12  active       70000 non-null  int64  
 13  cardio       70000 non-null  int64  
dtypes: float64(1), int64(13)
memory usage: 7.5 MB
None

1. alco, cardio 변수 간의 독립성 검정을 실시할 때, 카이제곱 통계량을 구하시오.
카이제곱이 나온다? → 무조건 scipy.stats의 chi2_contingency 사용하기

from scipy.stats import chi2_contingency
data = pd.crosstab(df.alco, df.cardio)
print(data)

cardio      0      1
alco                
0       33080  33156
1        1941   1823

카이 제곱 통계량 수행하기

chi2, p_value, dof, exp = chi2_contingency(data)
print(round(chi2, 4))

답 : 3.6965

gender, weight, smoke, cholesterol을 독립변수로 사용하여 로지스틱 회귀모형으로 분석할 때, smoke 변수의 계수 값을 구하시오

이렇게 있으면 일단 로지스틱 회귀모형으로 분석하는거고 여기선 from statsmodels.formula.api import logit 사용하기

그리고 회귀식 작성할때, ~, + 잘 확인하기, 즉 여기서는 cardio를 0 또는 1로 분석하는데, cardio가 **종속변수** 그 외 gender, weight, smoke, cholesterol을 독립변수로 사용하기에 + 붙여서 식 세우기

from statsmodels.formula.api import logit
result = logit('cardio ~ gender + weight + smoke + cholesterol', data = df).fit().params

Optimization terminated successfully.
         Current function value: 0.654936
         Iterations 5
         
print(round(result.smoke, 4))

답 : -0.2174

2번 문제에서 생성한 로지스틱 회귀모형에서 gender 변수가 한 단위 증가할 때 심혈관 질환이 발생할 오즈비의 값을 구하시오,

이 경우 np.exp 사용하기 → np.exp(result)

answer = np.exp(result)
print(round(answer.gender, 4))

답 : 1.0002

사실 3번 문제는 잘 이해가 안됨, 조금더 추가 공부 필요

작업형 3유형 2번 문제

제공되는 심장 질환 데이터를 활용하여 심장질환 발생 여부(target)를 예측하고자 한다. 각 문항의 답을 제출형식에 맞게 제출하시오. data = pd.read_csv("heart_disease_data.csv")

독립변수를 exang, trestbps, ca로 하고, 목표변수를 target으로 하는 로지스틱 회귀분석을 진행했을 때, ca변수의 표준오차를 구하시오.
fbs, thalach, chol, sex를 독립변수로 target을 분류하는 로지스틱 회귀분석 모형을 생성했을때, 학습(train) 데이터의 오분류율을 구하시오. (단, 전체 데이터 중 800행을 학습(train) 데이터로 사용하고, train_test_split의 test_size는 0.2로 하며, random_state는 123으로 한다.)
2번 문제 모델의 심장질환 발생(1)에 대한 정밀도를 구하시오. (단, 모든 문제의 정답은 반올림하여 소수점 두 번째 자리까지 출력한다.)

독립변수를 exang, trestbps, ca로 하고, 목표변수를 target으로 하는 로지스틱 회귀분석을 진행했을 때, ca변수의 표준오차를 구하시오.

df = pd.read_csv("./bigdata_csvfile/heart_disease_data.csv")
print(df.info())
display(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB
None

이 문제에서 독립변수를 ~~ 로 하고, 목표변수를 target으로 하는~~ 이런식으로 나왔기 때문에 얘도 동일하게 식 작성해야함

from statsmodels.formula.api import logit
result = logit('target ~ exang + trestbps + ca', data = df).fit().summary()
print(result)

Optimization terminated successfully.
         Current function value: 0.516009
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 target   No. Observations:                 1025
Model:                          Logit   Df Residuals:                     1021
Method:                           MLE   Df Model:                            3
Date:                Thu, 30 Oct 2025   Pseudo R-squ.:                  0.2552
Time:                        09:48:21   Log-Likelihood:                -528.91
converged:                       True   LL-Null:                       -710.12
Covariance Type:            nonrobust   LLR p-value:                 3.046e-78
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.1733      0.598      5.305      0.000       2.001       4.346
exang         -2.1073      0.167    -12.629      0.000      -2.434      -1.780
trestbps      -0.0138      0.004     -3.095      0.002      -0.023      -0.005
ca            -0.8756      0.083    -10.574      0.000      -1.038      -0.713
==============================================================================

답 : 0.083

2. fbs, thalach, chol, sex를 독립변수로 target을 분류하는 로지스틱 회귀분석 모형을 생성했을때, 학습(train) 데이터의 오분류율을 구하시오. (단, 전체 데이터 중 800행을 학습(train) 데이터로 사용하고, train_test_split의 test_size는 0.2로 하며, random_state는 123으로 한다.)

왜 갑자기 훈련 시키는거지? 문제가 원래 이렇게 나오나??

# Q2
train = df.iloc[:800, :]
test = df.iloc[-225:, :]
print(train.info())
print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       800 non-null    int64  
 1   sex       800 non-null    int64  
 2   cp        800 non-null    int64  
 3   trestbps  800 non-null    int64  
 4   chol      800 non-null    int64  
 5   fbs       800 non-null    int64  
 6   restecg   800 non-null    int64  
 7   thalach   800 non-null    int64  
 8   exang     800 non-null    int64  
 9   oldpeak   800 non-null    float64
 10  slope     800 non-null    int64  
 11  ca        800 non-null    int64  
 12  thal      800 non-null    int64  
 13  target    800 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 87.6 KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 225 entries, 800 to 1024
Data columns (total 14 columns):
...
 13  target    225 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 24.7 KB
None

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

X = train[['fbs', 'thalach', 'chol', 'sex']]
y = train['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(640, 4) (160, 4) (640,) (160,)

lr = LogisticRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)

confussion = confusion_matrix(y_test, pred)
print(confussion)

roc = roc_auc_score(y_test, pred)
print(roc)

[[59 19]
 [28 54]]
0.7074734208880551

print(round(1 - roc, 2))

답 : 0.29

3. 2번 문제 모델의 심장질환 발생(1)에 대한 정밀도를 구하시오.

정밀도의 경우 classification_report를 통해서 확인해야함.

# Q3 
cla_re = classification_report(y_test, pred)
print(cla_re)

              precision    recall  f1-score   support

           0       0.68      0.76      0.72        78
           1       0.74      0.66      0.70        82

    accuracy                           0.71       160
   macro avg       0.71      0.71      0.71       160
weighted avg       0.71      0.71      0.71       160

답 : 0.74

저작자표시 비영리 (새창열림)