Notice
Recent Posts
Recent Comments
Link
ยซ   2025/05   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Archives
Today
Total
๊ด€๋ฆฌ ๋ฉ”๋‰ด

TechBlog

[Kaggle/python] PUBG ๋ฐฐํ‹€๊ทธ๋ผ์šด๋“œ ๊ฒŒ์ž„ ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ณธ๋ฌธ

Study/Data Analysis

[Kaggle/python] PUBG ๋ฐฐํ‹€๊ทธ๋ผ์šด๋“œ ๊ฒŒ์ž„ ๋ฐ์ดํ„ฐ ๋ถ„์„

jiazzang 2023. 10. 10. 23:31

 

๐Ÿ“Œ ์ฃผ์ œ

PUBG ๋ฐฐํ‹€๊ทธ๋ผ์šด๋“œ ๊ฒŒ์ž„ ๋ฐ์ดํ„ฐ ๋ถ„์„

๐Ÿ“– ์ˆœ์„œ

1. ์ฃผ์ œ ์ •์˜

2. ๊ฒŒ์ž„ ์„ค๋ช…

3. ํ™œ์šฉ ๋ฐ์ดํ„ฐ ๋ฐ ๋ณ€์ˆ˜
4. ์ž๋ฃŒ ๋ถ„์„ ๊ณผ์ •

 


1. ์ฃผ์ œ ์ •์˜

  • ์ฃผ์ œ: PUBG ๋ฐฐํ‹€๊ทธ๋ผ์šด๋“œ ๊ฒŒ์ž„ ๋ฐ์ดํ„ฐ ๋ถ„์„
  • ์š”์•ฝ: PUBG Developer์—์„œ ๊ณต๊ฐœํ•œ ๋ฐฐํ‹€๊ทธ๋ผ์šด๋“œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๋ถ„์„ ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋ถ„์„ ๋‚ด์šฉ: EDA, ์ „์ฒ˜๋ฆฌ, ์‹œ๊ฐํ™”๋ฅผ ์œ„์ฃผ๋กœ ๋ถ„์„์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ”Œ๋ ˆ์ด์–ด ๋“ฑ๊ธ‰์˜ ์ตœ์ข… ๋ฐฐ์น˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” RandomForest ๋ชจ๋ธ๋„ basicํ•˜๊ฒŒ ๋งŒ๋“ค์–ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

 


2. ๊ฒŒ์ž„ ์„ค๋ช…

๐Ÿ“– ๊ฒŒ์ž„ ๊ฐœ์š”

๋ฐฐํ‹€๊ทธ๋ผ์šด๋“œ๋Š” ํ”Œ๋ ˆ์ด์–ด๊ฐ€ ๋Œ์•„๋‹ค๋‹ˆ๋ฉด์„œ ๋ฌด๊ธฐ, ์ด์•Œ, ๋ฐฉ์–ด๊ตฌ, ๊ตฌ๊ธ‰์•ฝ ๋“ฑ์„ ์ˆ˜์ง‘ํ•˜๊ณ  ์„œ๋กœ ์ฃฝ์ด๋ฉฐ ์ตœํ›„๊นŒ์ง€ ์‚ด์•„๋‚จ๋Š” ๊ฒŒ์ž„์ž…๋‹ˆ๋‹ค. ๊ฒŒ์ž„ ์‹œ์ž‘ ์‹œ ํ”Œ๋ ˆ์ด์–ด๋“ค์€ ๋น„ํ–‰๊ธฐ์— ํƒ„ ์ƒํƒœ์ด๋ฉฐ, ๊ฐ์ž ์›ํ•˜๋Š” ์œ„์น˜์— ๋‚™ํ•˜ํ•œ ํ›„ ํŒŒ๋ฐ(์•„์ดํ…œ์„ ์ค๋Š” ํ–‰์œ„)์„ ํ†ตํ•ด ์‹ธ์›€์„ ์œ„ํ•œ ์ค€๋น„๋ฅผ ํ•ฉ๋‹ˆ๋‹ค. ๊ฒŒ์ž„์—์„œ ์Šน๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ์ „๋žต์œผ๋กœ๋Š” ํฌ๊ฒŒ 1) ์†Œ๊ทน์  ์ „๋žต(๊ฑด๋ฌผ์— ์ˆจ์–ด์„œ ์ ์„ ์ฃฝ์ž„), 2) ์ ๊ทน์  ์ „๋žต(๋Œ๊ฒฉ์„ ํ†ตํ•ด ๋น ๋ฅด๊ฒŒ ์ ์„ ์ฃฝ์ž„)์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

๐Ÿšท ๊ฒŒ์ž„ ๊ทœ์น™

  • ๊ฐ ๊ฒฝ๊ธฐ(matchId)์—๋Š” ์ตœ๋Œ€ 100๋ช…์˜ ํ”Œ๋ ˆ์ด์–ด๊ฐ€ ์ฐธ๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•œ ๊ฒฝ๊ธฐ ๋‚ด์—์„œ ์ตœ๋Œ€ 4๋ช…์˜ ํ”Œ๋ ˆ์ด์–ด๊ฐ€ ๊ฐ™์€ ๊ทธ๋ฃน์— ํฌํ•จ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

โš ๏ธ ์ฃผ์˜ํ•  ์ 

๋ฐ์ดํ„ฐ ํ™•์ธ ๊ฒฐ๊ณผ, ํ•œ ๊ฒฝ๊ธฐ ๋‚ด์—์„œ groupId๊ฐ€ ๊ฐ™์€ ํ”Œ๋ ˆ์ด์–ด๊ฐ€ 4๋ช… ์ด์ƒ์ธ ๊ฒฝ์šฐ๊ฐ€ ์กด์žฌํ–ˆ์Šต๋‹ˆ๋‹ค. Kaggle notebook์—์„œ ์ฐธ๊ณ ํ•œ ๋ฐ”๋กœ๋Š”, ์ด๋Š” ๊ฒŒ์ž„์—์„œ ์—ฐ๊ฒฐ์ด ๋Š๊ธฐ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•˜๋Š” ํ˜„์ƒ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ๊ทธ๋ฃน์˜ ํ”Œ๋ ˆ์ด์–ด๊ฐ€ API์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ๋™์ผํ•œ ์ตœ์ข… ๋ฐฐ์น˜๋ฅผ ๊ฐ€์ง„ ๊ฒƒ์œผ๋กœ ์ €์žฅ๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” groupId๋ฅผ ํ™•์‹คํžˆ ํ•จ๊ป˜ํ•œ ํŒ€์ด ์•„๋‹ˆ๋ผ ๋™์ผํ•œ ์ตœ์ข… ์ˆœ์œ„๋ฅผ ๊ฐ€์ง„ ํ”Œ๋ ˆ์ด์–ด๋ผ๊ณ  ์ƒ๊ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 


2. ํ™œ์šฉ ๋ฐ์ดํ„ฐ ๋ฐ ๋ณ€์ˆ˜

(1) ํ™œ์šฉ ๋ฐ์ดํ„ฐ

โ–ถ train ๋ฐ์ดํ„ฐ(.csv)

  • Rows X Columns: 4446966 X 29
  • ๊ฐ ํ–‰์€ ํ”Œ๋ ˆ์ด์–ด ํ•œ ๋ช…์˜ ๊ฒŒ์ž„(match) ํ†ต๊ณ„๋ฅผ ๋‚˜ํƒ€๋ƒ„
  • Target variable: ํ”Œ๋ ˆ์ด์–ด ๋“ฑ๊ธ‰์˜ ์ตœ์ข… ๋ฐฐ์น˜(winPlacePerc)

 

(2) ํ™œ์šฉ ๋ฐ์ดํ„ฐ ์ถœ์ฒ˜

โ–ถ Kaggle - PUBG Finish Placement Prediction (Kernels Only)

 

(3) ํ™œ์šฉ ๋ณ€์ˆ˜

Data fields Details
DBNOs Number of enemy players knocked
assists Number of enemy players this player damaged that were killed by teammates.
boosts Number of boost items used.
damageDealt Total damage dealt. Note: Self inflicted damage is subtracted.
headshotKills Number of enemy players killed with headshots.
heals Number of healing items used.
Id Player’s Id
killPlace Ranking in match of number of enemy players killed.
killPoints Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
killStreaks  Max number of enemy players killed in a short amount of time.
kills  Number of enemy players killed.
longestKill  Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
matchDuration  Duration of match in seconds.
matchId  ID to identify match. There are no matches that are in both the training and testing set.
matchType  String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
rankPoints  Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
revives Number of times this player revived teammates.
rideDistance Total distance traveled in vehicles measured in meters.
roadKills Number of kills while in a vehicle.
swimDistance Total distance traveled by swimming measured in meters.
teamKills Number of times this player killed a teammate.
vehicleDestroy Number of vehicles destroyed.
walkDistance Total distance traveled on foot measured in meters.
weaponsAcquired Number of weapons picked up.
winPoints  Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
groupId  ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
numGroups  Number of groups we have data for in the match.
maxPlace  Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
winPlacePerc  The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

 


4. ์ž๋ฃŒ ๋ถ„์„ ๊ณผ์ •

(1) ๊ธฐ๋ณธ ์„ธํŒ…

โšก ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['font.family'] = "Malgun Gothic"
plt.rcParams['axes.grid'] = False

color = sns.color_palette()
plt.style.use("fivethirtyeight")
import matplotlib.font_manager as fm
parameters = {'axes.labelsize': 10,
              'axes.titlesize': 15, 
              'figure.titlesize': 17, 
              'xtick.labelsize': 11, 
              'ytick.labelsize': 14, 
              'legend.fontsize': 12, 
              'legend.title_fontsize': 13}
plt.rcParams.update(parameters)

 

import warnings
warnings.filterwarnings('ignore')

 

(2) ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ

โšก ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

## train ๋ฐ์ดํ„ฐ ๋กœ๋“œ
train = pd.read_csv("./PUBG_train_V2.csv")

 

โšก ๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๊ธฐ

## ๋ฐ์ดํ„ฐ ํ™•์ธ
train.head()
 
 

โšก ํ–‰, ์—ด ๊ฐœ์ˆ˜์™€ ๊ฒฐ์ธก์น˜ ์กด์žฌ ์—ฌ๋ถ€ ํ™•์ธํ•˜๊ธฐ

## ํ–‰/์—ด ํ™•์ธ
print(train.shape)
# ์ปฌ๋Ÿผ๋ณ„ ๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜ ํ™•์ธ
print(train.isnull().sum())

 

# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
(4446966, 29)
Id                 0
groupId            0
matchId            0
assists            0
boosts             0
damageDealt        0
DBNOs              0
headshotKills      0
heals              0
killPlace          0
killPoints         0
kills              0
killStreaks        0
longestKill        0
matchDuration      0
matchType          0
maxPlace           0
numGroups          0
rankPoints         0
revives            0
rideDistance       0
roadKills          0
swimDistance       0
teamKills          0
vehicleDestroys    0
walkDistance       0
weaponsAcquired    0
winPoints          0
winPlacePerc       1
dtype: int64
 

โšก ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”ํ•˜๊ธฐ

## ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”ํ•˜๊ธฐ
# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    #start_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    #end_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    #print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df
train = reduce_mem_usage(train)

 

(3) ๋ถ„ํฌ ํ™•์ธํ•˜๊ธฐ

โšก ๋ณ€์ˆ˜๋ณ„ Plot

  • ๋ณ€์ˆ˜๋ณ„ Plot์„ ํ™•์ธํ•ด๋ณธ ๊ฒฐ๊ณผ, "Kill", "longestKill", "walkDistance", "rideDistance", "swimDistance" ๋“ฑ target ๋ณ€์ˆ˜์— ์œ ์˜๋ฏธํ•œ ์˜ํ–ฅ์„ ์ค„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜๋Š” ๋ณ€์ˆ˜๋“ค์˜ ๋ถ„ํฌ ๋ถˆ๊ท ํ˜•์ด ์‹ฌํ•œ ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋จ
  • ๋”ฐ๋ผ์„œ ํ•ด๋‹น ๋ณ€์ˆ˜๋“ค์— ๋Œ€ํ•ด ์ด์ƒ์น˜ ์ œ๊ฑฐ ํ•„์š”

โœ๏ธ ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ Kill ํšŸ์ˆ˜ ๋ถ„ํฌ

## ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ Kill ํšŸ์ˆ˜ ๋ถ„ํฌ
plt.figure(figsize=(12,4))
sns.countplot(x='kills', data=train)
plt.title('ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ kills ํšŸ์ˆ˜')
plt.show()

print("\n", "kills ํšŸ์ˆ˜ ํ‰๊ท :", train['kills'].mean())

 

 kills ํšŸ์ˆ˜ ํ‰๊ท : 0.9247833241810259

 

ํ•ด์„

  • ๋Œ€๋ถ€๋ถ„์˜ ํ”Œ๋ ˆ์ด์–ด๋“ค์€ 0ํ‚ฌ์ด๊ณ , ์ ๊ฒŒ๋Š” 1ํ‚ฌ๋ถ€ํ„ฐ ๋งŽ๊ฒŒ๋Š” 72ํ‚ฌ๊นŒ์ง€ ๋„“๊ฒŒ ๋ถ„ํฌํ•จ

 

โœ๏ธ ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ matchDuration ๋ถ„ํฌ

## ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ matchDuration(๋งค์น˜์ง€์†์‹œ๊ฐ„) ๋ถ„ํฌ
plt.figure(figsize=(12,4))
sns.distplot(x=train['matchDuration'], bins=10)
plt.title('ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ matchDuration ๋ถ„ํฌ')
plt.show()

 

ํ•ด์„

  • ๋Œ€๋ถ€๋ถ„์˜ ํ”Œ๋ ˆ์ด์–ด๋Š” 1000์ดˆ ์ด์ƒ ๋งค์น˜๋ฅผ ์ง€์†ํ•จ

 

โœ๏ธ ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ teamKills ํšŸ์ˆ˜ ๋ถ„ํฌ

## ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ ํŒ€ํ‚ฌ ํšŸ์ˆ˜ ๋ถ„ํฌ
plt.figure(figsize=(12,4))
sns.countplot(x='teamKills', data=train)
plt.title('ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ teamKills ํšŸ์ˆ˜')
plt.ylim([0, 100000])   # y์ถ• ๋ฒ”์œ„ ์ œํ•œํ•ด์„œ ๋ณด๊ธฐ
plt.show()

 

ํ•ด์„

  • ๋Œ€๋ถ€๋ถ„์˜ ํ”Œ๋ ˆ์ด์–ด๋“ค์€ ํŒ€ํ‚ฌ์„ ํ•˜์ง€ ์•Š์ง€๋งŒ, 1๋ฒˆํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ 8๋งŒ ๊ฑด ์ด์ƒ์ž„
  • ํŒ€ํ‚ฌ์„ ํ•˜๋Š” ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถ”์ธกํ•ด๋ณผ ์ˆ˜ ์žˆ์Œ
    • ๋‹จ์ˆœํžˆ ์žฌ๋ฏธ๋ฅผ ์œ„ํ•ด
    • ํŒ€์›์ด ๋„ˆ๋ฌด ๋ชปํ•ด์„œ (์ฆ‰, ๊ฒŒ์ž„์— ๋„์›€์ด ๋˜์ง€ ์•Š์„ ๊ฒƒ ๊ฐ™์•„์„œ...)

 

โœ๏ธ ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ longestKill ๋ถ„ํฌ

## ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ longestKill ๋ถ„ํฌ
plt.figure(figsize=(12,4))
sns.distplot(x=train['longestKill'], bins=10)
plt.title('ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ longestKill ๋ถ„ํฌ')
plt.xlim([0, 200])   # x์ถ• ๋ฒ”์œ„ ์ œํ•œํ•ด์„œ ๋ณด๊ธฐ
plt.show()

 

ํ•ด์„

  • ๋Œ€๋ถ€๋ถ„ 0~25m ์ด๋‚ด์— ๋ถ„ํฌํ•˜๊ณ  ์žˆ์œผ๋‚˜, ๊ฐ„ํ˜น ์•„์ฃผ ๋จผ ๊ฑฐ๋ฆฌ์—์„œ ์ฃฝ์ด๋Š” ๊ฒฝ์šฐ๋„ ์กด์žฌํ•จ (์ด์ƒ์น˜ ์ œ๊ฑฐ ํ•„์š”)

 

โœ๏ธ ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ walkDistance ๋ถ„ํฌ

## ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ walkDistance ๋ถ„ํฌ
plt.figure(figsize=(12,4))
sns.distplot(x=train['walkDistance'], bins=10)
plt.title('ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ walkDistance ๋ถ„ํฌ')
plt.show()

 

ํ•ด์„

  • ๋Œ€๋ถ€๋ถ„์˜ ํ”Œ๋ ˆ์ด์–ด๋“ค์€ ํ•œ ๋งค์น˜ ๋‚ด์—์„œ 0~2500m ์ •๋„ ๋„๋ณด๋กœ ์ด๋™ํ•˜์ง€๋งŒ, 25000m๋ฅผ ๋„˜๋Š” ๊ฒฝ์šฐ๋„ ์กด์žฌํ•จ (์ด์ƒ์น˜ ์ œ๊ฑฐ ํ•„์š”)

 

โœ๏ธ ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ rideDistance ๋ถ„ํฌ

## ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ rideDistance ๋ถ„ํฌ
plt.figure(figsize=(12,4))
sns.distplot(x=train['rideDistance'], bins=10)
plt.title('ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ rideDistance ๋ถ„ํฌ')
plt.show()
 

 

โœ๏ธ ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ swimDistance ๋ถ„ํฌ

## ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ swimDistance ๋ถ„ํฌ
plt.figure(figsize=(12,4))
sns.distplot(x=train['swimDistance'], bins=10)
plt.title('ํ”Œ๋ ˆ์ด์–ด๋“ค์˜ swimDistance ๋ถ„ํฌ')
plt.show()

 

ํ•ด์„

  • ๋Œ€๋ถ€๋ถ„์˜ ํ”Œ๋ ˆ์ด์–ด๋“ค์€ ์ˆ˜์˜์„ ํ†ตํ•ด 0~500m ์ •๋„ ์ด๋™ํ•˜์ง€๋งŒ, 3500m ์ด์ƒ์ธ ๊ฒฝ์šฐ๋„ ์กด์žฌํ•จ (์ด์ƒ์น˜ ์ œ๊ฑฐ ํ•„์š”)

 

โšก ํ•œ ๋งค์น˜๋‹น ๋ช‡ ๋ช…์˜ ํ”Œ๋ ˆ์ด์–ด๊ฐ€ ์ฐธ๊ฐ€ํ• ๊นŒ?

  • ํ•œ ๋งค์น˜์— 95~98๋ช…์˜ ํ”Œ๋ ˆ์ด์–ด๊ฐ€ ์ฐธ๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋Œ€๋ถ€๋ถ„์ž„
  • ๋งค์น˜๋Š” ๋Œ€๋ถ€๋ถ„ ์ฐธ๊ฐ€ ๊ฐ€๋Šฅํ•œ ์ตœ๋Œ€ ํ”Œ๋ ˆ์ด์–ด ์ˆ˜ 100๋ช…์ด ๊ฑฐ์˜ ์ฑ„์›Œ์ ธ์„œ ์ง„ํ–‰๋˜๋Š” ํŽธ์ž„
# "matchId"๋ณ„๋กœ playersJoined ๊ณ„์‚ฐํ•˜์—ฌ ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์œผ๋กœ ์ถ”๊ฐ€
train['playersJoined'] = train.groupby('matchId')['matchId'].transform('count')

# ํ•œ ๋งค์น˜์— 75๋ช… ์ด์ƒ ์ฐธ๊ฐ€ํ•œ ๊ฒฝ์šฐ๋งŒ ์‹œ๊ฐํ™”
plt.figure(figsize=(8,4))
sns.countplot(x=train[train['playersJoined']>=75]['playersJoined'])
plt.title('playersJoined')
plt.show()

 

 

โšก ์†”๋กœ/๋“€์˜ค/์Šค์ฟผ๋“œ ์ค‘ ๊ฐ€์žฅ ์Šน๋ฅ ์ด ๋†’์€ ์œ ํ˜•์€?

  • ๋Œ€์ฒด์ ์œผ๋กœ ์†”๋กœ, ๋“€์˜ค, ์Šค์ฟผ๋“œ ์ˆœ์œผ๋กœ ์Šน๋ฅ ์ด ๋†’์€ ํŽธ์ž„
# ๋งค์น˜ ์ข…๋ฅ˜ ํ™•์ธ
train['matchType'].value_counts()
# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
matchType
squad-fpp           1756186
duo-fpp              996691
squad                626526
solo-fpp             536762
duo                  313591
solo                 181943
normal-squad-fpp      17174
crashfpp               6287
normal-duo-fpp         5489
flaretpp               2505
normal-solo-fpp        1682
flarefpp                718
normal-squad            516
crashtpp                371
normal-solo             326
normal-duo              199
Name: count, dtype: int64

 

# "solo", "duo", "squad" ํ‚ค์›Œ๋“œ๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋“ค๋งŒ ์ถ”์ถœ
solo = train[train['numGroups'] > 50]
duo = train[(train['numGroups'] > 25) & (train['numGroups'] <= 50)]
squad = train[train['numGroups'] <= 25]

# ๊ทธ๋ž˜ํ”„ ์‹œ๊ฐํ™”
f,ax1 = plt.subplots(figsize=(20,10))
sns.pointplot(x='kills', y='winPlacePerc', data=solo, color='black')
sns.pointplot(x='kills', y='winPlacePerc', data=duo, color='#CC0000')
sns.pointplot(x='kills', y='winPlacePerc', data=squad, color='#3399FF')
plt.text(37,0.6, 'Solo', color='black', fontsize=17, style='italic')
plt.text(37,0.55, 'Duo', color='#CC0000', fontsize=17, style='italic')
plt.text(37,0.5, 'squad', color='#3399FF', fontsize=17, style='italic')
plt.xlabel('Number of kills', fontsize=15, color='blue')
plt.ylabel('Win Percentage', fontsize=15, color='blue')
plt.title('Solo vs Duo vs Squad Kills', fontsize=20, color='blue')
plt.grid()  # ๊ทธ๋ฆฌ๋“œ ํ‘œ์‹œ
plt.show()

 

 

โšก Kill์„ ๋งŽ์ด ํ• ์ˆ˜๋ก ์Šน๋ฅ ์ด ์˜ฌ๋ผ๊ฐˆ๊นŒ?

  • ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ Kill ์ˆ˜๊ฐ€ ๋งŽ์„ ๋•Œ ์Šน๋ฅ  ๋˜ํ•œ ๋†’์€ ๊ฒƒ์œผ๋กœ ๋ณด์ž„
# kill ํšŸ์ˆ˜์™€ ์Šน๋ฅ ์˜ ์‚ฐ์ ๋„
plt.figure(figsize=(6,4))
plt.scatter(x = train['winPlacePerc'], y = train['kills'], color="red", alpha=0.3)
plt.title('kill ํšŸ์ˆ˜์™€ ์Šน๋ฅ  ์‚ฌ์ด์˜ ์‚ฐ์ ๋„')
plt.show()
 
 

 

# kill ํšŸ์ˆ˜ ๋ฒ”์ฃผ๋ณ„ box plot
kills = train[['kills', 'winPlacePerc']]
kills['killsCategories'] = pd.cut(kills['kills'], [-1, 0, 2, 5, 10, 80], labels=['0 kills','1-2 kills', '3-5 kills', '6-10 kills', '10+ kills'])

plt.figure(figsize=(8,4))
sns.boxplot(x="killsCategories", y="winPlacePerc", data=kills)
plt.show()

 

 

(4) ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

โšก ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

  • "winPlacePerc"์— ๊ฒฐ์ธก์น˜ 1๊ฐœ ์กด์žฌ => ํ™•์ธํ•ด๋ณธ ๊ฒฐ๊ณผ, ์˜ค์ง ํ•œ ๋ช…์˜ ํ”Œ๋ ˆ์ด์–ด๋งŒ ์ฐธ๊ฐ€ํ•œ ๊ฒฝ๊ธฐ์ธ ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋˜๋ฏ€๋กœ ํ•ด๋‹น ๋ฐ์ดํ„ฐ ์ œ๊ฑฐ
# "winPlacePerc"์— ๊ฒฐ์ธก์น˜ 1๊ฐœ ์กด์žฌ
print("๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜:", train['winPlacePerc'].isnull().sum())
# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜: 1

 

# ๊ฒฐ์ธก์น˜๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ ํ™•์ธ
train[train['winPlacePerc'].isnull()]
# ๊ฒฐ์ธก์น˜ ํ–‰ ์ œ๊ฑฐ ํ›„ ์ธ๋ฑ์Šค ์žฌ์ •๋ ฌ
train = train.drop(2744604).reset_index(drop=True)

 

โšก ํŒŒ์ƒ๋ณ€์ˆ˜ ์ƒ์„ฑ

# ํž ํšŸ์ˆ˜์™€ ๋ถ€์ŠคํŠธ ํšŸ์ˆ˜๋ฅผ ๋”ํ•œ ํŒŒ์ƒ๋ณ€์ˆ˜("healsandboosts") ์ƒ์„ฑ
train['healsandboosts'] = train['heals'] + train['boosts']
# ๊ธฐ์กด๋ณ€์ˆ˜("heals", "boosts") ์ œ๊ฑฐ
train.drop(['heals', 'boosts'], axis=1, inplace=True)

 

โšก ์ด์ƒ์น˜ ํƒ์ƒ‰

  • totalDistance: ํ”Œ๋ ˆ์ด์–ด๊ฐ€ ๊ฒŒ์ž„ ์ค‘์— ํ•œ ๋ฒˆ๋„ ์›€์ง์ด์ง€ ์•Š์œผ๋ฉด์„œ kill ํšŸ์ˆ˜๊ฐ€ 1 ์ด์ƒ์ธ ๊ฒฝ์šฐ, ์ด์ƒ์น˜๋กœ ํŒ๋‹จํ•˜๊ณ  ์ œ๊ฑฐ
  • roadKills: ๋กœ๋“œํ‚ฌ์„ 10ํšŒ ์ด์ƒ ์‹ค์‹œํ•œ ๊ฒฝ์šฐ, ์ด์ƒ์น˜๋กœ ํŒ๋‹จํ•˜๊ณ  ์ œ๊ฑฐ
  • longestKill, walkDistance, rideDistance, swimDistance: ๊ฐ๊ฐ1km, 10km, 20km, 2km ์ด์ƒ์ด๋ฉด ์ด์ƒ์น˜๋กœ ํŒ๋‹จํ•˜๊ณ  ์ œ๊ฑฐ
## ํ•œ ๋ฒˆ๋„ ์›€์ง์ด์ง€ ์•Š์œผ๋ฉด์„œ kill ํšŸ์ˆ˜๊ฐ€ 1 ์ด์ƒ์ธ ๊ฒฝ์šฐ ์ด์ƒ์น˜๋กœ ํŒ๋‹จ ํ›„ ์ œ๊ฑฐ

# ํ”Œ๋ ˆ์ด์–ด๊ฐ€ ์›€์ง์ธ ์ด ๊ฑฐ๋ฆฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ณ€์ˆ˜ ์ƒ์„ฑ ("totalDistance")
train['totalDistance'] = train['rideDistance'] + train['walkDistance'] + train['swimDistance']

# ์ด์ƒ์น˜ ์ œ๊ฑฐ
train['killsWithoutMoving'] = ((train['kills'] > 0) & (train['totalDistance'] == 0))
train.drop(train[train['killsWithoutMoving'] == True].index, inplace=True)

# ํ•„์š” ์—†๋Š” ๋ณ€์ˆ˜ ๋ชจ๋‘ ์ œ๊ฑฐํ•ด์ฃผ๊ธฐ
train.drop(['totalDistance', 'killsWithoutMoving'], axis=1, inplace=True)
## "roadKills"์„ 10ํšŒ ์ด์ƒ ์‹ค์‹œํ•œ ๊ฒฝ์šฐ ์ด์ƒ์น˜๋กœ ํŒ๋‹จ ํ›„ ์ œ๊ฑฐ
train.drop(train[train['roadKills'] > 10].index, inplace=True)
## "longestKill"๊ฐ€ 1km ์ด์ƒ์ด๋ฉด ์ด์ƒ์น˜๋กœ ํŒ๋‹จํ•˜๊ณ  ์ œ๊ฑฐ
train.drop(train[train['longestKill'] >= 1000].index, inplace=True)
## Distance๊ฐ€ ๊ฐ๊ฐ 10km, 20km, 2km ์ด์ƒ์ด๋ฉด ์ด์ƒ์น˜๋กœ ํŒ๋‹จํ•˜๊ณ  ์ œ๊ฑฐ
train.drop(train[train['walkDistance'] >= 10000].index, inplace=True)
train.drop(train[train['rideDistance'] >= 20000].index, inplace=True)
train.drop(train[train['swimDistance'] >= 2000].index, inplace=True)

 

# ์ด์ƒ์น˜๋ฅผ ๋ชจ๋‘ ์ œ๊ฑฐํ•˜๊ณ  ๋‚จ์€ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜
train.shape
# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
(4445024, 29)

 

โšก ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ธ์ฝ”๋”ฉ

  • matchType: ์›ํ•ซ์ธ์ฝ”๋”ฉ
  • groupId, matchId: categoryํ˜•์œผ๋กœ ๋ณ€ํ™˜ (group๊ณผ match ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„ ํ™œ์šฉ)
## "matchType": ์›ํ•ซ์ธ์ฝ”๋”ฉ
train = pd.get_dummies(train, columns=['matchType'])

# boolean(True/False)๋ฅผ int(0,1)๋กœ ๋ณ€ํ™˜
train[train.columns[27:]] = train[train.columns[27:]].astype(int)

 

## "groupId", "matchId": categoryํ˜•์œผ๋กœ ๋ณ€ํ™˜ ํ›„ ์ˆซ์ž๊ฐ’์œผ๋กœ ์ธ์ฝ”๋”ฉ
train['groupId'] = train['groupId'].astype('category')
train['matchId'] = train['matchId'].astype('category')
train['groupId_cat'] = train['groupId'].cat.codes
train['matchId_cat'] = train['matchId'].cat.codes

# ๊ธฐ์กด ๋ณ€์ˆ˜ ์ œ๊ฑฐ
train.drop(['groupId', 'matchId'], axis=1, inplace=True)

# ์ธ์ฝ”๋”ฉ์ด ์ž˜ ๋˜์—ˆ๋Š”์ง€ ํ™•์ธ
train[['groupId_cat', 'matchId_cat']].head()

 

โšก ํ•„์š” ์—†๋Š” ๋ณ€์ˆ˜ ์ œ๊ฑฐ

## ํ”Œ๋ ˆ์ด์–ด์˜ ID ์ปฌ๋Ÿผ ์ œ๊ฑฐ
print("Id ๊ณ ์œ ๊ฐ’ ๊ฐœ์ˆ˜:", train['Id'].nunique())
train.drop("Id", axis=1, inplace=True)
# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
Id ๊ณ ์œ ๊ฐ’ ๊ฐœ์ˆ˜: 4445024

 

(5) ํ•™์Šต ๋ฐ ํ‰๊ฐ€

## ๋จธ์‹ ๋Ÿฌ๋‹์„ ์œ„ํ•œ ํŒจํ‚ค์ง€
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

 

## ๋””๋ฒ„๊น…์„ ์œ„ํ•œ ์ƒ˜ํ”Œ๋ง
sample = 500000
df_sample = train.sample(sample)

 

## target ๋ณ€์ˆ˜ ๋”ฐ๋กœ ์ €์žฅ
y = df_sample['winPlacePerc']
df = df_sample.drop('winPlacePerc', axis=1)

 

โšก ๊ฒ€์ฆ์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ

## train, valid ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌํ•˜๋Š” ํ•จ์ˆ˜ ์ƒ์„ฑ
def split_vals(a, n:int):
    return a[:n].copy(), a[n:].copy()
val_perc = 0.12
n_valid = int(val_perc * sample)
n_trn = len(df) - n_valid

# Split
raw_train, raw_valid = split_vals(df_sample, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

 

# ํ™•์ธ
print('train:', X_train.shape, 'target:', y_train.shape, 'validation:', X_valid.shape)
# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
train: (440000, 42) target: (440000,) validation: (60000, 42)

 

โšก ํ‰๊ฐ€ ์ง€ํ‘œ(MAE)

## ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ(MAE) ์ถœ๋ ฅํ•˜๋Š” ํ•จ์ˆ˜ ์ƒ์„ฑ
def print_score(m: RandomForestRegressor):
    res = ['mae train:', mean_absolute_error(m.predict(X_train), y_train),
           'mae val:', mean_absolute_error(m.predict(X_valid), y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

 

โšก Basic RF Model 1

## Basic model ํ•™์Šต
rf = RandomForestRegressor(n_estimators=50,
                           min_samples_leaf=3,
                           max_features='sqrt',
                           n_jobs=-1)
rf.fit(X_train, y_train)
print_score(rf)
# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
['mae train:', 0.041673329186753184, 'mae val:', 0.06266519503048588]

 

โšก Feature Importance

## Basic model์˜ ๋ณ€์ˆ˜์ค‘์š”๋„ ํ™•์ธ
rf_feature_importance = pd.DataFrame(rf.feature_importances_, X_train.columns, columns=['Feature Importance'])
# ๋ณ€์ˆ˜์ค‘์š”๋„ ์ˆœ์„œ๋กœ ์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ
rf_feature_importance = rf_feature_importance.sort_values('Feature Importance', ascending=False)

# ๋ณ€์ˆ˜์ค‘์š”๋„ ์‹œ๊ฐํ™”
plt.figure(figsize=(18,9))
sns.barplot(x='Feature Importance', y=rf_feature_importance.index, orient='h', data=rf_feature_importance)
plt.title("Feature Importance of RF", size=20)

plt.xticks(size=15)
plt.yticks(size=15)
plt.xlabel('feature importance', size=20)
plt.ylabel('columns', size=20)
plt.show()

 

 

โšก RF Model 2

  • ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •: n_estimators ๊ฐ’ 50 -> 80
## ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ • ํ›„ ํ•™์Šต
rf2 = RandomForestRegressor(n_estimators=80,
                           min_samples_leaf=3,
                           max_features='sqrt',
                           n_jobs=-1)
rf2.fit(X_train, y_train)
print_score(rf2)
# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
['mae train:', 0.04123858030265405, 'mae val:', 0.06211411133651147]

 

## Basic model์˜ ๋ณ€์ˆ˜์ค‘์š”๋„ ํ™•์ธ
rf_feature_importance2 = pd.DataFrame(rf2.feature_importances_, X_train.columns, columns=['Feature Importance'])
# ๋ณ€์ˆ˜์ค‘์š”๋„ ์ˆœ์„œ๋กœ ์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ
rf_feature_importance2 = rf_feature_importance2.sort_values('Feature Importance', ascending=False)

# ๋ณ€์ˆ˜์ค‘์š”๋„ ์‹œ๊ฐํ™”
plt.figure(figsize=(18,9))
sns.barplot(x='Feature Importance', y=rf_feature_importance2.index, orient='h', data=rf_feature_importance2)
plt.title("Feature Importance of RF", size=20)

plt.xticks(size=15)
plt.yticks(size=15)
plt.xlabel('feature importance', size=20)
plt.ylabel('columns', size=20)
plt.show()

 

 

โšก ์ƒ๊ด€๊ด€๊ณ„

## Feature Importance > 0.05์ธ ๋ณ€์ˆ˜๋“ค๋งŒ ์ถ”์ถœ
df_keep = df[rf_feature_importance2[rf_feature_importance2['Feature Importance'] > 0.05].index].copy()
X_train, X_valid = split_vals(df_keep, n_trn)

 

 
## ํžˆํŠธ๋งต์œผ๋กœ ์ƒ๊ด€๊ด€๊ณ„ ํ™•์ธ
corr = df_keep.corr()
plt.figure(figsize=(10, 7))
sns.heatmap(corr, cmap="Greens", annot=True, linewidths=0.5, fmt=".3f", cbar = True)
plt.show()

 

 

(6) ์ตœ์ข… RF Model

## train, valid data ๋ถ„๋ฆฌ
val_perc_full = 0.2
n_valid_full = int(val_perc_full * len(train))
n_trn_full = len(train) - n_valid_full

# X, y ๋ถ„๋ฆฌ
y = train['winPlacePerc']
df_full = train.drop('winPlacePerc', axis=1)
# df_full = df_full[to_keep]

# Split
X_train, X_valid = split_vals(df_full, n_trn_full)
y_train, y_valid = split_vals(y, n_trn_full)

# ํ™•์ธ
print('train:', X_train.shape, 'target:', y_train.shape, 'validation:', X_valid.shape)
# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
train: (3556020, 42) target: (3556020,) validation: (889004, 42)

 

## ์ตœ์ข… RF Model ํ•™์Šต
rf_final = RandomForestRegressor(n_estimators=80,
                                 min_samples_leaf=3,
                                 max_features='sqrt',
                                 n_jobs=-1)
rf_final.fit(X_train, y_train)
print_score(rf_final)
# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
['mae train:', 0.0394811623200244, 'mae val:', 0.058680613710969526]

 


์ฐธ๊ณ  ์ž๋ฃŒ

Kaggle | PUBG Data Exploration + RF (+ Funny GIFs)

 

๋ถ„์„ ์ฝ”๋“œ

Github | PUBG_Battlegrounds_game_data_analysis