PUBG data Anaylsis

Introduction

I am a gamer, so for my final project. I would like to analysis some features from a game called PlayerUknownBattleGround(PUBG) using the skills I learn from data science. This game is a survival type of game embedded with shooting, driving and running features. This game is very fascinating and made many people fell in love with it so am I. But I am not very good at this game, so I want to find specific strategies in order to help me and other players become a pro player form data science standpoint. I will analysis solo mode for this project. The game objective is to survive until the last. When the game starts, 100 players join a game map. They can loot equipment from buildings including weapons. A player with better equipment and weapons will have a higher survival chance. The map will have a circle method which everyone has to stand in the circle otherwise he/she will die, so this functionality is used to increase players’ encounter chance. Furthermore, the circle will get smaller and smaller. Players have to kill others to get items and supplies to survive to the last circle.

Data Prepration

I used is from Kaggle: PUBG Finish Placement Prediction at https://www.kaggle.com/c/pubg-finish-placement-prediction/data and I also scraped and cleaned another data set from https://pubg.op.gg/weapons where they provide a full analysis of weapons used in PUBG game. I also downloaded a kill stats from https://www.kaggle.com/skihikingkevin/pubg-match-deaths#deaths.zip in order to compare weapon effects.

Explination of Kaggle Data

website:https://www.kaggle.com/c/pubg-finish-placement-prediction/data In a PUBG game, up to 100 players start in each match (matchid). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In the game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences – such as falling too far or running themselves over and eliminating themselves.

For each row contains one player’s post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 players per group.

See Appendix for fields description.

Kill Match Stats Explination

This is the stats about players killed. It specifies where is the killer and where is the death to place by using a coordinate system. It also demonstrated what weapon killed the dead player.

See Appendix for fields description.

OP.GG data Explination

website:https://pubg.op.gg/weapons This is the weapon statics’ by performance data scraped from op.gg websites. A weapon is one of the most important equipment in PUBG game which could impact the survival rate of a player. Each weapon has its own performance data in areas such as damage, rate of fire, reload duration, body hit impact. The other table I scraped is weapon statics’ by usage pattern which contains the information of how a weapon is preferred by players in PUBG.

See Appendix for fields description.

#scraping the weapon data from op.gg
url<-"https://pubg.op.gg/weapons" 

url.scrap <- read_html(url)
items <- html_nodes(x   = url.scrap, css = "li") %>% html_text()
items<-items[1:124]

#data cleaning
preference_items<-items[49:86]%>%
  str_remove_all(" ")%>%
  stri_replace_all_regex("\n"," ")

performance_items<-items[87:124]%>%
  str_remove_all(" ")%>%
  stri_replace_all_regex("\n"," ")

#tidy
string<-gsub("\\s+", " ", str_trim(preference_items))
string<-str_sub(string,start=1L,end=-2L)
string<-str_replace_all(string,"%","")
string[38]<-gsub(string[38],"","DP-28 0.00 34.21 0.19 34.64")

#as data frame
tab<-as.data.frame(string)
preference_tab<-tab%>%
  separate(string,c("weapon","K_D","preference","spawn_rate","avg_kill_dist")," ")

#replace - and 0.00 as NAs
preference_tab[preference_tab=="-"]<-NA
preference_tab[preference_tab=="0.00"]<-NA

#convert columns' type
preference_tab<-preference_tab%>%
  type_convert(col_types=cols( K_D = col_double()))%>%
  type_convert(col_types=cols( preference = col_double()))%>%
  type_convert(col_types=cols( spawn_rate = col_double()))%>%
  type_convert(col_types=cols( avg_kill_dist= col_double()))


#divide preference and spawn_rate by 100
divide_by<-function(x) x/100
preference_tab$preference<-sapply(preference_tab$preference,divide_by)
preference_tab$spawn_rate<-sapply(preference_tab$spawn_rate,divide_by)

head(preference_tab)

##   weapon  K_D preference spawn_rate avg_kill_dist
## 1    AWM 0.46     0.8611         NA        110.28
## 2    AUG 0.81     0.8589         NA         30.72
## 3  Groza 0.91     0.8534         NA         29.77
## 4    M24 0.25     0.8490         NA        107.35
## 5   Mk14 0.56     0.8372         NA         99.44
## 6   M249 0.86     0.8179         NA         33.98

Similarily, I will be scraping the weapon performance data from op.gg

string<-gsub("\\s+", " ", str_trim(performance_items))

p_tab<-as.data.frame(string)
performance_tab<-p_tab%>%
  separate(string,c("weapon","damage","rate_fire","reload_duration","body_hit_impact")," ")%>%
  separate(rate_fire,c("rate_fire","tmp"),"s")%>%
  separate(reload_duration,c("reload_duration","tmp"),"s")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [2].

performance_tab$tmp<-NULL
performance_tab[performance_tab=="-"]<-NA

performance_tab<-performance_tab[,c("weapon","damage","rate_fire","reload_duration","body_hit_impact")]

performance_tab<-performance_tab%>%
  type_convert(col_types=cols( damage = col_integer()))%>%
  type_convert(col_types=cols( rate_fire = col_double()))%>%
  type_convert(col_types=cols( reload_duration = col_double()))%>%
  type_convert(col_types=cols( body_hit_impact= col_integer()))

head(performance_tab)

##     weapon damage rate_fire reload_duration body_hit_impact
## 1      AWM    105      1.85           4.600           40000
## 2 Crossbow    105        NA           3.550            8000
## 3      M24     79      1.80           4.200           20000
## 4   Kar98K     75      1.90           4.000           16000
## 5    Win94     66      0.60           4.000           16000
## 6     Mk14     61      0.09           3.683           20000

We can also join two tables by its weapon names to make a more sophesticate anaylsis.

weapon_tab<-inner_join(preference_tab,performance_tab,by="weapon")
head(weapon_tab)

##   weapon  K_D preference spawn_rate avg_kill_dist damage rate_fire
## 1    AWM 0.46     0.8611         NA        110.28    105     1.850
## 2    AUG 0.81     0.8589         NA         30.72     43     0.086
## 3  Groza 0.91     0.8534         NA         29.77     49     0.080
## 4    M24 0.25     0.8490         NA        107.35     79     1.800
## 5   Mk14 0.56     0.8372         NA         99.44     61     0.090
## 6   M249 0.86     0.8179         NA         33.98     45     0.075
##   reload_duration body_hit_impact
## 1           4.600           40000
## 2           3.667            9000
## 3           3.000           10000
## 4           4.200           20000
## 5           3.683           20000
## 6           8.200           10000

Let’s see the missingness of this data we scraped from op.gg

missmap(weapon_tab)

Most missing data came form spawn rates. It is due to the nature of the game since some weapons in this game can only acquire from airdrop which means they are not spawnable. And some weapon is because it is so new that op.gg may not collect those data yet. Since we will not use spawn rate, such missing data have not to effect on our analysis.

Inport Kaggle data

We already had the weapon data we needed, Now lets import match data sett downloaded from Kaggle website. We will use train_V2 for our analysis and leave test data behind for the machine learning part. These data are big so it might take a while to load. Note that id, matched,and group id are encoded as hash so don’t bother to decode it just leave as it is. This game has two modes, and We will analysis on solo mode.

train_V2 <-read_csv("~/Desktop/CMSC/CMSC320/train_V2.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Id = col_character(),
##   groupId = col_character(),
##   matchId = col_character(),
##   matchType = col_character()
## )

## See spec(...) for full column specifications.

test_V2 <-read_csv("~/Desktop/CMSC/CMSC320/test_V2.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Id = col_character(),
##   groupId = col_character(),
##   matchId = col_character(),
##   matchType = col_character()
## )
## See spec(...) for full column specifications.

#solo data
col<-c(1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,29)
col2<-c(1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28)
solo_tab<-subset(train_V2[col],matchType=="solo"|matchType=="solo-fpp"|matchType=="normal-solo"|matchType=="normal-solo-fpp")
solo_tab<-solo_tab%>%
  sample_n(202013)

Now we have solo_tab as solo match data and group_tab as group matched data

#drop some unused columns and nas
solo_tab$matchType<-NULL
solo_tab$DBNOs<-NULL
solo_test<-solo_tab
solo_tab<-na.omit(solo_tab)
head(solo_tab)

## # A tibble: 6 x 25
##   Id    groupId matchId assists boosts damageDealt heals killPlace
##   <chr> <chr>   <chr>     <dbl>  <dbl>       <dbl> <dbl>     <dbl>
## 1 b7af… 8c51d4… 1c7173…       0      0         0       0        59
## 2 d4bd… 99e408… 0d3987…       0      0       198       0        47
## 3 6ce7… 0756bc… 5af846…       0      0        50.8     0        75
## 4 3199… 624f87… 8efcf8…       0      0        56.8     0        84
## 5 c960… 708147… f3391f…       0      1       168.      3        24
## 6 ea1f… 0d241b… acbf07…       0      5       651.      5         4
## # … with 17 more variables: killPoints <dbl>, kills <dbl>,
## #   killStreaks <dbl>, longestKill <dbl>, matchDuration <dbl>,
## #   maxPlace <dbl>, numGroups <dbl>, rankPoints <dbl>, rideDistance <dbl>,
## #   roadKills <dbl>, swimDistance <dbl>, teamKills <dbl>,
## #   vehicleDestroys <dbl>, walkDistance <dbl>, weaponsAcquired <dbl>,
## #   winPoints <dbl>, winPlacePerc <dbl>

Inport Kill stats

kill_stat<-read_csv("~/Desktop/kill_stats.csv")

## Parsed with column specification:
## cols(
##   killed_by = col_character(),
##   killer_name = col_character(),
##   killer_placement = col_double(),
##   killer_position_x = col_double(),
##   killer_position_y = col_double(),
##   map = col_character(),
##   match_id = col_character(),
##   time = col_double(),
##   victim_name = col_character(),
##   victim_placement = col_double(),
##   victim_position_x = col_double(),
##   victim_position_y = col_double()
## )

head(kill_stat)

## # A tibble: 6 x 12
##   killed_by killer_name killer_placement killer_position… killer_position…
##   <chr>     <chr>                  <dbl>            <dbl>            <dbl>
## 1 Grenade   KrazyPortu…                5          657725.          146275.
## 2 SCAR-L    nide2Bxiao…               31           93091.          722236.
## 3 S686      Ascholes                  43          366921.          421624.
## 4 Down and… Weirdo7777                 9          472014.          313275.
## 5 M416      Solayuki1                  9          473358.          318340.
## 6 Punch     xuezhiqian…               26          721944.          359575.
## # … with 7 more variables: map <chr>, match_id <chr>, time <dbl>,
## #   victim_name <chr>, victim_placement <dbl>, victim_position_x <dbl>,
## #   victim_position_y <dbl>

Exploratory Data Analysis

First, we know that weapon is one of the most important sources in order to survive in the game, so we plot weapon data first to see which weapons are favored by player. I assume higher damage weapons have higher Kill/Death rates. Therefore the player will have a higher chance to win the game when they have a better weapon. However, I need to fit a regression model and doing a hypothesis test to verify such relation. Lest see the plot top 20 favored weapons first.

preference_tab%>%
  arrange(desc(preference))%>%
  top_n(20)%>%
  ggplot(mapping=aes(x = reorder(weapon,+preference),y = preference))+
  geom_bar(stat="identity")+
  geom_col(fill = "Navy") +
  geom_label_repel(aes(label = preference), size = 2.5) +
  coord_flip() +
  labs(title = "Top 20 Weapon Picks",
       x = "Preference",
       y = "Weapons")

## Selecting by avg_kill_dist

#now plot koill match stat to see which weapons kills the most 
kill_stat%>%
  filter(killed_by!="Down and Out"&killed_by!="Bluezone"&killed_by!="Falling"&killed_by!="Hit by Car")%>%
  group_by(killed_by)%>%
  summarize(sum=n())%>%
  arrange(desc(sum))%>%
  top_n(20)%>%
  ggplot(aes(x=reorder(killed_by,+sum),y=sum))+
  geom_bar(stat="identity")+
  coord_flip()+
  labs(title = "Top 20 Weapon Kill",
       x = "Sum",
       y = "Weapons")

## Selecting by sum

weapon kill plot matched with our weapon preference plot that there are some high preference weapons such as M416, Scar-L AKM, and M164… However, some weapons have high preference rate but do not kill that many people. Lets see what are these weapons.

weapon_tab%>%
  ggplot(aes(x=preference,y=K_D))+
  geom_point()+
  labs(title="Preference Vs Kill/Death",
       x= "Preference",
       y = "Kill/Death")

## Warning: Removed 2 rows containing missing values (geom_point).

We can see that these weapon’s preference rates are around 60% to 75%. Lets find out more about them.

weapon_range<-function(x){
  if(x<20) return("short range")
  else if(x>20 & x<50) return("middle range")
  else if(is.null(x)) return(NA)
  else return("long range")
}
#add weapon range
preference_tab<-preference_tab%>%
  mutate(weapon_range=sapply(avg_kill_dist,weapon_range))

weapon_tab<-weapon_tab%>%
  mutate(weapon_range=sapply(avg_kill_dist,weapon_range))

weapon_tab%>%
  filter(preference>=0.6 & preference <=0.75)%>%
  filter(K_D<0.25)

##      weapon  K_D preference spawn_rate avg_kill_dist damage rate_fire
## 1      P18C 0.05     0.7124     0.0126         17.02     23     0.060
## 2 Sawed-off 0.07     0.7080     0.0480         11.87     20     0.250
## 3  Skorpion 0.07     0.6808         NA         16.31     22     0.071
## 4     R1895 0.04     0.6288     0.0184         20.28     55     0.400
## 5     P1911 0.10     0.6145     0.1916         14.94     41     0.110
## 6       P92 0.08     0.6136     0.1856         14.11     35     0.135
##   reload_duration body_hit_impact weapon_range
## 1            2.00            7000  short range
## 2            4.00            8000  short range
## 3            3.10            5000  short range
## 4            0.75            8000 middle range
## 5            2.10            6000  short range
## 6            2.00            7000  short range

High weapon preference but low in Kill/Death weapons are mostly short-range weapons which have a high spawn rate. There is one middle range weapon which has a low K/D 4%. Combined with kill stats plot, M416 by far is the most reliable weapon and highly preferable by players. So I suggest Newbie player use M416 when they have the chance to improve their win placement in this match.

Regression Anaylsis

Lets now analysis the relationship between preference and different weapon range,average kill distance and weapon damage using regression.

#making plot first
preference_tab%>%
  ggplot(aes(x = avg_kill_dist, y = preference,color=weapon_range))+
  geom_point()+
  geom_smooth(method=lm)+
  labs(title = "Preference Over Kill/Death",
       x = "Average Kill Dist in m",
       y = "Preference")

Surprisingly, the middle range weapon is different from others. From the plot, we see that for short range weapons as kill distance gets longer it is more preferred. For middle range weapon as its kill distance gets longer, preference gets lower. For long-range weapons, their preference gets higher as the range gets larger. So we might need different regression for 3 different weapon ranges.

1.Short-range Weapon

#fit linear mode with these attributes
short<-weapon_tab%>%
  filter(weapon_range=="short range")
fit_short<-lm(preference~avg_kill_dist+damage,data=short)
summary(fit_short)

## 
## Call:
## lm(formula = preference ~ avg_kill_dist + damage, data = short)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.137187 -0.042616 -0.008203  0.032083  0.155242 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    0.277807   0.109905   2.528  0.02999 * 
## avg_kill_dist  0.037573   0.009836   3.820  0.00337 **
## damage        -0.008552   0.003394  -2.519  0.03042 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09709 on 10 degrees of freedom
## Multiple R-squared:  0.5942, Adjusted R-squared:  0.5131 
## F-statistic: 7.323 on 2 and 10 DF,  p-value: 0.011

From the static analysis, we can see that the p value of damage > 5%. We can conclude that damage of a weapon does not have significance to preference whereas average kill distance has a very low p value < 5%. Therefore, we can reject the null hypothesis that the average kill distance for the short-range weapon has no relationship.

2.Middle-range Weapon

middle<-weapon_tab%>%
  filter(weapon_range=="middle range")
fit_middle<-lm(preference~avg_kill_dist+damage,data=middle)
summary(fit_middle)

## 
## Call:
## lm(formula = preference ~ avg_kill_dist + damage, data = middle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19075 -0.11991 -0.03034  0.05397  0.28395 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.914358   0.211842   4.316 0.000838 ***
## avg_kill_dist -0.006756   0.006215  -1.087 0.296808    
## damage        -0.002934   0.002955  -0.993 0.338792    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1788 on 13 degrees of freedom
## Multiple R-squared:  0.1833, Adjusted R-squared:  0.05769 
## F-statistic: 1.459 on 2 and 13 DF,  p-value: 0.2681

From this statical analysis, p value for damage and average kill distance are both too high. So we can accept the null hypothesis test that for middle range weapon average kill distance and its damage has no relation with preference. The most effective middle-range weapon from the statical analysis is M416.

2.Long-range Weapon

long<-weapon_tab%>%
  filter(weapon_range=="long range")
fit_long<-lm(preference~avg_kill_dist+damage,data=long)
summary(fit_middle)

## 
## Call:
## lm(formula = preference ~ avg_kill_dist + damage, data = middle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19075 -0.11991 -0.03034  0.05397  0.28395 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.914358   0.211842   4.316 0.000838 ***
## avg_kill_dist -0.006756   0.006215  -1.087 0.296808    
## damage        -0.002934   0.002955  -0.993 0.338792    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1788 on 13 degrees of freedom
## Multiple R-squared:  0.1833, Adjusted R-squared:  0.05769 
## F-statistic: 1.459 on 2 and 13 DF,  p-value: 0.2681

From the statical analysis for long range weapon, we find out that damage has a strong relationship with its preference since the p value < 5%. We can drop the null hypothesis and conclude thta as a long-range weapon’s damage gets higher, its preference gets higher. It is understandable because long-range weapon are often used by snipers who favor the weapon that can kill people from hundreds of meters away by just one shot.

Weapons can be very useful when playing PUBG. However, there are some other factors that could determine the win placement in this game. We will analyze Kaggle data next.

Heat Death Map

In this part, I will generate a heat map for players to see where is the battle took place. Therefore, newbie players should avoid these areas of highly frequent fights in order to increase their survival rate. We will use Erangel as an example.

library(jpeg)
my_image=readJPEG("~/Desktop/erangel.jpg")
set.seed(1)

erangel<-kill_stat%>%
  filter(map=="ERANGEL")%>%
  mutate(pos_x = victim_position_x*1000/800000,
         pos_y = victim_position_y*1000/800000)

erangel%>%
  ggplot(aes(x=pos_x,y=pos_y))+
  annotation_custom(rasterGrob(my_image, width=unit(1,"npc"), height=unit(1,"npc")),-Inf, Inf, -Inf, Inf) +
  stat_bin2d(alpha = 0.7, bins = 100) +
  scale_x_continuous(expand = c(0,0)) +
  scale_y_reverse(expand = c(0,0)) +
  scale_fill_gradient(low = "PURPLE", high = "ORANGE", name = "the Number of Victims", labels = scales::comma, trans = "log10") +
  labs(title="Erangel Death Heat Map",
       x="Position X of Victim",y= "Position X of Victim")

From the heat map generated, we can see that areas around cities have a high chance of encountering with other players. Therefore, those areas have a high chance of death rate. Especially at the center of the map, players will fight with each other more often. However, at the corner of the map chances of a fight are less likely. If you want to survive till the last minute, you should avoid those areas that have a high heat level such as Military base.

Machine learning

The last Part for this tutorial is to predict win placement by machine learning methods using Kaggle pubg data

Correlations

By finding what other variables might affect the win placement. We first need to generate a correlated plot to find out these variables. Then we can use these variables to build our machine learning models.

solo_cor<-cor(solo_tab[4:25])
corrplot(solo_cor, type = "upper", tl.col = "black", tl.srt = 45,tl.cex=0.8)

For Solo mode, from the correlation table, we can see that boots,killPlace, walkDistance,and damage dealt have a high correlation value to winplacment, we will then use these five features to make our prediction. Now let us fit the linear regression model.

win_fit <- lm(winPlacePerc ~ walkDistance + killPlace + boosts+damageDealt + kills, data=solo_tab)
summary(win_fit)

## 
## Call:
## lm(formula = winPlacePerc ~ walkDistance + killPlace + boosts + 
##     damageDealt + kills, data = solo_tab)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65917 -0.07976 -0.00095  0.08581  1.92141 
## 
## Coefficients:
##                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   6.061e-01  1.257e-03  481.960  < 2e-16 ***
## walkDistance  1.389e-04  4.121e-07  336.947  < 2e-16 ***
## killPlace    -5.234e-03  1.782e-05 -293.725  < 2e-16 ***
## boosts        1.955e-02  2.454e-04   79.695  < 2e-16 ***
## damageDealt   3.153e-05  5.839e-06    5.399  6.7e-08 ***
## kills        -2.884e-02  6.212e-04  -46.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1392 on 202007 degrees of freedom
## Multiple R-squared:  0.7822, Adjusted R-squared:  0.7822 
## F-statistic: 1.451e+05 on 5 and 202007 DF,  p-value: < 2.2e-16

The dependent variable is winPlacePerc, all p values < 5%, and Adjusted R-suquared are 0.78 which means the model fits the data well for pubg in solo mode. If you want to be good at this game. Those 5 features are the deterministic attributes for you to win. It can be demonstrated that player covers more ground he will encounter more enemies and he will find more supplies and high-quality weapons we mentioned before. Therefore he will definitely have a higher chance to survive but he also has the risk to be killed by other players.

Let’s see make some predictions.

winpredict <- predict(win_fit ,data=solo_test)
#points(solo_tab$winPlacePerc, col = 2)
actuals_preds <- data.frame(cbind(actuals=solo_test$winPlacePerc, predicteds=winpredict)) 
correlation_accuracy <- cor(actuals_preds) #88%

regr.eval(actuals_preds$actuals, actuals_preds$predicteds)

##        mae        mse       rmse       mape 
## 0.10339583 0.01936817 0.13916958        Inf

Here from the correlation table, we see that our test data and prediction’s data’s correlation is about 88% which mean this linear model is very accurate. So we can conclude that our model fits the prediction. And the rmse is 0.14.

Conclusion

This final project is really fun for me. I applied all the skills to learn from data science such as data scraping and cleaning, data visualization, linear regression, and machine learning techniques. I also learned how to generate a heat map over an image which is intriguing. So in order to become a pro player in PUBG, you should learn how to pick a weapon. M416 is by far the most effective weapon for players to use and it can help you survive. Other factors such as kills, distance walked, boosts, and heals are also important. And it is maybe way PUBG becomes a hot game in just a short period of time. This game includes many features for players to discover and train. It is not only a shooting or survival game. To be good at it really makes players think of strategies to survive. It is also obvious from my analysis. I hope you enjoy my work and maybe download PUBG from steam and start to play to apply my suggestions and maybe find out your own way of becoming a pro!

Appendix

Kaggle data fields Specification:

\(\bullet\)DBNOs - Number of enemy players knocked.
\(\bullet\)assists - Number of enemy players this player damaged that were killed by teammates.
\(\bullet\)boosts - Number of boost items used.
\(\bullet\)damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.
\(\bullet\)headshotKills - Number of enemy players killed with headshots.
\(\bullet\)heals - Number of healing items used.
\(\bullet\)Id - Player’s Id
\(\bullet\)killPlace - Ranking in match of number of enemy players killed.
\(\bullet\)killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
\(\bullet\)killStreaks - Max number of enemy players killed in a short amount of time.
\(\bullet\)kills - Number of enemy players killed.
\(\bullet\)longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
\(\bullet\)matchDuration - Duration of match in seconds.
\(\bullet\)matchId - ID to identify match. There are no matches that are in both the training and testing set.
\(\bullet\)matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
\(\bullet\)rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
\(\bullet\)revives - Number of times this player revived teammates.
\(\bullet\)rideDistance - Total distance traveled in vehicles measured in meters.
\(\bullet\)roadKills - Number of kills while in a vehicle. \(\bullet\)swimDistance - Total distance traveled by swimming measured in meters.
\(\bullet\)teamKills - Number of times this player killed a teammate.
\(\bullet\)vehicleDestroys - Number of vehicles destroyed.
\(\bullet\)walkDistance - Total distance traveled on foot measured in meters. weaponsAcquired - Number of weapons picked up.
\(\bullet\)winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
\(\bullet\)groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
\(\bullet\)numGroups - Number of groups we have data for in the match.
\(\bullet\)maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
\(\bullet\)winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

Kill Stats Data:

\(\bullet\)Killed_by - the name of a wepaon which killed a player.
\(\bullet\)killer_name- killer name of the player.
\(\bullet\)killer_placement - placement of the killer
\(\bullet\)killer_position_x - position of killer in the map refers to x position in x-y coordinate
\(\bullet\)killer_position_y - position of killer in the map refers to y position in x-y coordinate
\(\bullet\)map - maps played
\(\bullet\)matchi_id - match id
\(\bullet\)time - duration of a match
\(\bullet\)victim_name - id of the victim
\(\bullet\)victim_placement - victim placement
\(\bullet\)victim_position_x - position of victim in the map refers to x position in x-y coordinate
\(\bullet\)victim_position_y - position of victim in the map refers to y position in x-y coordinate

Op.gg data:

\(\bullet\)weapon- name of a weapon
\(\bullet\)damage - weapons damage to a player
\(\bullet\)rate_fire - rate of fire of this weapon
\(\bullet\)reload_duration - time a weapon takes to reload
\(\bullet\)K/D - kill / death rate of a weapon
\(\bullet\)preference - preference rate to players.
\(\bullet\)spawn_rate - the probility a weapon spawns in the map.
\(\bullet\)avg_kill_dist - average kill distances of a weapon.
\(\bullet\)body_hit_impact - body impact such as shaking when bullets of a weapon hits a player.

Reference:

Kaggle https://www.kaggle.com/c/pubg-finish-placement-prediction/data

Op.gg https://pubg.op.gg/weapons

PUBG data Anaylsis

Youming Zhang

5/16/2019

Introduction

Data Prepration

Explination of Kaggle Data

Kill Match Stats Explination

OP.GG data Explination

Inport Kaggle data

Inport Kill stats

Exploratory Data Analysis

Regression Anaylsis

Heat Death Map

Machine learning

Correlations

Conclusion

Appendix