Projections using Hypertuned model through XGboost

All data is from FanGraphs. I have no affiliation with FanGraphs, but please consider contributing to their website if you found this project informative.

1 Project Scope

1.1 Objective

This project is designed to showcase how Using a Percentile Based Worth System values Fantasy Baseball Players through a Inning Pitched (IP) weighted projection

The Categories used for prediction valuation are year-end rankings for the following metrics: - HRs - Runs - RBIs - Batting Average - Stolen Bases - OPS


2 Processing the Data

2.1 Getting Data Into R

2.1.1 Load Libraries

First we need to load the packages that R needs to run the analysis

library(sqldf) #SQL in R
library(skimr) #Summaries and useful for removing low % data
library(ggplot2) #Plotting Functions
library(plyr) #slightly deprecated data cleaning
library(dplyr) #slightly updated data cleaning
library(tidyverse) #tidyverse data cleaning universe
library(caret) #wrapper for creating, tuning and validating models
library(xgboost) #package for creating regression tree model
library(vtreat) # useful package for treating data before modeling 
library(Matrix)
library(Boruta)
library(mgcv)
library(moments) #for measuring skewness
library(data.table) #alternative to dplyr we use to create lags
library(pdp) #partial dependence graphs
library(vip) #variable importance 
library(grid) #put multiple plots on one grid
library(gridExtra) #additional grid functionality
library(janitor) #one function used to clean transposed data set
library(ggpubr) #for qq plot
library(tableHTML)
library(kableExtra)

The # comments generally explain what additional functionality each library adds to R

2.1.2 Load in Data

All data is downloaded from Fan Graphs. From this location. The data is also available on my Github here. There are player level and team data sets


#data read-in
Batter_data <- read_csv("FanGraphs Leaderboard_Hitting50PA.csv")
#Team datasets
FDG_Team = read_csv("FanGraphs Leaderboard_Team.csv")
#Create a prefix for all team stats that starts with T_
FDG_Team2 <- FDG_Team %>% 
  rename_with( ~ paste0("T_", .x))

2.1.3 Checking Team Data

str give information about an object, while skim provides a customizable summary


#Output not shown for space
#str(FDG_Team2)

skim(FDG_Team2) %>%  
  tibble::as_tibble()

2.2 Understanding the Dataset

2.2.1 Exploring the dataset

skim let’s us see how the data was imported into R. Documentation can be found here


#Full Dataset dimensions

skimr::skim(Batter_data) %>% 
  tibble::as_tibble() %>% 
  select(skim_type,skim_variable,complete_rate) %>% 
  filter(complete_rate >0.30) #288 Variables

#skim_type - character or numeric
#skim_variable - name of variable
#complete_rate - % of data that is not missing
#filter - only keep variables that have 30% of data populated

Additionally let’s look at how variables vary by year to see if there are any discrepancies there


#It looks like one year, there were fewer games played, and there is a clear drop off in home runs
Batter_data_dist =
Batter_data %>% 
 group_by(Season) %>% 
  summarize (Games_played = max(G),
             Avg_HR= mean(HR)
             )
Batter_data_dist

ggplot(Batter_data_dist, aes(Season, Avg_HR)) +
  geom_col()+
  ggtitle("Average Home Runs by Year")+
  theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))


2.3 Cleaning and Creating Initial Dataset for Model

What are some issues with the data?

  1. Many of Variables, such as K%, are being read in as characters

    • Only Team and Player Name should be characters
  2. There is spotty data coverage in some of the variables (~Variables have less than 30% Coverage)

  3. 2020 Data only includes 60 games worth of data

    • This was a season shortened due to Covid-19
  4. Team Data needs to be appended to Batter Data by Team Name


2.3.1 Cleanly Changing all Variables that are characters to numeric.

There are several ways to do this, we will identify the variables we want to change that are mis-identified. parse_number can be used to pull numbers from these variables. Additional ways to tackle this can be found here


#Select Column names that are characters but not Team or Name, These should be percentages
Batter_data_chars_to_convert <- Batter_data %>% 
  select_if(is.character)%>% select(-Team,-Name) %>% 
  mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Note : There are additional ways to do this, this is just one solution


#We can exclude the variables we converted and reintroduce them
Batter_data_num <- Batter_data %>% select(-colnames(Batter_data_chars_to_convert))

Batter_data2 = cbind(Batter_data_num,Batter_data_chars_to_convert) %>% 
  select (colnames(Batter_data)) %>%  #preserve original order 
  dplyr::rename(flyball_perc = `FB%...46`,fastball_perc = `FB%...73`) #rename two ambiguous columns
  
skim(Batter_data2) %>% 
  as_tibble() %>% 
  group_by(skim_type) %>% 
  count()


#Logical variables are R's best guess, in our case they are all NA's and will be removed

The same can be done for the Team Data that is loaded


#Select Column names that are characters but not Team or Name, These should be percentages
FDG_Team2_chars_to_convert <- FDG_Team2 %>% 
  select_if(is.character)%>% select(-T_Team) %>% 
  mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Keep in mind, parse number may make actual characters into numerical variables so carefully check your data before using

#We can exclude the variables we converted and reintroduce them
FDG_Team2_num <- FDG_Team2 %>% select(-colnames(FDG_Team2_chars_to_convert))

FDG_Team3 = cbind(FDG_Team2_num,FDG_Team2_chars_to_convert) %>% 
  select (colnames(FDG_Team2)) %>%  #preserve original order
dplyr::rename(T_flyball_perc = `T_FB%...45`,T_fastball_perc = `T_FB%...72`) 

skim(FDG_Team3) %>% 
  as_tibble() %>% 
  group_by(skim_type) %>% 
  count()

2.3.2 Filtering Data with Low Coverage

I choose 30% coverage of data necessary but this can be adjusted up or down. This will also get rid of columns that are all NA.


# Keep variables with enough values (Need 30% data coverage rate here)
Player_cols_to_keep =
skim(Batter_data2) %>% 
  dplyr::select(skim_type, skim_variable, complete_rate) %>% 
  filter (complete_rate > 0.30)

#Transpose Rows to get column names as skim melts the data
Player_cols_to_keep_transpose = t(Player_cols_to_keep) 

#extract the colnames we would like to keep
Player_cols_to_keep = colnames(janitor::row_to_names(Player_cols_to_keep_transpose,row_number = 2))

#Only keep the columns designated to have over 30% of their data populated or greater
Batter_data3 = Batter_data2 %>% 
  select(one_of(Player_cols_to_keep)) 

Repeat the process for Team Variables

Team_cols_to_keep =
skim(FDG_Team3) %>% 
  dplyr::select(skim_type, skim_variable, complete_rate) %>% 
  filter (complete_rate > 0.30)


#Transpose Rows to get column names as skim melts the data
Team_cols_to_keep_transpose = t(Team_cols_to_keep) 

#extract the colnames we would like to keep
Team_cols_to_keep = colnames(janitor::row_to_names(Team_cols_to_keep_transpose,row_number = 2))

#Only keep the columns designated to have over 30% of their data populated or greater
FDG_Team4 = FDG_Team3 %>% 
  select(one_of(Team_cols_to_keep)) 

2.3.3 Creating Variables Normalized by Year

Some Variables will need to be normalized by Innings_Pitched (IP) if they aren’t a percentage already. Remaining Variables are percentages or indices so will not need to be transformed. The full data dictionary for these variables can be found on FanGraph’s website here. for pitching variables and here. for hitting variables.


Batter_data4 = Batter_data3 %>% 
  mutate( #create new variables based on existing variables
    H_PA = H/PA,
    x1B_PA = `1B`/PA, #note: R can't have variables start with a number
    x2b_PA = `2B`/PA,
    x3b_PA = `3B`/PA,
    HR_PA = HR/PA,
    R_PA = R/PA,
    RBI_PA = RBI/PA,
    BB_PA = BB/PA,
    IBB_PA = IBB/PA,
    SO_PA=SO/PA,
    HBP_PA=HBP/PA,
    SF_PA=SF/PA,
    SH_PA=SH/PA,
    GDP_PA= GDP/PA,#ground into double play
    SB_PA=SB/PA,
    CS_PA=CS/PA,
    GB_PA = GB/PA,   #Groundballs
    FB_PA =  FB/PA,  #FlyBalls
    LD_PA = LD/PA,   #LineDrives
    IFFB_PA = IFFB/PA,  #Infield Fly balls
    Pitches_PA= Pitches/PA,
    Balls_PA= Balls/PA,
    Strikes_PA= Strikes/PA,
    IFH_PA= IFH/PA,
    BU_PA= BU/PA,
    BUH_PA= BUH/PA,
    PH_PA= PH/PA,
    Barrels_PA= Barrels/PA,
    HardHits_PA= HardHit/PA
  ) %>% select(-(H:CS),-(GB:BUH),-PH,-Barrels,-HardHit,-Events) #Drop the old variables

#skim(Batter_data4) %>% as_tibble()

Repeat the process for Team Variables


FDG_Team5 = FDG_Team4 %>% 
  mutate( #create new variables based on existing variables
    T_H_T_PA = T_H/T_PA,
    T_x1B_T_PA = T_1B/T_PA, #note: R can't have variables start with a number
    T_x2b_T_PA = T_2B/T_PA,
    T_x3b_T_PA = T_3B/T_PA,
    T_HR_T_PA = T_HR/T_PA,
    T_R_T_PA = T_R/T_PA,
    T_RBI_T_PA = T_RBI/T_PA,
    T_BB_T_PA = T_BB/T_PA,
    T_IBB_T_PA = T_IBB/T_PA,
    T_SO_T_PA=T_SO/T_PA,
    T_HBP_T_PA=T_HBP/T_PA,
    T_SF_T_PA=T_SF/T_PA,
    T_SH_T_PA=T_SH/T_PA,
    T_GDP_T_PA= T_GDP/T_PA,#ground into double play
    T_SB_T_PA=T_SB/T_PA,
    T_CS_T_PA=T_CS/T_PA,
    T_GB_T_PA = T_GB/T_PA,   #Groundballs
    T_FB_T_PA =  T_FB/T_PA,  #FlyBalls
    T_LD_T_PA = T_LD/T_PA,   #LineDrives
    T_IFFB_T_PA = T_IFFB/T_PA,  #Infield Fly balls
    T_Pitches_T_PA= T_Pitches/T_PA,
    T_Balls_T_PA= T_Balls/T_PA,
    T_Strikes_T_PA= T_Strikes/T_PA,
    T_IFH_T_PA= T_IFH/T_PA,
    T_BU_T_PA= T_BU/T_PA,
    T_BUH_T_PA= T_BUH/T_PA,
    T_PH_T_PA= T_PH/T_PA,
    T_Barrels_T_PA= T_Barrels/T_PA,
    T_HardHits_T_PA= T_HardHit/T_PA
  ) %>% select(-(T_H:T_CS),-(T_GB:T_BUH),-T_PH,-T_Barrels,-T_HardHit,-T_Events) #Drop the old variables


#skim(FDG_Team5) %>% as_tibble()

2.3.4 Creating Lagged Variables

There are several ways to lag a dataset BY GROUP.
* Dplyr way is here..
* The data.table (the method used below) is here.

#Note we will only be lagging the player level data, as the previous year's team performance shouldn't impact current performance


#Order the dataset by lag columns
Batter_data5 =  arrange(Batter_data4, playerid,Season) #playerid is the Fangraph id assigned to each player

# Convert dataframe to data.table format
DT_batter = data.table(Batter_data5)

#designate columns to lag - which is all of them
cols1 = colnames(Batter_data5)
anscols = paste("lag", cols1, sep="_")
DT_batter[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year

Batter_data6 = as.data.frame(DT_batter) %>% select(-lag_playerid, -lag_Team, -lag_Season, -lag_Age,-lag_Name)

ncol(Batter_data5) #287 - no lags
[1] 259
ncol(Batter_data6) #574 - lagged data ~ (287 * 2)-5
[1] 513

2.3.5 Merging Team and Player Data

We can use either the merge function or the SQL functionality provided by the sqldf package to join the lagged player level data to the Team level data


df_batting_init = sqldf(
  "
  select a.*, b.*
  from Batter_data6 a
  left join FDG_Team5 b
  on a.Team = b.T_Team and a.Season = b.T_Season
  
  "
)  %>% select(-T_Team,-T_Season,T_Age,T_G,T_AB)# Unncessary Team Variables


nrow(df_batting_init) - nrow(Batter_data6) #check if any rows are duplicated
[1] 0

3 Creating Rankings for Players Based On Percentiles

We can use Percentile based ranking to get rankings for players from the 2021 season.

3.1 Worth of each stat

3.1.1 Calculating past performance

Each player goes from a 0% to 100% on each percentile stat that is used for creating a scoring opportunity. Data is not yet normalized by PA as certain stats such as HRs and SBs will be worth more when we do.


#Categories I include are:
#Runs (R), Home Runs (HR), Runs Batted In (RBI), Stolen Bases (SB), Batting Average (AVG)
df_batting_init2 =  df_batting_init %>%
#  arrange(player_id,year) %>% 
  group_by(Season) %>% 
  mutate(
    Runs_share = order(order(rank(R_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     HR_share = order(order(rank(HR_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     RBI_share = order(order(rank(RBI_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     SB_share = order(order(rank(SB_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     AVG_share = order(order(rank(AVG,ties.method = 'average'),decreasing = FALSE))/n(),
    OPS_share = order(order(rank(OPS,ties.method = 'average'),decreasing = FALSE))/n(),
    Worth = Runs_share+HR_share+RBI_share+SB_share+AVG_share+OPS_share
    ) %>% 
  ungroup() 

Chart of the Distribution of initial percentiles
As the chart below shows, the data is roughly normal.


skewness((df_batting_init2$Worth))
[1] -0.2
ggplot2::qplot(df_batting_init2$Worth, main="Total Dataset") + geom_histogram(colour="black", fill="lime green")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

min(df_batting_init2$Worth)
[1] 0.031
max(df_batting_init2$Worth)
[1] 5.8
ggpubr::ggqqplot(df_batting_init2$Worth)


shapiro.test(df_batting_init2$Worth)

    Shapiro-Wilk normality test

data:  df_batting_init2$Worth
W = 1, p-value <0.0000000000000002

3.2 2021 Player Rankings - Per PA performance

3.2.1 2021 Player Rankings - Top Worth Player

There are per PA rankings. Players like Byron Buxton which had a great per PA score but can’t stay healthy for a season will be adjusted down.


options(digits=2)

df_batting_init2021 =
df_batting_init2 %>% 
  group_by(Name) %>% 
  filter(Season == 2021) %>% 
  arrange(desc(Worth)) %>% 
  select(Name,Runs_share,HR_share,RBI_share, SB_share,OPS_share,AVG_share,Worth)


df_batting_init2021 %>%
  filter (Worth>3.9) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)
Name Runs_share HR_share RBI_share SB_share OPS_share AVG_share Worth
Fernando Tatis Jr. 0.99 1.00 0.98 0.96 0.99 0.89 5.8
Ronald Acuna Jr. 1.00 0.98 0.82 0.96 0.99 0.90 5.6
Byron Buxton 1.00 0.99 0.70 0.92 0.99 0.97 5.6
Tyler O'Neill 0.96 0.97 0.86 0.88 0.96 0.91 5.5
Jose Ramirez 0.98 0.93 0.94 0.94 0.94 0.78 5.5
Bryce Harper 0.96 0.95 0.80 0.82 1.00 0.97 5.5
Teoscar Hernandez 0.92 0.90 1.00 0.80 0.93 0.94 5.5
Kyle Tucker 0.87 0.89 0.94 0.86 0.97 0.94 5.5
Shohei Ohtani 0.95 0.99 0.90 0.94 0.98 0.69 5.4
Vladimir Guerrero Jr. 0.99 0.98 0.92 0.49 0.99 0.98 5.3
Frank Schwindel 0.97 0.90 0.95 0.55 0.98 0.99 5.3
Bo Bichette 0.99 0.76 0.85 0.92 0.87 0.95 5.3
Trea Turner 0.96 0.78 0.63 0.97 0.96 1.00 5.3
Nick Castellanos 0.95 0.94 0.97 0.47 0.98 0.97 5.3
Juan Soto 0.96 0.80 0.83 0.70 0.99 0.99 5.3
Brandon Crawford 0.85 0.78 0.94 0.80 0.94 0.95 5.3
Brandon Belt 0.97 0.99 0.89 0.56 0.99 0.85 5.2
Luis Robert 0.84 0.79 0.83 0.81 0.98 1.00 5.2
Javier Baez 0.87 0.93 0.92 0.91 0.85 0.77 5.2
Marcus Semien 0.94 0.96 0.80 0.81 0.93 0.77 5.2
Mike Trout 0.93 0.91 0.67 0.69 1.00 1.00 5.2
Paul Goldschmidt 0.90 0.83 0.84 0.76 0.94 0.94 5.2
Rafael Devers 0.91 0.93 0.97 0.55 0.94 0.88 5.2
George Springer 0.98 0.97 0.84 0.66 0.96 0.76 5.2
Aaron Judge 0.83 0.96 0.88 0.60 0.97 0.92 5.2
A.J. Pollock 0.64 0.87 0.94 0.82 0.94 0.95 5.2
Manny Machado 0.85 0.79 0.95 0.78 0.89 0.88 5.1
Jorge Polanco 0.90 0.88 0.88 0.76 0.87 0.80 5.1
Matt Olson 0.89 0.94 0.94 0.50 0.96 0.83 5.1
Brandon Lowe 0.93 0.97 0.93 0.66 0.92 0.59 5.0
Freddie Freeman 0.98 0.80 0.64 0.66 0.95 0.95 5.0
Ozzie Albies 0.90 0.78 0.88 0.89 0.82 0.71 5.0
Adam Engel 0.89 0.88 0.72 0.97 0.88 0.64 5.0
Yordan Alvarez 0.91 0.91 0.98 0.34 0.93 0.87 5.0
Thairo Estrada 0.85 0.89 0.96 0.55 0.85 0.85 4.9
Jesse Winker 0.93 0.87 0.84 0.36 0.98 0.96 4.9
Avisail Garcia 0.73 0.92 0.96 0.72 0.86 0.74 4.9
Kyle Schwarber 0.95 0.98 0.87 0.36 0.97 0.78 4.9
Joey Votto 0.78 0.98 0.99 0.35 0.97 0.78 4.9
Bryan Reynolds 0.85 0.69 0.79 0.55 0.96 0.96 4.8
Tim Anderson 0.97 0.57 0.54 0.90 0.83 0.98 4.8
Salvador Perez 0.74 0.99 0.99 0.33 0.91 0.84 4.8
Rhys Hoskins 0.86 0.95 0.93 0.53 0.92 0.58 4.8
Jose Altuve 0.98 0.83 0.66 0.54 0.89 0.87 4.8
Max Muncy 0.94 0.95 0.92 0.40 0.95 0.61 4.8
Austin Riley 0.79 0.87 0.93 0.26 0.95 0.96 4.8
Randy Arozarena 0.92 0.62 0.59 0.91 0.85 0.86 4.8
Xander Bogaerts 0.89 0.70 0.73 0.57 0.92 0.94 4.7
Kris Bryant 0.88 0.77 0.68 0.75 0.88 0.78 4.7
Patrick Wisdom 0.85 0.99 0.94 0.64 0.87 0.44 4.7
Starling Marte 0.96 0.41 0.46 1.00 0.89 0.98 4.7
Hunter Renfroe 0.92 0.91 0.96 0.35 0.85 0.71 4.7
C.J. Cron 0.67 0.88 0.96 0.35 0.95 0.89 4.7
Will Smith 0.83 0.87 0.87 0.50 0.91 0.70 4.7
Franmil Reyes 0.59 0.97 0.99 0.58 0.90 0.66 4.7
Trevor Story 0.88 0.73 0.70 0.91 0.83 0.63 4.7
LaMonte Wade Jr. 0.78 0.84 0.84 0.73 0.84 0.65 4.7
Cedric Mullins II 0.77 0.80 0.29 0.95 0.93 0.93 4.7
Darin Ruf 0.72 0.88 0.78 0.51 0.95 0.82 4.7
Chris Owings 0.99 0.34 0.41 0.94 1.00 0.99 4.7
Chas McCormick 0.88 0.79 0.90 0.68 0.72 0.69 4.7
Ketel Marte 0.80 0.69 0.74 0.48 0.96 0.99 4.7
Justin Turner 0.84 0.79 0.81 0.46 0.88 0.87 4.6
Evan Longoria 0.92 0.81 0.91 0.41 0.88 0.73 4.6
Nelson Cruz 0.77 0.91 0.85 0.47 0.88 0.77 4.6
Jake Meyers 0.77 0.68 0.98 0.77 0.72 0.72 4.6
Mookie Betts 0.96 0.76 0.47 0.77 0.91 0.76 4.6
Pete Alonso 0.65 0.94 0.85 0.46 0.92 0.74 4.6
Jared Walsh 0.55 0.87 0.96 0.40 0.91 0.87 4.5
Chris Taylor 0.93 0.65 0.69 0.83 0.78 0.66 4.5
Corey Seager 0.73 0.72 0.79 0.37 0.97 0.97 4.5
Matt Beaty 0.89 0.56 0.97 0.57 0.72 0.82 4.5
Jonathan India 0.92 0.62 0.52 0.78 0.89 0.81 4.5
Willy Adames 0.80 0.81 0.73 0.59 0.86 0.74 4.5
Ryan Mountcastle 0.72 0.92 0.88 0.53 0.81 0.67 4.5
Mitch Haniger 0.94 0.92 0.83 0.33 0.83 0.65 4.5
Rafael Ortega 0.74 0.62 0.41 0.92 0.86 0.93 4.5
Tyrone Taylor 0.59 0.80 0.91 0.82 0.77 0.58 4.5
Brett Phillips 0.97 0.80 0.87 0.97 0.60 0.26 4.5
Jose Abreu 0.70 0.82 0.98 0.33 0.88 0.73 4.4
Carlos Correa 0.95 0.74 0.81 0.15 0.90 0.88 4.4
J.D. Martinez 0.86 0.79 0.90 0.05 0.92 0.91 4.4
Tyler Naquin 0.49 0.76 0.88 0.65 0.84 0.81 4.4
Gavin Sheets 0.67 0.96 1.00 0.29 0.87 0.62 4.4
Yasmani Grandal 0.94 0.96 0.95 0.09 0.97 0.51 4.4
Seth Brown 0.82 0.97 0.90 0.69 0.69 0.32 4.4
Adolis Garcia 0.62 0.87 0.82 0.87 0.66 0.55 4.4
Wander Franco 0.98 0.41 0.71 0.52 0.85 0.92 4.4
Sam Hilliard 0.76 0.95 0.81 0.81 0.70 0.33 4.4
Ryan McMahon 0.76 0.71 0.82 0.62 0.77 0.66 4.3
Nolan Arenado 0.62 0.88 0.93 0.39 0.84 0.67 4.3
Adalberto Mondesi 0.81 0.79 0.69 1.00 0.58 0.44 4.3
Adam Duvall 0.57 0.98 1.00 0.59 0.74 0.42 4.3
Austin Meadows 0.75 0.83 0.99 0.53 0.74 0.47 4.3
Mike Zunino 0.97 1.00 0.95 0.13 0.91 0.34 4.3
Harrison Bader 0.46 0.73 0.69 0.83 0.80 0.79 4.3
Matt Vierling 0.84 0.47 0.21 0.87 0.90 0.99 4.3
Steven Duggar 0.90 0.49 0.62 0.85 0.73 0.69 4.3
J.T. Realmuto 0.54 0.58 0.77 0.85 0.78 0.75 4.3
Kolten Wong 0.84 0.52 0.43 0.85 0.79 0.83 4.3
Mitch Garver 0.54 0.90 0.79 0.43 0.93 0.68 4.3
Lorenzo Cain 0.81 0.51 0.70 0.96 0.60 0.69 4.3
Austin Hays 0.79 0.75 0.76 0.55 0.74 0.68 4.3
Buster Posey 0.89 0.72 0.67 0.07 0.94 0.96 4.2
Luis Urias 0.77 0.73 0.73 0.58 0.81 0.61 4.2
Jake Rogers 0.75 0.85 0.75 0.56 0.83 0.50 4.2
Akil Baddoo 0.70 0.52 0.64 0.94 0.73 0.72 4.2
Yuli Gurriel 0.78 0.45 0.75 0.34 0.90 0.99 4.2
Josh Bell 0.73 0.85 0.89 0.12 0.86 0.73 4.2
Garrett Cooper 0.56 0.67 0.73 0.42 0.90 0.90 4.2
Andrew McCutchen 0.78 0.84 0.79 0.63 0.76 0.38 4.2
Brad Miller 0.82 0.89 0.72 0.56 0.75 0.42 4.2
Joey Gallo 0.87 0.96 0.69 0.61 0.84 0.21 4.2
Eduardo Escobar 0.67 0.84 0.87 0.34 0.80 0.64 4.2
Andrew Benintendi 0.51 0.58 0.76 0.71 0.72 0.86 4.2
Giancarlo Stanton 0.44 0.95 0.96 0.03 0.92 0.84 4.1
Jake Cronenworth 0.87 0.61 0.54 0.51 0.83 0.79 4.1
Lourdes Gurriel Jr. 0.49 0.71 0.89 0.35 0.80 0.86 4.1
Jose Rondon 0.86 0.62 0.41 0.83 0.63 0.75 4.1
Connor Joe 0.42 0.70 0.95 0.22 0.90 0.91 4.1
Bobby Dalbec 0.44 0.92 0.98 0.45 0.81 0.51 4.1
Eric Haase 0.64 0.94 0.92 0.47 0.67 0.44 4.1
Jean Segura 0.75 0.45 0.44 0.73 0.79 0.93 4.1
Mike Yastrzemski 0.83 0.84 0.74 0.54 0.73 0.40 4.1
Eddie Rosario 0.34 0.64 0.87 0.88 0.65 0.71 4.1
Jazz Chisholm Jr. 0.79 0.66 0.46 0.95 0.60 0.60 4.1
Dansby Swanson 0.55 0.75 0.76 0.70 0.71 0.60 4.1
Alex Bregman 0.77 0.56 0.78 0.38 0.76 0.81 4.1
Daulton Varsho 0.70 0.66 0.65 0.79 0.70 0.57 4.1
Taylor Ward 0.80 0.63 0.78 0.44 0.74 0.62 4.0
Francisco Lindor 0.81 0.70 0.64 0.79 0.63 0.43 4.0
Austin Slater 0.66 0.72 0.46 0.97 0.67 0.52 4.0
Alex Verdugo 0.86 0.37 0.46 0.62 0.76 0.93 4.0
Wilmer Flores 0.71 0.75 0.65 0.37 0.78 0.74 4.0
Tony Kemp 0.78 0.35 0.35 0.80 0.82 0.88 4.0
Miguel Sano 0.67 0.92 0.80 0.42 0.77 0.39 4.0
Danny Jansen 0.92 0.90 0.77 0.22 0.75 0.39 4.0
Christian Arroyo 0.58 0.62 0.78 0.48 0.73 0.74 3.9
Wil Myers 0.46 0.64 0.70 0.73 0.73 0.67 3.9
Willson Contreras 0.65 0.78 0.62 0.62 0.77 0.49 3.9
Joey Wendle 0.87 0.38 0.50 0.73 0.66 0.77 3.9
Max Schrock 0.83 0.39 0.46 0.54 0.76 0.92 3.9
Dustin Garneau 0.74 1.00 0.93 0.07 0.87 0.29 3.9

4 Creating Model File

4.1 Additional Data Prep

4.1.1 Remove Variables which are based off current hitting numbers

Not all variables can be used for predictive modeling. Variables that go into the percentile ranking or are non-normalized metrics created after the fact (such as WAR - Wins above Replacement or RE24) should be removed. However, metrics that are normalized by a per pitch basis (such as HR/FB%+) can remain as we expect batters to have similar performance in these metrics one year out.

#Creating a new dataset to keep original intact
df_batting_init3 = df_batting_init2

Lagged Percentile (_share) Variables can be used for predictive modeling. However since these variables were created for the Worth metric they must also be removed for modeling purposes.


#Order the dataset by lag columns
df_batting_init4 =  arrange(df_batting_init3, playerid,Season) #playerid is the Fangraph id assigned to each player

# Convert dataframe to data.table format
DT_batter2 = data.table(df_batting_init4)

#designate columns to lag - which is all of them
cols1 = (c('Runs_share','HR_share','RBI_share', 'SB_share','OPS_share','AVG_share','Worth'))
anscols = paste("lag", cols1, sep="_")
DT_batter2[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year

#names(DT_batter2)

df_batting_final = as.data.frame(DT_batter2) %>% 
  select(-c(Runs_share,HR_share,RBI_share, SB_share,OPS_share,AVG_share))%>% 
  select(-(G:AVG),-(OBP:BABIP),-(wOBA:Dol),-(`wRC+`:REW),-(`WPA/LI`),-(wFB:wSF),-BsR,-(Def:wGDP),-(`wCH (pi)`:`wCH/C (pi)`),-(`AVG+`),-(`OBP+`:`BABIP+`),-(H_PA:PH_PA)) %>% select (-Name)

4.1.2 Creating Training/Test Split

We split the data into Training Data (which is used to create the model) and test data (which is used to validate the model)


set.seed(15674)  # For reproducibility
# Create index for testing and training data
inTrain <- createDataPartition(y = df_batting_final$Worth, p = 0.80, list = FALSE)
# subset pitching data for training
tr_2021 <- df_batting_final[inTrain,]
# subset the rest to test and validate trained model
te_2021 <- df_batting_final[-inTrain,]

nrow(tr_2021)/nrow(df_batting_final) #check if split is 0.8
[1] 0.8

4.1.3 Treat Missing Data by Imputing Mean Value

Vtreat Package in R is excellent for treating data before using for modeling. Additional documentation can be found here.
Note: The treatment plan also fixes variables names likeHR/FB%+ (which R doesn’t always handle the best) to HR_slash_FB_percent_plus_

treat_plan_2021 <- vtreat::designTreatmentsZ(
  dframe = tr_2021, # training data
  varlist = colnames(tr_2021) %>% .[. != "hitting_score1"], # input variables = all training data columns, except random
  codeRestriction = c("clean", "isBAD", "lev"), # derived variables types (drop cat_P)
  verbose = FALSE) # suppress messages

#clean stands for cleaned numerical variable, isBAD indicates that a value replacement has occurred (which indicates a missing value in this case), and lev is a binary indicator whether a particular value of that categorical variable was present.  

#### Checking Scoreframe

score_frame <- treat_plan_2021$scoreFrame %>% 
  select(varName, origName, code)

head(score_frame)


tr_treated_2021 <- vtreat::prepare(treat_plan_2021, tr_2021)
te_treated_2021 <- vtreat::prepare(treat_plan_2021, te_2021)


Total_dataset1_untreat = as.data.frame(DT_batter2) %>% select(-Name)

treat_plan_2021 <- vtreat::designTreatmentsZ(
  dframe = Total_dataset1_untreat, # training data
  varlist = colnames(Total_dataset1_untreat) %>% .[. != "hitting_score1"], # input variables = all training data columns, except random
  codeRestriction = c("clean", "isBAD", "lev"), # derived variables types (drop cat_P)
  verbose = FALSE) # suppress messages


total_treated_2021_hitting <- vtreat::prepare(treat_plan_2021, Total_dataset1_untreat)



#tr_treated = tr
#te_treated = te

dim(tr_treated_2021) #note there are dummies for each player and team
[1] 3424 1397

4.1.4 Check Distribution of Training Population

The population used for Training should be indicative of Total Population


ggplot2::qplot(tr_treated_2021$Worth, main="Training Set") + geom_histogram(colour="black", fill="limegreen") + theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

skewness(tr_treated_2021$Worth) #The skewness is the same as the overall
[1] -0.2

5 Running XGboost Model

To keep things simple with modeling, we’ll turn the training data into simple input variables for caret::train, dropping the response variable and converting the data frame to a matrix. Documentation for this approach to XGboost can be found here.

5.1 Tuning the Model

5.1.1 Initial Non-Tuned Model

Break the data set into x and y inputs with x being a matrix

input_x <- as.matrix(((tr_treated_2021))%>%
   select(-Worth) %>%                      
   select(!ends_with ("_isBAD")))

input_y <- tr_treated_2021$Worth

XGBoost with Default Hyperparameters
The Variable Importance (caret::varImp(xgb_base_2021, scale = F )) from the caret package shows the contribution of each variable to the initial model. As you can see SLG_plus_ (SLG+) takes up much of the importance as it is derived from SLG (one of the key contributors to Worth). These types of variables will be removed during variable selection in the next step.
XGBoost documentation can be found for more general models here.


#Defaults for xgboost model
grid_default <- expand.grid(
  nrounds = 100,
  max_depth = 6,
  eta = 0.3,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

#This is a blank train_control set, this will be updated after
train_control <- caret::trainControl(
  method = "none",
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)

xgb_base_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

caret::varImp(xgb_base_2021, scale = F  )
xgbTree variable importance

  only 20 most important variables shown (out of 765)

5.2 Further Variable Selection

5.2.1 Remove redundant and highly correlated variables

Selection Removal Step 1: Check for high correlations
Normally, this step is done early, but those steps were reserved for preparing the data


dep_cor1 <- t(as.data.frame(cor(tr_treated_2021[ , colnames(tr_treated_2021) != "Worth"],
                tr_treated_2021$Worth)))
dep_cor1 <-
as.data.frame(t(as.data.frame(dep_cor1)%>% 
  select(!starts_with("lag")) %>% #remove lag variables
  select(!contains("_isBAD")))) 

dep_cor1 <- tibble::rownames_to_column(dep_cor1,"VARIABLES")%>% #remove indicators for missing data
  filter(V1 > 0.70|V1 < -0.5)

dep_cor1

dep_cor2 <- colnames(row_to_names(t(dep_cor1),row_number = 1))

Let’s Remove variables with high correlation to worth metric (such as wFB/C)


input_x <- as.matrix(((tr_treated_2021))%>%
   select(-Worth) %>% #Remove dependent variable
     select (-all_of(dep_cor2) ) %>%      
select(!ends_with ("_isBAD"))) #indicator variable for missing data

input_y <- tr_treated_2021$Worth

Run the model on the new dataset to make sure the variable importances look fine


#Note Training parameters were set in initial model set up
xgb_base_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

caret::varImp(xgb_base_2021, scale = F  )
xgbTree variable importance

  only 20 most important variables shown (out of 764)

5.3 Model with new data

5.3.1 Tuning All Hyperparameters

A tune grid allows us to test a large amount of hyper-parameters and find the model with the lowest RMSE for predictions.
However, The more values you want to test and the greater the amount of Cross-Fold Validations (method = "cv"), the greater the computational time it will take. More information on the specific parameters can be found here.


# maximum number of trees
nrounds <- 1000

# note to start nrounds from 200, as smaller learning rates result in errors so
# big with lower starting points that they'll mess the scales
tune_grid <- expand.grid(
  nrounds = seq(from = 100, to = nrounds, by = 50),
  eta = c(0.01, 0.025, 0.05, 0.1),
  max_depth = c(2, 4, 6, 8),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

tune_control <- caret::trainControl(
  method = "cv", # cross-validation
  number = 5, # with n folds 
  ## Note this was # out in the original code
  #index = createFolds(tr_treated$Id_clean), # fix the folds
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)

Running the initial tuning model

#Note I will be timing these runs to give an estimate on how long this model takes to run
start_time <- Sys.time()

xgb_tune_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid,
  method = "xgbTree",
  verbose = FALSE
  ,verbosity = 0
)

end_time <- Sys.time()

end_time - start_time
Time difference of 13 mins

Tuning Plot and Variable Importance

varImp(xgb_tune_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 764)
# helper function for the plots
tuneplot <- function(x, probs = .90) {
  ggplot(x) +
    coord_cartesian(ylim = c(quantile(x$results$RMSE, probs = probs), min(x$results$RMSE))) +
    theme_bw()
}

tuneplot(xgb_tune_2021)

5.3.2 Fine Tuning Model

5.3.2.1 Second Tuning: Maximum Depth and Minimum Child Weight

After fixing the learning rate to 0.1 and we’ll also set maximum depth to 3 +-1 (or +2 if max_depth == 2) to experiment a bit around the suggested best tune in previous step. Then, well fix maximum depth and minimum child weigh

tune_grid2 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = ifelse(xgb_tune_2021$bestTune$max_depth == 2,
    c(xgb_tune_2021$bestTune$max_depth:4),
    xgb_tune_2021$bestTune$max_depth - 1:xgb_tune_2021$bestTune$max_depth + 1),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = c(1, 2, 3),
  subsample = 1
)

xgb_tune2_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid2,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune2_2021)


xgb_tune2_2021$bestTune

varImp(xgb_tune2_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 764)

5.3.2.2 Third Tuning: Column and Row Sampling


tune_grid3 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = 0,
  colsample_bytree = c(0.4, 0.6, 0.8, 1.0),
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = c(0.5, 0.75, 1.0)
)

xgb_tune3_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid3,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune3_2021, probs = .95)


xgb_tune3_2021$bestTune

varImp(xgb_tune3_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 764)

5.3.2.3 Fourth Tuning: Gamma

Next, we again pick the best values from previous step, and now will see whether changing the gamma has any effect on the model fit:

tune_grid4 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = c(0, 0.05,0.1, 0.2,0.4, 0.5, 0.7, 0.9, 1.0),
  colsample_bytree = xgb_tune3_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = xgb_tune3_2021$bestTune$subsample
)

xgb_tune4_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid4,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune4_2021)
Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have 9.
Consider specifying shapes manually if you must have them.
Warning: Removed 60 rows containing missing values (geom_point).

xgb_tune4_2021$bestTune

varImp(xgb_tune4_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 764)

5.3.2.4 Fifth Tuning: Reducing the Learning Rate

Now, we have tuned the hyperparameters and can start reducing the learning rate to get to the final model:

start_time <- Sys.time()

tune_grid5 <- expand.grid(
  nrounds = seq(from = 100, to = 10000, by = 75),
   eta = c(0.01, 0.015, 0.025,0.035, 0.05,0.75, 0.1),
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = xgb_tune4_2021$bestTune$gamma,
  colsample_bytree = xgb_tune3_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = xgb_tune3_2021$bestTune$subsample
)



xgb_tune5_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid5,
  method = "xgbTree",
  verbose = TRUE
)

#tuneplot(xgb_tune5_2021)

end_time <- Sys.time()

end_time - start_time
Time difference of 29 mins
xgb_tune5_2021$bestTune

varImp(xgb_tune5_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 764)

5.3.2.5 Fitting Final Model


(final_grid_2021 <- expand.grid(
  nrounds = xgb_tune5_2021$bestTune$nrounds,
  eta = xgb_tune5_2021$bestTune$eta,
  max_depth = xgb_tune5_2021$bestTune$max_depth,
  gamma = xgb_tune5_2021$bestTune$gamma,
  colsample_bytree = xgb_tune5_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5_2021$bestTune$min_child_weight,
  subsample = xgb_tune5_2021$bestTune$subsample
))

(xgb_model_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = final_grid_2021,
  method = "xgbTree",
  verbose = TRUE
))
eXtreme Gradient Boosting 

3424 samples
 764 predictor

No pre-processing
Resampling: None 
varImp(xgb_model_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 764)

5.3.3 Model Performance

5.3.3.1 Checking Model on Test Split Data



y_pred_test <- predict(xgb_model_2021, data.matrix(te_treated_2021))

test_stats= cbind((te_treated_2021$Worth),y_pred_test)

test_statsR2 = cor(test_stats[,1],test_stats[,2])^2

print(test_statsR2)
[1] 0.92
y_pred_train <- predict(xgb_model_2021, data.matrix(tr_treated_2021))

train_stats = cbind((tr_treated_2021$Worth),y_pred_train)

train_statsR2 = cor(train_stats[,1],train_stats[,2])^2

print(train_statsR2)
[1] 0.99
#test dataset
x <- select(te_treated_2021, -Worth)
y <- (te_treated_2021$Worth)

(xgb_model_rmse <- ModelMetrics::rmse(y, predict(xgb_model_2021, newdata = x)))
[1] 0.37
holdout_x <- select(tr_treated_2021, -Worth)
holdout_y <- tr_treated_2021$Worth

(xgb_model_rmse <- ModelMetrics::rmse(holdout_y, predict(xgb_model_2021, newdata = holdout_x)))
[1] 0.11

5.3.3.2 Graphical Representation of Model


ggplot2::ggplot() +
  aes(x = test_stats[,1], y = test_stats[,2]) +
  geom_jitter() +
  xlab("Predicted Values") +
  ylab("Actual Values") +
  ggtitle("Results of Hitting Model on Test Data")+
  theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))+
  geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'

6 Creating 2022 Projections from Model

6.1 Re-fit model for Important Variables

Now that we have an acceptable model, we can use it to create projections for how well we think players should do in 2022 based on their hitting statistics in 2021. First let’s reduce

  1. Only keep variables with high enough importance in model


vip(xgb_model_2021, num_features = 30)  # 10 is the default, 30 gives a visual on the top 30 most important features of the model


unscalevi = vi(xgb_model_2021, method="model") #shows the numbers behind the plot

unscalevi$Importance_perc = with(unscalevi,Importance/sum(Importance)) #adds percentages 

unscalevi # importance by variables

variables_to_keep_2021 = subset(unscalevi, Importance_perc > 0.0010) %>% select(Variable) #Keep Variables that explain at least a small amount [0.1%] of the model. This is a low threshold for inclusion ,but you can adjust this

variables_to_keep_2021b = t(variables_to_keep_2021)

variables_to_keep_2022 = colnames(row_to_names(variables_to_keep_2021b,row_number = 1))

tr_treated_2022 = tr_treated_2021 %>%  select(Worth,one_of(variables_to_keep_2022),playerid,starts_with("Team_lev_x_")) #keep modeled important variables along with team indicator variables

te_treated_2022 = te_treated_2021 %>%  select(Worth,one_of(variables_to_keep_2022),playerid,starts_with("Team_lev_x_"))

input_x_2022 = as.matrix(select(tr_treated_2022, -Worth))

input_y_2022 = tr_treated_2022$Worth
  1. Re-fit model with reduced variable scope


(final_grid_2021 <- expand.grid(
  nrounds = xgb_tune5_2021$bestTune$nrounds,
  eta = xgb_tune5_2021$bestTune$eta,
  max_depth = xgb_tune5_2021$bestTune$max_depth,
  gamma = xgb_tune5_2021$bestTune$gamma,
  colsample_bytree = xgb_tune5_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5_2021$bestTune$min_child_weight,
  subsample = xgb_tune5_2021$bestTune$subsample
))

(xgb_model_2022 <- caret::train(
  x = input_x_2022,
  y = input_y_2022,
  trControl = train_control,
  tuneGrid = final_grid_2021,
  method = "xgbTree",
  verbose = TRUE
))
eXtreme Gradient Boosting 

3424 samples
  79 predictor

No pre-processing
Resampling: None 
vip(xgb_model_2022, num_features = 30)


unscalevi24 = vi(xgb_model_2022, method="model")

unscalevi24$Importance_perc = with(unscalevi24,Importance/sum(Importance)) 

unscalevi24

save(xgb_model_2022,file = '2022_Hitting6x6_Model.Rdata')

hitting6x6 = xgb_model_2022

hittinginput6X6 = input_x_2022

#For anything above breaking_IP we need to create projection table by age or age bucket

#write_csv(unscalevi24,"unscalevi24.csv")
# 2022 Projections Full
First let’s prepare a file for predicting based on our model object
```r
variableslag6xb= row_to_names(as.data.frame(t(variables_to_keep_2022)),row_number = 1) %>% select (starts_with(“lag”))
variables_nolag6xb = (owmr::remove_prefix(variableslag6xb,“lag” , sep = “_“))
Data_Predict_2022a6xb = total_treated_2021_hitting %>% select (one_of(colnames(variables_nolag6xb)),Season,playerid)
colnames(Data_Predict_2022a6xb) <- paste0(“lag_”, colnames(Data_Predict_2022a6xb))
Data_Predict_2022b6xb = total_treated_2021_hitting %>% select (one_of(colnames(variables_nolag6xb))) colnames(Data_Predict_2022b6xb) = colnames(variableslag6xb)
variables_to_keep_2022_nolag6xb = total_treated_2021_hitting %>% select(one_of(variables_to_keep_2022),Season,playerid,starts_with(“Team_lev_x_”))%>% select(-one_of(colnames(Data_Predict_2022b6xb)))
Data_predict_20226xb = sqldf( ” select a.,b. from Data_Predict_2022a6xb a, variables_to_keep_2022_nolag6xb b on b.playerid = a.lag_playerid and b.Season = a.lag_Season ” ) %>% select(-lag_playerid,lag_Season) %>% filter(Season == 2021) %>% select(one_of(variables_to_keep_2022),playerid,starts_with(“Team_lev_x_”))
```

6.2 Create Predictions for Model

6.2.1 Run Projections on Players who Played in 2021

This is the raw prediction score per IP for each pitcher


hitting_predictions6xb = as.data.frame(predict(xgb_model_2022,Data_predict_20226xb))

names(hitting_predictions6xb) = c("Predict_Score")

Data_predict_2022_w_hitting_Predictions6xb = cbind(Data_predict_20226xb,hitting_predictions6xb) %>% select(playerid,Predict_Score)

head(Data_predict_2022_w_hitting_Predictions6xb)
NA


Latest_2022_hittingdata_FP = read_csv("FanGraph_Fantasy_Baseball_Hitting.csv")
Rows: 625 Columns: 28
-- Column specification ----------------------------------------------------------------------------------------------------------------------------------
Delimiter: ","
chr  (3): Name, Team, playerid
dbl (25): G, PA, AB, H, 2B, 3B, HR, R, RBI, BB, SO, HBP, SB, CS, AVG, OBP, SLG, OPS, wOBA, wRC+, WAR, ADP, InterSD, InterSK, IntraSD

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Latest_2022_hittingdata_FP
NA
NA



hitting_Data_NonAdj_Projections6xb = sqldf(
  "
  select a.*,b.Predict_Score
  from Latest_2022_hittingdata_FP a 
  left join 
  Data_predict_2022_w_hitting_Predictions6xb b
  on a.playerid = b.playerid
  "
) %>% filter(ADP<370 | is.na(Predict_Score)==F)


hitting_Data_Adj_Projections6xb =
hitting_Data_NonAdj_Projections6xb %>% 
  mutate(
    Avg_PA = 300,
    AdjPredict_Score_raw = ifelse(is.na(Predict_Score),NA,Predict_Score*(PA/Avg_PA)),
    max_predscore= max(AdjPredict_Score_raw,na.rm = T),
    AdjPredict_Score = ifelse (is.na(AdjPredict_Score_raw),NA,AdjPredict_Score_raw *100/max_predscore),
    WAR_rank = order(order(rank(WAR,ties.method = 'average'),decreasing = TRUE)),
    AdjPredict_Score_Rank = order(order(rank(AdjPredict_Score,ties.method = 'average'),decreasing = TRUE))-sum(is.na(AdjPredict_Score)),
        Ranks_Above_ADP = ADP - AdjPredict_Score_Rank
  ) %>% select (Name,ADP,WAR, WAR_rank,AdjPredict_Score ,AdjPredict_Score_Rank,Ranks_Above_ADP)


  

ggplot2::qplot(hitting_Data_Adj_Projections6xb$AdjPredict_Score, main="Predictions") + geom_histogram(colour="black", fill="grey") + theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


7 2022 Projections Full

7.1 Table of hitting Projections (Players who Didn’t Play in 2021 - Recieve an NA)

AdjPredict_Score are normalized to 100


tableexport =
hitting_Data_Adj_Projections6xb %>%
  arrange (ADP,WAR) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)

save_kable(tableexport,file = "hitting6x6.html")

#tableexport

This is a better formatted Table



ft_dt <- hitting_Data_Adj_Projections6xb[1:nrow(hitting_Data_Adj_Projections6xb), 1:ncol(hitting_Data_Adj_Projections6xb)] %>% 
  filter(AdjPredict_Score_Rank>0)%>%  arrange((AdjPredict_Score_Rank))

ft_dt$ADP <- color_tile("white", "red")(ft_dt$ADP)

ft_dt$WAR <- color_bar("lightblue")(ft_dt$WAR)

ft_dt$AdjPredict_Score<- color_bar("lightblue")(ft_dt$AdjPredict_Score)

ft_dt$WAR_Rank <- color_tile("green","orange")(ft_dt$WAR_rank)

ft_dt$Predict_Rank <- color_tile("green","orange")(ft_dt$AdjPredict_Score_Rank) 


ft_dt$Ranks_Above_ADP <- 
  ifelse(
  ft_dt$Ranks_Above_ADP < 0,
  cell_spec(round(ft_dt$Ranks_Above_ADP,2), color = "red", italic = T),
  cell_spec(round(ft_dt$Ranks_Above_ADP,2), color = "green", italic = T)
)


ft_dt2 <- ft_dt[c("Name", "ADP", "WAR", "AdjPredict_Score", "WAR_Rank","Predict_Rank","Ranks_Above_ADP")]



table_export = 
kbl(ft_dt2, escape = F) %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T) %>%   column_spec(6, width = "3cm") %>%
  add_header_above(c(" ", "Scores" = 3, "Ranks" = 2," "))
save_kable(table_export,file = "Hitting6x6_updated.html")
  
table_export  
Scores
Ranks
Name ADP WAR AdjPredict_Score WAR_Rank Predict_Rank Ranks_Above_ADP
Marcus Semien 39.0 4.7 100.00 14 1 38
Bryce Harper 9.1 5.2 99.95 8 2 7.1
Vladimir Guerrero Jr. 5.6 5.6 98.11 5 3 2.6
Trea Turner 1.4 5.5 97.08 6 4 -2.6
Paul Goldschmidt 47.6 3.5 96.51 44 5 42.6
José Ramírez 3.4 5.9 96.06 3 6 -2.6
Bo Bichette 4.9 4.7 94.81 13 7 -2.1
Freddie Freeman 19.2 5.1 94.63 9 8 11.2
Rafael Devers 14.5 4.8 94.54 11 9 5.5
Shohei Ohtani 9.2 3.9 94.39 34 10 -0.8
Juan Soto 3.9 7.6 94.04 1 11 -7.1
Nick Castellanos 60.6 2.5 93.49 81 12 48.6
Luis Robert 16.2 4.6 93.25 16 13 3.2
Kyle Tucker 12.7 4.6 93.18 15 14 -1.3
Teoscar Hernández 27.8 2.3 92.29 98 15 12.8
Mookie Betts 15.4 5.8 89.86 4 16 -0.6
Jorge Polanco 77.8 3.1 89.62 58 17 60.8
Cedric Mullins 32.8 3.3 89.55 52 18 14.8
Aaron Judge 36.0 5.3 89.23 7 19 17
Matt Olson 39.0 4.2 88.97 28 20 19
Ozzie Albies 17.6 4.1 88.83 32 21 -3.4
Pete Alonso 46.2 3.5 88.13 43 22 24.2
Mike Trout 13.0 6.6 87.97 2 23 -10
Jose Altuve 73.3 3.8 87.86 36 24 49.3
Manny Machado 22.1 4.5 87.73 20 25 -2.9
Bryan Reynolds 93.0 4.2 87.46 30 26 67
George Springer 57.8 4.3 87.14 26 27 30.8
Tim Anderson 30.9 3.2 87.03 55 28 2.9
Tyler O'Neill 47.1 2.8 86.93 67 29 18.1
Xander Bogaerts 45.8 4.5 85.46 19 30 15.8
Ronald Acuña Jr. 10.4 5.0 84.52 10 31 -20.6
Yordan Alvarez 26.3 4.1 84.50 31 32 -5.7
Trevor Story 36.8 4.2 84.37 29 33 3.8
Javier Báez 62.5 2.8 83.60 68 34 28.5
Jonathan India 90.7 3.8 83.60 35 35 55.7
Brandon Lowe 75.4 4.3 82.47 24 36 39.4
Kris Bryant 84.9 2.4 82.30 86 37 47.9
Austin Riley 51.8 3.1 82.28 59 38 13.8
Frank Schwindel 233.8 0.8 81.82 262 39 194.8
Byron Buxton 48.0 4.4 80.75 23 40 8
Jesse Winker 111.8 3.4 80.71 45 41 70.8
José Abreu 78.4 2.1 80.56 114 42 36.4
Ketel Marte 77.2 3.7 80.38 37 43 34.2
Mitch Haniger 109.0 2.4 80.23 89 44 65
Starling Marte 31.2 2.4 80.18 84 45 -13.8
Salvador Perez 32.2 3.7 79.95 38 46 -13.8
Joey Votto 141.9 1.9 79.69 130 47 94.9
Nolan Arenado 67.7 3.7 79.03 39 48 19.7
Kyle Schwarber 114.5 2.7 79.00 71 49 65.5
J.D. Martinez 88.9 2.0 78.49 116 50 38.9
Rhys Hoskins 126.3 3.0 77.54 61 51 75.3
Randy Arozarena 65.4 2.1 77.53 113 52 13.4
Franmil Reyes 120.9 1.8 77.34 132 53 67.9
Avisaíl García 170.2 2.3 77.03 93 54 116.2
Wander Franco 58.9 4.5 76.70 21 55 3.9
Ryan Mountcastle 119.7 1.6 76.37 161 56 63.7
Brandon Crawford 219.2 2.7 75.82 70 57 162.2
Carlos Correa 100.9 4.7 75.22 12 58 42.9
Brandon Belt 226.3 2.5 74.80 80 59 167.3
Corey Seager 78.3 4.5 74.79 22 60 18.3
Jared Walsh 119.8 2.2 74.41 101 61 58.8
Willy Adames 123.2 3.3 73.96 49 62 61.2
Dansby Swanson 119.1 2.7 73.62 72 63 56.1
C.J. Cron 124.7 1.7 73.33 154 64 60.7
Max Muncy 134.9 3.2 72.27 54 65 69.9
Jake Cronenworth 127.1 3.4 72.23 48 66 61.1
Hunter Renfroe 161.4 1.7 71.92 153 67 94.4
Francisco Lindor 53.5 4.2 71.53 27 68 -14.5
J.T. Realmuto 48.8 3.6 71.20 40 69 -20.2
Joey Gallo 177.0 3.5 71.12 42 70 107
Josh Bell 127.6 1.9 71.02 129 71 56.6
Justin Turner 152.7 3.0 70.54 62 72 80.7
Nelson Cruz 171.9 1.4 70.54 181 73 98.9
Eduardo Escobar 200.5 1.9 69.73 131 74 126.5
Austin Meadows 145.5 1.5 69.72 168 75 70.5
Whit Merrifield 32.5 2.3 69.04 97 76 -43.5
Giancarlo Stanton 97.7 2.3 68.42 99 77 20.7
Kolten Wong 190.8 2.5 68.12 79 78 112.8
Anthony Rizzo 193.6 2.1 67.38 109 79 114.6
Ty France 154.6 2.4 67.17 88 80 74.6
Jean Segura 202.3 2.4 67.02 91 81 121.3
Jazz Chisholm Jr. 76.0 1.8 66.56 138 82 -6
Dylan Carlson 172.0 2.5 66.20 77 83 89
Will Smith 53.2 4.6 65.75 17 84 -30.8
Chris Taylor 139.9 2.3 65.74 95 85 54.9
Austin Hays 243.2 1.7 65.74 150 86 157.2
Adolis García 166.0 1.3 65.73 201 87 79
Enrique Hernández 226.7 2.8 65.73 66 88 138.7
Ryan McMahon 161.2 2.0 65.39 119 89 72.2
Mark Canha 283.4 1.8 65.26 141 90 193.4
Luis Urías 173.3 2.5 64.82 75 91 82.3
Alex Bregman 88.3 4.5 64.79 18 92 -3.7
Lourdes Gurriel Jr. 145.2 1.4 64.25 187 93 52.2
Jonathan Schoop 213.2 1.7 64.03 145 94 119.2
Andrew Benintendi 192.7 1.9 64.02 125 95 97.7
Alex Verdugo 168.6 2.4 63.99 85 96 72.6
Yuli Gurriel 210.6 1.4 63.75 189 97 113.6
Akil Baddoo 156.9 1.6 63.45 158 98 58.9
Yasmani Grandal 101.1 4.0 63.38 33 99 2.1
Jesús Sánchez 226.3 2.0 62.86 115 100 126.3
Mike Yastrzemski 286.5 2.2 62.81 108 101 185.5
AJ Pollock 218.6 1.8 62.28 136 102 116.6
Nathaniel Lowe 251.7 1.8 62.25 133 103 148.7
Miguel Sanó 280.4 0.8 62.04 271 104 176.4
Robbie Grossman 191.3 1.7 61.68 149 105 86.3
Max Kepler 294.6 2.5 61.62 76 106 188.6
Trent Grisham 143.5 3.1 61.50 57 107 36.5
Jeimer Candelario 239.6 3.2 61.14 56 108 131.6
Ian Happ 213.2 2.2 61.05 104 109 104.2
Yoán Moncada 158.7 3.4 60.86 47 110 48.7
Harrison Bader 235.8 2.9 60.62 63 111 124.8
Tyler Naquin 352.7 0.7 60.57 274 112 240.7
Charlie Blackmon 236.7 0.6 59.83 301 113 123.7
Brendan Rodgers 169.9 1.6 59.65 160 114 55.9
Tommy Edman 90.2 2.3 59.59 94 115 -24.8
Christian Yelich 97.2 2.9 59.12 64 116 -18.8
Ramón Laureano 247.0 2.6 58.58 74 117 130
LaMonte Wade Jr. 343.8 1.0 58.23 233 118 225.8
Patrick Wisdom 358.3 1.2 57.95 214 119 239.3
Amed Rosario 170.7 2.0 57.87 118 120 50.7
Jorge Soler 192.2 1.7 57.55 147 121 71.2
Josh Donaldson 187.1 3.3 57.21 51 122 65.1
Eloy Jiménez 65.1 2.2 57.01 105 123 -57.9
Andrew McCutchen 348.9 0.9 56.86 254 124 224.9
Bobby Dalbec 229.2 1.6 56.81 157 125 104.2
Adam Frazier 400.3 2.0 56.78 117 126 274.3
Trey Mancini 205.0 1.4 56.69 180 127 78
Eddie Rosario 181.7 0.9 56.52 251 128 53.7
Brandon Nimmo 324.3 3.2 56.45 53 129 195.3
Willson Contreras 109.9 2.7 56.44 69 130 -20.1
Adalberto Mondesi 57.3 2.0 56.27 123 131 -73.7
Adam Duvall 212.8 1.7 56.11 148 132 80.8
Wil Myers 288.9 1.0 56.03 231 133 155.9
Rafael Ortega 337.3 1.3 55.28 193 134 203.3
Seth Brown 502.2 0.6 55.16 293 135 367.2
Nicky Lopez 250.4 1.8 55.05 140 136 114.4
Matt Chapman 181.5 3.6 54.39 41 137 44.5
Tony Kemp 459.8 1.6 54.22 155 138 321.8
Fernando Tatis Jr. 27.5 3.4 54.02 46 139 -111.5
Josh Rojas 247.6 1.5 53.63 167 140 107.6
Jo Adell 228.4 0.6 53.30 300 141 87.4
Michael Brantley 251.4 2.0 52.69 122 142 109.4
Miguel Rojas 475.9 2.3 52.48 96 143 332.9
Joey Wendle 403.7 1.3 51.66 194 144 259.7
Jesús Aguilar 300.5 1.2 51.03 215 145 155.5
Gavin Sheets 525.8 0.8 50.88 265 146 379.8
Daulton Varsho 93.3 2.1 50.85 110 147 -53.7
J.P. Crawford 452.2 2.5 50.81 83 148 304.2
Tyler Stephenson 138.3 2.9 50.42 65 149 -10.7
Ke'Bryan Hayes 142.6 3.3 49.79 50 150 -7.4
Tommy Pham 274.8 1.9 49.76 126 151 123.8
Anthony Santander 298.7 1.3 49.58 197 152 146.7
Mike Zunino 253.5 2.0 49.34 124 153 100.5
Eugenio Suárez 198.5 2.2 49.10 107 154 44.5
César Hernández 403.0 1.7 48.99 152 155 248
DJ LeMahieu 119.2 2.4 48.80 87 156 -36.8
Lorenzo Cain 451.3 1.5 47.99 176 157 294.3
Alex Kirilloff 177.6 1.5 47.90 165 158 19.6
Evan Longoria 455.5 1.5 47.66 174 159 296.5
Mitch Garver 166.1 2.4 47.43 90 160 6.1
Isiah Kiner-Falefa 296.8 1.6 47.29 163 161 135.8
Michael Conforto 207.5 2.7 46.72 73 162 45.5
Jonathan Villar 253.1 1.1 46.42 225 163 90.1
Jake Meyers 593.2 1.1 46.36 217 164 429.2
Myles Straw 127.0 2.3 46.30 100 165 -38
Eric Hosmer 470.1 0.4 46.29 329 166 304.1
Raimel Tapia 273.9 0.1 46.23 384 167 106.9
Andrew Vaughn 250.3 1.2 46.13 207 168 82.3
Randal Grichuk 419.6 0.6 45.75 292 169 250.6
Luke Voit 253.5 1.3 45.06 199 170 83.5
Connor Joe 341.2 0.8 44.99 255 171 170.2
Gleyber Torres 154.9 2.5 44.96 78 172 -17.1
Brad Miller 551.8 1.4 44.72 179 173 378.8
Pavin Smith 476.9 0.6 44.58 286 174 302.9
Wilmer Flores 472.5 1.2 44.38 208 175 297.5
Nick Madrigal 342.1 1.8 44.36 139 176 166.1
Yandy Díaz 464.5 1.6 43.94 156 177 287.5
Odúbel Herrera 583.2 1.1 43.70 228 178 405.2
Ramón Urías 529.7 1.6 43.62 162 179 350.7
Lane Thomas 269.2 1.1 43.28 226 180 89.2
Paul DeJong 491.4 2.1 43.27 112 181 310.4
Sam Hilliard 477.8 0.5 43.17 309 182 295.8
Joc Pederson 480.8 0.8 43.03 270 183 297.8
David Peralta 518.1 1.1 42.92 221 184 334.1
Luis Arraez 351.3 1.5 42.86 177 185 166.3
Bobby Bradley 491.0 0.5 42.52 308 186 305
Adam Engel 597.3 0.6 42.42 290 187 410.3
Manuel Margot 416.6 1.1 42.41 229 188 228.6
Josh Harrison 412.3 1.1 42.33 224 189 223.3
Jarred Kelenic 141.5 1.5 41.87 169 190 -48.5
Alcides Escobar 586.4 0.2 41.77 374 191 395.4
David Fletcher 417.0 1.3 41.64 191 192 225
Darin Ruf 454.5 0.9 41.61 247 193 261.5
Jeff McNeil 323.0 1.9 41.47 128 194 129
José Iglesias 547.7 1.2 41.30 210 195 352.7
Keibert Ruiz 145.4 2.2 41.24 106 196 -50.6
Garrett Cooper 527.1 1.0 41.17 238 197 330.1
Ji-Man Choi 577.3 0.7 40.70 282 198 379.3
Ben Gamel 579.0 0.9 40.53 242 199 380
Michael A. Taylor 516.5 1.2 40.05 212 200 316.5
Kyle Lewis 399.0 1.5 40.01 170 201 198
Kyle Farmer 528.0 1.1 39.87 220 202 326
Chas McCormick 495.8 1.1 39.85 218 203 292.8
Tyrone Taylor 526.6 1.1 39.84 222 204 322.6
Brian Anderson 495.9 1.8 39.79 137 205 290.9
Anthony Rendon 110.2 4.3 39.62 25 206 -95.8
Elias Díaz 229.7 1.3 39.11 198 207 22.7
Max Stassi 300.6 1.8 38.90 135 208 92.6
Jake Fraley 547.6 0.9 38.79 241 209 338.6
Hunter Dozier 443.8 0.6 38.60 288 210 233.8
Gio Urshela 306.8 1.3 38.29 192 211 95.8
Alec Bohm 311.6 1.4 38.25 188 212 99.6
Kevin Kiermaier 588.0 1.2 38.24 203 213 375
Chad Pinder 570.3 0.6 37.37 291 214 356.3
Leury García 536.2 0.7 37.13 281 215 321.2
Eric Haase 302.9 0.7 36.82 278 216 86.9
Christian Walker 506.6 0.7 36.67 276 217 289.6
Carlos Santana 511.8 0.7 36.64 272 218 293.8
Danny Jansen 403.1 1.9 36.18 127 219 184.1
Carson Kelly 280.7 2.0 36.18 120 220 60.7
Steven Duggar 597.8 0.7 35.95 275 221 376.8
Gavin Lux 266.4 1.5 35.79 178 222 44.4
Rowdy Tellez 342.4 0.8 35.59 268 223 119.4
Marcell Ozuna 174.3 2.2 35.55 103 224 -49.7
Yadiel Hernandez 583.5 0.1 35.52 400 225 358.5
Sean Murphy 252.4 2.5 35.18 82 226 26.4
Abraham Toro 323.7 1.4 35.13 190 227 96.7
Andy Ibáñez 563.1 1.5 35.07 175 228 335.1
Edmundo Sosa 501.9 1.5 34.71 172 229 272.9
Miguel Cabrera 539.9 -0.4 34.53 500 230 309.9
Gary Sánchez 245.3 1.3 34.33 196 231 14.3
Tommy La Stella 561.3 1.2 34.17 213 232 329.3
Kyle Isbel 542.5 0.6 34.00 299 233 309.5
Yoshi Tsutsugo 431.9 0.3 33.96 336 234 197.9
Jordan Luplow 600.6 0.8 33.83 263 235 365.6
Justin Upton 565.1 0.3 33.82 349 236 329.1
Andrés Giménez 282.5 1.7 33.51 146 237 45.5
Brandon Marsh 418.4 1.2 33.43 209 238 180.4
Cody Bellinger 98.9 3.1 33.43 60 239 -140.1
Garrett Hampson 299.6 0.3 33.38 340 240 59.6
Austin Slater 575.2 1.0 33.27 232 241 334.2
Christian Vázquez 201.1 1.7 33.26 142 242 -40.9
Bradley Zimmer 505.3 0.7 32.70 280 243 262.3
Nick Solak 494.8 0.7 32.51 284 244 250.8
Yadier Molina 322.4 1.2 32.43 211 245 77.4
Kole Calhoun 522.9 1.1 32.23 227 246 276.9
Anthony Alford 555.4 0.0 32.21 468 247 308.4
Kevin Pillar 590.7 0.4 32.16 314 248 342.7
Alejandro Kirk 237.1 1.7 32.10 144 249 -11.9
Nick Ahmed 581.5 1.1 31.97 223 250 331.5
Jurickson Profar 540.1 0.3 31.88 341 251 289.1
Bryan De La Cruz 534.7 0.9 31.53 240 252 282.7
Colin Moran 582.8 -0.1 31.44 494 253 329.8
Jed Lowrie 597.2 0.8 31.04 267 254 343.2
Cavan Biggio 319.3 1.4 30.99 184 255 64.3
Didi Gregorius 513.0 0.9 30.82 252 256 257
Lars Nootbaar 572.6 0.2 30.61 375 257 315.6
Ha-Seong Kim 349.8 1.5 30.60 173 258 91.8
Nick Senzel 422.0 1.0 30.27 234 259 163
Corey Dickerson 581.1 0.1 30.04 388 260 321.1
Willie Calhoun 451.5 0.2 30.00 359 261 190.5
Jacob Stallings 421.9 2.2 29.89 102 262 159.9
J.D. Davis 511.6 0.9 29.81 253 263 248.6
Omar Narváez 251.8 1.7 29.56 143 264 -12.2
Santiago Espinal 557.5 0.7 29.54 273 265 292.5
Luis García 556.8 0.7 29.43 277 266 290.8
Yan Gomes 442.1 1.5 29.31 166 267 175.1
Nico Hoerner 506.9 1.4 29.19 182 268 238.9
Aristides Aquino 585.4 0.4 29.07 333 269 316.4
Dominic Smith 456.4 0.5 29.02 304 270 186.4
Elvis Andrus 566.4 0.9 29.00 248 271 295.4
Michael Chavis 569.3 0.0 28.96 477 272 297.3
Ryan Jeffers 480.5 1.6 28.59 159 273 207.5
Thairo Estrada 600.3 0.8 28.55 259 274 326.3
Mike Moustakas 374.8 0.9 28.39 245 275 99.8
Josh Naylor 587.4 0.5 28.26 306 276 311.4
Rougned Odor 504.9 0.6 28.11 289 277 227.9
Jorge Mateo 472.9 0.5 28.09 312 278 194.9
Jake McCarthy 600.8 0.0 27.82 429 279 321.8
Travis d'Arnaud 224.4 1.4 27.34 183 280 -55.6
Oscar Mercado 593.3 0.7 26.97 285 281 312.3
Austin Nola 377.8 1.5 26.90 171 282 95.8
Luis Torrens 446.4 0.7 26.55 279 283 163.4
Victor Reyes 592.8 0.4 26.27 316 284 308.8
Aledmys Díaz 573.3 0.9 25.96 243 285 288.3
Niko Goodrum 571.4 0.7 25.90 283 286 285.4
Alex Dickerson 597.9 0.2 25.34 357 287 310.9
Francisco Mejía 492.9 1.3 25.26 202 288 204.9
James McCann 351.5 1.4 24.39 186 289 62.5
Brett Phillips 590.7 0.3 24.18 337 290 300.7
Daniel Vogelbach 586.6 0.6 23.82 298 291 295.6
Aaron Hicks 524.7 1.6 23.74 164 292 232.7
Tucker Barnhart 471.5 1.3 23.71 200 293 178.5
Brett Gardner 598.6 0.6 23.57 297 294 304.6
Tyler Wade 505.0 0.4 23.51 331 295 210
Victor Robles 474.9 1.0 23.47 237 296 178.9
Stephen Piscotty 594.4 -0.2 23.41 497 297 297.4
Roberto Pérez 545.7 1.3 23.39 195 298 247.7
Donovan Solano 594.0 0.4 23.28 324 299 295
Jason Heyward 594.8 0.8 23.20 269 300 294.8
Kevin Newman 597.1 0.5 22.44 303 301 296.1
Manny Piña 570.7 1.2 22.42 205 302 268.7
Jarren Duran 497.2 0.4 22.24 327 303 194.2
Mitch Moreland 600.9 0.3 21.77 342 304 296.9
Jorge Alfaro 454.9 0.0 21.75 474 305 149.9
Matt Vierling 578.6 0.5 21.62 313 306 272.6
Yu Chang 596.6 0.4 21.45 334 307 289.6
Kyle Higashioka 555.9 1.0 21.33 236 308 247.9
DJ Stewart 600.3 0.5 21.32 305 309 291.3
Cole Tucker 586.6 0.4 20.09 325 310 276.6
Jace Peterson 579.4 0.2 20.02 361 311 268.4
Matt Beaty 591.6 0.2 19.96 358 312 279.6
Dom Nuñez 586.9 0.5 19.74 311 313 273.9
Matt Duffy 586.2 0.5 19.74 310 314 272.2
Pedro Severino 522.6 0.9 19.66 244 315 207.6
Josh VanMeter 597.6 0.4 19.59 323 316 281.6
Martín Maldonado 562.4 1.2 19.36 216 317 245.4
Trevor Larnach 565.1 0.1 19.26 390 318 247.1
Brent Rooker 597.9 0.3 19.16 348 319 278.9
Clint Frazier 472.5 0.3 19.12 345 320 152.5
Lewin Díaz 585.3 0.3 19.00 344 321 264.3
Carter Kieboom 539.8 0.8 18.82 258 322 217.8
Austin Hedges 587.0 0.5 18.76 307 323 264
Mike Brosseau 597.1 0.4 18.20 320 324 273.1
Albert Pujols 597.5 -0.3 17.99 499 325 272.5
Taylor Walls 587.7 1.1 17.93 219 326 261.7
Tom Murphy 539.9 1.2 17.55 204 327 212.9
Brian Goodwin 600.6 0.1 17.21 399 328 272.6
Asdrúbal Cabrera 599.6 0.0 17.10 425 329 270.6
Robinson Chirinos 595.5 0.6 17.08 294 330 265.5
Max Schrock 599.8 0.0 16.79 467 331 268.8
Ehire Adrianza 999.0 0.1 16.42 403 332 667
Derek Hill 999.0 0.1 16.30 387 333 666
Dylan Moore 509.0 0.4 16.21 326 334 175
Jason Castro 599.2 0.8 16.17 256 335 264.2
Keston Hiura 465.4 0.2 16.11 364 336 129.4
Andrew Stevenson 597.8 -0.1 15.79 482 337 260.8
Christian Arroyo 594.0 0.2 15.56 354 338 256
Riley Adams 593.0 0.8 15.52 257 339 254
Jackie Bradley Jr. 589.7 0.2 15.46 379 340 249.7
Kevin Plawecki 581.7 0.8 15.38 261 341 240.7
Austin Barnes 592.5 0.8 15.06 264 342 250.5
Jose Barrero 478.7 1.2 15.06 206 343 135.7
Andrelton Simmons 586.5 1.1 14.85 230 344 242.5
Yonathan Daza 598.7 -0.1 14.83 480 345 253.7
Kelvin Gutierrez 597.6 0.3 14.80 351 346 251.6
Curt Casali 592.1 0.9 14.52 250 347 245.1
Edward Olivares 584.8 0.3 14.31 346 348 236.8
Taylor Ward 600.9 0.4 14.18 322 349 251.9
Danny Santana 999.0 0.1 14.08 406 350 649
Adam Eaton 999.0 0.2 13.97 355 351 648
Nick Gordon 584.3 0.4 13.90 330 352 232.3
Jonah Heim 531.7 0.9 13.76 246 353 178.7
Daz Cameron 599.3 0.2 13.64 378 354 245.3
Harold Castro 599.7 -0.1 13.58 492 355 244.7
Ben Rortvedt 999.0 1.0 12.93 235 356 643
Jake Marisnick 999.0 0.2 12.48 381 357 642
Victor Caratini 582.5 0.8 12.48 260 358 224.5
Guillermo Heredia 999.0 -0.1 12.45 491 359 640
Sheldon Neuse 999.0 0.2 11.95 362 360 639
Shed Long Jr. 999.0 0.2 11.67 367 361 638
Harold Ramirez 574.0 0.2 11.51 373 362 212
Charlie Culberson 597.2 0.0 11.21 432 363 234.2
Jarrod Dyson 999.0 -0.2 11.19 495 364 635
Willi Castro 563.7 0.3 11.01 352 365 198.7
Jon Berti 576.3 0.4 10.83 328 366 210.3
Juan Lagares 999.0 0.1 10.83 417 367 632
Luis Rengifo 600.2 0.4 10.62 317 368 232.2
Wilson Ramos 600.9 0.6 10.62 296 369 231.9
Mauricio Dubón 596.2 0.3 10.62 335 370 226.2
Nomar Mazara 600.4 0.1 10.48 416 371 229.4
Leody Taveras 573.9 0.6 10.47 287 372 201.9
Starlin Castro 597.3 0.4 10.42 332 373 224.3
Jack Mayfield 600.9 0.1 10.36 407 374 226.9
Cristian Pache 573.0 0.8 10.35 266 375 198
Khris Davis 999.0 0.0 10.12 448 376 623
Tomás Nido 601.0 0.2 10.11 360 377 224
Hoy Park 600.7 0.4 9.80 315 378 222.7
Owen Miller 600.7 0.2 9.57 365 379 221.7
Hanser Alberto 600.3 0.2 9.53 356 380 220.3
Maikel Franco 590.0 0.0 9.42 444 381 209
Jose Trevino 596.6 0.3 9.31 343 382 214.6
Josh Reddick 999.0 -0.2 9.29 498 383 616
Zack Collins 598.5 0.6 9.26 302 384 214.5
Shogo Akiyama 600.8 0.0 9.24 449 385 215.8
Zach McKinstry 596.2 0.1 9.22 382 386 210.2
Seby Zavala 598.5 -0.1 9.14 485 387 211.5
David Bote 597.7 0.4 9.11 319 388 209.7
Ronald Torreyes 600.9 0.0 9.06 453 389 211.9
Eli White 599.7 0.1 8.92 414 390 209.7
Michael Perez 999.0 0.4 8.54 318 391 608
Andrew Velazquez 600.7 0.2 8.46 368 392 208.7
Orlando Arcia 600.8 0.3 8.44 350 393 207.8
Wilmer Difo 999.0 -0.1 8.38 487 394 605
Kurt Suzuki 599.5 0.2 8.26 369 395 204.5
Travis Shaw 600.5 0.1 8.16 418 396 204.5
Jake Rogers 600.8 0.3 8.14 347 397 203.8
Renato Núñez 599.6 0.2 8.00 380 398 201.6
Ryan O'Hearn 599.8 0.0 7.85 461 399 200.8
Miguel Andújar 595.5 0.0 7.73 439 400 195.5
Matt Carpenter 597.1 0.0 7.68 458 401 196.1
Kyle Garlick 999.0 0.0 7.67 471 402 597
Jake Cave 999.0 -0.1 7.45 490 403 596
Cal Raleigh 560.8 0.9 7.42 249 404 156.8
Joe Panik 999.0 0.1 7.29 422 405 594
Jose Rojas 600.4 0.0 7.09 450 406 194.4
Andrew Knizner 598.8 0.6 7.00 295 407 191.8
Sergio Alcántara 600.5 0.2 6.99 371 408 192.5
Emmanuel Rivera 999.0 0.1 6.96 393 409 590
Billy McKinney 599.2 0.0 6.74 438 410 189.2
Chris Owings 600.7 0.0 6.73 435 411 189.7
Danny Mendick 999.0 0.1 6.59 383 412 587
Stephen Vogt 999.0 0.1 6.55 409 413 586
Eric Sogard 999.0 0.0 6.49 430 414 585
Daniel Johnson 999.0 0.0 6.46 442 415 584
Ernie Clement 999.0 0.2 6.32 370 416 583
Luis Guillorme 999.0 0.3 6.29 353 417 582
Billy Hamilton 597.8 -0.1 6.28 489 418 179.8
Lewis Brinson 600.5 0.0 6.26 441 419 181.5
Rob Refsnyder 999.0 0.0 6.20 447 420 579
Yonny Hernandez 599.8 0.3 6.19 339 421 178.8
Marwin Gonzalez 600.4 0.2 6.05 376 422 178.4
Phil Gosselin 594.3 0.0 6.04 460 423 171.3
Tim Locastro 598.6 0.1 5.93 401 424 174.6
Jordy Mercer 999.0 -0.1 5.89 483 425 574
Brock Holt 999.0 0.1 5.77 423 426 573
Nick Maton 999.0 0.0 5.73 434 427 572
Sandy León 999.0 0.0 5.67 443 428 571
Isaac Paredes 600.3 0.4 5.55 321 429 171.3
José Peraza 600.8 0.1 5.45 410 430 170.8
Tres Barrera 999.0 0.1 5.38 398 431 568
Travis Jankowski 999.0 0.0 5.38 427 432 567
Taylor Trammell 600.9 0.1 5.18 396 433 167.9
Rodolfo Castro 999.0 0.1 5.14 386 434 565
Pablo Reyes 999.0 0.1 4.97 408 435 564
Taylor Jones 999.0 0.0 4.78 436 436 563
William Contreras 593.2 0.2 4.75 363 437 156.2
Rafael Marchan 999.0 0.1 4.62 392 438 561
Dustin Garneau 999.0 0.1 4.50 402 439 560
José Rondón 999.0 0.0 4.42 446 440 559
Cam Gallagher 600.6 0.3 4.23 338 441 159.6
Reese McGuire 599.1 0.2 4.21 372 442 157.1
DJ Peters 600.5 0.0 4.11 464 443 157.5
Jahmai Jones 600.7 0.1 4.00 411 444 156.7
Yermín Mercedes 596.9 0.1 3.88 415 445 151.9
Luke Williams 999.0 0.0 3.86 455 446 553
Drew Ellis 999.0 0.1 3.84 389 447 552
Roman Quinn 999.0 0.0 3.75 454 448 551
Matt Joyce 999.0 -0.1 3.67 484 449 550
Jake Lamb 999.0 0.1 3.65 404 450 549
Evan White 600.2 0.0 3.59 437 451 149.2
Isan Díaz 600.5 0.1 3.57 421 452 148.5
Richie Martin 999.0 -0.1 3.57 479 453 546
Alan Trejo 999.0 0.0 3.53 426 454 545
Andrew Knapp 600.9 0.0 3.53 462 455 145.9
Alex Jackson 999.0 0.2 3.40 366 456 543
Alex Blandino 999.0 -0.2 3.36 496 457 542
Jake Bauers 600.5 -0.1 3.27 486 458 142.5
Ryan McKenna 999.0 0.1 3.25 391 459 540
Zack Short 999.0 0.1 3.14 385 460 539
Aramis Garcia 999.0 0.0 3.14 457 461 538
Delino DeShields 999.0 0.0 3.13 451 462 537
Magneuris Sierra 999.0 -0.1 3.01 493 463 536
Donovan Walton 999.0 0.1 2.98 405 464 535
Ender Inciarte 600.0 0.0 2.98 465 465 135
Grayson Greiner 999.0 0.0 2.95 456 466 533
Jason Vosler 999.0 0.1 2.85 419 467 532
David Dahl 599.6 0.0 2.65 476 468 131.6
Skye Bolt 999.0 0.2 2.65 377 469 530
Daniel Robertson 999.0 0.0 2.61 440 470 529
John Nogowski 600.8 0.0 2.58 433 471 129.8
Gilberto Celestino 999.0 0.1 2.49 413 472 527
Austin Romine 999.0 0.1 2.47 395 473 526
Luke Raley 999.0 0.1 2.26 420 474 525
Abraham Almonte 999.0 0.0 2.26 459 475 524
Tucupita Marcano 600.6 0.0 2.13 428 476 124.6
Pat Valaika 999.0 0.0 1.99 469 477 522
Rio Ruiz 999.0 0.0 1.80 452 478 521
Jonathan Araúz 999.0 0.0 1.77 424 479 520
Webster Rivas 999.0 0.0 1.71 470 480 519
Wyatt Mathisen 999.0 0.0 1.68 431 481 518
Phillip Evans 999.0 0.0 1.59 478 482 517
Willians Astudillo 601.0 0.0 1.59 466 483 118
Jonathan Davis 999.0 -0.1 1.51 481 484 515
Patrick Mazeika 999.0 0.1 1.49 412 485 514
Ildemaro Vargas 999.0 0.0 1.41 445 486 513
Franchy Cordero 999.0 0.0 1.20 463 487 512
Chad Wallach 999.0 0.0 1.19 475 488 511
Justin Williams 999.0 -0.1 1.03 488 489 510
Edwin Ríos 587.4 0.0 0.91 472 490 97.4
Chance Sisco 999.0 0.1 0.80 397 491 508
Albert Almora Jr. 999.0 0.1 0.64 394 492 507
Robel García 999.0 0.0 0.50 473 493 506
NA
NA
NA
NA
NA
NA
---
title: "Welcome to my 2022 Projections for Hitters 6x6"
author: "Darshan Patel"
date: "`r Sys.Date()`"
output: 
  html_notebook:
    toc: true
    toc_float: true
    number_sections: true
    theme: sandstone
    highlight: tango
    fig_caption: true
    df_print: paged
---

<html>

<p>

Projections using Hypertuned model through XGboost

</p>

<p>

All data is from [FanGraphs.](https://www.fangraphs.com/) I have no affiliation with FanGraphs, but please consider contributing to their [website](https://plus.fangraphs.com/shop/) if you found this project informative.

</p>

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = 'C:/Users/Admin/Documents/Learning Python Folder1/Python Essence Training/Fantasy-Baseball/Data')
options(knitr.table.format = "html") 
options(digits=2)
options(scipen = 100)
```


# Project Scope {.tabset .tabset-pills}

## Objective

This project is designed to showcase how Using a Percentile Based Worth System values Fantasy Baseball Players through a Inning Pitched (IP) weighted projection

The Categories used for prediction valuation are year-end rankings for the following metrics:
-   HRs
-   Runs
-   RBIs
-   Batting Average
-   Stolen Bases
-   OPS

![](IntroChart6x6.png)


***  

# Processing the Data {.tabset .tabset-pills}

## Getting Data Into R

### Load Libraries

<p style="color:black;">

*First we need to load the packages that R needs to run the analysis*

</p>

```{r load library,message = FALSE,warning=FALSE}
library(sqldf) #SQL in R
library(skimr) #Summaries and useful for removing low % data
library(ggplot2) #Plotting Functions
library(plyr) #slightly deprecated data cleaning
library(dplyr) #slightly updated data cleaning
library(tidyverse) #tidyverse data cleaning universe
library(caret) #wrapper for creating, tuning and validating models
library(xgboost) #package for creating regression tree model
library(vtreat) # useful package for treating data before modeling 
library(Matrix)
library(Boruta)
library(mgcv)
library(moments) #for measuring skewness
library(data.table) #alternative to dplyr we use to create lags
library(pdp) #partial dependence graphs
library(vip) #variable importance 
library(grid) #put multiple plots on one grid
library(gridExtra) #additional grid functionality
library(janitor) #one function used to clean transposed data set
library(ggpubr) #for qq plot
library(tableHTML)
library(kableExtra)
```

The \# comments generally explain what additional functionality each library adds to R

### Load in Data

All data is downloaded from Fan Graphs. From this [location](https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2021&month=0&season1=2021&ind=0). The data is also available on my Github [here](https://github.com/dissipation/Fantasy-Baseball). There are player level and team data sets

```{r data read-in, results= 'hide',message=FALSE}

#data read-in
Batter_data <- read_csv("FanGraphs Leaderboard_Hitting50PA.csv")

#Team datasets
FDG_Team = read_csv("FanGraphs Leaderboard_Team.csv")


#Create a prefix for all team stats that starts with T_
FDG_Team2 <- FDG_Team %>% 
  rename_with( ~ paste0("T_", .x))

```

### Checking Team Data

`str` give information about an object, while `skim` provides a customizable summary  

```{r checking team data}

#Output not shown for space
#str(FDG_Team2)

skim(FDG_Team2) %>%  
  tibble::as_tibble()
```
***  

## Understanding the Dataset

### Exploring the dataset

`skim` let's us see how the data was imported into R. Documentation can be found [here](https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html)

```{r}

#Full Dataset dimensions

skimr::skim(Batter_data) %>% 
  tibble::as_tibble() %>% 
  select(skim_type,skim_variable,complete_rate) %>% 
  filter(complete_rate >0.30) #288 Variables

#skim_type - character or numeric
#skim_variable - name of variable
#complete_rate - % of data that is not missing
#filter - only keep variables that have 30% of data populated
```

Additionally let's look at how variables vary by year to see if there are any discrepancies there  

```{r}

#It looks like one year, there were fewer games played, and there is a clear drop off in home runs
Batter_data_dist =
Batter_data %>% 
 group_by(Season) %>% 
  summarize (Games_played = max(G),
             Avg_HR= mean(HR)
             )
Batter_data_dist

ggplot(Batter_data_dist, aes(Season, Avg_HR)) +
  geom_col()+
  ggtitle("Average Home Runs by Year")+
  theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))
```

***  

## Cleaning and Creating Initial Dataset for Model

What are some issues with the data?

1.  Many of Variables, such as K%, are being read in as characters

    -   Only Team and Player Name should be characters

2.  There is spotty data coverage in some of the variables (\~Variables have less than 30% Coverage)

3.  2020 Data only includes 60 games worth of data

    -   This was a season shortened due to Covid-19

4.  Team Data needs to be appended to Batter Data by Team Name 

***  

### Cleanly Changing all Variables that are characters to numeric.  
There are several ways to do this, we will identify the variables we want to change that are mis-identified. `parse_number` can be used to pull numbers from these variables. Additional ways to tackle this can be found [here](https://stackoverflow.com/questions/8329059/how-to-convert-character-of-percentage-into-numeric-in-r)

```{r}

#Select Column names that are characters but not Team or Name, These should be percentages
Batter_data_chars_to_convert <- Batter_data %>% 
  select_if(is.character)%>% select(-Team,-Name) %>% 
  mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Note : There are additional ways to do this, this is just one solution


#We can exclude the variables we converted and reintroduce them
Batter_data_num <- Batter_data %>% select(-colnames(Batter_data_chars_to_convert))

Batter_data2 = cbind(Batter_data_num,Batter_data_chars_to_convert) %>% 
  select (colnames(Batter_data)) %>%  #preserve original order 
  dplyr::rename(flyball_perc = `FB%...46`,fastball_perc = `FB%...73`) #rename two ambiguous columns
  
skim(Batter_data2) %>% 
  as_tibble() %>% 
  group_by(skim_type) %>% 
  count()


#Logical variables are R's best guess, in our case they are all NA's and will be removed
```

The same can be done for the Team Data that is loaded  


```{r}

#Select Column names that are characters but not Team or Name, These should be percentages
FDG_Team2_chars_to_convert <- FDG_Team2 %>% 
  select_if(is.character)%>% select(-T_Team) %>% 
  mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Keep in mind, parse number may make actual characters into numerical variables so carefully check your data before using

#We can exclude the variables we converted and reintroduce them
FDG_Team2_num <- FDG_Team2 %>% select(-colnames(FDG_Team2_chars_to_convert))

FDG_Team3 = cbind(FDG_Team2_num,FDG_Team2_chars_to_convert) %>% 
  select (colnames(FDG_Team2)) %>%  #preserve original order
dplyr::rename(T_flyball_perc = `T_FB%...45`,T_fastball_perc = `T_FB%...72`) 

skim(FDG_Team3) %>% 
  as_tibble() %>% 
  group_by(skim_type) %>% 
  count()
```
***  


### Filtering Data with Low Coverage    
I choose 30% coverage of data necessary but this can be adjusted up or down. This will also get rid of columns that are all `NA`.  
```{r}

# Keep variables with enough values (Need 30% data coverage rate here)
Player_cols_to_keep =
skim(Batter_data2) %>% 
  dplyr::select(skim_type, skim_variable, complete_rate) %>% 
  filter (complete_rate > 0.30)

#Transpose Rows to get column names as skim melts the data
Player_cols_to_keep_transpose = t(Player_cols_to_keep) 

#extract the colnames we would like to keep
Player_cols_to_keep = colnames(janitor::row_to_names(Player_cols_to_keep_transpose,row_number = 2))

#Only keep the columns designated to have over 30% of their data populated or greater
Batter_data3 = Batter_data2 %>% 
  select(one_of(Player_cols_to_keep)) 
```


*Repeat the process for Team Variables*
```{r}
Team_cols_to_keep =
skim(FDG_Team3) %>% 
  dplyr::select(skim_type, skim_variable, complete_rate) %>% 
  filter (complete_rate > 0.30)


#Transpose Rows to get column names as skim melts the data
Team_cols_to_keep_transpose = t(Team_cols_to_keep) 

#extract the colnames we would like to keep
Team_cols_to_keep = colnames(janitor::row_to_names(Team_cols_to_keep_transpose,row_number = 2))

#Only keep the columns designated to have over 30% of their data populated or greater
FDG_Team4 = FDG_Team3 %>% 
  select(one_of(Team_cols_to_keep)) 



```



***  

### Creating Variables Normalized by Year  
Some Variables will need to be normalized by Innings_Pitched (IP) if they aren't a percentage already. Remaining Variables are percentages or indices so will not need to be transformed. The full data dictionary for these variables can be found on FanGraph's website [here.](https://library.fangraphs.com/pitching/complete-list-pitching/) for pitching variables and [here.](https://library.fangraphs.com/offense/offensive-statistics-list/) for hitting variables.  
```{r}

Batter_data4 = Batter_data3 %>% 
  mutate( #create new variables based on existing variables
    H_PA = H/PA,
    x1B_PA = `1B`/PA, #note: R can't have variables start with a number
    x2b_PA = `2B`/PA,
    x3b_PA = `3B`/PA,
    HR_PA = HR/PA,
    R_PA = R/PA,
    RBI_PA = RBI/PA,
    BB_PA = BB/PA,
    IBB_PA = IBB/PA,
    SO_PA=SO/PA,
    HBP_PA=HBP/PA,
    SF_PA=SF/PA,
    SH_PA=SH/PA,
    GDP_PA= GDP/PA,#ground into double play
    SB_PA=SB/PA,
    CS_PA=CS/PA,
    GB_PA = GB/PA,   #Groundballs
    FB_PA =  FB/PA,  #FlyBalls
    LD_PA = LD/PA,   #LineDrives
    IFFB_PA = IFFB/PA,  #Infield Fly balls
    Pitches_PA= Pitches/PA,
    Balls_PA= Balls/PA,
    Strikes_PA= Strikes/PA,
    IFH_PA= IFH/PA,
    BU_PA= BU/PA,
    BUH_PA= BUH/PA,
    PH_PA= PH/PA,
    Barrels_PA= Barrels/PA,
    HardHits_PA= HardHit/PA
  ) %>% select(-(H:CS),-(GB:BUH),-PH,-Barrels,-HardHit,-Events) #Drop the old variables

#skim(Batter_data4) %>% as_tibble()


```

*Repeat the process for Team Variables*
```{r}

FDG_Team5 = FDG_Team4 %>% 
  mutate( #create new variables based on existing variables
    T_H_T_PA = T_H/T_PA,
    T_x1B_T_PA = T_1B/T_PA, #note: R can't have variables start with a number
    T_x2b_T_PA = T_2B/T_PA,
    T_x3b_T_PA = T_3B/T_PA,
    T_HR_T_PA = T_HR/T_PA,
    T_R_T_PA = T_R/T_PA,
    T_RBI_T_PA = T_RBI/T_PA,
    T_BB_T_PA = T_BB/T_PA,
    T_IBB_T_PA = T_IBB/T_PA,
    T_SO_T_PA=T_SO/T_PA,
    T_HBP_T_PA=T_HBP/T_PA,
    T_SF_T_PA=T_SF/T_PA,
    T_SH_T_PA=T_SH/T_PA,
    T_GDP_T_PA= T_GDP/T_PA,#ground into double play
    T_SB_T_PA=T_SB/T_PA,
    T_CS_T_PA=T_CS/T_PA,
    T_GB_T_PA = T_GB/T_PA,   #Groundballs
    T_FB_T_PA =  T_FB/T_PA,  #FlyBalls
    T_LD_T_PA = T_LD/T_PA,   #LineDrives
    T_IFFB_T_PA = T_IFFB/T_PA,  #Infield Fly balls
    T_Pitches_T_PA= T_Pitches/T_PA,
    T_Balls_T_PA= T_Balls/T_PA,
    T_Strikes_T_PA= T_Strikes/T_PA,
    T_IFH_T_PA= T_IFH/T_PA,
    T_BU_T_PA= T_BU/T_PA,
    T_BUH_T_PA= T_BUH/T_PA,
    T_PH_T_PA= T_PH/T_PA,
    T_Barrels_T_PA= T_Barrels/T_PA,
    T_HardHits_T_PA= T_HardHit/T_PA
  ) %>% select(-(T_H:T_CS),-(T_GB:T_BUH),-T_PH,-T_Barrels,-T_HardHit,-T_Events) #Drop the old variables


#skim(FDG_Team5) %>% as_tibble()


```

***  

### Creating Lagged Variables  

There are several ways to lag a dataset **BY GROUP**.\
\* `Dplyr` way is [here.](https://statisticsglobe.com/create-lagged-variable-by-group-in-r).\
\* The `data.table` (the method used below) is [here.](https://stackoverflow.com/questions/26291988/how-to-create-a-lag-variable-within-each-group)  
```{r}
#Note we will only be lagging the player level data, as the previous year's team performance shouldn't impact current performance


#Order the dataset by lag columns
Batter_data5 =  arrange(Batter_data4, playerid,Season) #playerid is the Fangraph id assigned to each player

# Convert dataframe to data.table format
DT_batter = data.table(Batter_data5)

#designate columns to lag - which is all of them
cols1 = colnames(Batter_data5)
anscols = paste("lag", cols1, sep="_")
DT_batter[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year

Batter_data6 = as.data.frame(DT_batter) %>% select(-lag_playerid, -lag_Team, -lag_Season, -lag_Age,-lag_Name)

ncol(Batter_data5) #287 - no lags
ncol(Batter_data6) #574 - lagged data ~ (287 * 2)-5

```

***    

### Merging Team and Player Data  
We can use either the `merge` function or the SQL functionality provided by the `sqldf` package to join the lagged player level data to the Team level data

```{r}

df_batting_init = sqldf(
  "
  select a.*, b.*
  from Batter_data6 a
  left join FDG_Team5 b
  on a.Team = b.T_Team and a.Season = b.T_Season
  
  "
)  %>% select(-T_Team,-T_Season,T_Age,T_G,T_AB)# Unncessary Team Variables


nrow(df_batting_init) - nrow(Batter_data6) #check if any rows are duplicated


```


***  


# Creating Rankings for Players Based On Percentiles {.tabset .tabset-pills}

We can use Percentile based ranking to get rankings for players from the 2021 season.

## Worth of each stat

### Calculating past performance

Each player goes from a 0% to 100% on each percentile stat that is used for creating a scoring opportunity. Data is not yet normalized by PA as certain stats such as HRs and SBs will be worth more when we do.\

```{r}

#Categories I include are:
#Runs (R), Home Runs (HR), Runs Batted In (RBI), Stolen Bases (SB), Batting Average (AVG)
df_batting_init2 =  df_batting_init %>%
#  arrange(player_id,year) %>% 
  group_by(Season) %>% 
  mutate(
    Runs_share = order(order(rank(R_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     HR_share = order(order(rank(HR_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     RBI_share = order(order(rank(RBI_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     SB_share = order(order(rank(SB_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     AVG_share = order(order(rank(AVG,ties.method = 'average'),decreasing = FALSE))/n(),
    OPS_share = order(order(rank(OPS,ties.method = 'average'),decreasing = FALSE))/n(),
    Worth = Runs_share+HR_share+RBI_share+SB_share+AVG_share+OPS_share
    ) %>% 
  ungroup() 

```

Chart of the Distribution of initial percentiles  
As the chart below shows, the data is roughly normal.
```{r}

skewness((df_batting_init2$Worth))

ggplot2::qplot(df_batting_init2$Worth, main="Total Dataset") + geom_histogram(colour="black", fill="lime green")

min(df_batting_init2$Worth)

max(df_batting_init2$Worth)

ggpubr::ggqqplot(df_batting_init2$Worth)

shapiro.test(df_batting_init2$Worth)
```


***  
## 2021 Player Rankings - Per PA performance

### 2021 Player Rankings - Top Worth Player

There are per PA rankings. Players like Byron Buxton which had a great per PA score but can't stay healthy for a season will be adjusted down.

```{r,warning=FALSE}

options(digits=2)

df_batting_init2021 =
df_batting_init2 %>% 
  group_by(Name) %>% 
  filter(Season == 2021) %>% 
  arrange(desc(Worth)) %>% 
  select(Name,Runs_share,HR_share,RBI_share, SB_share,OPS_share,AVG_share,Worth)


df_batting_init2021 %>%
  filter (Worth>3.9) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)
```


***  


# Creating Model File {.tabset .tabset-pills}  

## Additional Data Prep  

### Remove Variables which are based off current hitting numbers  

Not all variables can be used for predictive modeling.  Variables that go into the percentile ranking or are non-normalized metrics created after the fact (such as `WAR` - Wins above Replacement or `RE24`) should be removed. However, metrics that are normalized by a per pitch basis (such as `HR/FB%+`) can remain as we expect batters to have similar performance in these metrics one year out.  

```{r}
#Creating a new dataset to keep original intact
df_batting_init3 = df_batting_init2
```

Lagged Percentile (`_share`) Variables can be used for predictive modeling. However since these variables were created for the Worth metric they must also be removed for modeling purposes.  

```{r}

#Order the dataset by lag columns
df_batting_init4 =  arrange(df_batting_init3, playerid,Season) #playerid is the Fangraph id assigned to each player

# Convert dataframe to data.table format
DT_batter2 = data.table(df_batting_init4)

#designate columns to lag - which is all of them
cols1 = (c('Runs_share','HR_share','RBI_share', 'SB_share','OPS_share','AVG_share','Worth'))
anscols = paste("lag", cols1, sep="_")
DT_batter2[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year

#names(DT_batter2)

df_batting_final = as.data.frame(DT_batter2) %>% 
  select(-c(Runs_share,HR_share,RBI_share, SB_share,OPS_share,AVG_share))%>% 
  select(-(G:AVG),-(OBP:BABIP),-(wOBA:Dol),-(`wRC+`:REW),-(`WPA/LI`),-(wFB:wSF),-BsR,-(Def:wGDP),-(`wCH (pi)`:`wCH/C (pi)`),-(`AVG+`),-(`OBP+`:`BABIP+`),-(H_PA:PH_PA)) %>% select (-Name)


```



### Creating Training/Test Split  
We split the data into Training Data (which is used to create the model) and test data (which is used to validate the model)   
```{r}

set.seed(15674)  # For reproducibility
# Create index for testing and training data
inTrain <- createDataPartition(y = df_batting_final$Worth, p = 0.80, list = FALSE)
# subset pitching data for training
tr_2021 <- df_batting_final[inTrain,]
# subset the rest to test and validate trained model
te_2021 <- df_batting_final[-inTrain,]

nrow(tr_2021)/nrow(df_batting_final) #check if split is 0.8

```

### Treat Missing Data by Imputing Mean Value  
Vtreat Package in R is excellent for treating data before using for modeling. Additional documentation can be found [here.](https://winvector.github.io/vtreat/index.html)  
*Note: The treatment plan also fixes variables names  like`HR/FB%+` (which R doesn't always handle the best) to `HR_slash_FB_percent_plus_`*  
```{r}
treat_plan_2021 <- vtreat::designTreatmentsZ(
  dframe = tr_2021, # training data
  varlist = colnames(tr_2021) %>% .[. != "hitting_score1"], # input variables = all training data columns, except random
  codeRestriction = c("clean", "isBAD", "lev"), # derived variables types (drop cat_P)
  verbose = FALSE) # suppress messages

#clean stands for cleaned numerical variable, isBAD indicates that a value replacement has occurred (which indicates a missing value in this case), and lev is a binary indicator whether a particular value of that categorical variable was present.  

#### Checking Scoreframe

score_frame <- treat_plan_2021$scoreFrame %>% 
  select(varName, origName, code)

head(score_frame)


tr_treated_2021 <- vtreat::prepare(treat_plan_2021, tr_2021)
te_treated_2021 <- vtreat::prepare(treat_plan_2021, te_2021)


Total_dataset1_untreat = as.data.frame(DT_batter2) %>% select(-Name)

treat_plan_2021 <- vtreat::designTreatmentsZ(
  dframe = Total_dataset1_untreat, # training data
  varlist = colnames(Total_dataset1_untreat) %>% .[. != "hitting_score1"], # input variables = all training data columns, except random
  codeRestriction = c("clean", "isBAD", "lev"), # derived variables types (drop cat_P)
  verbose = FALSE) # suppress messages


total_treated_2021_hitting <- vtreat::prepare(treat_plan_2021, Total_dataset1_untreat)



#tr_treated = tr
#te_treated = te

dim(tr_treated_2021) #note there are dummies for each player and team

```


***    


### Check Distribution of Training Population  
The population used for Training should be indicative of Total Population
```{r}

ggplot2::qplot(tr_treated_2021$Worth, main="Training Set") + geom_histogram(colour="black", fill="limegreen") + theme_bw()

skewness(tr_treated_2021$Worth) #The skewness is the same as the overall
```


# Running XGboost Model {.tabset .tabset-pills} 
To keep things simple with modeling, we’ll turn the training data into simple input variables for `caret::train`, dropping the response variable and converting the data frame to a matrix. Documentation for this approach to XGboost can be found [here.](https://www.kaggle.com/pelkoja/visual-xgboost-tuning-with-caret)    

## Tuning the Model

### Initial Non-Tuned Model
Break the data set into x and y inputs with x being a matrix  
```{r}
input_x <- as.matrix(((tr_treated_2021))%>%
   select(-Worth) %>%                      
   select(!ends_with ("_isBAD")))

input_y <- tr_treated_2021$Worth

```

XGBoost with Default Hyperparameters    
The Variable Importance (`caret::varImp(xgb_base_2021, scale = F  )`) from the caret package shows the contribution of each variable to the initial model. As you can see SLG_plus_ (SLG+) takes up much of the importance as it is derived from SLG (one of the key contributors to Worth). These types of variables will be removed during variable selection in the next step.  
*XGBoost documentation can be found for more general models [here.](https://www.kaggle.com/code/rtatman/machine-learning-with-xgboost-in-r/notebook)*

```{r}

#Defaults for xgboost model
grid_default <- expand.grid(
  nrounds = 100,
  max_depth = 6,
  eta = 0.3,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

#This is a blank train_control set, this will be updated after
train_control <- caret::trainControl(
  method = "none",
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)

xgb_base_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

caret::varImp(xgb_base_2021, scale = F  )




```

## Further Variable Selection

### Remove redundant and highly correlated variables

Selection Removal Step 1: Check for high correlations\
Normally, this step is done early, but those steps were reserved for preparing the data

```{r}

dep_cor1 <- t(as.data.frame(cor(tr_treated_2021[ , colnames(tr_treated_2021) != "Worth"],
                tr_treated_2021$Worth)))
dep_cor1 <-
as.data.frame(t(as.data.frame(dep_cor1)%>% 
  select(!starts_with("lag")) %>% #remove lag variables
  select(!contains("_isBAD")))) 

dep_cor1 <- tibble::rownames_to_column(dep_cor1,"VARIABLES")%>% #remove indicators for missing data
  filter(V1 > 0.70|V1 < -0.5)

dep_cor1

dep_cor2 <- colnames(row_to_names(t(dep_cor1),row_number = 1))



```
Let's Remove variables with high correlation to worth metric (such as `wFB/C`)

```{r}

input_x <- as.matrix(((tr_treated_2021))%>%
   select(-Worth) %>% #Remove dependent variable
     select (-all_of(dep_cor2) ) %>%      
select(!ends_with ("_isBAD"))) #indicator variable for missing data

input_y <- tr_treated_2021$Worth





```

Run the model on the new dataset to make sure the variable importances look fine
```{r}

#Note Training parameters were set in initial model set up
xgb_base_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

caret::varImp(xgb_base_2021, scale = F  )


```


## Model with new data  

### Tuning All Hyperparameters
A tune grid allows us to test a large amount of hyper-parameters and find the model with the lowest RMSE for predictions.   
However, The more values you want to test and the greater the amount of Cross-Fold Validations (`method = "cv"`), the greater the computational time it will take. More information on the specific parameters can be found [here.](https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/)

```{r}

# maximum number of trees
nrounds <- 1000

# note to start nrounds from 200, as smaller learning rates result in errors so
# big with lower starting points that they'll mess the scales
tune_grid <- expand.grid(
  nrounds = seq(from = 100, to = nrounds, by = 50),
  eta = c(0.01, 0.025, 0.05, 0.1),
  max_depth = c(2, 4, 6, 8),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

tune_control <- caret::trainControl(
  method = "cv", # cross-validation
  number = 5, # with n folds 
  ## Note this was # out in the original code
  #index = createFolds(tr_treated$Id_clean), # fix the folds
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)



```

*Running the initial tuning model*  
```{r}
#Note I will be timing these runs to give an estimate on how long this model takes to run
start_time <- Sys.time()

xgb_tune_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid,
  method = "xgbTree",
  verbose = FALSE
  ,verbosity = 0
)

end_time <- Sys.time()

end_time - start_time

```

*Tuning Plot and Variable Importance*
```{r}
varImp(xgb_tune_2021, scale = F  ) 


# helper function for the plots
tuneplot <- function(x, probs = .90) {
  ggplot(x) +
    coord_cartesian(ylim = c(quantile(x$results$RMSE, probs = probs), min(x$results$RMSE))) +
    theme_bw()
}

tuneplot(xgb_tune_2021)
```

### Fine Tuning Model  
#### Second Tuning: Maximum Depth and Minimum Child Weight  
After fixing the learning rate to 0.1 and we’ll also set maximum depth to 3 +-1 (or +2 if max_depth == 2) to experiment a bit around the suggested best tune in previous step. Then, well fix maximum depth and minimum child weigh

```{r}
tune_grid2 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = ifelse(xgb_tune_2021$bestTune$max_depth == 2,
    c(xgb_tune_2021$bestTune$max_depth:4),
    xgb_tune_2021$bestTune$max_depth - 1:xgb_tune_2021$bestTune$max_depth + 1),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = c(1, 2, 3),
  subsample = 1
)

xgb_tune2_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid2,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune2_2021)

xgb_tune2_2021$bestTune

varImp(xgb_tune2_2021, scale = F  ) 
```

#### Third Tuning: Column and Row Sampling

```{r}

tune_grid3 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = 0,
  colsample_bytree = c(0.4, 0.6, 0.8, 1.0),
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = c(0.5, 0.75, 1.0)
)

xgb_tune3_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid3,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune3_2021, probs = .95)

xgb_tune3_2021$bestTune

varImp(xgb_tune3_2021, scale = F  ) 
```

#### Fourth Tuning: Gamma  
Next, we again pick the best values from previous step, and now will see whether changing the gamma has any effect on the model fit:
```{r}
tune_grid4 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = c(0, 0.05,0.1, 0.2,0.4, 0.5, 0.7, 0.9, 1.0),
  colsample_bytree = xgb_tune3_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = xgb_tune3_2021$bestTune$subsample
)

xgb_tune4_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid4,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune4_2021)

xgb_tune4_2021$bestTune

varImp(xgb_tune4_2021, scale = F  ) 
```

#### Fifth Tuning: Reducing the Learning Rate  
Now, we have tuned the hyperparameters and can start reducing the learning rate to get to the final model:  

```{r}
start_time <- Sys.time()

tune_grid5 <- expand.grid(
  nrounds = seq(from = 100, to = 10000, by = 75),
   eta = c(0.01, 0.015, 0.025,0.035, 0.05,0.75, 0.1),
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = xgb_tune4_2021$bestTune$gamma,
  colsample_bytree = xgb_tune3_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = xgb_tune3_2021$bestTune$subsample
)



xgb_tune5_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid5,
  method = "xgbTree",
  verbose = TRUE
)

#tuneplot(xgb_tune5_2021)

end_time <- Sys.time()

end_time - start_time

xgb_tune5_2021$bestTune

varImp(xgb_tune5_2021, scale = F  ) 
```


#### Fitting Final Model

```{r}

(final_grid_2021 <- expand.grid(
  nrounds = xgb_tune5_2021$bestTune$nrounds,
  eta = xgb_tune5_2021$bestTune$eta,
  max_depth = xgb_tune5_2021$bestTune$max_depth,
  gamma = xgb_tune5_2021$bestTune$gamma,
  colsample_bytree = xgb_tune5_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5_2021$bestTune$min_child_weight,
  subsample = xgb_tune5_2021$bestTune$subsample
))

(xgb_model_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = final_grid_2021,
  method = "xgbTree",
  verbose = TRUE
))

varImp(xgb_model_2021, scale = F  ) 

```

### Model Performance  


#### Checking Model on Test Split Data  

```{r}


y_pred_test <- predict(xgb_model_2021, data.matrix(te_treated_2021))

test_stats= cbind((te_treated_2021$Worth),y_pred_test)

test_statsR2 = cor(test_stats[,1],test_stats[,2])^2

print(test_statsR2)


y_pred_train <- predict(xgb_model_2021, data.matrix(tr_treated_2021))

train_stats = cbind((tr_treated_2021$Worth),y_pred_train)

train_statsR2 = cor(train_stats[,1],train_stats[,2])^2

print(train_statsR2)

#test dataset
x <- select(te_treated_2021, -Worth)
y <- (te_treated_2021$Worth)

(xgb_model_rmse <- ModelMetrics::rmse(y, predict(xgb_model_2021, newdata = x)))

holdout_x <- select(tr_treated_2021, -Worth)
holdout_y <- tr_treated_2021$Worth

(xgb_model_rmse <- ModelMetrics::rmse(holdout_y, predict(xgb_model_2021, newdata = holdout_x)))


```

#### Graphical Representation of Model   


```{r}

ggplot2::ggplot() +
  aes(x = test_stats[,1], y = test_stats[,2]) +
  geom_jitter() +
  xlab("Predicted Values") +
  ylab("Actual Values") +
  ggtitle("Results of Hitting Model on Test Data")+
  theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))+
  geom_smooth(method = "lm")
```


# Creating 2022 Projections from Model  {.tabset .tabset-pills} 


## Re-fit model for Important Variables
Now that we have an acceptable model, we can use it to create projections for how well we think players should do in 2022 based on their hitting statistics in 2021. First let's reduce

1. Only keep variables with high enough importance in model  

```{r}


vip(xgb_model_2021, num_features = 30)  # 10 is the default, 30 gives a visual on the top 30 most important features of the model

unscalevi = vi(xgb_model_2021, method="model") #shows the numbers behind the plot

unscalevi$Importance_perc = with(unscalevi,Importance/sum(Importance)) #adds percentages 

unscalevi # importance by variables

variables_to_keep_2021 = subset(unscalevi, Importance_perc > 0.0010) %>% select(Variable) #Keep Variables that explain at least a small amount [0.1%] of the model. This is a low threshold for inclusion ,but you can adjust this

variables_to_keep_2021b = t(variables_to_keep_2021)

variables_to_keep_2022 = colnames(row_to_names(variables_to_keep_2021b,row_number = 1))

tr_treated_2022 = tr_treated_2021 %>%  select(Worth,one_of(variables_to_keep_2022),playerid,starts_with("Team_lev_x_")) #keep modeled important variables along with team indicator variables

te_treated_2022 = te_treated_2021 %>%  select(Worth,one_of(variables_to_keep_2022),playerid,starts_with("Team_lev_x_"))

input_x_2022 = as.matrix(select(tr_treated_2022, -Worth))

input_y_2022 = tr_treated_2022$Worth



```

2. Re-fit model with reduced variable scope  

```{r}


(final_grid_2021 <- expand.grid(
  nrounds = xgb_tune5_2021$bestTune$nrounds,
  eta = xgb_tune5_2021$bestTune$eta,
  max_depth = xgb_tune5_2021$bestTune$max_depth,
  gamma = xgb_tune5_2021$bestTune$gamma,
  colsample_bytree = xgb_tune5_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5_2021$bestTune$min_child_weight,
  subsample = xgb_tune5_2021$bestTune$subsample
))

(xgb_model_2022 <- caret::train(
  x = input_x_2022,
  y = input_y_2022,
  trControl = train_control,
  tuneGrid = final_grid_2021,
  method = "xgbTree",
  verbose = TRUE
))


vip(xgb_model_2022, num_features = 30)

unscalevi24 = vi(xgb_model_2022, method="model")

unscalevi24$Importance_perc = with(unscalevi24,Importance/sum(Importance)) 

unscalevi24

save(xgb_model_2022,file = '2022_Hitting6x6_Model.Rdata')

hitting6x6 = xgb_model_2022

hittinginput6X6 = input_x_2022

#For anything above breaking_IP we need to create projection table by age or age bucket

#write_csv(unscalevi24,"unscalevi24.csv")

```

------------------------------------------------------------------------
# 2022 Projections Full  

First let's prepare a file for predicting based on our model object

```{r}


variableslag6xb= row_to_names(as.data.frame(t(variables_to_keep_2022)),row_number = 1)  %>% select (starts_with("lag"))

variables_nolag6xb = (owmr::remove_prefix(variableslag6xb,"lag" , sep = "_"))

Data_Predict_2022a6xb = total_treated_2021_hitting %>% select (one_of(colnames(variables_nolag6xb)),Season,playerid)

colnames(Data_Predict_2022a6xb) <- paste0("lag_", colnames(Data_Predict_2022a6xb))

Data_Predict_2022b6xb = total_treated_2021_hitting %>% select (one_of(colnames(variables_nolag6xb)))
colnames(Data_Predict_2022b6xb) = colnames(variableslag6xb)

variables_to_keep_2022_nolag6xb = total_treated_2021_hitting %>% select(one_of(variables_to_keep_2022),Season,playerid,starts_with("Team_lev_x_"))%>% select(-one_of(colnames(Data_Predict_2022b6xb)))


Data_predict_20226xb = sqldf(
  "
  select a.*,b.* from
  Data_Predict_2022a6xb a,
  variables_to_keep_2022_nolag6xb b
  on b.playerid = a.lag_playerid
  and b.Season = a.lag_Season
  "
) %>% select(-lag_playerid,lag_Season) %>%
  filter(Season == 2021) %>% 
  select(one_of(variables_to_keep_2022),playerid,starts_with("Team_lev_x_"))



```

------------------------------------------------------------------------

## Create Predictions for Model

### Run Projections on Players who Played in 2021

This is the raw prediction score per IP for each pitcher

```{r}

hitting_predictions6xb = as.data.frame(predict(xgb_model_2022,Data_predict_20226xb))

names(hitting_predictions6xb) = c("Predict_Score")

Data_predict_2022_w_hitting_Predictions6xb = cbind(Data_predict_20226xb,hitting_predictions6xb) %>% select(playerid,Predict_Score)

head(Data_predict_2022_w_hitting_Predictions6xb)

```

------------------------------------------------------------------------


```{r}

Latest_2022_hittingdata_FP = read_csv("FanGraph_Fantasy_Baseball_Hitting.csv")

Latest_2022_hittingdata_FP


```




------------------------------------------------------------------------

```{r, warning = False}


hitting_Data_NonAdj_Projections6xb = sqldf(
  "
  select a.*,b.Predict_Score
  from Latest_2022_hittingdata_FP a 
  left join 
  Data_predict_2022_w_hitting_Predictions6xb b
  on a.playerid = b.playerid
  "
) %>% filter(ADP<370 | is.na(Predict_Score)==F)


hitting_Data_Adj_Projections6xb =
hitting_Data_NonAdj_Projections6xb %>% 
  mutate(
    Avg_PA = 300,
    AdjPredict_Score_raw = ifelse(is.na(Predict_Score),NA,Predict_Score*(PA/Avg_PA)),
    max_predscore= max(AdjPredict_Score_raw,na.rm = T),
    AdjPredict_Score = ifelse (is.na(AdjPredict_Score_raw),NA,AdjPredict_Score_raw *100/max_predscore),
    WAR_rank = order(order(rank(WAR,ties.method = 'average'),decreasing = TRUE)),
    AdjPredict_Score_Rank = order(order(rank(AdjPredict_Score,ties.method = 'average'),decreasing = TRUE))-sum(is.na(AdjPredict_Score)),
        Ranks_Above_ADP = ADP - AdjPredict_Score_Rank
  ) %>% select (Name,ADP,WAR, WAR_rank,AdjPredict_Score ,AdjPredict_Score_Rank,Ranks_Above_ADP)


  

ggplot2::qplot(hitting_Data_Adj_Projections6xb$AdjPredict_Score, main="Predictions") + geom_histogram(colour="black", fill="grey") + theme_bw()


```

------------------------------------------------------------------------

# 2022 Projections Full

## Table of hitting Projections (Players who Didn't Play in 2021 - Recieve an NA)

AdjPredict_Score are normalized to 100

```{r}

tableexport =
hitting_Data_Adj_Projections6xb %>%
  arrange (ADP,WAR) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)

save_kable(tableexport,file = "hitting6x6.html")

#tableexport



```

This is a better formatted Table

```{r , warning=FALSE}


ft_dt <- hitting_Data_Adj_Projections6xb[1:nrow(hitting_Data_Adj_Projections6xb), 1:ncol(hitting_Data_Adj_Projections6xb)] %>% 
  filter(AdjPredict_Score_Rank>0)%>%  arrange((AdjPredict_Score_Rank))

ft_dt$ADP <- color_tile("white", "red")(ft_dt$ADP)

ft_dt$WAR <- color_bar("lightblue")(ft_dt$WAR)

ft_dt$AdjPredict_Score<- color_bar("lightblue")(ft_dt$AdjPredict_Score)

ft_dt$WAR_Rank <- color_tile("green","orange")(ft_dt$WAR_rank)

ft_dt$Predict_Rank <- color_tile("green","orange")(ft_dt$AdjPredict_Score_Rank) 


ft_dt$Ranks_Above_ADP <- 
  ifelse(
  ft_dt$Ranks_Above_ADP < 0,
  cell_spec(round(ft_dt$Ranks_Above_ADP,2), color = "red", italic = T),
  cell_spec(round(ft_dt$Ranks_Above_ADP,2), color = "green", italic = T)
)


ft_dt2 <- ft_dt[c("Name", "ADP", "WAR", "AdjPredict_Score", "WAR_Rank","Predict_Rank","Ranks_Above_ADP")]



table_export = 
kbl(ft_dt2, escape = F) %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T) %>%   column_spec(6, width = "3cm") %>%
  add_header_above(c(" ", "Scores" = 3, "Ranks" = 2," "))
save_kable(table_export,file = "Hitting6x6_updated.html")
  
table_export  






```







</html>
