Should we account for team quality in an xG model?

Finnicking around (with an xG model) and finding out

Author

Tony ElHabr

Published

December 29, 2023

Introduction

“Should we account for team quality in an xG model?” From a purely philosophical point of view, my opinion is “no”–I think that xG models should be agnostic to player and team abilities, even if we were to find that accounting for shot-taker or team identity improves the performance of the model.

But I thought it might be fun to entertain the question from a quantitative perspective. If we add features for team strength to an xG model, can we meaningfully improve the model’s overall predictive performance and calibration?

Motivation

This write-up is inspired by Ryan Brill’s recent presentation on fourth-down decision-making in the National Football League (NFL). He points out that expected points (EP) models in the NFL have a selection bias problem–they tend to under-rate the probability of a positive outcome for “good” teams and over-rate such outcomes for “bad” teams.

Expected goals (xG) in soccer also suffer from this bias. Lars Maurath has a great deep-dive looking into expected goals under-estimation for strong teams like Barcelona.

This phenomenon is evident in the English Premier League (EPL) as well. The end-of-season table for the 2022/23 season shows that xGD, i.e. goals (G) minus xG, is very positive for the top 6 teams and very negative for the bottom 6 teams.

Team	Points	G	xG	xGD
Manchester City	89	33	78.6	46.5
Arsenal	84	43	71.9	29.9
Manchester Utd	75	43	67.7	17.3
Newcastle Utd	71	33	72.0	32.4
Liverpool	67	47	72.6	21.7
Brighton	62	53	73.3	23.1
Aston Villa	61	46	50.2	-2.2
Tottenham	60	63	57.1	7.4
Brentford	59	46	56.8	6.8
Fulham	52	53	46.2	-17.6
Crystal Palace	45	49	39.3	-8.8
Chelsea	44	47	49.5	-3.0
Wolves	41	58	36.8	-23.1
West Ham	40	55	49.2	-3.9
Bournemouth	39	71	38.6	-25.3
Nott’ham Forest	38	68	39.3	-24.9
Everton	36	57	45.2	-20.5
Leicester City	34	68	50.6	-12.8
Leeds United	31	78	47.4	-19.8
Southampton	25	73	37.7	-23.3

Lars does an “ex-post” analysis–evaluating the “residual” of expected goals, or xGD–to arrive at his conclusion that team quality does seem to explain large deviations between goals and expected goals for top teams.

[W]hen conditioning on… team quality (Elo ranking) I do not find this bias in either naive or sophisticated models.

I’m curious to see if we can arrive at a related conclusion in a more direct, “ex-ante” fashion. Can we reduce the underestimation of goals for strong teams by directly accounting for team quality in an xG model?

Analysis and Results

Data

I’ll be using event data that I’ve ingested with the {socceraction} package (which I’ve made available here!) for the 2013/14 through 2022/23 EPL seasons. I’ll focus on just “open-play” shots, i.e. shots excluding penalties and those taken from set pieces.

I’ve scraped Elo ratings from ClubElo for my measure of team quality. I’ve chosen Elo because it provides an intuitive, sport-agnostic measure of relative skill. Also, it is calculated independent of the events that take place in a game, so any correlation with measures of shot volume, quality, etc. is only coincidental.¹

Package imports and other setup

## Data retrieval
library(curl)
library(arrow)
library(qs) ## local dev
library(worldfootballR)

## Data manipulation
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
library(forcats)

## Modeling
library(rsample)
library(recipes)
library(parsnip)
library(workflows)
library(hardhat)

## Model tuning
library(tune)
library(dials)
library(workflowsets)
library(finetune)

## Model diagnostics
library(rlang)
library(yardstick)
library(SHAPforxgboost)

## Plotting
library(ggplot2)
library(sysfonts)
library(showtext)
library(ggtext)
library(htmltools)
library(scales)
library(grid)
library(glue)
library(knitr)

PROJ_DIR <- 'posts/xg-team-quality'

TAG_LABEL <- htmltools::tagList(
  htmltools::tags$span(htmltools::HTML(enc2utf8("&#xf099;")), style = 'font-family:fb'),
  htmltools::tags$span("@TonyElHabr"),
)
SUBTITLE_LABEL <- 'English Premier League, 2012/13 - 2022/23'
PLOT_RESOLUTION <- 300
WHITISH_FOREGROUND_COLOR <- 'white'
COMPLEMENTARY_FOREGROUND_COLOR <- '#cbcbcb' # '#f1f1f1'
BLACKISH_BACKGROUND_COLOR <- '#1c1c1c'
COMPLEMENTARY_BACKGROUND_COLOR <- '#4d4d4d'
FONT <- 'Titillium Web'
sysfonts::font_add_google(FONT, FONT)
## https://github.com/tashapiro/tanya-data-viz/blob/main/chatgpt-lensa/chatgpt-lensa.R for twitter logo
sysfonts::font_add('fb', 'Font Awesome 6 Brands-Regular-400.otf')
showtext::showtext_auto()
showtext::showtext_opts(dpi = PLOT_RESOLUTION)

ggplot2::theme_set(ggplot2::theme_minimal())
ggplot2::theme_update(
  text = ggplot2::element_text(family = FONT),
  title = ggplot2::element_text(size = 20, color = WHITISH_FOREGROUND_COLOR),
  plot.title = ggtext::element_markdown(face = 'bold', size = 20, color = WHITISH_FOREGROUND_COLOR),
  plot.title.position = 'plot',
  plot.subtitle = ggtext::element_markdown(size = 16, color = COMPLEMENTARY_FOREGROUND_COLOR),
  axis.text = ggplot2::element_text(color = WHITISH_FOREGROUND_COLOR, size = 14),
  # axis.title = ggplot2::element_text(size = 14, color = WHITISH_FOREGROUND_COLOR, face = 'bold', hjust = 0.99),
  axis.title.x = ggtext::element_markdown(size = 14, color = WHITISH_FOREGROUND_COLOR, face = 'bold', hjust = 0.99),
  axis.title.y = ggtext::element_markdown(size = 14, color = WHITISH_FOREGROUND_COLOR, face = 'bold', hjust = 0.99),
  axis.line = ggplot2::element_blank(),
  strip.text = ggplot2::element_text(size = 14, color = WHITISH_FOREGROUND_COLOR, face = 'bold', hjust = 0),
  legend.position = 'top',
  legend.text = ggplot2::element_text(size = 12, color = WHITISH_FOREGROUND_COLOR, face = 'plain'),
  legend.title = ggplot2::element_text(size = 12, color = WHITISH_FOREGROUND_COLOR, face = 'bold'),
  panel.grid.major = ggplot2::element_line(color = COMPLEMENTARY_BACKGROUND_COLOR),
  panel.grid.minor = ggplot2::element_line(color = COMPLEMENTARY_BACKGROUND_COLOR),
  panel.grid.minor.x = ggplot2::element_blank(),
  panel.grid.minor.y = ggplot2::element_blank(),
  plot.margin = ggplot2::margin(10, 20, 10, 20),
  plot.background = ggplot2::element_rect(fill = BLACKISH_BACKGROUND_COLOR, color = BLACKISH_BACKGROUND_COLOR),
  plot.caption = ggtext::element_markdown(color = WHITISH_FOREGROUND_COLOR, hjust = 0, size = 10, face = 'plain'),
  plot.caption.position = 'plot',
  plot.tag = ggtext::element_markdown(size = 10, color = WHITISH_FOREGROUND_COLOR, hjust = 1),
  plot.tag.position = c(0.99, 0.01),
  panel.spacing.x = grid::unit(2, 'lines'),
  panel.background = ggplot2::element_rect(fill = BLACKISH_BACKGROUND_COLOR, color = BLACKISH_BACKGROUND_COLOR)
)

Retrieve and wrangle data

read_parquet_from_url <- function(url) {
  load <- curl::curl_fetch_memory(url)
  arrow::read_parquet(load$content)
}

REPO <- 'tonyelhabr/socceraction-streamlined'
read_socceraction_parquet_release <- function(name, tag) {
  url <- sprintf('https://github.com/%s/releases/download/%s/%s.parquet', REPO, tag, name)
  read_parquet_from_url(url)
}

read_socceraction_parquet_releases <- function(name, tag = 'data-processed') {
  purrr::map_dfr(
    2013:2022,
    \(season_start_year) {
      basename <- sprintf('8-%s-%s', season_start_year, name)
      message(basename)
      read_socceraction_parquet_release(basename, tag = tag)
    }
  )
}

read_socceraction_parquet <- function(name, branch = 'main') {
  url <- sprintf('https://github.com/%s/raw/%s/%s.parquet', REPO, branch, name)
  read_parquet_from_url(url)
}

x <- read_socceraction_parquet_releases('x')
y <- read_socceraction_parquet_releases('y')
actions <- read_socceraction_parquet_releases('actions')
games <- read_socceraction_parquet_releases('games') |> 
  dplyr::mutate(
    date = lubridate::date(game_date)
  )
team_elo <- read_socceraction_parquet('data/final/8/2013-2022/clubelo-ratings')

open_play_shots <- games |>
  dplyr::transmute(
    season_id,
    game_id,
    date,
    home_team_id,
    away_team_id
  ) |> 
  dplyr::inner_join(
    x |> 
      dplyr::filter(type_shot_a0 == 1) |> 
      dplyr::select(
        game_id,
        action_id,
        
        ## features
        start_x_a0,
        start_y_a0,
        start_dist_to_goal_a0,
        start_angle_to_goal_a0,
        type_dribble_a1,
        type_pass_a1,
        type_cross_a1,
        type_corner_crossed_a1,
        type_shot_a1,
        type_freekick_crossed_a1,
        bodypart_foot_a0,
        bodypart_head_a0,
        bodypart_other_a0
      ) |> 
      dplyr::mutate(
        dplyr::across(-c(game_id, action_id), as.integer)
      ),
    by = dplyr::join_by(game_id),
    relationship = 'many-to-many'
  ) |> 
  dplyr::inner_join(
    y |> 
      dplyr::transmute(
        game_id, 
        action_id,
        scores = ifelse(scores, 'yes', 'no') |> factor(levels = c('yes', 'no'))
      ),
    by = dplyr::join_by(game_id, action_id)
  ) |> 
  dplyr::inner_join(
    actions |> 
      dplyr::select(
        game_id,
        action_id,
        team_id,
        player_id
      ),
    by = dplyr::join_by(game_id, action_id)
  ) |> 
  dplyr::left_join(
    team_elo |> dplyr::select(date, home_team_id = team_id, home_elo = elo),
    by = dplyr::join_by(date, home_team_id)
  ) |> 
  dplyr::left_join(
    team_elo |> dplyr::select(date, away_team_id = team_id, away_elo = elo),
    by = dplyr::join_by(date, away_team_id)
  ) |> 
  dplyr::transmute(
    date,
    season_id,
    game_id,
    team_id,
    opponent_team_id = ifelse(team_id == home_team_id, away_team_id, home_team_id),
    action_id,
    
    scores,
    
    elo = ifelse(team_id == home_team_id, home_elo, away_elo),
    opponent_elo = ifelse(team_id == home_team_id, away_elo, home_elo),
    elo_diff = elo - opponent_elo,
    
    start_dist_to_goal_a0,
    start_angle_to_goal_a0,
    type_dribble_a1,
    type_pass_a1,
    type_cross_a1,
    type_corner_crossed_a1,
    type_shot_a1,
    type_freekick_crossed_a1,
    bodypart_foot_a0,
    bodypart_head_a0,
    bodypart_other_a0
  )

Split the data for modeling

split <- rsample::make_splits(
  open_play_shots |> dplyr::filter(season_id %in% c(2013L:2019L)),
  open_play_shots |> dplyr::filter(season_id %in% c(2020L:2022L))
)

train <- rsample::training(split)
test <- rsample::testing(split)

It’s worth spotlighting Elo a bit more since it’s the novel feature here. Below is a look at the distribution of pre-match Elo over the course of the entire data set.

To make Elo values feel a bit more tangible, note that:

Man City–arguably the best team in the EPL for the past decade–has sustained an Elo greater than 1850 for the majority of the past decade.
Bottom-table teams often in the relegation battle tend to have Elos less than 1650.

The pre-match difference team Elos follows a normal-ish distribution, with most values falling within the ±300 range.

Model Training

The feature set for our “base” xG model consists of the following:

location of the shot (distance and angle of the shot to the center of the goal mouth)².
type of action leading to the shot.
body part with which the shot was taken.

These features are essentially what all xG models have in common, although the exact implementation differs. Data providers such as Opta also account for information that is not captured in traditional event data, such as the position of the goalkeeper.

For the team-quality-adjusted (or “Elo-augmented”) model, I’ll add two additional features:

the Elo of the team of the shot-taker.
the difference in the Elo of the shot-taking team and the opposing team.

The former is meant to capture the opponent-agnostic quality of a team, while the latter captures the quality of the team relative to their opponent.

Setting up the models

rec_elo <- recipes::recipe(
  scores ~ 
    elo +
    elo_diff +
    start_dist_to_goal_a0 +
    start_angle_to_goal_a0 +
    type_dribble_a1 +
    type_pass_a1 +
    type_cross_a1 +
    type_corner_crossed_a1 +
    type_shot_a1 +
    type_freekick_crossed_a1 +
    bodypart_foot_a0 +
    bodypart_head_a0 +
    bodypart_other_a0,
  data = train
)

rec_base <- rec_elo |> 
  recipes::step_rm(elo, elo_diff)

I’ll be using xgboost for my xG models.³ I’ll choose hyperparameters using an efficient grid search, evaluating models with the Brier skill score (BSS). For the reference Brier score for BSS, I’ll use a dummy model that predicts 12% conversion for all shots.⁴

Brier skill score (BSS)

If this isn’t the first blog post you’ve read of mine, you probably know that I love to use BSS for classification tasks, especially for xG models.

But why BSS? Simply put, Brier scores are known as the best evaluation metric to use for classification tasks where you’re purely interested in probabilities. And BSS goes one step beyond Brier scores, forcing one to contextualize the model evaluation with a reasonable baseline. In the context of xG, BSS helps us directly see whether our fitted model is better than a naive prediction, such as guessing that all shots convert at the observed shot conversion rate.

Keep in mind that a higher BSS is ideal. A perfect model would have a BSS of 1; a model that is no better than a reference model would have a BSS of 0.

Functions for Brier skill score

## See also: probability-calibration
brier_skill_score <- function(data, ...) {
  UseMethod('brier_skill_score')
}

brier_skill_score <- yardstick::new_prob_metric(
  brier_skill_score, 
  direction = 'maximize'
)

bss <- function(
    truth, 
    estimate, 
    ref_estimate, 
    event_level,
    case_weights,
    ...
) {
  
  if (length(estimate) == 1) {
    estimate <- rep(estimate, length(truth))
  }
  
  if (length(ref_estimate) == 1) {
    ref_estimate <- rep(ref_estimate, length(truth))
  }
  
  estimate_brier_score <- brier_class_vec(
    truth = truth,
    estimate = estimate,
    event_level = event_level,
    case_weights = case_weights,
    ...
  )
  
  ref_brier_score <- brier_class_vec(
    truth = truth,
    estimate = ref_estimate,
    event_level = event_level,
    case_weights = case_weights,
    ...
  )
  
  1 - (estimate_brier_score / ref_brier_score)
}

brier_skill_score_estimator_impl <- function(
    truth, 
    estimate, 
    ref_estimate, 
    event_level,
    case_weights
) {
  bss(
    truth = truth,
    estimate = estimate,
    ref_estimate = ref_estimate,
    event_level = event_level,
    case_weights = case_weights
  )
}

brier_skill_score_vec <- function(
    truth, 
    estimate, 
    ref_estimate, 
    na_rm = TRUE, 
    event_level = yardstick:::yardstick_event_level(),
    case_weights = NULL, 
    ...
) {
  
  yardstick:::abort_if_class_pred(truth)
  
  estimator <- yardstick::finalize_estimator(
    truth, 
    metric_class = 'brier_skill_score'
  )
  
  yardstick::check_prob_metric(truth, estimate, case_weights, estimator)
  
  if (na_rm) {
    result <- yardstick::yardstick_remove_missing(truth, estimate, case_weights)
    
    truth <- result$truth
    estimate <- result$estimate
    case_weights <- result$case_weights
  } else if (yardstick::yardstick_any_missing(truth, estimate, case_weights)) {
    return(NA_real_)
  }
  
  brier_skill_score_estimator_impl(
    truth = truth,
    estimate = estimate,
    ref_estimate = ref_estimate,
    event_level = event_level,
    case_weights = case_weights
  )
}

brier_skill_score.data.frame <- function(
    data, 
    truth, 
    ...,
    na_rm = TRUE,
    event_level = yardstick:::yardstick_event_level(),
    case_weights = NULL,
    
    ref_estimate = 0.5,
    name = 'brier_skill_score'
) {
  yardstick::prob_metric_summarizer(
    name = name,
    fn = brier_skill_score_vec,
    data = data,
    truth = !!rlang::enquo(truth),
    ...,
    na_rm = na_rm,
    event_level = event_level,
    case_weights = !!rlang::enquo(case_weights),
    fn_options = list(
      ref_estimate = ref_estimate
    )
  )
}

xg_brier_skill_score <- function(data, ...) {
  UseMethod('xg_brier_skill_score')
}

# REF_ESTIMATE <- open_play_shots |>
#   dplyr::summarize(goal_rate = sum(scores == 'yes') / dplyr::n()) |>
#   dplyr::pull(goal_rate)
REF_ESTIMATE <- 0.12

xg_brier_skill_score.data.frame <- function(...) {
  brier_skill_score(
    ref_estimate = REF_ESTIMATE,
    name = 'xg_brier_skill_score',
    ...
  )
}

xg_brier_skill_score <- yardstick::new_prob_metric(
  xg_brier_skill_score,
  direction = 'maximize'
)

Tuning the xG models

## Useful reference: https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/
TREES <- 500
LEARN_RATE <- 0.01
spec <- parsnip::boost_tree(
  trees = !!TREES,
  learn_rate = !!LEARN_RATE,
  tree_depth = tune::tune(),
  min_n = tune::tune(), 
  loss_reduction = tune::tune(),
  sample_size = tune::tune(), 
  mtry = tune::tune(),
  stop_iter = tune::tune()
) |>
  parsnip::set_engine('xgboost') |> 
  parsnip::set_mode('classification')

grid <- dials::grid_latin_hypercube(
  dials::tree_depth(),
  dials::min_n(range = c(5L, 40L)),
  dials::loss_reduction(),
  sample_size = dials::sample_prop(),
  dials::finalize(dials::mtry(), train),
  dials::stop_iter(range = c(10L, 50L)),
  size = 50
)

wf_sets <- workflowsets::workflow_set(
  preproc = list(
    base = rec_base, 
    elo = rec_elo
  ),
  models = list(
    model = spec
  ),
  cross = FALSE
)

control <- finetune::control_race(
  save_pred = TRUE,
  parallel_over = 'everything',
  save_workflow = TRUE,
  verbose = TRUE
)

set.seed(42)
train_folds <- rsample::vfold_cv(train, strata = scores, v = 5)

tuned_results <- workflowsets::workflow_map(
  wf_sets,
  fn = 'tune_race_anova',
  grid = grid,
  control = control,
  metrics = yardstick::metric_set(xg_brier_skill_score),
  resamples = train_folds,
  seed = 42
)

Choosing best hyper-parameters

MODEL_TYPES <- c(
  'base_model' = 'Base',
  'elo_model' = 'Elo-augmented'
)
select_best_model <- function(tuned_results, model_type) {
  tuned_results |>
    workflowsets::extract_workflow_set_result(model_type) |> 
    tune::select_best(metric = 'xg_brier_skill_score') |>
    dplyr::transmute(
      model_type = MODEL_TYPES[model_type],
      mtry,
      min_n,
      tree_depth, 
      loss_reduction,
      sample_size,
      stop_iter
    )
}

best_base_set <- select_best_model(tuned_results, 'base_model')
best_elo_set <- select_best_model(tuned_results, 'elo_model')

dplyr::bind_rows(
  best_base_set,
  best_elo_set
) |> 
  dplyr::transmute(
    model_type,
    mtry,
    min_n,
    tree_depth, 
    loss_reduction,
    sample_size,
    stop_iter
  ) |> 
  knitr::kable()

Model hyperparameters

For reproducibility, the chosen hyperparameters are as follows.⁵ By coincidence, the same hyper-parameters are chosen. (Only 50 combinations were evaluated.)

model_type	mtry	min_n	tree_depth	loss_reduction	sample_size	stop_iter
Base	7	17	6	0	0.4641502	50
Elo-augmented	7	17	6	0	0.4641502	50

The number of trees and learning_rate were pre-defined to be 500 and 0.01 respectively.

Fit models on the entire training set after choosing the hyperparameters

finalize_tuned_results <- function(tuned_results, model_type) {
  best_set <- select_best_model(tuned_results, model_type)
  tuned_results |>
    hardhat::extract_workflow(model_type) |>
    tune::finalize_workflow(best_base_set) |> 
    tune::last_fit(
      split,
      metrics = yardstick::metric_set(xg_brier_skill_score)
    )
}
last_base_fit <- finalize_tuned_results(tuned_results, 'base_model')
last_elo_fit <- finalize_tuned_results(tuned_results, 'elo_model')

Feature Importance

We should verify to see that the fitted xG models are behaving as expected. One way of doing so is to look at the importance of the features for model predictions.

In bespoke models (like the Elo-augmented model), feature importance can be enlightening, as it tells us which features are contributing most to the predicted outcomes. For the base model, I know that shot distance should be the most important feature (by far), as this is found to be the most important features in other similar “basic” public xG models, such as this one and this one.

I like to use SHAP values for feature importance since they have nice “local” explanations and are found to be consistent in their “global” structure.

SHAP values

SHAP values quantify the impact of each feature in a predictive model on the model’s output. For a binary classification task like ours, SHAP values are expressed on a 0-1 scale.

A SHAP value of 0.9 for a specific feature in a binary classification model doesn’t mean that the feature contributes 90% to the prediction, nor does it directly imply that the predicted probability is 90%. Instead, it indicates that this particular feature contributes a value of 0.9 to the shift from the baseline prediction (usually the average of the training data’s output) to the specific model output for that observation.

Indeed, we find that shot distance is the most important feature with the base xG model.

Now, if we make the same plot and add the aggregate SHAP values for our Elo-augmented xG model, we see that Elo and Elo difference terms are in the middle of the pack in terms of feature importance. That’s pretty interesting. That indicates that these features aren’t super influential, on average.

Further, we can verify that the Elo-augmented model predicts xG in a manner that matches intuition using SHAP dependence plots. We should see that the SHAP values are higher when Elo is higher and when Elo difference is higher.

Indeed, this is generally what we observe, although the relationships aren’t quite monotonic, as I would have expected.⁶

Model Evaluation

So aggregate SHAP values indicate that our two Elo features aren’t playing a big role in contributing to model predictions, although the features do indeed contribute in the way that we’d expect. But SHAP values are just telling us about how and why a model makes certain predictions. What about the performance of the Elo-augmented model compared to the base xG model?

Model Type	Brier Skill Score
Base	0.1058
Elo-augmented	0.1056

Well, at least in terms of BSS, it looks like there’s basically no difference between the models.

But that’s just one, holistic measure. How well-calibrated is the Elo-augmented model compared to the traditional xG model? Does the augmented model demonstrate better performance across the whole range of shot conversion probabilities?

Visually, I wouldn’t say there’s any significant difference between the calibration curves. A plot of calibration given various buckets for Elo difference would lead to the same conclusion–the Elo-augmented model isn’t notably more calibrated.

Discussion

So we’ve seen that there is no clear evidence that accounting for team quality improves an xG model. Sure, it can provide value to an xG model–as shown with the SHAP plots for the Elo-augmented model–but it’s effectively just providing a different kind of information that could be otherwise captured by other traditional xG features.

So naturally one may ask–what about player quality? Or goalkeeper quality?

While I do think those would be a little more useful in terms of making a better xG model, the better question is whether we should be accounting for these kinds of “quality” features at all. say “no”–although I do think it’s fine for NFL EP models–due to the manner in which xG models are typically applied.

Application of xG in soccer vs. EP in American football

Unlike a fourth-down model that can be “actively” used by NFL coaches to make real-time decisions, an xG model in soccer serves a more “passive” role. Soccer players lack the luxury of time that NFL coaches have to calculate xG on the fly when deciding whether to take a shot. Instead, xG feels better suited as a retrospective, descriptive measure of actions that have already occurred.

Further, one can make the argument that incorporating team or player quality into an xG model can make interpretation more difficult, at least in the manner that we typically use xG to contextualize a game or a season. (xGods, please forgive me for looking at single-game xG.) As an extreme example, let’s say that we add an “Is Messi?” feature to an xG model–not unlike what Ryan suggested with Justin Tucker in his presentation–to achieve a slightly more accurate xG model. That’s all fine in theory, but then your game- and season-level xG insights would need to change.

I may look at an Inter Miami game where they “lost” on actual goals, let’s say 2 to 1, and “lost” on xG, let’s say 0.9 to 1.7, and think “Oh, that’s a fair result”. But if Messi took 4 shots and we’re using an expected goals model with the “Is Messi?” feature, the Inter Miami xG might appear to be 2.2, perhaps leading to a different take on the game–“that was Inter Miami’s game, but they failed to convert on some key chances”.

Overall, I’d say that it feels circular to build in “quality” features into a model used to evaluate shot quality, and, downstream, player and team quality, at least in the current paradigm of soccer analytics.

In favor of team and/or player quality features

On the other hand, I do see the argument in favor of identity features in xG models, e.g. a RAPM-esque xG model. Such indicator variables allow us to estimate player and team impact directly, as well as the uncertainty for those individual estimates. Assuming the model estimation procedure is sound, a direct estimate of player or team skill is better than relying on the “residual” of goals minus xG, as is typically done.

Further, I’ve been assuming that looking at xGD will always be the primary way that xG is used to evaluate players. If we can reliably estimate player skill directly with an xG model and can tease out how that effect changes over time, then perhaps we should prefer those kinds of estimates.

Conclusion

Inspired by observed biases in American football’s expected points models, I trained a model to evaluate whether accounting for team strength directly might improve predictiveness and calibration of an expected goals model for soccer. Despite SHAP plots indicating some value in Elo-augmented models, the overall performance metrics and calibration plots showed negligible differences compared to the base xG model.

On the surface, this is a little at odds with Lars Maurath’s ex-post finding that team quality is useful for explaining xGD, which matches my intuition. However, as Lars points out, team quality is highly correlated with shot volume, and team quality may just be a mask for the effect of highly skilled players with superior finishing capabilities. So, perhaps with better feature engineering–e.g. shots per 90 in the prior N matches, weekly wages of the shot-taker or the shot-taking team–I could prove that directly accounting for team strength can improve an xG model.

Given all my questions and speculations about the role of player quality with respect to xG calibration, perhaps it’s not surprising that the most natural area of future research would be to evaluate how measures of player skill might improve an xG model. There is some prior art on this, so it would be wise to identify how one could contribute something novel to that work.

Footnotes

It was also fairly easy to retrieve and is what Lars used to gauge “top” teams in a quantitative way.↩︎
One might get a slightly more performant by adding the x and y coordinates of the shot–to implicitly account for right-footed bias, for example–but I actually prefer not to add those in the model. Such terms can result in slight over-fitting, in the presence of other features that provide information about the location of the shot, such as distance and angle. (This is the classical “bias-variance” trade-off.)↩︎
xgboost is the state-of-the-art framework for tabular machine learning tasks and is used by companies like Opta for their xG models.↩︎
12% is approximately the observed shot conversion rate in the whole data set.↩︎
This post isn’t meant to be so much about the why’s and how’s of model tuning and training, so I’ve spared commentary on the code.↩︎
Of course, I could have forced the xgboost model with the Elo terms to have monotonic behavior for those features. I intentionally didn’t do that here so as to not confirm my own biases–that higher xG should generally correspond with higher Elo.↩︎