WB Project PDO features classification: binary outcome

Supervised ML for text classification

Author

Luisa M. Mimmi

Warning

WORK IN PROGRESS! (Please expect unfinished sections, and unpolished code. Feedback is welcome!)

Set up

# Pckgs -------------------------------------
library(fs) # Cross-Platform File System Operations Based on 'libuv'
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data
library(skimr) # Compact and Flexible Summaries of Data
library(here) # A Simpler Way to Find Your Files
library(paint) # paint data.frames summaries in colour
library(readxl) # Read Excel Files
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax)
library(glue) # Interpreted String Literals
#library(tidyverse) # Easily Install and Load the 'Tidyverse'
library(dplyr) # A Grammar of Data Manipulation
library(tidyr) # Tidy Messy Data
library(tibble) # Tibbles: A Modern Version of Data Frames
library(purrr) # Functional Programming Tools
library(lubridate) # Make Dealing with Dates a Little Easier
library(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics
library(stringr) # Simple, Consistent Wrappers for Common String Operations
library(forcats) # Tools for Working with Categorical Variables (Factors)
# ML & Text Mining -------------------------------------
library(tidymodels) # Easily Install and Load the 'Tidymodels' Packages 
library(textrecipes) # Extra 'Recipes' for Text Processing  
library(discrim) # Model Wrappers for Discriminant Analysis
library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools 
library(SnowballC) # Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library
library(rvest) # Easily Harvest (Scrape) Web Pages
library(cleanNLP) # A Tidy Data Model for Natural Language Processing
library(themis) # Extra Recipes for Dealing with Unbalanced Classes

set.seed(123) # for reproducibility
# 1) --- Set the font as the default for ggplot2
# Who else? https://datavizf24.classes.andrewheiss.com/example/05-example.html 
lulas_theme <- theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(),
        # Bold, bigger title
        plot.title = element_text(face = "bold", size = rel(1.6)),
        # Plain, slightly bigger subtitle that is grey
        plot.subtitle = element_text(face = "plain", size = rel(1.4), color = "#A6A6A6"),
        # Italic, smaller, grey caption that is left-aligned
        plot.caption = element_text(face = "italic", size = rel(0.7), 
                                    color = "#A6A6A6", hjust = 0),
        # Bold legend titles
        legend.title = element_text(face = "bold"),
        # Bold, slightly larger facet titles that are left-aligned for the sake of repetition
        strip.text = element_text(face = "bold", size = rel(1.1), hjust = 0),
        # Bold axis titles
        axis.title = element_text(face = "bold"),
        # Change X-axis label size
        axis.text.x = element_text(size = rel(1.4)),   
        # Change Y-axis label size
        axis.text.y = element_text(size = 14),   
        # Add some space above the x-axis title and make it left-aligned
        axis.title.x = element_text(margin = margin(t = 10), hjust = 0),
        # Add some space to the right of the y-axis title and make it top-aligned
        axis.title.y = element_text(margin = margin(r = 10), hjust = 1),
        # Add a light grey background to the facet titles, with no borders
        strip.background = element_rect(fill = "grey90", color = NA),
        # Add a thin grey border around all the plots to tie in the facet titles
        panel.border = element_rect(color = "grey90", fill = NA))

# 2) --- use 
# ggplot + lulas_theme

Load data & functions

# Load Proj train dataset `projs_train_t`
projs_train2 <- readRDS(here("data","derived_data", "projs_train2.rds"))  
custom_stop_words_df  <-  readRDS(here("data","derived_data", "custom_stop_words_df.rds")) 
# Load function
source(here("R","f_recap_values.R")) 

SEE https://bookdown.org/f_lennert/text-mining-book/supervisedml.html

PREDICT MISSING FEATUREs

What ML models work with text?

Remember that text data is SPARSE!

To predict a missing feature (e.g., sector) based on available features from text data, several supervised machine learning algorithms can be applied. Given that you have a mixture of text and structured data, here are some suitable algorithms:

  • Logistic Regression / Multinomial Logistic Regression: If you’re predicting a categorical variable like “sector”, logistic regression can work well, especially with appropriate feature engineering for text (e.g., converting text data into numeric features using TF-IDF or word embeddings).
  • Lasso regression/classification learns how much of a penalty to put on some features (sometimes penalizing all the way down to zero) so that we can select only some features out of the high-dimensional space of original possible variables (tokens) for the final model.
  • k-Nearest Neighbors (k-NN): k-NN can be useful for text data, especially when you have a mix of structured and unstructured data. It’s a simple algorithm that can work well with text data, but it can be computationally expensive.
  • Decision Trees / Random Forests: These algorithms handle both numeric and categorical data efficiently and can manage missing values quite well. You can input text-based features as well, though you might need to preprocess the text into numeric form (e.g., using embeddings).
  • Naive Bayes: Naive Bayes is a simple and efficient algorithm for text classification. It assumes feature independence, which may not always hold, but it’s often a good baseline, particularly with short texts.
  • Support Vector Machines (SVMs): SVMs are useful when you have high-dimensional data, which is common with text after feature extraction (like TF-IDF). They can perform well with a mix of structured and unstructured data.

Some model parameters can be learned from data during fitting/training. Some CANNOT 😱. These are hyperparameters, and we estimate them by training lots of models with different hyperparameters and comparing them

Check missing feature

names (projs_train2)

tot <- sum(!is.na(projs_train2$pdo)) # 4425
sum(!is.na(projs_train2$regionname)) / tot  # 100%
sum(!is.na(projs_train2$countryname)) / tot  # 100%
sum(!is.na(projs_train2$status)) / tot  # 100%
sum(!is.na(projs_train2$lendinginstr)) / tot  # 98% 
sum(!is.na(projs_train2$curr_total_commitment)) / tot  # 100% 

sum(!is.na(projs_train2$ESrisk)) / tot  # 0.092  !!!!!
projs_train2 |> count(ESrisk) # 4 levels+ NA

sum(!is.na(projs_train2$env_cat)) / tot  # 72%
table(projs_train2$env_cat, useNA = "ifany") # 5 levels+ NA
projs_train2 |> count(env_cat) # 5 levels+ NA

sum(!is.na(projs_train2$sector1)) /tot# 99%
projs_train2 |> count(sector1) # 76levels

sum(!is.na(projs_train2$theme1)) /tot # 71%  --> 99%
projs_train2 |> count(theme1, useNA = "ifany") # 81 levels
# source function
source(here("R","f_recap_values.R")) 

# check candidate lables for classification 
f_recap_values(projs_train2, c("sector1", "theme1","env_cat","ESrisk")) %>% 
   kable()
skim_variable total_rows n_distinct n_missing missing_perc
env_cat 4425 6 1195 27%
ESrisk 4425 5 4014 90.7%
sector1 4425 76 5 0.1%
theme1 4425 81 7 0.2%

Identify features for classification

These could be:

  • features derived from raw text (e.g. characters, words, ngrams, etc.),
  • feature vectors (e.g. word embeddings), or
  • meta-linguistic features (e.g. part-of-speech tags, syntactic parses, or semantic features)

How do we use them?

  • Do we use raw token counts?
  • Do we use normalized frequencies?
  • Do we use some type of weighting scheme? ✅
    • yes, we use tf-idf (a weighting scheme, which will downweight words that are common across all documents and upweight words that are unique to a document)
  • Do we use some type of dimensionality reduction? ✅

TEXT CLASSIFICATION for Environmental Assessment Category

Francom text Classification (in R) Stanford slide

BINARY OUTCOME

One candidate variable that could be predicted is the env_cat variable (“Environmental Assessment Category”). This is a categorical variable with 7 levels (A, B, C, F, H, M, U ) , but, to simplify, I converted it into a binary outcome defined as in Table 1.

___ MODEL A) lasso regr ONLY pdo

Lasso regression or classification learns how much of a penalty to put on some features (sometimes penalizing all the way down to zero) so that we can select only some features out of the high-dimensional space of original possible variables (tokens) for the final model.

0) Prep for data split based on outcome [projs_train2]

Recoded env_cat variable

[done in analysis/01b_WB_project_pdo_EDA.qmd]

# projs_train2 <- projs_train2 %>% 
#    # useful for later 
#    # rename(., proj_id = "id") %>%
#    # recode as factors 
#    dplyr::mutate (across (c(status, FY_appr, regionname, countryname, sector1, 
#                             theme1, lendinginstr), as.factor))  %>% 
#    # env risk category 7 levels
#    dplyr::mutate(env_cat_f = forcats::fct_na_value_to_level(
#       env_cat, level = "Missing")) %>% 
#    dplyr::mutate(env_cat_f = forcats::fct_recode(
#       env_cat_f, 
#       "A_high risk" = "A", 
#       "B_med risk" = "B", 
#       "C_low risk" = "C", 
#       "F_fin expos" = "F", 
#       # "Other" = "H", 
#       # "Other" = "M", 
#       "Other" = "U", 
#       "Missing" = "Missing" )) %>% 
#    dplyr::relocate(env_cat_f , .after = env_cat) %>% 
#    # collaapse env_cat_f into 2 levels
#    dplyr::mutate(env_cat_f2 = forcats::fct_collapse(
#       env_cat_f, 
#       "High-Med-risk" = c("A_high risk", "B_med risk"),
#       "Low-risk_Othr" = c("C_low risk", "F_fin expos", "Other", "Missing")
#    )) %>% 
#    dplyr::relocate(env_cat_f2, .after = env_cat_f)

# Recap
tabyl(projs_train2, env_cat, show_na = TRUE) # 7 levels
tabyl(projs_train2, env_cat_f, show_na = TRUE) # 2 levels
tabyl(projs_train2, env_cat_f2, show_na = TRUE) # 7levels
# Show as contingency table env_cat_f and env_cat_f2
env_cat_f_f2_k <- table(projs_train2$env_cat_f, projs_train2$env_cat_f2) %>% 
   kable() %>% 
   kable_styling("striped", full_width = F)

env_cat_f_f2_k
saveRDS(env_cat_f_f2_k, here("analysis", "output", "tables", "env_cat_f_f2_k.rds"))
Table 1: Recoded Environmental Assessment Category
High-Med-risk Low-risk_Othr
A_high risk 315 0
B_med risk 1837 0
C_low risk 0 905
F_fin expos 0 122
Other 0 51
Missing 0 1195

Recode sector1 variable

(this I use later as MULTI-CLASS outcome) [done in analysis/01b_WB_project_pdo_EDA.qmd]

# # !!!!! `sector_f` e' diverso da `tok_sector_broad` XCHE si basa su `sector1` !!!! 
# projs_train2 <- projs_train2 %>% # sector1 99 levels
#    mutate(sector_f = case_when(
#       str_detect(sector1, regex("water|wastewater|sanitat|Sewer|Irrigat|Drainag", 
#                                 ignore_case = T)) ~ "WAT & SAN",
#       str_detect(sector1, regex("transport|railway|road|airport|waterway|bus|metropolitan|inter-urban|aviation",
#                                 ignore_case = T)) ~ "TRANSPORT",
#       sector1 == "port" ~ "TRANSPORT",
#       str_detect(sector1, regex("urban|housing|inter-urban|peri-urban|waste",
#                                 ignore_case = T)) ~ "URBAN",
#       str_detect(sector1, regex("energ|electri|hydroele|hydropow|renewable|transmis",
#                                 ignore_case = T)) ~ "ENERGY",  # Matches either "energy" or "power"
#       str_detect(sector1, regex("health|hospital|medicine|drugs|epidem|pandem|covid-19|vaccin",
#                                 ignore_case = T)) ~ "HEALTH",
#       str_detect(sector1, regex("educat|school|vocat|teach|univers|student|literacy|training|curricul",
#                                 ignore_case = T)) ~ "EDUCATION",
#       
#       str_detect(sector1, regex("Agricultural|Agro|Fish|Forest|Crop|livestock|agri-business",
#                                 ignore_case = T)) ~ "AGR FOR FISH",
#       str_detect(sector1, regex("Minin|oil|gas|mineral|Extract",
#                                 ignore_case = T)) ~ "MINING OIL&GAS",
#       str_detect(sector1, regex("Social Protec",
#                                 ignore_case = T)) ~ "SOCIAL PROT.",
#       
#       str_detect(sector1, regex("Bank|finan|Investment",
#                                 ignore_case = T)) ~ "FINANCIAL",
#       str_detect(sector1, regex("Information|Communication|ICT|Internet|Technologies",
#                                 ignore_case = T)) ~ "ICT",
#       str_detect(sector1, regex("Tourism|Trade and Services|Manuf|Other Industry|Trade and Services",
#                                 ignore_case = T)) ~ "IND TRADE SERV",
#       str_detect(sector1, regex("Government|Public Admin|Institution|Central Agenc|Sub-national Gov|law|justice|governance",
#                                 ignore_case = T)) ~ "INSTIT. SUPP.",
#       TRUE ~ "Missing")) %>% 
#    relocate(sector_f, .after = sector1)  

# Recap
projs_train2 |> count(sector1) # 76 levels
projs_train2 |> count(sector_f) # 14 levels

1) Split data in train/test [based on env_cat]

Sampling strategy

🟧 Proportional sub-set rsample::initial_split

Based on availability of env_cat_f.

We will use the strata argument to stratify the data by the outcome variable (env_cat_f). This will ensure that the training and validation sets have the same proportion.

# Create a stratified split based on missing vs non-missing env_cat
projs_train2 %>% tabyl(env_cat_f) # 7 levels

# Split BUT only "Not Missing" `env_cat_f` 
## --- 0) THIS WILL BE 4 TRAINING & VALIDATION 
env_cat_use <- projs_train2 %>% 
   filter(env_cat_f != "Missing") # 3236 proj 

# SPLIT INTO TRAINING, VALIDATION 
set.seed(123)  # Ensure reproducibility
env_split <- initial_split(env_cat_use, prop = 0.7, # 70% training, 30% testing
                       strata = env_cat_f) # stratify by OUTCOME 

## -- 1) for training (labelled `env_cat_f`)
env_cat_train <- training(env_split)   # 2265 proj
    
## -- 2) for validation (labelled `env_cat_f`)
env_cat_test <- testing(env_split)  # 971 proj
   
# # UNLABELLED PORTION 
## -- 3) for actual test (UNlabelled `env_cat_f`)
env_cat_missing <- projs_train2 %>% 
  filter(env_cat_f == "Missing") # 1167 proj 

# check ditribution of `env_cat_f` in training and validation
tabyl(env_cat_train, env_cat_f) |> adorn_totals("row") |> adorn_pct_formatting(digits = 1)# 
tabyl(env_cat_test, env_cat_f)|> adorn_totals("row") |> adorn_pct_formatting(digits = 1)# 
#rm( env_cat_use, env_cat_missing)
# env_cat_train |> count(env_cat_f)

2) Pre-processing and featurization (textrecipes)

{textrecipes} provides a number of step functions for pre-processing text data. These include functions to tokenize (e.g. step_tokenize()), remove stop words (e.g. step_stopwords()), and to derive meta-features (e.g. step_lemma(), step_stem(), etc.) Furthermore, there are functions to engineer features in ways that are particularly relevant to text data, such as feature frequencies and weights (e.g. step_tf(), step_tfidf(), etc.) and token filtering (e.g. step_tokenfilter()).

  • step_tokenize()
  • step_tfidf()
  • smooth_idf = FALSE (terms that appear in many (or all) documents will not be down weighted as much as they would be if the smoothing term was not added)

see here

Null ecipe [env_recipe_zero]

# Create a recipe that only uses the `pdo` text variable
env_recipe_zero <- recipe (formula = env_cat_f2 ~ pdo,
                            data = env_cat_train) %>%
   # tokenize
   step_tokenize(pdo) %>% 
   # tf-idf matrix of term frequencies weighted by inverse document frequency 
   step_tfidf(pdo, smooth_idf = FALSE) 

# Review the recipe   
env_recipe_zero

recipes::bake() takes a trained recipe and applies its operations to a data set to create a design matrix.

# Run the recipe 
env_recipe_zero_desmatrix <-  env_recipe_zero %>% 
   # chooses the parameters for the recipe based on the data
   prep(training = NULL) %>% 
   # applies the recipe to the data already specified in the recipe
   bake(new_data = NULL)

# preview the DESIGN Matrix baked recipe -- TOO SPARSE 
dim(env_recipe_zero_desmatrix)
#[1] 2260 7763

The resulting engineered features data frame has 2260 observations and 7763 variables!!! I.e. for each writing sample, only a small subset of them will actually appear, most of our cells will be filled with zeros. This is what is known as a sparse matrix. Furthermore, the more features we have, the more chance these features will capture the nuances of these particular writing samples increasing the likelihood we overfit the model.

Improved recipe [env_recipe]

Basically, we added step_tokenfilter(pdo, max_tokens = 100) to use just raw the 100 most common words (and reduce the number of features).

# -- Rebuild recipe with tokenfilter step
env_recipe <- recipe (formula = env_cat_f2 ~ pdo,
                      data = env_cat_train) %>%
   # tokenize
   step_tokenize(pdo) %>%   
   # !!! filter by frequency of occurrence !!!
   step_tokenfilter(pdo, max_tokens = 100) %>%  
   # tf-idf  creates matrix of weighted term frequencies  
   step_tfidf(pdo, smooth_idf = FALSE)

Re-check the design matrix

# -- Run the recipe 
env_recipe_bake <-  env_recipe %>% 
   # chooses the parameters for the recipe based on the data
   prep(training = NULL) %>% 
   # applies the recipe to the data
   bake(new_data = NULL)

# -- preview the baked recipe
dim(env_recipe_bake)
#[1] 2260 7763--> #[1] 2260  101

# subset check
env_recipe_bake[1:5, 1:10 ]

3) Model specification

Now that the data is ready, the model can be specified. The parsnip package is used for this:

  • model specification,
  • model type mode,
  • the engine, i.e. the software that fits the model (given the type)

Let’s start with a simple logistic regression model to see how well we can classify the texts in the training set with the features we have engineered. We will use the parsnip::logistic_reg() function to specify the logistic regression model. We then select the implementation engine (glmnet General Linear Model) and the mode of the model (classification).

  • tune() is a placeholder for a range of values for the penalty hyperparameter.
# Create a model specification
env_spec <-
   # lasso regularized model
   parsnip::logistic_reg(
      # non-negative number ~ the total amount of regularization
      penalty = tune(),  # 0 = no penalty, 1 = max
      # number between zero and one (inclusive) 
      mixture = 1 # specifies a pure lasso model,
   ) %>%
   # set the mode of the model
   parsnip::set_mode("classification") %>%
   # set the engine of the model
   parsnip::set_engine("glmnet")

# Preview 
env_spec

4) Model training (workflows)

The resulting workflow (a container object that aggregates information required to fit and predict from a model) by adding the recipe and model to it.

# Create a workflow
env_wf <- workflows::workflow() %>%
   # preprocessing recipe
   add_recipe(env_recipe) %>%
   # model specifications
   add_model(env_spec)

I haven’t chosen the penaly yet in env_spec!

It can then be fit using fit(), but I need to specify hyperparameters (e.g. pennalty first).

#env_wf %>% workflows::fit(data = env_cat_train)

5) Model evaluation / tweaking

Different algorithms will have different parameters that can be adjusted which can affect the performance of the model (hyperparameters) ≠ parameters (features) –> hyperparameters tuning which is tipically done during fitting the model to the training set and evaluating its performance

— Penalty tuning [env_grid]

In logistic regression, the penalty hyperparameter is like a control that helps prevent the model from becoming too complex and overfitting to the training data. There are two common types of penalties:

  • L1 (Lasso): Encourages the model to use fewer features by making some of the coefficients exactly zero. This can simplify the model.

  • L2 (Ridge): Tries to keep all the coefficients small but not exactly zero, which can help stabilize the model and avoid overfitting.

  • the logistic regression model using glmnet can be tuned to prevent overfitting by adjusting the penalty and mixture (combination of L1 and L2) hyperparameters

    • In our env_spec model, tune() was a placeholder for a range of values for the penalty hyperparameter.
    • To tune the penalty hyperparameter, we use thegrid_regular() function from {dials} to specify a grid of values to try.
  • The package dials contains infrastructure to create and manage values of tuning parameters for the tidymodels packages.

# Create a grid of values for the penalty hyperparameter (random set of 10 values)
env_grid <- dials::grid_regular(
   penalty(), 
   levels = 10)

# Preview
env_grid 
# 0 no penalty
# ... 
# 1 max penalty

— K-fold cross-validation (for penalty optimal) [ env_fold, env_tune]

Now to perform the tuning and choose an optimal value for penalty we need to create a tuning workflow. We use the strategy of resampling (splitting env_cat_train in multiple training/testing sets) called k-fold cross-validation to arrive at the optimal value for the penalty hyperparameter.

set.seed(123)

# Create a resampling object
env_vfold <- rsample::vfold_cv(env_cat_train, v = 10)

# Create a tuning workflow
env_tune <- tune::tune_grid(
  object = env_wf,
  resamples = env_vfold,
  grid = env_grid,
  control = control_grid(save_pred = TRUE),
  metrics = metric_set(roc_auc, accuracy, sensitivity, specificity)
  
)

# preview
env_tune

The env_tune object contains the results of the tuning for each fold. We can see the results of the tuning for each fold by calling the collect_metrics() function on the env_tune object

# Collect the results of the tuning
env_tune_metrics <- tune::collect_metrics(env_tune)
env_tune_pred  <- tune::collect_predictions(env_tune)

# visualize the results
tune::autoplot(env_tune) +
  labs(
    title = "Lasso model performance across regularization penalties",
    subtitle = "Performance metrics can be used to identify the best penalty"
  )
# in roc_auc: many of the penalty values performed similarly, with a drop-off in performance at the higher values

The most common metrics for model performance in classification are:

  1. accuracy (the proportion of correct predictions)
    • here it drops sharply at the higher values of penalty (1e-02), meaning dropping too many features will hurt predictive performance
  2. ROC-AUC (area under the receiver operating characteristic area under the curve, i.e. a single score for how well the model can distinguish between classes) The closer to 1 the more discriminating power the model has.
    • similar, i.e song penalties will damage the model’s ability to distinguish between classes.
  3. sensitivity (the proportion of TP that are correctly identified)
    • here it peaks at 1.0 beyond 1e-02, meaning the model is very good at identifying the positive class (High-Med-risk) BUT…
  4. specificity (the proportion of TN that are correctly identified)
    • … at the cost of specificity). In fact here, performance is very poor, suggesting no true negatives are correctly identified.

Conveniently, the show_best() function from {tune} takes a tune_grid object and returns the best performing hyperparameter values.

# Show the best hyperparameter values
tune::show_best(env_tune, metric = "roc_auc") # 0.00599

# Make selection of penalty programmatically
env_best <- select_best(env_tune, metric ="roc_auc")           # 0.00599
env_best_lambda <- env_best$penalty                            # 1e-10

env_best_acc <- select_best(env_tune, metric ="accuracy")      # 0.00599
env_best_sens <- select_best(env_tune, metric ="sensitivity")  # 0.0774

6) Hyperparameter tuning

Update workflow [env_wf_lasso]

Now we can update the model specification and workflow with the best performing hyperparameter value using the previous cls_wf_tune workflow and the finalize_workflow() function.

Instead of penalty = tune() like before, now our workflow has finalized values for all arguments.

# Update the workflow/model specification with the best penalty value
env_wf_lasso <- env_wf %>% 
   tune::finalize_workflow(env_best)

# Preview updated workflow object (with defined penalty paramv  0.00599)
env_wf_lasso

7) Final fit on train set

Fit the final model

# Fit the model to the training data
env_lasso_fit <- env_wf_lasso %>% 
   # fit the model to the training data
   workflows::fit(data = env_cat_train)

Here I see:

  • Df = Number of non-zero coefficients (i.e., selected features) at a given lambda.
  • %Dev = Percentage of null deviance explained (i.e., a measure of model fit – how well the model explains the variation in the response).
  • Lambda = The regularization parameter – higher values imply more shrinkage (simpler models).
    • (here) lambda = 1e-10 means no penalty (i.e., no regularization), i.e. the model is almost unpenalized, so it is similar to a standard logistic regression model.

This table shows the model performance across different levels of regularization (lambda). As lambda decreases:

  • The number of selected features (Df) increases.
  • Model fit improves (%Dev goes up), but complexity increases.
  • At the smallest lambda shown (0.001950), 93 features are included, explaining 23.73% of deviance.

You’ll typically select a lambda using cross-validation (e.g., tune_grid() or select_best() in tidymodels). The goal is to balance simplicity and predictive performance – too small a lambda overfits; too large underfits.

Coefficients for the lambda with hte best performance

  • using the workflows::extract_fit_parsnip() function, I see only the model coefficients for the best performing lambda value. This function extracts the fitted model object from the workflow, and
  • then I can use the tidy() function to get the coefficients.
# Get the coefficients of the model
env_lasso_fit %>% 
   workflows::extract_fit_parsnip() %>% 
   broom::tidy() %>% 
   dplyr::filter(term != "(Intercept)") %>% 
   dplyr::arrange(desc(abs(estimate))) %>% 
   dplyr::slice(1:20) # top 10

I can see stopwords like among the most important features!

— Get predicted class and probabilities (train)

Here I got the predicted probabilities from env_lasso_fit logistic regression model on the dataset env_cat_train + with type = "prob" argument in the predict() function, the model returns class probabilities for each of the 2260 observations 1. .pred_High-Med-risk = probability of being in the High-Med-risk category 2. .pred_Low-risk_Othr = probability of being in the Low-risk_Othr category

# Prep data frame containing actual and predicted values
pred_train  <-  stats::predict(env_lasso_fit, 
                               new_data = env_cat_train, 
                               type = "prob") |> # prob **high** and prob **low** for each observ based on model 
   # combine with the original training data (labeled)
   bind_cols(env_cat_train) %>% 
   select(env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_high_risk = '.pred_High-Med-risk', # PROB 1obs is PREDICTED HIGH-MED-RISK 
          pred_low_risk = '.pred_Low-risk_Othr'   # PROB 1obs is PREDICTED LOW-RISK
   )  # 2260 rows


# convert to long format for plotting
pred_train_long <- pred_train |>
   pivot_longer(cols = c(pred_high_risk, pred_low_risk),
                names_to = "pred_risk_type", values_to = "pred_risk_value") # 4,520 rows

— [FIG] Assessing performance on training set

# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_train_long, aes(x = env_cat_f2, y = pred_risk_value, fill = pred_risk_type)) +
  geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
  #geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) +  # Remove width and use position_dodge
   labs(title = "PREDICTED class distribution (y axis) v. ACTUAL class (x axis)",
        subtitle = "Model: Lasso Regression fitted on training data",
       x = "ACTUAL env. class",
       y = "Probabiity of PREDICTED env. class",
       fill = "Risk Type") +  # Set label for fill legend
  theme_minimal() +
  guides(color = "none")  # Suppress the color legend

Calculate performance metrics

train_pred_probs <- predict(env_lasso_fit, new_data = env_cat_train, type = "prob")
train_pred_class <- predict(env_lasso_fit, new_data = env_cat_train, type = "class")

train_results <- bind_cols(env_cat_train, train_pred_probs, train_pred_class) |> 
   select(env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_high_risk = '.pred_High-Med-risk', # PROB 1obs is PREDICTED HIGH-MED-RISK 
          pred_low_risk = '.pred_Low-risk_Othr',   # PROB 1obs is PREDICTED LOW-RISK
          pred_class = '.pred_class')

# Confusion matrix
a_train_acc   <- accuracy(train_results, truth = env_cat_f2, estimate = pred_class)
# ROC AUC (needs probabilities + positive class specified)
a_train_roc_auc   <- roc_auc(train_results, truth = env_cat_f2, pred_high_risk)
# Sensitivity and Specificity
a_train_sens   <- sens(train_results, truth = env_cat_f2, estimate = pred_class)
a_train_spec   <- spec(train_results, truth = env_cat_f2, estimate = pred_class)

# confusion matrix
conf_mat(train_results, truth = env_cat_f2, estimate = pred_class) |> 
   autoplot(type = "heatmap") +
   labs(title = "Confusion matrix for Lasso model",
        subtitle = "Training set")

# roc curve
roc_curve(train_results, truth = env_cat_f2, pred_high_risk) |> 
   autoplot() +
   labs(title = "ROC curve for Lasso model (only x = pdo)",
        subtitle = "Training set")

8) Evaluation on test set

— Get predicted class and probabilities (test)

# Predict on the test set
pred_test  <-  stats::predict(env_lasso_fit, 
                      new_data = env_cat_test, 
                      type = "prob") |> # prob **high** and prob **low** for each observ based on model 
   # combine with the original training data (labeled)
   bind_cols(env_cat_test) %>% 
   select(env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_high_risk = '.pred_High-Med-risk', # PROB 1obs is PREDICTED HIGH-MED-RISK 
          pred_low_risk = '.pred_Low-risk_Othr'   # PROB 1obs is PREDICTED LOW-RISK
          )  # 970 rows

# convert to long format for plotting
pred_test_long <- pred_test |> 
   pivot_longer(cols = c(pred_high_risk, pred_low_risk),
                names_to = "pred_risk_type", values_to = "pred_risk_value") # 1,940 rows

— [FIG] Assessing performance on test set

# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_test_long, aes(x = env_cat_f2, y = pred_risk_value, fill = pred_risk_type)) +
  geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
  #geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) +  # Remove width and use position_dodge
   labs(title = "PREDICTED class distribution (y axis) v. ACTUAL class (x axis)",
        subtitle = "Model: Lasso Regression fitted on test data",
       x = "ACTUAL env. class",
       y = "Probabiity of PREDICTED env. class",
       fill = "Risk Type") +  # Set label for fill legend
  theme_minimal() +
  guides(color = "none")  # Suppress the color legend

Calculate performance metrics

Using yardstick package, I can calculate the performance metrics for the model on the test set. The roc_auc() function calculates the area under the ROC curve, which is a measure of how well the model can distinguish between the two classes.

test_pred_probs <- predict(env_lasso_fit, new_data = env_cat_test, type = "prob")
test_pred_class <- predict(env_lasso_fit, new_data = env_cat_test, type = "class")

test_results <- bind_cols(env_cat_test, test_pred_probs, test_pred_class) |> 
   select(env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_high_risk = '.pred_High-Med-risk', # PROB 1obs is PREDICTED HIGH-MED-RISK 
          pred_low_risk = '.pred_Low-risk_Othr',   # PROB 1obs is PREDICTED LOW-RISK
          pred_class = '.pred_class')

# Confusion matrix
a_test_acc <- accuracy(test_results, truth = env_cat_f2, estimate = pred_class)
# ROC AUC (needs probabilities + positive class specified)
a_test_roc_auc <- roc_auc(test_results, truth = env_cat_f2, pred_high_risk)
# Sensitivity and Specificity
a_test_sens <- sens(test_results, truth = env_cat_f2, estimate = pred_class)
a_test_spec <- spec(test_results, truth = env_cat_f2, estimate = pred_class)

# confusion matrix
conf_mat(test_results, truth = env_cat_f2, estimate = pred_class) |> 
   autoplot(type = "heatmap") +
   labs(title = "Confusion matrix for Lasso model",
        subtitle = "testing set")

# roc curve
roc_curve(test_results, truth = env_cat_f2, pred_high_risk) |> 
   autoplot() +
   labs(title = "ROC curve for Lasso model (only x = pdo)",
        subtitle = "testing set")

Recap

So, the results of the ‘best model’ in the training and testing set are:

#| tbl-cap: Model performance metrics

#| tbl-label: tbl-model-metrics
#| output: true

# table with accuracy, roc_auc, sensitivity, specificity
model_metrics_recap <- tibble(
   metric = c("Accuracy", "ROC AUC", "Sensitivity", "Specificity"),
   train = c(a_train_acc$.estimate, 
             a_train_roc_auc$.estimate, 
             a_train_sens$.estimate, 
             a_train_spec$.estimate      )  ,
   test =  c(a_test_acc$.estimate, 
             a_test_roc_auc$.estimate, 
             a_test_sens$.estimate, 
             a_test_spec$.estimate      )  
   ) |> 
   mutate(across(c(train, test), ~ round(.x, 3))) |>
   kable() |> 
   kable_styling("striped", full_width = F) |> 
   # header merged column 
   kableExtra::add_header_above(c("A) Lasso Logistic model", "datasets" = 2), 
                                bold = TRUE, color = "black", background = "#D9EAD3")  

model_metrics_recap
A) Lasso Logistic model
datasets
metric train test
Accuracy 0.759 0.759
ROC AUC 0.812 0.774
Sensitivity 0.910 0.905
Specificity 0.456 0.473

___ MODEL B) lasso w pdo + tf-idf + stopwords

0) Prep for split based on outcome [using projs_train2]

(same)

1) Split data in train/test [based on env_cat]

(same)

2) Pre-processing and featurization (recipes)

To improve supervised learning models, consider:

  1. Engineering the features differently
    • we set a token filter to limit the number of features to 100, which we could adjust (max_tokens)
  2. Selecting different (or additional) features
  3. Changing the algorithm
  4. Tuning the hyperparameters differently
# Create a custom stopword list
stop_vector <- custom_stop_words_df %>%  pull(word)

— Improve recipe [env_stop_recipe]

# ---  Create a recipe with a token filter step that excludes stopwords
# Rebuild recipe with tokenfilter step
env_stop_recipe <- recipes::recipe (
   formula = env_cat_f2 ~ pdo,
   data = env_cat_train) %>%
   # tokenize
   step_tokenize(pdo) %>%   
   # remove CUSTOM stopwords (NEW!)
   step_stopwords(pdo, custom_stopword_source = stop_vector) %>%  
   # filter by frequency of occurrence
   step_tokenfilter(pdo, max_tokens = 100) %>%  
   # tf-idf  creates matrix of weighted term frequencies 
   step_tfidf(pdo, smooth_idf = FALSE)       

# prep and bake the recipe
env_stop_recipe_bake <-  env_stop_recipe %>% 
  prep() %>% 
   bake(new_data = NULL)

# preview the baked recipe
dim(env_stop_recipe_bake)
#[1] 2264 101
env_recipe_bake[1:5, 1:10]
env_stop_recipe_bake[1:5, 1:10]

3) Model specification

The model specification is the same as before (env_spec), but now we will use the new recipe with stopwords removed.

— Model specification [env_spec]

# Create a model specification
env_spec <-
   # generalized linear model for binary outcomes
   parsnip::logistic_reg(
      # A non-negative number representing the total amount of regularization
      penalty = tune(),  # 0 = no penalty, 1 = max
      #A number between zero and one (inclusive)
      mixture = 1 # pecifies a pure lasso model,
   ) %>%
   set_engine("glmnet")
                           ##### tune() IS A PLACEHOLDER
# Preview
env_spec

4) Model training (workflows)

We create a new workflow [env_stop_wf], with env_stop_recipe, the part that changed in this workflow adding step_stopwords().

# Create a workflow
env_stop_wf <- workflows::workflow() %>%
   add_recipe(env_stop_recipe) %>%  # NEW RECIPE
   add_model(env_spec)
# Preview
env_stop_wf

5) Model evaluation / tweaking

(same)

— Penalty tuning [env_grid] (same)

# Create a grid of values for the penalty hyperparameter (random set of 10 values)
env_grid <- dials::grid_regular(
  penalty(), levels = 10
  )
# Preview
env_grid 

— K-fold cross-val (same, but NEW wf)

Now to perform the tuning and choose an optimal value for penalty we need to create a tuning workflow. We use the strategy of resampling (splitting env_cat_train in multiple training/testing sets) called k-fold cross-validation to arrive at the optimal value for the penalty hyperparameter.

set.seed(123)

# Create a resampling object
env_vfold <- rsample::vfold_cv(env_cat_train, v = 10)

# Create a tuning workflow
env_stop_tune <- tune::tune_grid(
  object = env_stop_wf, # changed ! 
  resamples = env_vfold,
  grid = env_grid,
  control = control_grid(save_pred = TRUE),
  metrics = metric_set(roc_auc, accuracy, sensitivity, specificity)
)
# preview
env_stop_tune

The env_stop_tune object contains the results of the tuning for each fold. We can see the results of the tuning for each fold by calling the collect_metrics() function on the env_stop_tune object

# Collect the results of the tuning
env_stop_tune_metrics <- tune::collect_metrics(env_stop_tune)

# visualize the results
tune::autoplot(env_stop_tune) +
  labs(
    title = "Lasso model performance across regularization penalties",
    subtitle = "Performance metrics can be used to identify the best penalty"
  )
# in roc_auc: many many of the penalty values performed similarly, with a drop-off in performance at the higher val- ues

Once again, as the penalty increases, the model performance reach a peak for accuracy, but with a tradeoff between sensitivity and specificity. The roc_auc metric is similar to the previous model, but the accuracy metric is slightly lower.

Conveniently, the tune::show_best() function takes a tune_grid object and returns the best performing hyperparameter values (i.e. PENALTY in lasso logistic).

# Show the best hyperparameter values (BASED ON A CHOSEN METRIC)
show_best(env_stop_tune, metric = "roc_auc")  # 0.00599

# Make selection of penalty programmatically
env_stop_best <- select_best(env_stop_tune, metric ="roc_auc")           # 0.00599
env_stop_best_lambda <- env_stop_best$penalty                            # 1e-10

env_stop_best_acc <- select_best(env_stop_tune, metric ="accuracy")      # 0.00599
env_stop_best_sens <- select_best(env_stop_tune, metric ="sensitivity")  # 0.0774

6) Hyperparameter tuning

Update workflow [env_stop_wf_lasso]

Now we can update the model specification and workflow with the best performing hyperparameter value using the previous cls_wf_tune workflow and the finalize_workflow() function.

# Update the model specification
env_stop_wf_lasso <- env_stop_wf %>% 
   tune::finalize_workflow(env_stop_best)

# Preview updated workflow object (with defined penalty paramv  0.00599)
env_stop_wf_lasso

7) Final fit on training set

— Fit the final model

# Fit the model to the training data
env_stop_lasso_fit <- fit (env_stop_wf_lasso, data = env_cat_train)

Here I see:

  • Df = Number of non-zero coefficients (i.e., selected features) at a given lambda.
  • %Dev = Percentage of null deviance explained (i.e., a measure of model fit – how well the model explains the variation in the response).
  • Lambda = The regularization parameter – higher values imply more shrinkage (simpler models).
    • (here) lambda = 1e-10 means no penalty (i.e., no regularization), i.e. the model is almost unpenalized, so it is similar to a standard logistic regression model.

This table shows the model performance across different levels of regularization (lambda). As lambda decreases:

  • The number of selected features (Df) increases.
  • Model fit improves (%Dev goes up), but complexity increases.
  • At the smallest lambda shown (0.001950), 93 features are included, explaining 23.73% of deviance.

You’ll typically select a lambda using cross-validation (e.g., tune_grid() or select_best() in tidymodels). The goal is to balance simplicity and predictive performance – too small a lambda overfits; too large underfits.

— Coefficients for the lambda with the best performance

  • using the workflows::extract_fit_parsnip() function, I see only the model coefficients for the best performing lambda value. This function extracts the fitted model object from the workflow

  • Then for the penalty we chose, we see what terms contribute the most to env cat NOT being high risk .

env_stop_lasso_fit |> 
   workflows::extract_fit_parsnip() %>% 
   broom::tidy() %>% 
   dplyr::filter(term != "(Intercept)") %>% 
   dplyr::arrange(desc(abs(estimate))) %>% 
   dplyr::slice(1:20) # top 10

“and” is not top coefficient anymore!!!

— Get predicted class and probabilities (train)

# Example of a data frame containing actual and predicted values
pred_stop_train  <- stats::predict(env_stop_lasso_fit, 
                           new_data = env_cat_train , 
                           type = "prob")|># prob **high** and prob **low** for each observ based on model 
   bind_cols(env_cat_train) %>% 
   select(env_cat_f2,  pred_high_risk = '.pred_High-Med-risk', pred_low_risk = '.pred_Low-risk_Othr')   # 2260 rows 


pred_stop_train_long <- pred_stop_train |>
   pivot_longer(cols = c(pred_high_risk, pred_low_risk),
                names_to = "risk_type", values_to = "risk_value") # 4,520 rows

— [FIG] Assessing performance [env_stop_lasso_fit] on training set

# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_stop_train_long, aes(x = env_cat_f2, y = risk_value, fill = risk_type)) +
   geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
   #geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) +  # Remove width and use position_dodge
   labs(title = "PREDICTED class distribution (y axis) v. ACTUAL class (x axis)",
        subtitle = "Model: Lasso Regression fitted on training data",
        x = "ACTUAL env. class",
        y = "Probabiity of PREDICTED env. class",
        fill = "Risk Type") +  # Set label for fill legend
   theme_minimal() +
   guides(color = "none")  # Suppress the color legend

This model did not improve much (especially on the LOW-risk-Other level prediction!

Calculate performance metrics

train_stop_pred_probs <- predict(env_stop_lasso_fit, new_data = env_cat_train, type = "prob")
train_stop_pred_class <- predict(env_stop_lasso_fit, new_data = env_cat_train, type = "class")

train_stop_results <- bind_cols(env_cat_train, train_stop_pred_probs, train_stop_pred_class) |> 
   select(env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_high_risk = '.pred_High-Med-risk', # PROB 1obs is PREDICTED HIGH-MED-RISK 
          pred_low_risk = '.pred_Low-risk_Othr',   # PROB 1obs is PREDICTED LOW-RISK
          pred_class = '.pred_class')

# Confusion matrix
b_train_acc <- accuracy(train_stop_results, truth = env_cat_f2, estimate = pred_class)
# ROC AUC (needs probabilities + positive class specified)
b_train_roc_auc <- roc_auc(train_stop_results, truth = env_cat_f2, pred_high_risk)
# Sensitivity and Specificity
b_train_sens <- sens(train_stop_results, truth = env_cat_f2, estimate = pred_class)
b_train_spec <- spec(train_stop_results, truth = env_cat_f2, estimate = pred_class)

# confusion matrix
conf_mat(train_stop_results, truth = env_cat_f2, estimate = pred_class) |> 
   autoplot(type = "heatmap") +
   labs(title = "Confusion matrix for Lasso model",
        subtitle = "Training set")

# roc curve
roc_curve(train_stop_results, truth = env_cat_f2, pred_high_risk) |> 
   autoplot() +
   labs(title = "ROC curve for Lasso model B (only x = pdo)",
        subtitle = "Training set")

There are more false positives (low risk predicted to be high risk) than false negatives. (This is a common issue in imbalanced datasets and can be addressed by adjusting the decision threshold of the model.)

8) Evaluation on test set

— Get predicted class and probabilities (test)

# Predict on the test set
pred_stop_test  <- stats::predict(env_stop_lasso_fit, 
                           new_data = env_cat_test , 
                           type = "prob")|># prob **high** and prob **low** for each observ based on model 
   bind_cols(env_cat_test) %>% 
   select(env_cat_f2,  pred_high_risk = '.pred_High-Med-risk', pred_low_risk = '.pred_Low-risk_Othr')   # 970 rows

pred_stop_test_long <- pred_stop_test |>
   pivot_longer(cols = c(pred_high_risk, pred_low_risk),
                names_to = "risk_type", values_to = "risk_value") # 1,940 rows

— [FIG] Assessing performance on test set

# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_stop_test_long, aes(x = env_cat_f2, y = risk_value, fill = risk_type)) +
   geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
   #geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) +  # Remove width and use position_dodge
   labs(title = "PREDICTED class distribution (y axis) v. ACTUAL class (x axis)",
        subtitle = "Model: Lasso Regression fitted on test data",
        x = "ACTUAL env. class",
        y = "Probabiity of PREDICTED env. class",
        fill = "Risk Type") +  # Set label for fill legend
   theme_minimal() +
   guides(color = "none")  # Suppress the color legend

Calculate performance metrics

test_stop_pred_probs <- predict(env_stop_lasso_fit, new_data = env_cat_test, type = "prob")
test_stop_pred_class <- predict(env_stop_lasso_fit, new_data = env_cat_test, type = "class")
test_stop_results <- bind_cols(env_cat_test, test_stop_pred_probs, test_stop_pred_class) |> 
   select(env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_high_risk = '.pred_High-Med-risk', # PROB 1obs is PREDICTED HIGH-MED-RISK 
          pred_low_risk = '.pred_Low-risk_Othr',   # PROB 1obs is PREDICTED LOW-RISK
          pred_class = '.pred_class')

# accuracy
b_test_acc <- accuracy(test_stop_results, truth = env_cat_f2, estimate = pred_class)
# ROC AUC (needs probabilities + positive class specified)
b_test_roc_auc <- roc_auc(test_stop_results, truth = env_cat_f2, pred_high_risk)
# Sensitivity and Specificity
b_test_sens <- sens(test_stop_results, truth = env_cat_f2, estimate = pred_class)
b_test_spec <- spec(test_stop_results, truth = env_cat_f2, estimate = pred_class)

# confusion matrix
conf_mat(test_stop_results, truth = env_cat_f2, estimate = pred_class) |> 
   autoplot(type = "heatmap") +
   labs(title = "Confusion matrix for Lasso model",
        subtitle = "testing set")

# roc curve
roc_curve(test_stop_results, truth = env_cat_f2, pred_high_risk) |> 
   autoplot() +
   labs(title = "ROC curve for Lasso model B (only x = pdo)",
        subtitle = "testing set")

Recap

So, the results of the ‘best model’ in the training and testing set are:

# table with accuracy, roc_auc, sensitivity, specificity
model_stop_metrics_recap <- tibble(
   metric = c("Accuracy", "ROC AUC", "Sensitivity", "Specificity"),
   train = c(b_train_acc$.estimate, 
             b_train_roc_auc$.estimate, 
             b_train_sens$.estimate, 
             b_train_spec$.estimate      )  ,
   test =  c(b_test_acc$.estimate, 
             b_test_roc_auc$.estimate, 
             b_test_sens$.estimate, 
             b_test_spec$.estimate      )  
   ) |> 
   mutate(across(c(train, test), ~ round(.x, 3))) |>
   kable() |> 
   kable_styling("striped", full_width = F) |> 
   # header merged column 
   kableExtra::add_header_above(c("B) Lasso Logistic model", "datasets" = 2), 
                                bold = TRUE, color = "black", background = "#D9EAD3")
#model_metrics_recap
model_stop_metrics_recap
Model performance metrics
B) Lasso Logistic model
datasets
metric train test
Accuracy 0.766 0.755
ROC AUC 0.816 0.785
Sensitivity 0.913 0.907
Specificity 0.471 0.457

🟨 Alternative (7) & (8) together

To do this we need to fit the tuned workflow to the training set, which is the actual training phase. We will use the last_fit() function from {workflows}

Le’s use the updated workflow env_stop_wf_lasso

— Fit the best model (to train) and evaluate on the test set

(same as I got in split steps)

# fit the model to the training set and evaluate on the validation set
env_stop_lasso_final_fit <- last_fit(
   env_stop_wf_lasso, 
   split = env_split)

# Evaluate the model on the validation set (in my case)
collect_metrics(env_stop_lasso_final_fit) |> 
   dplyr::filter(.metric %in% c("accuracy", "roc_auc", 
                                # not available 
                                "sens", "spec"))

# 1 accuracy    binary         0.762 Preprocessor1_Model1
# 2 roc_auc     binary         0.807 Preprocessor1_Model1
# 3 brier_class binary         0.163 Preprocessor1_Model1

The performance metrics are very close to those we achieved on the training set –> good sign that the model is robust as it performs well on both training and test (validation) sets.

___ MODEL C) [👍🏻] lasso w pdo + tf-idf + stopwords + 3 fct

0) Prep for split based on outcome [using projs_train2]

(same)

1) Split data in train/test [based on env_cat]

(same)

2) Pre-processing and featurization (recipes)

— Improve recipe [env_FEAT_recipe] (NEW!)

  • using sector_f to include the sector tag but with less dimensions
  • step_dummy because logistic regression, especially when using certain tuning functions in tidymodels, requires numeric or dummy variables.
# ---  Create a recipe with a token filter step that excludes stopwords
# Rebuild recipe with tokenfilter step
env_FEAT_recipe <- recipe (env_cat_f2 ~ pdo + sector_f + regionname + FY_appr,
                           data = training(env_split)) %>%
   # tokenize the text
   step_tokenize(pdo) %>%  
   # remove CUSTOM stopwords
   step_stopwords(pdo, custom_stopword_source = stop_vector) %>%  
   # filter by frequency of occurrence
   step_tokenfilter(pdo, max_tokens = 100) %>%  
   # creates tf-idf matrix of weighted term frequencies
   step_tfidf(pdo, smooth_idf = FALSE) %>%
   # add NA as special factor level
   step_unknown(sector_f ,new_level = "Unknown sect" ) %>%
   step_unknown(regionname ,new_level = "Unknown reg" ) %>%
   step_unknown(FY_appr ,new_level = "Unknown FY" ) %>%
   # convert to dummy variables
   step_dummy(sector_f, regionname, FY_appr, one_hot = TRUE) 

check what changed…

# prep and bake the recipe
env_FEAT_recipe_bake <-  env_FEAT_recipe %>% 
  prep() %>% 
   bake(new_data = NULL)

# preview the baked recipe
dim(env_FEAT_recipe_bake)
#[1] 2264 101 --> 2260  149
env_FEAT_recipe_bake[1:5, 1:10]

3) Model specification

(same)

— Model specification [env_spec]

# Create a model specification
env_spec <-
   # generalized linear model for binary outcomes
   parsnip::logistic_reg(
      mode = "classification",
      # A non-negative number representing the total amount of regularization
      penalty = tune(),  # 0 = no penalty, 1 = max
      #A number between zero and one (inclusive)
      mixture = 1 # pecifies a pure lasso model,
   ) %>%
   set_engine("glmnet")
                           ##### tune() IS A PLACEHOLDER
# Preview
env_spec

4) Model training (workflows)

— Create workflow [env_FEAT_wf] (NEW!)

env_FEAT_recipe is actually the part that changed in this workflow adding step_stopwords().

# Create a workflow
env_FEAT_wf <- workflows::workflow() %>%
   add_recipe(env_FEAT_recipe) %>%  # NEW RECIPE
   add_model(env_spec) # same model
# Preview
env_FEAT_wf

5) Model evaluation / tweaking

— Penalty tuning + folds [env_grid, env_fold] (same)

# Create a grid of values for the penalty hyperparameter (random set of 10 values)
env_grid <- dials::grid_regular(
  penalty(),
  levels = 10)

env_grid

— K-fold cross-val tuning [env_FEAT_tune] (NEW)

[…] To choose an optimal value for penalty we need to create a tuning workflow, that resamples env_cat_train in multiple training/testing sets

set.seed(123)

# Create a resampling object
env_vfold <- rsample::vfold_cv(env_cat_train, v = 10)

# Create a tuning workflow
env_FEAT_tune <- tune::tune_grid(
  object = env_FEAT_wf, # changed ! 
  resamples = env_vfold,
  grid = env_grid,
  control = control_grid(save_pred = TRUE),
  # metrics 
  metrics = metric_set(roc_auc, accuracy, sensitivity, specificity)
)
# preview
env_FEAT_tune

The env_FEAT_tune object contains the results of the tuning for each fold. We can see the results of the tuning for each fold by calling the collect_metrics() function on the env_FEAT_tune object

# Collect the results of the tuning
env_FEAT_tune_metrics <- tune::collect_metrics(env_FEAT_tune)

# visualize the results
tune::autoplot(env_FEAT_tune) +
  labs(
    title = "Lasso model performance across regularization penalties",
    subtitle = "Performance metrics can be used to identify the best penalty"
  )
# in roc_auc: many many of the penalty values performed similarly, with a drop-off in performance at the higher val- ues

Conveniently, the tune::show_best() function takes a tune_grid object and returns the best performing hyperparameter values.

# Show the best hyperparameter values
tune::show_best(env_FEAT_tune, metric = "roc_auc")                  # 0.00599 (same )

# Make selection of penalty programmatically
env_FEAT_best <- select_best(env_FEAT_tune, metric ="roc_auc")      # 0.00599 (same)
env_FEAT_best_lambda <- env_best$penalty                            # 0.00599 (same)

env_FEAT_best_acc <- select_best(env_FEAT_tune, metric ="accuracy") # 0.00599 (same)
env_FEAT_best_sens <- select_best(env_FEAT_tune, metric ="sensitivity")  # 0.0774

6) Hyperparameter tuning

Update workflow [env_FEAT_wf2]

Now we can update the model specification and workflow with the best performing hyperparameter value using the previous env_FEAT_tune workflow and the finalize_workflow() function.

# Update the model specification
env_FEAT_wf2 <- env_FEAT_wf %>% 
   tune::finalize_workflow(env_FEAT_best)

# Preview updated workflow object (with defined penalty paramv  0.00599)
env_FEAT_wf2

7) Final fit on training set

Fit the final model

“Take my model workflow (env_FEAT_wf2), train it on the training data (env_cat_train), and save the result as env_FEAT_fit so I can use it to make predictions or evaluate performance.”

# Fit the model to the training data
env_FEAT_fit <- env_FEAT_wf2 |> 
   workflows::fit (data = env_cat_train)

From this object I can extract many objects about processing and results of the fitted model:

— Extract …

# Extract the fitted model
env_FEAT_fit |> 
   workflows::extract_fit_parsnip()  

# Extract Preprocessor
env_FEAT_fit |> 
   workflows::extract_preprocessor()
 
# Extract names of the features used in the model
extract_fit_parsnip(env_FEAT_fit)$fit$beta@Dimnames[[1]] # names of the features used in the model

# Extract names of penalty values
extract_fit_parsnip(env_FEAT_fit)$fit$beta@Dimnames[[2]] # Lambda values used in the model
# etc

— Extract Coefficients [enf_fitted_coeff]

Here we access the model coefficients to see which features are most important in the model + We see here, for the penalty we chose, what terms contribute the most to a en cat NOT being high risk .

env_FEAT_fit %>% 
   #workflows::extract_fit_parsnip() %>% 
   broom::tidy() %>% 
   dplyr::filter(term != "(Intercept)") %>% 
   dplyr::arrange(desc(abs(estimate))) %>% 
   dplyr::slice(1:20) # top 10
  • “sector_f…” appear among the top coefficients!!!

— Get predicted class and probabilities (train)

Here I get the predicted probabilities from env_FEAT_fit logistic regression model on the dataset env_cat_train + with type = "prob" argument in the predict() function, the model returns class probabilities for each of the 2260 observations 1. .pred_High-Med-risk = probability of being in the High-Med-risk category 2. .pred_Low-risk_Othr = probability of being in the Low-risk_Othr category

# Fit the model on the training set
env_FEAT_fit <- env_FEAT_wf2 %>% # NEW
   fit(data = training(env_split))

# Example of a data frame containing actual and predicted values
pred_FEAT  <- predict(env_FEAT_fit, 
                           new_data = training(env_split), 
                           type = "prob")|>
   # combine with the original training data (labeled)
   bind_cols(training(env_split)) %>% 
   select(env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_high_risk = '.pred_High-Med-risk', # PROB 1obs is PREDICTED HIGH-MED-RISK 
          pred_low_risk = '.pred_Low-risk_Othr'   # PROB 1obs is PREDICTED LOW-RISK
   )  # 2260 rows

pred_FEAT_long <- pred_FEAT |> 
   pivot_longer(cols = c(pred_high_risk, pred_low_risk),
                names_to = "risk_type", values_to = "risk_value") # 4,520 rows

— [FIG] Assessing performance [env_FEAT_fit] on training set

Now is env_split_train

# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_FEAT_long, aes(x = env_cat_f2, y = risk_value, fill = risk_type)) +
  geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
  #geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) +  # Remove width and use position_dodge
  labs(title = "PREDICTED class distribution (y axis) v. ACTUAL class (x axis)",
        subtitle = "Model: Lasso Regression fitted on training data",
       x = "ACTUAL env. class",
       y = "Probabiity of PREDICTED env. class",
       fill = "Risk Type") +  # Set label for fill legend
  theme_minimal() +
  guides(color = "none")  # Suppress the color legend

Seems improved also LOW risk prediction (at least o average although more dispersion)

Calculate performance metrics

train_FEAT_pred_probs <- predict(env_FEAT_fit, new_data = env_cat_train, type = "prob")
train_FEAT_pred_class <- predict(env_FEAT_fit, new_data = env_cat_train, type = "class")

train_FEAT_results <- bind_cols(env_cat_train, train_FEAT_pred_probs, train_FEAT_pred_class) |> 
   select(env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_high_risk = '.pred_High-Med-risk', # PROB 1obs is PREDICTED HIGH-MED-RISK 
          pred_low_risk = '.pred_Low-risk_Othr',   # PROB 1obs is PREDICTED LOW-RISK
          pred_class = '.pred_class')

# Confusion matrix
c_train_acc <- accuracy(train_FEAT_results, truth = env_cat_f2, estimate = pred_class)
# ROC AUC (needs probabilities + positive class specified)
c_train_roc_auc <- roc_auc(train_FEAT_results, truth = env_cat_f2, pred_high_risk)
# Sensitivity and Specificity
c_train_sens <- sens(train_FEAT_results, truth = env_cat_f2, estimate = pred_class)
c_train_spec <- spec(train_FEAT_results, truth = env_cat_f2, estimate = pred_class)

# confusion matrix
conf_mat(train_FEAT_results, truth = env_cat_f2, estimate = pred_class) |> 
   autoplot(type = "heatmap") +
   labs(title = "Confusion matrix for Lasso model",
        subtitle = "Training set")

# roc curve
roc_curve(train_FEAT_results, truth = env_cat_f2, pred_high_risk) |> 
   autoplot() +
   labs(title = "ROC curve for Lasso model (only x = pdo)",
        subtitle = "Training set")

8) Evaluation on test set

— Get predicted class and probabilities (test)

# Predict on the test set
pred_FEAT_test  <- stats::predict(env_FEAT_fit, 
                           new_data = testing(env_split), 
                           type = "prob")|># prob **high** and prob **low** for each observ based on model 
   bind_cols(testing(env_split)) %>% 
   select(env_cat_f2,  pred_high_risk = '.pred_High-Med-risk', pred_low_risk = '.pred_Low-risk_Othr')   # 970 rows

pred_FEAT_long_test <- pred_FEAT_test |>
   pivot_longer(cols = c(pred_high_risk, pred_low_risk),
                names_to = "risk_type", values_to = "risk_value") # 1,940 rows

— [FIG] Assessing performance on test set

# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_FEAT_long_test, aes(x = env_cat_f2, y = risk_value, fill = risk_type)) +
  geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
  #geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) +  # Remove width and use position_dodge
   labs(title = "PREDICTED class distribution (y axis) v. ACTUAL class (x axis)",
        subtitle = "Model: Lasso Regression fitted on test data",
       x = "ACTUAL env. class",
       y = "Probabiity of PREDICTED env. class",
       fill = "Risk Type") +  # Set label for fill legend
  theme_minimal() +
  guides(color = "none")  # Suppress the color legend

Calculate performance metrics

test_FEAT_pred_probs <- predict(env_FEAT_fit, new_data = env_cat_test, type = "prob")
test_FEAT_pred_class <- predict(env_FEAT_fit, new_data = env_cat_test, type = "class")

test_FEAT_results <- bind_cols(env_cat_test, test_FEAT_pred_probs, test_FEAT_pred_class) |> 
   select(env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_high_risk = '.pred_High-Med-risk', # PROB 1obs is PREDICTED HIGH-MED-RISK 
          pred_low_risk = '.pred_Low-risk_Othr',   # PROB 1obs is PREDICTED LOW-RISK
          pred_class = '.pred_class')

# Confusion matrix
c_test_acc <- accuracy(test_FEAT_results, truth = env_cat_f2, estimate = pred_class)
# ROC AUC (needs probabilities + positive class specified)
c_test_roc_auc <- roc_auc(test_FEAT_results, truth = env_cat_f2, pred_high_risk)
# Sensitivity and Specificity
c_test_sens <- sens(test_FEAT_results, truth = env_cat_f2, estimate = pred_class)
c_test_spec <- spec(test_FEAT_results, truth = env_cat_f2, estimate = pred_class)

# confusion matrix
conf_mat(test_FEAT_results, truth = env_cat_f2, estimate = pred_class) |> 
   autoplot(type = "heatmap") +
   labs(title = "Confusion matrix for Lasso model",
        subtitle = "testing set")

# roc curve
roc_curve(test_FEAT_results, truth = env_cat_f2, pred_high_risk) |> 
   autoplot() +
   labs(title = "ROC curve for Lasso model (only x = pdo)",
        subtitle = "testing set")

Recap

# table with accuracy, roc_auc, sensitivity, specificity
model_FEAT_metrics_recap <- tibble(
   metric = c("Accuracy", "ROC AUC", "Sensitivity", "Specificity"),
   train = c(c_train_acc$.estimate, 
             c_train_roc_auc$.estimate, 
             c_train_sens$.estimate, 
             c_train_spec$.estimate      )  ,
   test =  c(c_test_acc$.estimate, 
             c_test_roc_auc$.estimate, 
             c_test_sens$.estimate, 
             c_test_spec$.estimate      )  
   ) |> 
   mutate(across(c(train, test), ~ round(.x, 3))) |>
   kable() |> 
   kable_styling("striped", full_width = F) |> 
   # header merged column 
   kableExtra::add_header_above(c("C) Lasso Logistic model", "datasets" = 2), 
                                bold = TRUE, color = "black", background = "#D9EAD3")

model_FEAT_metrics_recap
Model performance metrics
C) Lasso Logistic model
datasets
metric train test
Accuracy 0.792 0.787
ROC AUC 0.859 0.823
Sensitivity 0.902 0.903
Specificity 0.571 0.558

Compared to previous model, there is just a little improvement in false positive (low risk predicted to be high risk) as they are less than before. This model did improve (especially in the low risk category): in fact the probability of being classified as HIGH RISK is less than 50% and of being classified LOW RISK above 50%.

8) CONFUSION MATRIX on test set

To do this we need to fit the tuned workflow to the training set, which is the actual training phase. We will use the tune::last_fit() function

Le’s use the updated workflow env_FEAT_wf2 (After determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set).

— [data prep]

# After fitting the best model (to the training set) --> evaluate it on the validation set
env_FEAT_final_fit <- tune::last_fit(
   env_FEAT_wf2, 
   split = env_split)

The performance metrics are very close to those we achieved on the training set (actually better!!) –> good sign that the model is robust as it performs well on both training and test (validation) sets.

— [FIG] Visualize the confusion matrix

# Create a table for the confusion matrix counts
ML_final_fit_cm_p <- env_FEAT_final_fit %>%
  collect_predictions() %>%
  conf_mat(truth = env_cat_f2, estimate = .pred_class) %>%
  autoplot(type = "heatmap") +
  labs(
    title = "Confusion Matrix for Lasso Logistic Regression Model",
    x = "Predicted Class",
    y = "True Class",
    fill = "Count"
  ) +
  scale_fill_gradient(low = "#f2e8ea", high = "#964957") +  # Adjust color gradient for better contrast
  theme_minimal(base_size = 14) +                              # Set a clean theme with larger base text size
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),     # Center and bold the title
    #axis.text.x = element_text(angle = 45, hjust = 1),         # Angle x-axis text for readability
    legend.position = "right"                                  # Place the legend on the right
  )

ML_final_fit_cm_p

f_save_plot_obj <- function(plot_object, plot_obj_name) {
   # Save the plot object
   saveRDS(plot_object, here("analysis", "output", "figures", paste0(plot_obj_name, ".rds")))
}

f_save_plot_obj(ML_final_fit_cm_p, "ML_final_fit_cm_p")
Figure 1: Confusion matrix for the final model on the validation set

Still (on the validation set) imbalanced false positives (138) and false negatives (63).

— [FIG] Visualize the ROC AUC curve

Take the output of last_fit() (env_FEAT_final_fit) and use it to plot the ROC curve.

colnames(env_FEAT_final_fit)
# Extract predictions from the final fit object
# Extract the tibble from the list
env_FEAT_final_fit_pred <-  env_FEAT_final_fit$.predictions[[1]]
str(env_FEAT_final_fit_pred)

# Visualize the ROC curve 
env_FEAT_final_fit_pred %>% 
   roc_curve(truth = env_cat_f2, '.pred_High-Med-risk') %>% 
   autoplot() +
   labs(
      title = "ROC Curve for High-Med Risk Prediction",
      x = "1 - Specificity (False Positive Rate)",
      y = "Sensitivity (True Positive Rate)",
      caption = "logistic regression model on text (stopwords) + features"
   )

9) Interpret the model

— Inspecting what levels of the outcome are most difficult to estimate

# collect the predictions from the final model
env_FEAT_final_fit_feat <- env_FEAT_final_fit %>%
   collect_predictions() %>%
   bind_cols(env_cat_test)  %>%
   rename(env_cat_f2 = 'env_cat_f2...6') %>% 
   select ( -'env_cat_f2...32')

#preview the predictions
glimpse(env_FEAT_final_fit_feat)

I will then select the columns with the actual outcome (env_cat_f2), the predicted outcome, the env_cat_f level, and the pdo text and separate the predicted outcome to inspect them separately

env_FEAT_final_fit_feat %>%
   filter(env_cat_f2 != .pred_class ) %>%  
   select(env_cat_f2, .pred_class,  env_cat_f, pdo, proj_id) 

Inspect to see in which actual category (env_cat_f) are proj when they are actually env_cat_f2 == 'High-Med-risk' but falsely predicted to be .pred_class == 'Low-risk_Othr': not surprisingly most of them are med risk level. (this makes sense)

env_FEAT_final_fit_feat %>%
   filter(env_cat_f2 == 'High-Med-risk' & .pred_class == 'Low-risk_Othr') %>%  
   select(env_cat_f2, .pred_class,  env_cat_f, pdo, proj_id) %>% 
   count(env_cat_f )

levels(env_FEAT_final_fit_feat$env_cat_f2)
#[1] "High-Med-risk" "Low-risk_Othr"

— Inspecting the most important features for predicting the outcome

  • using the extract_fit_parsnip() function from the workflows package to extract the model object from the workflow.
  • estimates are the log odds of the outcome for each feature (i.e. the probability of the outcome (High risk) divided by the probability of the opposite outcome (low risk)).
    • Positive coefficient: A positive coefficient indicate an increased likelihood of being in the “Low-risk_Othr” category compared to the “High-Med-risk” category.
    • Negative coefficient: A negative coefficient indicate an increased likelihood of being in the “High-Med-risk” category compared to the “Low-risk_Othr” category.
  • odds ratio (exponentiated coeff) means that the feature is associated with a lower probability of the outcome, while positive odds means that the feature is associated with a higher probability of the outcome.
    • Odds ratio > 1: Indicates that the predictor increases the likelihood of the outcome (e.g., “Low-risk_Othr”).
    • Odds ratio < 1: Indicates that the predictor decreases the likelihood of the outcome “High-Med-risk”.
# Extract the estimate (log-odds)
env_FEAT_final_fit_features <- extract_fit_parsnip(env_FEAT_final_fit) %>% 
   tidy() %>% 
   # Calculate the exponentiated estimate
   mutate(odds = exp(estimate),
          probability = odds / (1 + odds))  

#  tibble: 206 × 5
# term                   estimate penalty  odds probability
# <chr>                     <dbl>   <dbl> <dbl>       <dbl>
# (Intercept)            -1.26    0.00599 0.284      0.221 
# tfidf_pdo_access       -0.00452 0.00599 0.995      0.499 
# tfidf_pdo_activities    0.786   0.00599 2.19       0.687 
  • tfidf_pdo_activities est 0.786 | odds 2.19 | prob 0.687 (associated with a LOW risk)
  • tfidf_pdo_access est -0.004 | odds 0.995 | prob 0.499 (associated with a HIGH risk)

— Extract the most important features

A quick way to extract the most important features for predicting each outcome is to use the vi() function from {vip}.

  • The vi() function calculates the permutation importance of each feature in the model.
library(vip)

# Extract the most important features
env_FEAT_var_importance <-  extract_fit_parsnip(env_FEAT_final_fit) %>% 
   vip::vi() %>%
   # it is kinda counterintuitive  
   mutate(note = case_when(
       Sign  ==  "POS" ~ "More likely to be in Low-risk_Othr",
       Sign  ==  "NEG" ~ "More likely to be in High-Med-risk",    
      TRUE ~ "NEU"
   )) 

— [FIG] Plot the most important features

# Recode variable and sign 
var_importance_tbl <- env_FEAT_var_importance %>% 
   mutate(Feature =  str_remove(Variable, "tfidf_"),
          EnvRiskOutcome = case_when(
             Sign == "NEG" ~ "High-Med-risk",
             Sign == "POS" ~ "Low-risk_Othr") ) %>% 
   select(Feature, Importance,  EnvRiskOutcome)  

summary(var_importance_tbl$EnvRiskOutcome)
# Plot the most important features
ML_feature_importance_p <- var_importance_tbl %>%
   slice_max(Importance, n = 50) %>%
   ggplot(aes(x = reorder(Feature, Importance), y = Importance, color = EnvRiskOutcome)) +
   geom_point() +
   coord_flip() +
   facet_wrap(~EnvRiskOutcome, ncol = 2, scales = "free_y") +
   labs(
      title = glue("Most influential features for predicting Environmental Risk"),
      subtitle = "LASSO Logistic Regression model on text + metadata tags \n(Importance = absolute value of the logistic regression coefficients)",
      #caption = "(Importance of each feature calculated as the absolute value of the logistic regression coefficients)",
      x = "",
      y = "",
      fill = ""
   ) +
   lulas_theme + 
   theme(
     plot.title = element_text(hjust = 0.5, face = "bold")
   ) +
   guides(color = "none")

ML_feature_importance_p

The feature importance plot highlights the top 50 predictors of environmental risk (binary) category, ranked by their influence in a LASSO logistic regression model. For better readability the predictors are split according to the level of risk predicted. It should be no surprise that words in the PDO text (those variables starting with pdo_*) are the most important predictors given the data. Still some of the other predictors are also important, such as sector_f_URBAN (left panel) or regionname and sector_f_FINANCIAL (right panel).

Each facet groups features by environmental risk outcome, allowing a comparison of which factors contribute most to each category. Features with higher importance values, like [mention a few key features], play a significant role in predicting environmental risk, offering insights for targeted risk assessment and decision-making.

# show plot
ML_feature_importance_p
# save as rds
f_save_plot_obj(ML_feature_importance_p, "ML_feature_importance_p")

___ MODEL D) [🫤] NB w pdo + tf-idf + stopwords + 3 fct

0) Prep for split based on outcome [using projs_train2]

(same)

1) Split data in train/test [based on env_cat]

(same)

2) Pre-processing and featurization (recipes)

3) Model specification

— Model specification [new!]

Let’s use a naive Bayes model, which is available in the tidymodels package discrim. One of the main advantages is its ability to handle a large number of features, such as those we deal with when using word count methods. Here we have only kept the 1000 most frequent tokens, but we could have kept more tokens and a naive Bayes model would still be able to handle such predictors well. For now, we will limit the model to a moderate number of tokens.

# needed for naive Bayes
library(discrim)

# Create a model specification
env_NB_spec <-
   # generalized linear model for binary outcomes
   parsnip::naive_Bayes(
      # optional tunable parameter 
      #smoothness = tune(), # 0 = no penalty, 1 = max
   ) %>%
   # Specify the mode of the model
   set_mode("classification") %>%
   # Specify the engine
   set_engine("naivebayes")
 
# Preview
env_NB_spec

4) Model training (workflows)

— Create workflow (NEW!)

New workflow with

  • same preprocessing steps as Model C (env_FEAT_recipe)
  • NEW model specification (env_NB_spec): Naive Bayes model instead of the logistic regression model.
# Create a workflow
env_NB_wf <- workflows::workflow() %>%
   add_recipe(env_FEAT_recipe) %>%  # same RECIPE
   add_model(env_NB_spec) # NEW MODEL!
# Preview
env_NB_wf

5) Model evaluation / tweaking

— Fit the classificatoin model to the training set [env_NB_fit] (NEW!)

# Fit the model to the training set
env_NB_fit <- env_NB_wf %>%
   #add_model(env_NB_spec) %>%
   fit(data = env_cat_train)

### — (NO tuning!) RESAMPLES
Let’s use resampling to estimate the performance of the naive Bayes classification model we just fit. We can do this using resampled data sets built from the training set. Let’s create 10-fold cross-validation sets, and use these resampled sets for performance estimates.

— K-fold cross-validation sets

  • [env_grid, env_fold, env_NB_tune] (same, same, new)

The env_FEAT_tune object contains the results of the tuning for each fold. We can see the results of the tuning for each fold by calling the collect_metrics() function on the env_FEAT_tune object

set.seed(123)

env_vfold <- rsample::vfold_cv(env_cat_train, v = 10) 

# Fit the model to the resampled folds
env_NB_tune <- fit_resamples(
   object = env_NB_wf, # changed ! 
   resamples = env_vfold,
   control = control_resamples(save_pred = TRUE),
  # metrics 
  metrics = metric_set(roc_auc, accuracy, sensitivity, specificity)
)

# preview
env_NB_tune

We can extract the relevant information using collect_metrics() and collect_predictions().

# Get the avarage performance metrics across all resampled folds
env_NB_tune_metrics <- collect_metrics(env_NB_tune)

env_NB_tune_metrics

— Visualize the resamples’ performance metrics

Note: autoplot() is designed to work with tuning results, so here it doesn’t work anymore because NB model doesn’t have a penalty parameter to tune.

# !!!!!!!!!!!!! NOT ANYMORE !!!!!!!!!!!!! 
# visualize the results
#autoplot(env_NB_tune)
env_NB_tune_metrics %>%
  ggplot(aes(x = .metric, y = mean, fill = .metric)) +
  geom_col(show.legend = FALSE) +
  #facet_wrap(~.metric, scales = "free_y") +
  labs(
    title = "Naive Bayes cross-validated metrics",
    subtitle = "Obtained as mean from 10-fold cross-validation",
    y = "Mean performance across folds",
    x = NULL
  )
  • Accuracy (overall correct predictions)
  • ROC AUC (discrimination ability)
  • Sensitivity (true positive rate)
  • Specificity (true negative rate)

— Visualize the ROC AUC curve by fold

ROC curve is a way to visualize the trade-off between sensitivity and specificity at different thresholds.

  • each point on the curve represents a different threshold for classifying an observation as positive ("High-Med-risk") or negative.
    • x-axis: 1 – specificity (false positive rate)
    • y-axis: sensitivity (true positive rate)

The area under the curve (AUC) is a single number that summarizes the performance of the model across all thresholds. AUC values range from 0 to 1, with higher values indicating better model performance.

env_NB_tune_predictions <- collect_predictions(env_NB_tune)

# Visualize the ROC curve 
env_NB_tune_predictions %>% 
   # predictions grouped by fold
   group_by(id) %>% 
   # roc_curve(truth = env_cat_f2, .pred_class) %>% 
   # Not work with factor, use "positive" class 
   yardstick::roc_curve(
      truth = env_cat_f2,    # ACTUAL class label
      '.pred_High-Med-risk', # PREDICTED class label
      ) %>% 
   autoplot(color = id) +
   labs(
      title = "ROC Curve for Naive Bayes model",
      x = "1 - Specificity \n(FP Rate)",
      y = "Sensitivity \n(TP Rate)",
      caption = "Naive Bayes model on text + features"
   )

The area under each of these curves is the roc_auc metric we have computed. (If the curve was close to the diagonal line, then the model’s predictions would be no better than random guessing.)

— [FIG] Visualize the confusion matrix

# Plot the confusion matrix
conf_mat_resampled(env_NB_tune, tidy = FALSE) %>% 
   autoplot(type = "heatmap")

  • different from logistic
  • Very well with true positive (high risk), but very bad with true negative (low risk).

6) Hyperparameter tuning

Not done bc NB doesn’t have any hyperparameters to tune.

7) Final fit on training set

Fit the final model

# Fit the model to the training set
env_NB_fit <- env_NB_wf %>% 
   workflows::fit(data = env_cat_train)

— Extract…

# Extract the fitted model as a parsnip model object
nb_model <- env_NB_fit |> 
   workflows::extract_fit_parsnip()
# model
class(nb_model$fit)

# structure of nodel object
str(nb_model$fit, max.level = 2)

# Extract Preprocessor
env_NB_fit |> 
   workflows::extract_preprocessor()

# Extract stuff on the model
#extract_fit_parsnip(env_NB_fit)$fit$....   

— Extract coefficients

NOTA Naive Bayes is a probabilistic model, and the {naivebayes} package doesn’t expose the internal details (e.g., class-conditional probabilities) in a tidy-friendly format out of the box.

# # Extract the coefficients
# env_NB_fit %>% 
#    #workflows::extract_fit_parsnip() %>% 
#    broom::tidy() %>% 
#    dplyr::filter(term != "(Intercept)") %>% 
#    dplyr::arrange(desc(abs(estimate))) %>% 
#    dplyr::slice(1:20) # top 10
# a named list of 148 predictors, each storing class-conditional densities for "High-Med-risk" and "Low-risk_Othr".
nb_model$fit$tables

— Get predicted class and probabilities (train)

Using env_NB_fit

# where 
# env_NB_fit <- env_NB_wf %>% 
#    workflows::fit(data = env_cat_train)

# Predicted values 
pred_NB_train <- predict(env_NB_fit, 
                           new_data = training(env_split), 
                           type = "class")|>
   # combine with the original training data (labeled)
   bind_cols(training(env_split)) %>% 
   select(proj_id,
          env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_class = '.pred_class',  
   )  # 2260 rows

# Convert to long 
pred_NB_train_long <- pred_NB_train |> 
   pivot_longer( cols = c(env_cat_f2, pred_class),
                names_to = "actual_OR_predicted", values_to = "class") # 4,520 rows
  # 4,520 rows

— 🟨 [FIG] Assessing performance [env_NB_fit] on training set

# plot actual v predicted class 
ggplot(pred_NB_train_long, aes(x = actual_OR_predicted, y = class)) +
  geom_boxplot(aes(fill = actual_OR_predicted), alpha = 0.4) +
  geom_jitter(aes(color = actual_OR_predicted), alpha = 0.6) +
  labs(title = "PREDICTED class distribution (y axis) v. ACTUAL class (x axis)",
       subtitle = "Model: Naive Bayes fitted on training data",
       x = "ACTUAL env. class",
       y = "PREDICTED env. class",
       fill = "Risk Type") +  # Set label for fill legend
  theme_minimal() +
  guides(color = "none")  # Suppress the color legend
Predicted class distribution (y axis) v. Actual class (x axis)

Calculate performance metrics

train_NB_pred_probs <- predict(env_NB_fit, new_data = env_cat_train, type = "prob")
train_NB_pred_class <- predict(env_NB_fit, new_data = env_cat_train, type = "class")

train_NB_results <- bind_cols(env_cat_train, train_NB_pred_probs, train_NB_pred_class) |> 
   select(proj_id,
          env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_high_risk = '.pred_High-Med-risk', # PROB 1obs is PREDICTED HIGH-MED-RISK 
          pred_low_risk = '.pred_Low-risk_Othr',   # PROB 1obs is PREDICTED LOW-RISK
          pred_class = '.pred_class')

# Accuracy
d_train_acc <- accuracy(train_NB_results, truth = env_cat_f2, estimate = pred_class)
# ROC AUC (needs probabilities + positive class specified)
d_train_roc_auc <- roc_auc(train_NB_results, truth = env_cat_f2, pred_high_risk)
# Sensitivity  
d_train_sens <- sens(train_NB_results, truth = env_cat_f2, estimate = pred_class)
# Specificity
d_train_spec <- spec(train_NB_results, truth = env_cat_f2, estimate = pred_class)

# confusion matrix
conf_mat(train_NB_results, truth = env_cat_f2, estimate = pred_class) |> 
   autoplot(type = "heatmap") +
   labs(title = "Confusion matrix for Naive Bayes model",
        subtitle = "Training set")

# roc curve
roc_curve(train_NB_results, truth = env_cat_f2, pred_high_risk) |> 
   autoplot() +
   labs(title = "ROC curve for Naive Bayes model (only x = pdo)",
        subtitle = "Training set")

8) Evaluation on test set

— Get predicted class and probabilities (test)

# Predict on the test set
pred_NB_test  <- stats::predict(env_NB_fit, 
                           new_data = testing(env_split), 
                           type = "class")|>
   bind_cols(testing(env_split)) %>% 
   select(proj_id,
          env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_class = '.pred_class',  
   )  # 970 rows

# Convert to long
pred_NB_test_long <- pred_NB_test |> 
   pivot_longer( cols = c(env_cat_f2, pred_class),
                names_to = "actual_OR_predicted", values_to = "class") # 1,940 rows

— [FIG] Assessing performance on test set

# plot actual v predicted class
ggplot(pred_NB_test_long, aes(x = actual_OR_predicted, y = class)) +
  geom_boxplot(aes(fill = actual_OR_predicted), alpha = 0.4) +
  geom_jitter(aes(color = actual_OR_predicted), alpha = 0.6) +
  labs(title = "PREDICTED class distribution (y axis) v. ACTUAL class (x axis)",
       subtitle = "Model: Naive Bayes fitted on test data",
       x = "ACTUAL env. class",
       y = "PREDICTED env. class",
       fill = "Risk Type") +  # Set label for fill legend
  theme_minimal() +
  guides(color = "none")  # Suppress the color legend
Predicted class distribution (y axis) v. Actual class (x axis)

Calculate performance metrics

test_NB_pred_probs <- predict(env_NB_fit, new_data = env_cat_test, type = "prob")
test_NB_pred_class <- predict(env_NB_fit, new_data = env_cat_test, type = "class")

test_NB_results <- bind_cols(env_cat_test, test_NB_pred_probs, test_NB_pred_class) |> 
   select(proj_id,
          env_cat_f2,  # ACTUAL CLASS of the 1obs
          pred_high_risk = '.pred_High-Med-risk', # PROB 1obs is PREDICTED HIGH-MED-RISK 
          pred_low_risk = '.pred_Low-risk_Othr',   # PROB 1obs is PREDICTED LOW-RISK
          pred_class = '.pred_class')

# Accuracy
d_test_acc <- accuracy(test_NB_results, truth = env_cat_f2, estimate = pred_class)
# ROC AUC (needs probabilities + positive class specified)
d_test_roc_auc <- roc_auc(test_NB_results, truth = env_cat_f2, pred_high_risk)
# Sensitivity
d_test_sens <- sens(test_NB_results, truth = env_cat_f2, estimate = pred_class)
# Specificity
d_test_spec <- spec(test_NB_results, truth = env_cat_f2, estimate = pred_class)

# confusion matrix
conf_mat(test_NB_results, truth = env_cat_f2, estimate = pred_class) |> 
   autoplot(type = "heatmap") +
   labs(title = "Confusion matrix for Naive Bayes model",
        subtitle = "testing set")

# roc curve
roc_curve(test_NB_results, truth = env_cat_f2, pred_high_risk) |> 
   autoplot() +
   labs(title = "ROC curve for Naive Bayes model (only x = pdo)",
        subtitle = "testing set")

Recap

# table with accuracy, roc_auc, sensitivity, specificity
model_NB_metrics_recap <- tibble(
   metric = c("Accuracy", "ROC AUC", "Sensitivity", "Specificity"),
   train = c(d_train_acc$.estimate, 
             d_train_roc_auc$.estimate, 
             d_train_sens$.estimate, 
             d_train_spec$.estimate      )  ,
   test =  c(d_test_acc$.estimate, 
             d_test_roc_auc$.estimate, 
             d_test_sens$.estimate, 
             d_test_spec$.estimate      )  
   ) |> 
   mutate(across(c(train, test), ~ round(.x, 3))) |>
   kable() |> 
   kable_styling("striped", full_width = F) |> 
   # header merged column 
   kableExtra::add_header_above(c("D) Naive Bayes model", "datasets" = 2), 
                                bold = TRUE, color = "black", background = "#D9EAD3")

model_NB_metrics_recap
Model performance metrics
D) Naive Bayes model
datasets
metric train test
Accuracy 0.688 0.671
ROC AUC 0.900 0.735
Sensitivity 0.998 0.988
Specificity 0.065 0.052
model_metrics_recap
model_stop_metrics_recap
model_FEAT_metrics_recap
model_NB_metrics_recap

Compare all 4 models

# table with accuracy, roc_auc, sensitivity, specificity
model_metrics_recap <- tibble(
   metric = c("Accuracy", "ROC AUC", "Sensitivity", "Specificity"),
   test_a =  c(a_test_acc$.estimate, 
               a_test_roc_auc$.estimate, 
               a_test_sens$.estimate, 
               a_test_spec$.estimate      ),
   test_b =  c(b_test_acc$.estimate,
               b_test_roc_auc$.estimate, 
               b_test_sens$.estimate, 
               b_test_spec$.estimate      ),
   test_c =  c(c_test_acc$.estimate, 
               c_test_roc_auc$.estimate, 
               c_test_sens$.estimate, 
               c_test_spec$.estimate      ),
   test_d =  c(d_test_acc$.estimate,
               d_test_roc_auc$.estimate, 
               d_test_sens$.estimate, 
               d_test_spec$.estimate      )
   ) |>
   mutate(across(c(test_a, test_b, test_c, test_d), ~ round(.x, 3))) 
## identify col WITH max per row using cell_spec()
# model_metrics_highl <- c(
#    which.max(model_metrics_recap[1, 2:5]),
#    which.max(model_metrics_recap[2, 2:5]),
#    which.max(model_metrics_recap[3, 2:5]),
#    which.max(model_metrics_recap[4, 2:5])
# ) 


# table with accuracy, roc_auc, sensitivity, specificity
models_metrics_recap_k <-  model_metrics_recap |>
   kable(escape = FALSE) |> 
   kable_styling("striped", full_width = F) |> 
   # header merged column 
   kableExtra::add_header_above(c(
      "Perf metric (test)",
      "A) Lasso Logistic \n(pdo)",  
      "B) Lasso Logistic \n(pdo) + stopwds",
      "C) Lasso Logistic \n(pdo+feat) + stopwds",
      "D) Naive Bayes \n(pdo+feat) + stopwds"), 
      bold = TRUE, color = "black", background = "#D9EAD3")   
saveRDS(models_metrics_recap_k, here("analysis", "output", "tables", "models_metrics_recap_k.rds"))
quarto render analysis/02a_WB_project_pdo_feat_class_envcat.qmd --to html
open ./docs/analysis/02a_WB_project_pdo_feat_class_envcat.html