15 Software for modeling

https://www.tmwr.org/software-modeling

15.1 FUNDAMENTALS FOR MODELING SOFTWARE

https://www.tmwr.org/software-modeling#fundamentals-for-modeling-software

15.1.1 R

15.1.2 Tidyverse

15.2 TYPES OF MODELS

https://www.tmwr.org/software-modeling#model-types

15.2.1 DESCRIPTIVE MODELS

15.2.2 INFERENTIAL MODELS

15.2.3 PREDICTIVE MODELS

15.3 CONNECTIONS BETWEEN TYPES OF MODELS

https://www.tmwr.org/software-modeling#connections-between-types-of-models

If a model has limited fidelity to the data, the inferences generated by the model should be highly suspect. In other words, statistical significance may not be sufficient proof that a model is appropriate.

15.4 SOME TERMINOLOGY

https://www.tmwr.org/software-modeling#model-terminology

Regression predicts a numeric outcome.
Classification predicts an outcome that is an ordered or unordered set of qualitative values.
Exploratory data analysis (EDA): Initially there is a back and forth between numerical analysis and data visualization (represented in Figure 1.2) where different discoveries lead to more questions and data analysis side-quests to gain more understanding.
Feature engineering: The understanding gained from EDA results in the creation of specific model terms that make it easier to accurately model the observed data. This can include complex methodologies (e.g., PCA) or simpler features (using the ratio of two predictors). Chapter 8 focuses entirely on this important step.
Model tuning and selection (large circles with alternating segments): A variety of models are generated and their performance is compared. Some models require parameter tuning in which some structural parameters must be specified or optimized. The alternating segments within the circles signify the repeated data splitting used during resampling (see Chapter 10).
Model evaluation: During this phase of model development, we assess the model’s performance metrics, examine residual plots, and conduct other EDA-like analyses to understand how well the models work. In some cases, formal between-model comparisons (Chapter 11) help you understand whether any differences in models are within the experimental noise.

15.5 HOW DOES MODELING FIT INTO THE DATA ANALYSIS PROCESS?

https://www.tmwr.org/software-modeling#model-phases

As an example, M. Kuhn and Johnson (2020) use data to model the daily ridership of Chicago’s public train system using predictors such as the date, the previous ridership results, the weather, and other factors. Table 1.1 shows an approximation of these authors’ hypothetical inner monologue when analyzing these data and eventually selecting a model with sufficient performance.

Table 1.1: Hypothetical inner monologue of a model developer.
Thoughts	Activity
The daily ridership values between stations are extremely correlated.	EDA
Weekday and weekend ridership look very different.	EDA
One day in the summer of 2010 has an abnormally large number of riders.	EDA
Which stations had the lowest daily ridership values?	EDA
Dates should at least be encoded as day-of-the-week, and year.	Feature Engineering
Maybe PCA could be used on the correlated predictors to make it easier for the models to use them.	Feature Engineering
Hourly weather records should probably be summarized into daily measurements.	Feature Engineering
Let’s start with simple linear regression, K-nearest neighbors, and a boosted decision tree.	Model Fitting
How many neighbors should be used?	Model Tuning
Should we run a lot of boosting iterations or just a few?	Model Tuning
How many neighbors seemed to be optimal for these data?	Model Tuning
Which models have the lowest root mean squared errors?	Model Evaluation
Which days were poorly predicted?	EDA
Variable importance scores indicate that the weather information is not predictive. We’ll drop them from the next set of models.	Model Evaluation
It seems like we should focus on a lot of boosting iterations for that model.	Model Evaluation
We need to encode holiday features to improve predictions on (and around) those dates.	Feature Engineering
Let’s drop KNN from the model list.	Model Evaluation

library(tidyverse)
library(lubridate)

url <- "https://data.cityofchicago.org/api/views/5neh-572f/rows.csv?accessType=DOWNLOAD&bom=true&format=true"

url <- "data/CTA_-_Ridership_-__L__Station_Entries_-_Daily_Totals.csv"

all_stations <- 
  # Step 1: Read in the data.
  read_csv(url) %>% 
  # Step 2: filter columns and rename stationname
  dplyr::select(station = stationname, date, rides) %>% 
  # Step 3: Convert the character date field to a date encoding.
  # Also, put the data in units of 1K rides
  mutate(date = mdy(date), rides = rides / 1000) %>% 
  # Step 4: Summarize the multiple records using the maximum.
  group_by(date, station) %>% 
  summarize(rides = max(rides), .groups = "drop")