library(tidyverse)
library(lubridate)
<- "https://data.cityofchicago.org/api/views/5neh-572f/rows.csv?accessType=DOWNLOAD&bom=true&format=true"
url
<- "data/CTA_-_Ridership_-__L__Station_Entries_-_Daily_Totals.csv"
url
<-
all_stations # Step 1: Read in the data.
read_csv(url) %>%
# Step 2: filter columns and rename stationname
::select(station = stationname, date, rides) %>%
dplyr# Step 3: Convert the character date field to a date encoding.
# Also, put the data in units of 1K rides
mutate(date = mdy(date), rides = rides / 1000) %>%
# Step 4: Summarize the multiple records using the maximum.
group_by(date, station) %>%
summarize(rides = max(rides), .groups = "drop")
15 Software for modeling
15.1 FUNDAMENTALS FOR MODELING SOFTWARE
https://www.tmwr.org/software-modeling#fundamentals-for-modeling-software
15.1.1 R
15.1.2 Tidyverse
15.2 TYPES OF MODELS
15.2.1 DESCRIPTIVE MODELS
15.2.2 INFERENTIAL MODELS
15.2.3 PREDICTIVE MODELS
15.3 CONNECTIONS BETWEEN TYPES OF MODELS
https://www.tmwr.org/software-modeling#connections-between-types-of-models
If a model has limited fidelity to the data, the inferences generated by the model should be highly suspect. In other words, statistical significance may not be sufficient proof that a model is appropriate.
15.4 SOME TERMINOLOGY
Regression
predicts a numeric outcome.Classification
predicts an outcome that is an ordered or unordered set of qualitative values.Exploratory data analysis (EDA)
: Initially there is a back and forth between numerical analysis and data visualization (represented in Figure 1.2) where different discoveries lead to more questions and data analysis side-quests to gain more understanding.Feature engineering
: The understanding gained from EDA results in the creation of specific model terms that make it easier to accurately model the observed data. This can include complex methodologies (e.g., PCA) or simpler features (using the ratio of two predictors). Chapter 8 focuses entirely on this important step.Model tuning and selection
(large circles with alternating segments): A variety of models are generated and their performance is compared. Some models require parameter tuning in which some structural parameters must be specified or optimized. The alternating segments within the circles signify the repeated data splitting used during resampling (see Chapter 10).Model evaluation
: During this phase of model development, we assess the model’s performance metrics, examine residual plots, and conduct other EDA-like analyses to understand how well the models work. In some cases, formal between-model comparisons (Chapter 11) help you understand whether any differences in models are within the experimental noise.
15.5 HOW DOES MODELING FIT INTO THE DATA ANALYSIS PROCESS?
As an example, M. Kuhn and Johnson (2020) use data to model the daily ridership of Chicago’s public train system using predictors such as the date, the previous ridership results, the weather, and other factors. Table 1.1 shows an approximation of these authors’ hypothetical inner monologue when analyzing these data and eventually selecting a model with sufficient performance.
Thoughts | Activity |
---|---|
The daily ridership values between stations are extremely correlated. | EDA |
Weekday and weekend ridership look very different. | EDA |
One day in the summer of 2010 has an abnormally large number of riders. | EDA |
Which stations had the lowest daily ridership values? | EDA |
Dates should at least be encoded as day-of-the-week, and year. | Feature Engineering |
Maybe PCA could be used on the correlated predictors to make it easier for the models to use them. | Feature Engineering |
Hourly weather records should probably be summarized into daily measurements. | Feature Engineering |
Let’s start with simple linear regression, K-nearest neighbors, and a boosted decision tree. | Model Fitting |
How many neighbors should be used? | Model Tuning |
Should we run a lot of boosting iterations or just a few? | Model Tuning |
How many neighbors seemed to be optimal for these data? | Model Tuning |
Which models have the lowest root mean squared errors? | Model Evaluation |
Which days were poorly predicted? | EDA |
Variable importance scores indicate that the weather information is not predictive. We’ll drop them from the next set of models. | Model Evaluation |
It seems like we should focus on a lot of boosting iterations for that model. | Model Evaluation |
We need to encode holiday features to improve predictions on (and around) those dates. | Feature Engineering |
Let’s drop KNN from the model list. | Model Evaluation |