Statistics via Sports: Summer Lab 2024
- Taught as part of the Wharton Sports Analytics Summer Research Lab
- Recommended prerequisites: calculus, probability, R coding; matrix algebra helps
Daily format
- 1 to 2 hour lecture
- Hands-on active learning lab where you will analyze a real-world sports dataset
Regression modeling
- Simple linear regression
- Multivariable linear regression
- planned lecture
- note: estimating the coefficients
- lab, data: NBA team-seasons for the four factors, punts
- Example of the research process
- planned lecture
- lab:
- get into groups and start thinking about a research project
- plan to read relevant literature and start with a replication of existing analysis
- finish up the previous labs
- Logistic regression
- planned lecture
- note: logistic regression & gradient descent
- lab, data: field goals, 2023-2024 NCAA men’s basketball game results and team info from Kaggle
- Confounding
- planned lecture
- lab, data: MLB half-innings data for park effects
- Models do what they’re told
Frequentist statistical inference and uncertainty quantification
- Significance and p-values
- planned lecture
- lab, data: diving, TTO (time through the order)
- Normal approximation (CLT) and binomial proportion confidence interval
- The bootstrap
Shrinkage & Bayesian statistics
- Priors & the power of fake data
- Empirical Bayes
- Shrinkage estimation
- planned lecture – need to re-write this
- live lecture
- lab, data: first putt success percentage training data and held-out test data
- Fully Bayesian models
- planned lecture
- live lecture
- lab, data: NFL game-by-game data for Bayesian power rating and home field advantage model
- Regularization & ridge regression
- planned lecture
- lab, data: NBA lineup data; this is in a
.rds
format that takes significantly less storage than a.csv
file, to convert to a.csv
file in order to use in a non-R
language you’ll need to load it intoR
and then save it as a.csv
Machine learning
- Bias-variance trade-off
- planned lecture
- lab, data: MLB half-innings data for park effects
- Decision trees
- planned lecture
- lab: just work on research
- Random forests & Boosting
- planned lecture
- lab, data: NFL play-by-play data for win probability modeling
Miscellaneous
- Kelly betting
- Clustering (K means hard clustering)
If I had more time
- Topics included purely as labs that deserve their own lecture
- Selection bias
- Parametric inference
- Topics alluded to in class that deserve their own lecture
- Multicollinearity
- Other topics
- Tracking data
- Causal inference
- Regression (observational studies) is Not causation
- Randomized controlled trials
- Miscellaneous
- Soft clustering (GMMs & EM algorithm)
- Rare events
- Multiple hypothesis testing (Bonferroni correction & Benjamini Hochberg)
- GEV distribution for max race running time