# Statistics via Sports: Summer Lab 2024

- Taught as part of the Wharton Sports Analytics Summer Research Lab
- Recommended prerequisites: calculus, probability, R coding; matrix algebra helps

### Daily format

- 1 to 2 hour lecture
- Hands-on active learning lab where you will analyze a real-world sports dataset

### Regression modeling

- Simple linear regression
- Multivariable linear regression
- planned lecture
- note: estimating the coefficients
- lab, data: NBA team-seasons for the four factors, punts

- Example of the research process
- planned lecture
- lab:
- get into groups and start thinking about a research project
- plan to read relevant literature and start with a replication of existing analysis
- finish up the previous labs

- Logistic regression
- planned lecture
- note: logistic regression & gradient descent
- lab, data: field goals, 2023-2024 NCAA men’s basketball game results and team info from Kaggle

- Confounding
- planned lecture
- lab, data: MLB half-innings data for park effects

- Models do what they’re told

### Frequentist statistical inference and uncertainty quantification

- Significance and p-values
- planned lecture
- lab, data: diving, TTO (time through the order)

- Normal approximation (CLT) and binomial proportion confidence interval
- The bootstrap

### Shrinkage & Bayesian statistics

- Priors & the power of fake data
- Empirical Bayes
- Shrinkage estimation
- planned lecture – need to re-write this
- live lecture
- lab, data: first putt success percentage training data and held-out test data

- Fully Bayesian models
- planned lecture
- live lecture
- lab, data: NFL game-by-game data for Bayesian power rating and home field advantage model

- Regularization & ridge regression
- planned lecture
- lab, data: NBA lineup data; this is in a
`.rds`

format that takes significantly less storage than a`.csv`

file, to convert to a`.csv`

file in order to use in a non-`R`

language you’ll need to load it into`R`

and then save it as a`.csv`

### Machine learning

- Bias-variance trade-off
- planned lecture
- lab, data: MLB half-innings data for park effects

- Decision trees
- planned lecture
- lab: just work on research

- Random forests & Boosting
- planned lecture
- lab, data: NFL play-by-play data for win probability modeling

### Miscellaneous

- Kelly betting
- Clustering (K means hard clustering)

### If I had more time

- Topics included purely as labs that deserve their own lecture
- Selection bias
- Parametric inference

- Topics alluded to in class that deserve their own lecture
- Multicollinearity

- Other topics
- Tracking data
- Causal inference
- Regression (observational studies) is Not causation
- Randomized controlled trials

- Miscellaneous
- Soft clustering (GMMs & EM algorithm)
- Rare events
- Multiple hypothesis testing (Bonferroni correction & Benjamini Hochberg)
- GEV distribution for max race running time