# Computational Statistics and Data Analysis (MVComp2)

Winter Semester 2023/2024. October 16th, 2023 to February 10, 2024

**Lectures**: Tue 11-13; **Exercises**: Fri 14-16

**Lectures**: Philosophenweg 12 / kHS; **Exercises**: INF 308 / HS2

Lecturer: Prof. Dr. Tristan Bereau, Institute for Theoretical Physics, Heidelberg University

6 credit points

## Course links

## Course description

This lecture will introduce basic methods and approaches in computational statistics and data analysis, of great importance to empirical problems in the natural sciences. An overview of relevant concepts and theorems in probability theorey and statistics will be covered, all the way to more modern approaches, including automatic differentiation and machine learning. Lectures will be accompanied by computational exercises in Python. Students will learn to analyze data sets and interpret the results from a solid, thoeretically grounded statistical perspective; devise statistical and machine learning models of experimental situations; infer the parameters of these models from empirical observations; and test hypotheses.

## Prerequisites

- Linear (Matrix) Algebra
- Basic calculus (derivatives & integrals)
- Basic programming skills in Python

## Tentative course outline

- Basic concepts in probability theory
- Random variables; expectations, variances, covariances, and their properties
- Discrete & continuous probability distributions
- Moment-generating functions, central limit theorem, and multivariate distributions
- Statistical models & inference: parameter estimation
- Hypothesis tests: tests, confidence intervals, bootstrap method
- Linear regression: least squares, generalized linear model
- Regularization: Ridge & LASSO regression, MAP estimation
- Nonlinear regression: basis expansions, neural networks
- Classification: k-nearest neighbors, logistic regression, linear discriminant analysis
- Kernel methods: Mercer kernels, Gaussian processes, support vector machines
- Model selection: Jeffreys scale, BIC, bias-variance tradeoff
- Dimensionality reduction: principal component analysis, factor analysis
- Information theory

## Main references

- Wackerly, D., Mendenhall, W., & Scheaffer, R. L. (2014). Mathematical statistics with applications. Cengage Learning.
- Kevin P. Murphy, Probabilistic Machine Learning: An Introduction, MIT Press (2022), https://probml.github.io/pml-book/book1.html
- Kevin P. Murphy, Probabilistic Machine Learning: Advanced Topics, MIT Press (2022), https://probml.github.io/pml-book/book2.html
- Mehta, P., Bukov, M., Wang, C. H., Day, A. G., Richardson, C., Fisher, C. K., & Schwab, D. J. (2019). A high-bias, low-variance introduction to machine learning for physicists. Physics reports, 810, 1-124. https://doi.org/10.1016/j.physrep.2019.03.001
- Luca Amendola, Lecture notes on Statistical Methods. https://www.thphys.uni-heidelberg.de/%7Eamendola/teaching/compstat-hd.pdf

## Attendance

- Lectures
- Lectures will cover the conceptual and theoretical aspects relevant to the course. Slides used in the lectures will be uploaded after each lecture on this page.
- Exercises
- These sessions will provide opportunity to discuss the last lecture, previous exercises, and work on hands-on problems.

## Assessment

Competence and proficiency will be assessed through:

### Exercises

- To be handed in on a weekly basis.
- You are requested to submit by groups of two or three. Please ensure that your submitted document clearly contains the
**names**of everyone in your group. Your group is responsible for producing one*original*exercise submission. - Exercises will typically include one coding problem. Though I recommend Python, you are welcome to use any language you see fit. The code you write will not be graded—only the results will be (e.g., calculations or plots).
- Preferred formats of your exercise: LaTeX, Markdown, Jupyter notebook, Quarto, or equivalent. Please always submit one PDF file.
- Exercise is always due the day before the next Exercice session, i.e., on Thursdays by 23:59. Exercices are to be uploaded on Physik Übungsgruppen. No extensions will be granted.
- Exercises and solutions available on this page.

### Exam

One on-site, written exam at the end of the semester.

Date: Friday Feb. 9, 2024, **14:00-16:00**.

Place: INF 308 / HS 2

### Weights

- Exam: 70%
- Exercises: 30%

## Rocket.Chat

We will be using Rocket.Chat as a virtual and public forum for questions discussion. The system allows for fast responses from the instructor and classmates, and permits students to see previous questions and answers. Rather than emailing questions to the instructor, please post your questions on Rocket.Chat. If you have not already been automatically enrolled, please sign up via https://uebungen.physik.uni-heidelberg.de/chat/group/WS23-1758.

## On large language models

*Section written by OpenAI’s ChatGPT.*

- Supplement, Not Substitute
- While LLMs can be great supplementary tools, they should not replace your primary learning resources, such as textbooks, lectures, and discussions. Always prioritize understanding the fundamental concepts from your course materials.
- Exercises & Assignments
- We encourage students to use LLMs for understanding topics, brainstorming ideas, or verifying concepts. However, simply copying answers or relying on the LLM to solve your exercises defeats the purpose of your academic journey. It’s vital to ensure you understand the material and can apply it without external assistance.
- Ethical Considerations
- Remember, using external sources without proper citation or presenting another’s work as your own is plagiarism. Always cite any assistance or information obtained from LLMs.
- Limitations
- While LLMs are advanced and can provide a wealth of information, they are not infallible. Always cross-check crucial information from multiple trusted sources.