- course site: https://las.inf.ethz.ch/teaching/introml-s26
- projects: https://project.las.ethz.ch/
^aafc7b
- exam
- multiple choice, fully on paper
- projects
- 3/4 have to "pass" to be able to do the exam
- 1min video explaining solution for each project needed, uploaded to polybox
- practice classes
- 2 A4 pages (1 sheet) handwritten / >11pt
- calculator
Exercises
-> FS2026_tasks
Projects
Vorlesung
#timestamp 2026-02-27
#todo completely useless - delete or rewrite
We have:
normal equation:
#timestamp 2026-03-03
Variance might not be only because of noise, but also because there are not enough samples.
overfitting
#timestamp 2026-03-04
- Bias
distance between average and - error from wrong assumptions in algorithm ("underfitting)
- Variance
how far is from - error from sensitivity to noise ("overfitting")
-> increasing model complexity beyond point where training error
splitting data
- Train/Test split: Normally
- test split, model is evaluated on test data - Train/Validation/Test split: e.g.
, use validation set to tune hyperparameters, use test set to get unbiased estimate - K-Fold Cross-Validation (CV): Dividing the training data into
subsets (folds). The model is trained times, each time using a different fold as the validation set and the remaining folds for training.
control model complexity
Complex model (e.g. high-degree polynomial) might make weights large / osscilate. To counteract:
- Smaller degree
- Smaller number of monomials “active” by limiting
-norm - Limit
norm
Lasso Regression (
Adds a penalty proportional to the absolute value of the coefficients.
- Effect: Induces sparsity, meaning it sets some coefficients to exactly zero, effectively performing feature selection.
Ridge Regression (
Adds a penalty proportional to the square of the magnitude of coefficients.
- Effect: Shrinks all coefficients toward zero but rarely makes them exactly zero.
- Analytical Solution:
.
#todo why does lasso induce sparsity?
| Feature | Ridge (ℓ2) | Lasso (ℓ1) |
|---|---|---|
| Penalty | $\lambda \sum | |
| Solution | Closed-form | Numerical optimization |
| Sparsity | No (coefficients |
Yes (coefficients |
| Use Case | Preventing overfitting | Feature selection |
#timestamp 2026-03-20

check whether kernel is valid:
- check symmetry
- decompose into known valid kernels (using linearity)
- use closure rule (sum/product/scaling/composition)
- if unsure, build a small kernel matrix counterexample
10. Neural networks
#timestamp 2026-03-24
computation at each layer consists of three parts:
, trainable weights , biases, which shift our functions (linear -> affine) , activation functions which scale the values and introduces non-linearity -> allows the network to learn complex patterns - possible activation functions: ReLU, Sigmoid
forward propagation
- initial input (input vector
assigned to 0th layer):
- hidden layers can be written as matrix-multiplication (pre-activation value
, activation value ):
- output layer (
): linear combination of last hidden layer activations:
=> for a multilayer perceptron with a single hidden layer:
backwards propagation:
https://xnought.github.io/backprop-explainer/