- course site: https://las.inf.ethz.ch/teaching/introml-s26
- projects: https://project.las.ethz.ch/
^aafc7b
- exam
- multiple choice, fully on paper
- projects
- 3/4 have to "pass" to be able to do the exam
- 1min video explaining solution for each project needed, uploaded to polybox
- practice classes
- 2 A4 pages (1 sheet) handwritten / >11pt
- calculator
Exercises
-> FS2026_tasks
Projects
Vorlesung
#timestamp 2026-02-27
#todo completely useless - delete or rewrite
We have:
normal equation:
#timestamp 2026-03-03
Variance might not be only because of noise, but also because there are not enough samples.
overfitting
#timestamp 2026-03-04
- Bias
distance between average and - error from wrong assumptions in algorithm ("underfitting)
- Variance
how far is from - error from sensitivity to noise ("overfitting")
-> increasing model complexity beyond point where training error
splitting data
- Train/Test split: Normally
- test split, model is evaluated on test data - Train/Validation/Test split: e.g.
, use validation set to tune hyperparameters, use test set to get unbiased estimate - K-Fold Cross-Validation (CV): Dividing the training data into
subsets (folds). The model is trained times, each time using a different fold as the validation set and the remaining folds for training.
control model complexity
Complex model (e.g. high-degree polynomial) might make weights large / osscilate. To counteract:
- Smaller degree
- Smaller number of monomials “active” by limiting
-norm - Limit
norm
Lasso Regression (
Adds a penalty proportional to the absolute value of the coefficients.
- Effect: Induces sparsity, meaning it sets some coefficients to exactly zero, effectively performing feature selection.
Ridge Regression (
Adds a penalty proportional to the square of the magnitude of coefficients.
- Effect: Shrinks all coefficients toward zero but rarely makes them exactly zero.
- Analytical Solution:
.
#todo why does lasso induce sparsity?
| Feature | Ridge (ℓ2) | Lasso (ℓ1) |
|---|---|---|
| Penalty | $\lambda \sum | |
| Solution | Closed-form | Numerical optimization |
| Sparsity | No (coefficients |
Yes (coefficients |
| Use Case | Preventing overfitting | Feature selection |
#timestamp 2026-03-20

check whether kernel is valid:
- check symmetry
- decompose into known valid kernels (using linearity)
- use closure rule (sum/product/scaling/composition)
- if unsure, build a small kernel matrix counterexample
10. Neural networks
#timestamp 2026-03-24
computation at each layer consists of three parts:
, trainable weights , biases, which shift our functions (linear -> affine) , activation functions which scale the values and introduces non-linearity -> allows the network to learn complex patterns - possible activation functions: ReLU, Sigmoid
forward propagation
- initial input (input vector
assigned to 0th layer):
- hidden layers can be written as matrix-multiplication (pre-activation value
, activation value ):
- output layer (
): linear combination of last hidden layer activations:
=> for a multilayer perceptron with a single hidden layer:
backwards propagation:
https://xnought.github.io/backprop-explainer/
Dimensionality reduction
| Method | What it does | Best used for... |
|---|---|---|
| PCA | Flattens the data like a pancake. | Simple data where things move in straight lines. |
| Kernel PCA | Bends and twists the data before flattening it. | Complex patterns like spirals or circles. |
| Autoencoder | Compresses the data into a "secret code" and learns to unpack it. | Very messy, high-tech data like photos or speech. |
PCA
PCA relies on the Covariance Matrix
- The Matrix:
. - The Magic (Eigenvectors): We solve
. (Eigenvector) is the direction of the "best angle" (the principal component). (Eigenvalue) is a number representing the amount of spread (variance) in that direction.
- The Projection: To turn your
-dimensional point into a -dimensional point , you just multiply it by the top eigenvectors:
kernel PCA
In standard PCA, we find eigenvectors of the data. But what if we first transformed the data into a crazy high-dimensional space using a function
- The Kernel Trick: Instead of calculating
, we use a Kernel Function that calculates the "similarity" between two points directly. - The Kernel Matrix (
): We build a matrix where every entry is the similarity between point and point . - The Solution: We find the eigenvectors
of instead of . - The Projection: The new coordinates
for a point are calculated by checking its similarity to all training points:
autoencoders
This is an optimization problem. We have two functions: an Encoder
- The Bottleneck:
. Here, is a nonlinear "gate" (like ReLU or Sigmoid) that allows the math to learn curves instead of just straight lines. - The Loss Function: We want the output
to be as close to the input as possible:
- The Training: gradient descent