ETH_IntroductionToMachineLearning

links - IntML

Info

exam
- multiple choice, fully on paper
projects
- 3/4 have to "pass" to be able to do the exam
- 1min video explaining solution for each project needed, uploaded to polybox
practice classes
- ETH_IntroductionToMachineLearning_Übungen

Hilfsmittel IntroductionToMachineLearning

#timestamp 2026-02-27

#todo completely useless - delete or rewrite

We have:

\begin{aligned} \hat{w} & = w_{s} + w_{p} \\ w_{s} & \in span (X^{⊤}) = span (X^{⊤} X) \end{aligned}

normal equation:

\begin{aligned} X^{⊤} X ω & = X y \\ ⟺ ω & = (X^{⊤} X)^{†} X y = X^{†} y \end{aligned}

#timestamp 2026-03-03

Variance might not be only because of noise, but also because there are not enough samples.

overfitting $\hat{=}$ too sensitive to noise

#timestamp 2026-03-04

Bias $\sim$ distance between average $\overset{―}{f} = \frac{1}{J} \sum_{j = 1}^{J} {\hat{f}}_{D_{j}}$ and $f^{⋆}$
- error from wrong assumptions in algorithm ("underfitting)
Variance $\sim$ how far ${\hat{f}}_{D_{j}}$ is from $\overset{―}{f}$
- error from sensitivity to noise ("overfitting")

-> increasing model complexity beyond point where training error $= 0$ , can lead to second decrease in generalization error (more than just u-curve)

splitting data

Train/Test split: Normally $\frac{9}{10}$ - $\frac{1}{10}$ test split, model is evaluated on test data
Train/Validation/Test split: e.g. $50 % - 25 % - 25 %$ , use validation set to tune hyperparameters, use test set to get unbiased estimate
K-Fold Cross-Validation (CV): Dividing the training data into $K$ subsets (folds). The model is trained $K$ times, each time using a different fold as the validation set and the remaining $K - 1$ folds for training.

control model complexity
Complex model (e.g. high-degree polynomial) might make weights large / osscilate. To counteract:

Lasso Regression ( $ℓ_{1}$ Regularization)
Adds a penalty proportional to the absolute value of the coefficients.

{\hat{w}}_{lasso} = {argmin}_{w} | | y - Φ w | |_{2}^{2} + λ | | w | |_{1}

Effect: Induces sparsity, meaning it sets some coefficients to exactly zero, effectively performing feature selection.

Ridge Regression ( $ℓ_{2}$ Regularization)
Adds a penalty proportional to the square of the magnitude of coefficients.

{\hat{w}}_{ridge} = {argmin}_{w} | | y - Φ w | |_{2}^{2} + λ | | w | |_{2}^{2}

Effect: Shrinks all coefficients toward zero but rarely makes them exactly zero.
Analytical Solution: ${\hat{w}}_{ridge} = (X^{T} X + λ I)^{- 1} X^{T} y$ .

#todo why does lasso induce sparsity?

#timestamp 2026-03-20

Pasted image 20260320143223.png

check whether kernel is valid:

#timestamp 2026-03-24

computation at each layer consists of three parts:

$w$ , trainable weights
$b$ , biases, which shift our functions (linear -> affine)
$ϕ / φ$ , activation functions which scale the values and introduces non-linearity -> allows the network to learn complex patterns
- possible activation functions: ReLU, Sigmoid

forward propagation

h^{(0)} = x

hidden layers can be written as matrix-multiplication (pre-activation value $z$ , activation value $h$ ):

\begin{aligned} z^{(l)} & = W^{(l)} h^{(l - 1)} + b^{(l)} & l = {1, \dots, L - 1} \\ h^{(l)} & = ϕ (z^{(l)}) \end{aligned}

f = W^{(L)} h^{(L - 1)} + b^{(L)}

=> for a multilayer perceptron with a single hidden layer:

y = f (\vec{x}, w, θ) = \sum_{j = 1}^{p} w_{j}^{(2)} φ (w_{j}^{(1) ⊤} x + w_{j, 0}^{(1)})

backwards propagation:
https://xnought.github.io/backprop-explainer/