links - IntML
Info

  • exam
    • multiple choice, fully on paper
  • projects
    • 3/4 have to "pass" to be able to do the exam
    • 1min video explaining solution for each project needed, uploaded to polybox
  • practice classes

Hilfsmittel IntroductionToMachineLearning

  • 2 A4 pages (1 sheet) handwritten / >11pt
  • calculator

Exercises

-> FS2026_tasks

Projects

Vorlesung

#timestamp 2026-02-27

#todo completely useless - delete or rewrite

We have:

w^=ws+wpwsspan(X)=span(XX)

normal equation:

XXω=Xyω=(XX)Xy=Xy

#timestamp 2026-03-03

Variance might not be only because of noise, but also because there are not enough samples.

overfitting =^ too sensitive to noise

#timestamp 2026-03-04

-> increasing model complexity beyond point where training error =0, can lead to second decrease in generalization error (more than just u-curve)

splitting data

control model complexity
Complex model (e.g. high-degree polynomial) might make weights large / osscilate. To counteract:

  1. Smaller degree m
  2. Smaller number of monomials “active” by limiting l1-norm
  3. Limit l2 norm

Lasso Regression (1 Regularization)
Adds a penalty proportional to the absolute value of the coefficients.

w^lasso=argminw||yΦw||22+λ||w||1

Ridge Regression (2 Regularization)
Adds a penalty proportional to the square of the magnitude of coefficients.

w^ridge=argminw||yΦw||22+λ||w||22

#todo why does lasso induce sparsity?

Feature Ridge (ℓ2​) Lasso (ℓ1​)
Penalty λwj2 $\lambda \sum
Solution Closed-form Numerical optimization
Sparsity No (coefficients 0) Yes (coefficients =0)
Use Case Preventing overfitting Feature selection

#timestamp 2026-03-20

Pasted image 20260320143223.png

check whether kernel is valid:

  1. check symmetry
  2. decompose into known valid kernels (using linearity)
  3. use closure rule (sum/product/scaling/composition)
  4. if unsure, build a small kernel matrix counterexample

10. Neural networks

#timestamp 2026-03-24

computation at each layer consists of three parts:

forward propagation

h(0)=x z(l)=W(l)h(l1)+b(l)l={1,,L1}h(l)=ϕ(z(l)) f=W(L)h(L1)+b(L)

=> for a multilayer perceptron with a single hidden layer:

y=f(x,w,θ)=j=1pwj(2)φ(wj(1)x+wj,0(1))

backwards propagation:
https://xnought.github.io/backprop-explainer/

Dimensionality reduction

Method What it does Best used for...
PCA Flattens the data like a pancake. Simple data where things move in straight lines.
Kernel PCA Bends and twists the data before flattening it. Complex patterns like spirals or circles.
Autoencoder Compresses the data into a "secret code" and learns to unpack it. Very messy, high-tech data like photos or speech.

PCA
PCA relies on the Covariance Matrix Σ. This matrix tells us how every dimension in your data moves in relation to every other dimension.

z=WTx

kernel PCA
In standard PCA, we find eigenvectors of the data. But what if we first transformed the data into a crazy high-dimensional space using a function ϕ(x)? We can't actually compute ϕ(x) because it might be infinite-dimensional.

z=i=1nαik(xi,x)

autoencoders
This is an optimization problem. We have two functions: an Encoder E(x) and a Decoder D(z).

minW,bxD(E(x))2