Notes for PRML

9 minute read

Published:

finally have time to read PRML, mainly focus on my missing fondamental knowledges and finding theory basis of emperical knowledge

Chapter 1

already start forgetting los of algorithms

  • what is Huffman coding? greedy algorithm!

nice proof on largest entropy dist.

  • 1D: uniform
  • 2D: Gaussian
    • and entropy can be negative now
    • Exercise 1.34

mutual information

  • KL(p(x, y) p(x)p(y))

Chapter 2

Beta dist.

  • prior for Bernoulli dist.’s miu
  • posterior dist. obtained by multiplying beta prior and likelihood function, and normalizing
  • apply to multinomial case: Dirichelet dist.

as more data observed, posterior mean -> true mean and var -> zero

  • E[theta] = E[E[thetaData]]
  • var[theta] = E[var[thetaData]] + var[E[thetaData]]

Robbins and Monro estimate

  • find theta s.t. f(theta) = 0 online
  • theta_N = theta_{N - 1} + a_{N - 1}f(theta_{N - 1})
  • apply to MLE

periodic variables

  • statistics depends on choise of origin
  • using points on unit circle to get invariant mean and var
  • generalized Gaussian on theta
    • C_1exp{C_2 cos(theta - theta_0) + C_3}
  • where to use it?

exponential family

  • from sufficient statistics to miu
    • no need to store the whold dataset

noninformative priors

  • we need a prior techically while knowing nothing about the true dist.
  • Gaussian could be a good one

kernel dnesity estimator

  • kind of like attention

Chapter 3

linear model?

  • linear w.r.t. parameters
    • y(x, w) = w^T phi(x)

Chapter 4

one-versus-one and one-versus-the-rest

  • what about points in the intersection?

least square

  • qseudo-inverse
  • if t_n sums to one, then all pred. made will sum to one
    • but not necessarily a prob. dist. ==> maybe negative!
  • not robust
    • sensitive to outlier data

fisher

  • proj. to 1D, max. difference
  • in multi-class case: max. inter-class cov, while min. inclass cov.

revisit perceptron

  • kind of like SGD

view sigmoid function as Bayesian formula ==> interesting!

  • generalize to multiclass cases => softmax
  • apply to p(C_i) = N(miu_i, Sigma)

MLE: severe overfitting for linearly separable datasets

  • w_T phi = 0 ==> phi could be zero ==> w exploses

newton’s method: iterative reweighted least squares

  • close form, no need to worry calc of Hessian metrix

choice of link function(inverse of activation function) simplifies gradient

Laplace approx.

  • find max., get second order derivitive
    • construct a Gaussian with same mean and derivitive

Bayesian information criterion

  • ln p(D) ~= ln p(Dtheta) - 1/2 M ln N
    • find proof online
  • intuitive to assume Gaussian prior for Bayesian approach
    • mean and cov from likelihood functions

Chapter 5

symmetry in weight space

  • quite interesting
  • …but how to use that?

tangent propagation

  • when input transforms, keep the output same as before

inverse problems

  • kind of like Bayesian
  • least square works under Gaussian assumption
    • what if that fails?
    • multimodel assumption!

Chapter 6

training data points ket during prediction phase

linear model with regulized least square loss

  • entirely in terms of kernel function
  • no need to introduce feature vector

fisher kernel

  • k(x, x’) = g(theta, x)Fg(theta, x’)
  • F = E[gg]
  • whitening of fisher score

kernel regression

Chapter 7

sparse kernel?

  • kernel function evaluated on a subset of training data

SVM as a classifier

  • build a sigmoid layer on top to get a conditional prob.
  • hinge loss(with logistic loss) as an approx. to misclassification error.

SVM for regression

RVM

  • Bayesian linear regression model
  • weight has Gaussian prior
    • which will go to zero by optimization
  • reading paper later: NIPS 2000
    • done

Chapter 8

box notation of nodes

d-separation

since Chap. 9 and later chapters are all covered in Modern Computational Statistics, no need to take notes.

  • no, need to take Notes

Chapter 9

introducing latent vars -> marginal distribution can be expressed in terms of more tractable joint distribution of observed and latent variables

K-means

  • find an assignment of datapoints to clusters and center of clusters
  • min. distortion measure: can be Euclidean or other measurements
  • helpful to do linear re-scaling to zero mean and unit standard variation
  • kmeans used to initialize Gaussian mixture
  • online kmeans: further reading needed!
  • application on image segmentation: cluster pixel values
    • vector quantization, code-book vectors

mixtures of Gaussians

  • ancestral sampling: sample in topological order
  • avlid singularities: one Gaussian fit one point, sigma -> 0, overfitting
  • identifiability: K! solutions availble, how to interpret?
  • EM(actually coordinate descent), not guaranteed to reach global maxima

alternative view of EM

  • obverved data X, latent variable Z
  • motivation: hard to do MLL since sum over latent Z inside logarithm
    • estimate latent(using bayesian formula), s.t. sum is outside the log
    • then update parameters by max. LL
  • for MAP problems, max. LL + ln p(theta)
  • also apply to missing observations, if values are missing at random

variational distribution view of EM

  • introduce q(z), min. KL in the E step, and L in the M step
  • easy if p(X, Z) are of exponential family, since simplifies logarithm
  • GEM: increase L instead of max. L
  • ECM: do several constrained optimizations in M step
  • online EM?
    • E step easy, M step just keep the sufficient statistics

Chapter 10

infeasibility to eval posterior distribution

  • and dimensionality and complexity prohibit numerical intergration
  • variational distribution!
    • restrict the range of functions over which the optimization is performed
    • rich and flexible family while being tractable

generial expression

$ln q_j(z_j) = E_{i \neq j}[ln p(x, z)] + const$

factorized variational approximation tends to give too compact posterior distribution

  • KL divergence

model comparison

  • directly compare p(mx) is hard and nonsense
  • consider q(z, m) = q(zm)q(m)
  • optimize each q(zm) individually

mixture of Gaussian

  • some conjugate priors: Gaussian-Wishart, gamma
  • variational treatment
    • joint distribution over observed variables, parameters and latent variables
    • factorizes betweem latent and parameters
      • induced factorization: arise from assumption or conditional independence: d-separation
    • apply general expression
  • VI: no singularities, in which case penalty arises from prior distribution
    • thus no overfitting if large number of parameters applied
    • result same as EM given infinitely broad prior
  • approx. density: variational approximation replace true posterior distribution

choose K

  • MLE likelihood function incresces monotonically with K
  • while VI does not

exponential family

  • difference between parameter and latent variable? the latter are extensive: scale in number with the size of the dataset
  • prior distribution as a prior number of observations seen
  • VI in a graph: update a variable once all variables in its markov blanket is given -> receive message from all of its parents and children

local variational methods

  • local: local step, in contrast to global step
  • finding lounds over an / a set of variables
  • dual function: refresh on convex optimization
    • approximation of logistic sigmoid function: Gaussian approximation
    • which leads to an analytically intergratable bound
    • but note this approximation is not a distribution, cannot be normalized and still be a valid bound
      • EM update for the parameters
  • variational parameters(instead of RV) improved accuracy in the approximation
  • suit for sequential learning
  • EP: refer to textbook and slides
    • update a single latent variable by moment matching(property of exponential family)

Chapter 11

sampling methods

  • find the expectation fo some function f(z) w.r.t. a prob. distribution p(z)
    • accuracy does not depend on dimensionality of z
  • standard sampling methods: transform uniformly distributed random numbers using inverse of CDF
  • rejection sampling
    • sample from proposal distribution, reject with prob. p(z)/q(z)
    • adaptive rejection sampling: p(z) is log concave
      • envelope function being piecewise exponential distribution
  • importance sampling: E[f] = sum of p(z)f(z)
    • sample points where p(z) is large
    • importance weight: p(z) / q(z), works when q(z) matches p(z), at least. should be large whereever p(z) is large
  • sampling algorithm: approxmate the E step in EM

MCMC

  • maintain a record of current state and a proposal distribution depends on current state(simple)
  • check the detailed balance

Gibbs sampling

  • draw one variables conditoned on remaining variables
  • require the distribution to be ergodic
  • blocking Gibbs sampling: sample from group of variables

slice sampling

  • adaptive step size
  • sample u under uniform distribution [0, p(z)], then sample z from {z: p(z) > u}
    • choose a segment [z_min, z_max] where for any z in that segment, p(z) > u

HMC

  • refer to slices

estimate partition function

  • do not understand

continuous latent variable

continuous dataset

  • all data points lie close a manifold of lower dimensionality
  • interpret the departures of datapoints as noise
    • a generative view of models

PCA

  • orthogonal projection of data onto a lower dimensional linear space
    • while maximizing the variane of projected data
  • also minimizes the average projection cost
  • compression of dataset, or data-preprocessing(standardization)
  • for high dimensional data: O(D^3) is computationally not feasible
    • note that X^t X shares same eigenvector as X X^t

probabilistic PCA

  • PCA as MLE solution of a probabilistic latent variable model
  • x = Wz + mu + eps
    • solution is obtained when W = U(L - sigma^2 I)^(1/2) R
    • when M eigenvector are chosen to be M largest eigenvalues
    • correctly captures the variance of data along principal axes, approxmates variance in all remaining directions with a single average value sigma^2
  • EM for PCA: computational efficient for large-scale applications
    • also able to implemented in an on-line form: since only sufficient statistics are concerned

kernel PCA, ICA: refer to notes