Notes for PRML
Published:
finally have time to read PRML, mainly focus on my missing fondamental knowledges and finding theory basis of emperical knowledge
Chapter 1
already start forgetting los of algorithms
- what is Huffman coding? greedy algorithm!
nice proof on largest entropy dist.
- 1D: uniform
- 2D: Gaussian
- and entropy can be negative now
- Exercise 1.34
mutual information
KL(p(x, y) p(x)p(y))
Chapter 2
Beta dist.
- prior for Bernoulli dist.’s miu
- posterior dist. obtained by multiplying beta prior and likelihood function, and normalizing
- apply to multinomial case: Dirichelet dist.
as more data observed, posterior mean -> true mean and var -> zero
E[theta] = E[E[theta Data]] var[theta] = E[var[theta Data]] + var[E[theta Data]]
Robbins and Monro estimate
- find theta s.t. f(theta) = 0 online
- theta_N = theta_{N - 1} + a_{N - 1}f(theta_{N - 1})
- apply to MLE
periodic variables
- statistics depends on choise of origin
- using points on unit circle to get invariant mean and var
- generalized Gaussian on theta
- C_1exp{C_2 cos(theta - theta_0) + C_3}
- where to use it?
exponential family
- from sufficient statistics to miu
- no need to store the whold dataset
noninformative priors
- we need a prior techically while knowing nothing about the true dist.
- Gaussian could be a good one
kernel dnesity estimator
- kind of like attention
Chapter 3
linear model?
- linear w.r.t. parameters
- y(x, w) = w^T phi(x)
Chapter 4
one-versus-one and one-versus-the-rest
- what about points in the intersection?
least square
- qseudo-inverse
- if t_n sums to one, then all pred. made will sum to one
- but not necessarily a prob. dist. ==> maybe negative!
- not robust
- sensitive to outlier data
fisher
- proj. to 1D, max. difference
- in multi-class case: max. inter-class cov, while min. inclass cov.
revisit perceptron
- kind of like SGD
view sigmoid function as Bayesian formula ==> interesting!
- generalize to multiclass cases => softmax
- apply to p(C_i) = N(miu_i, Sigma)
MLE: severe overfitting for linearly separable datasets
- w_T phi = 0 ==> phi could be zero ==> w exploses
newton’s method: iterative reweighted least squares
- close form, no need to worry calc of Hessian metrix
choice of link function(inverse of activation function) simplifies gradient
Laplace approx.
- find max., get second order derivitive
- construct a Gaussian with same mean and derivitive
Bayesian information criterion
ln p(D) ~= ln p(D theta) - 1/2 M ln N - find proof online
- intuitive to assume Gaussian prior for Bayesian approach
- mean and cov from likelihood functions
Chapter 5
symmetry in weight space
- quite interesting
- …but how to use that?
tangent propagation
- when input transforms, keep the output same as before
inverse problems
- kind of like Bayesian
- least square works under Gaussian assumption
- what if that fails?
- multimodel assumption!
Chapter 6
training data points ket during prediction phase
linear model with regulized least square loss
- entirely in terms of kernel function
- no need to introduce feature vector
fisher kernel
- k(x, x’) = g(theta, x)Fg(theta, x’)
- F = E[gg]
- whitening of fisher score
kernel regression
Chapter 7
sparse kernel?
- kernel function evaluated on a subset of training data
SVM as a classifier
- build a sigmoid layer on top to get a conditional prob.
- hinge loss(with logistic loss) as an approx. to misclassification error.
SVM for regression
RVM
- Bayesian linear regression model
- weight has Gaussian prior
- which will go to zero by optimization
- reading paper later: NIPS 2000
- done
Chapter 8
box notation of nodes
d-separation
since Chap. 9 and later chapters are all covered in Modern Computational Statistics, no need to take notes.
- no, need to take Notes
Chapter 9
introducing latent vars -> marginal distribution can be expressed in terms of more tractable joint distribution of observed and latent variables
K-means
- find an assignment of datapoints to clusters and center of clusters
- min. distortion measure: can be Euclidean or other measurements
- helpful to do linear re-scaling to zero mean and unit standard variation
- kmeans used to initialize Gaussian mixture
- online kmeans: further reading needed!
- application on image segmentation: cluster pixel values
- vector quantization, code-book vectors
mixtures of Gaussians
- ancestral sampling: sample in topological order
- avlid singularities: one Gaussian fit one point, sigma -> 0, overfitting
- identifiability: K! solutions availble, how to interpret?
- EM(actually coordinate descent), not guaranteed to reach global maxima
alternative view of EM
- obverved data X, latent variable Z
- motivation: hard to do MLL since sum over latent Z inside logarithm
- estimate latent(using bayesian formula), s.t. sum is outside the log
- then update parameters by max. LL
- for MAP problems, max. LL + ln p(theta)
- also apply to missing observations, if values are missing at random
variational distribution view of EM
- introduce q(z), min. KL in the E step, and L in the M step
- easy if p(X, Z) are of exponential family, since simplifies logarithm
- GEM: increase L instead of max. L
- ECM: do several constrained optimizations in M step
- online EM?
- E step easy, M step just keep the sufficient statistics
Chapter 10
infeasibility to eval posterior distribution
- and dimensionality and complexity prohibit numerical intergration
- variational distribution!
- restrict the range of functions over which the optimization is performed
- rich and flexible family while being tractable
generial expression
$ln q_j(z_j) = E_{i \neq j}[ln p(x, z)] + const$
factorized variational approximation tends to give too compact posterior distribution
- KL divergence
model comparison
directly compare p(m x) is hard and nonsense consider q(z, m) = q(z m)q(m) optimize each q(z m) individually
mixture of Gaussian
- some conjugate priors: Gaussian-Wishart, gamma
- variational treatment
- joint distribution over observed variables, parameters and latent variables
- factorizes betweem latent and parameters
- induced factorization: arise from assumption or conditional independence: d-separation
- apply general expression
- VI: no singularities, in which case penalty arises from prior distribution
- thus no overfitting if large number of parameters applied
- result same as EM given infinitely broad prior
- approx. density: variational approximation replace true posterior distribution
choose K
- MLE likelihood function incresces monotonically with K
- while VI does not
exponential family
- difference between parameter and latent variable? the latter are extensive: scale in number with the size of the dataset
- prior distribution as a prior number of observations seen
- VI in a graph: update a variable once all variables in its markov blanket is given -> receive message from all of its parents and children
local variational methods
- local: local step, in contrast to global step
- finding lounds over an / a set of variables
- dual function: refresh on convex optimization
- approximation of logistic sigmoid function: Gaussian approximation
- which leads to an analytically intergratable bound
- but note this approximation is not a distribution, cannot be normalized and still be a valid bound
- EM update for the parameters
- variational parameters(instead of RV) improved accuracy in the approximation
- suit for sequential learning
- EP: refer to textbook and slides
- update a single latent variable by moment matching(property of exponential family)
Chapter 11
sampling methods
- find the expectation fo some function f(z) w.r.t. a prob. distribution p(z)
- accuracy does not depend on dimensionality of z
- standard sampling methods: transform uniformly distributed random numbers using inverse of CDF
- rejection sampling
- sample from proposal distribution, reject with prob. p(z)/q(z)
- adaptive rejection sampling: p(z) is log concave
- envelope function being piecewise exponential distribution
- importance sampling: E[f] = sum of p(z)f(z)
- sample points where p(z) is large
- importance weight: p(z) / q(z), works when q(z) matches p(z), at least. should be large whereever p(z) is large
- sampling algorithm: approxmate the E step in EM
MCMC
- maintain a record of current state and a proposal distribution depends on current state(simple)
- check the detailed balance
Gibbs sampling
- draw one variables conditoned on remaining variables
- require the distribution to be ergodic
- blocking Gibbs sampling: sample from group of variables
slice sampling
- adaptive step size
- sample u under uniform distribution [0, p(z)], then sample z from {z: p(z) > u}
- choose a segment [z_min, z_max] where for any z in that segment, p(z) > u
HMC
- refer to slices
estimate partition function
- do not understand
continuous latent variable
continuous dataset
- all data points lie close a manifold of lower dimensionality
- interpret the departures of datapoints as noise
- a generative view of models
PCA
- orthogonal projection of data onto a lower dimensional linear space
- while maximizing the variane of projected data
- also minimizes the average projection cost
- compression of dataset, or data-preprocessing(standardization)
- for high dimensional data: O(D^3) is computationally not feasible
- note that X^t X shares same eigenvector as X X^t
probabilistic PCA
- PCA as MLE solution of a probabilistic latent variable model
- x = Wz + mu + eps
- solution is obtained when W = U(L - sigma^2 I)^(1/2) R
- when M eigenvector are chosen to be M largest eigenvalues
- correctly captures the variance of data along principal axes, approxmates variance in all remaining directions with a single average value sigma^2
- EM for PCA: computational efficient for large-scale applications
- also able to implemented in an on-line form: since only sufficient statistics are concerned
kernel PCA, ICA: refer to notes