Self Attention for Computer Vision
Published:
revisit universality in DL
- goal: one works for all, e.g. Ax, relu, res, MLE
- not universial: prepocessing, format, normaliyation
- generaliyable, simple, min. expllicitr constraints, large impact from even small advances
- focus: build a universal mixing primitive
- entities with relations -> pixels, sentences, graph -> attn works for all, and also scalable
- bring attention to vision?
- RNN limitations: computations cannot be parallelized & long range interaction limited
word alignment: align regions in src and tgt language, p(a_ij e_1…n, f_1…n) - self attention: interaction with all possible pairs -> trivially parallelzed, FLOP O(length^2 * dim), with smaller constant inside
- a single SA averages all impacts from other nodes -> multihead attn models different kinds of interactions
- repeating patterns in an image
- e.g. multiple people in a same image
- pattern matching: non-local means, bilateral filters
- self attn as a data dependent convolution
- build a attn vis model
- fully attn model, reuse vision-designed components, and replace spatial mixing conv with attn
attn vision model design, classification for a example
- adapt an exist model: vision model add attn or attn model in nlp adapt to vision
- e.g. transformer to vision
- one-hot encoding -> embedding
- first try, every pixel as a token, pass into a transformer
- output [W*H, feature], average to [feature]
- too expensive: quadrative cost w.r.t. input length
- smaller image? lost of detail -> learned downsmaple of the image, conv
- patches, lin proj to a single vector for each patch
- what about multi scales? image -> resnet, then apply transformer at a high/medium/low level
- attn for large resolutions?
- cheaper attn, like conv -> local attn, each pixel only interacts with neighboring pixels, linear scaling w.r.t. num of pixels
- blocked local attn, a single local region for all pixels within the same block, low mem, efficient, and more accurate -> larger local neighborhood
- swin transformer: shift windows in different layers, offset windows by some margin -> eventually, all pixels can itneract with all other pixels
- axial attention: attn on height axis and attn on width axis
- approx attn, softmax(QK)V to QKV -> Q(KV)
- positional encoding? no translational equivariance
- relative positions, w.r.t. to the query pixel
- e.g. transformer to vision
survey of SA in CV
- DETR: details omitted, this paper rather puts me off btw…
- visulize the attn patterns
- semantic segmentation: SETR
- patch level attn also works
- DINO: self training, converge with past self for self supersived learning
- CLIP: contrastive pretraining, matching texts with images