Self Attention for Computer Vision

2 minute read

Published: September 21, 2021

revisit universality in DL

goal: one works for all, e.g. Ax, relu, res, MLE
- not universial: prepocessing, format, normaliyation
generaliyable, simple, min. expllicitr constraints, large impact from even small advances
focus: build a universal mixing primitive
- entities with relations -> pixels, sentences, graph -> attn works for all, and also scalable
bring attention to vision?
- RNN limitations: computations cannot be parallelized & long range interaction limited
- word alignment: align regions in src and tgt language, p(a_ij e_1…n, f_1…n)
  - self attention: interaction with all possible pairs -> trivially parallelzed, FLOP O(length^2 * dim), with smaller constant inside
a single SA averages all impacts from other nodes -> multihead attn models different kinds of interactions
repeating patterns in an image
- e.g. multiple people in a same image
- pattern matching: non-local means, bilateral filters
- self attn as a data dependent convolution
build a attn vis model
- fully attn model, reuse vision-designed components, and replace spatial mixing conv with attn

attn vision model design, classification for a example

survey of SA in CV

DETR: details omitted, this paper rather puts me off btw…
- visulize the attn patterns
semantic segmentation: SETR
- patch level attn also works
DINO: self training, converge with past self for self supersived learning
CLIP: contrastive pretraining, matching texts with images

Human Object Interaction