Self Attention for Computer Vision

2 minute read

Published:

revisit universality in DL

  • goal: one works for all, e.g. Ax, relu, res, MLE
    • not universial: prepocessing, format, normaliyation
  • generaliyable, simple, min. expllicitr constraints, large impact from even small advances
  • focus: build a universal mixing primitive
    • entities with relations -> pixels, sentences, graph -> attn works for all, and also scalable
  • bring attention to vision?
    • RNN limitations: computations cannot be parallelized & long range interaction limited
    • word alignment: align regions in src and tgt language, p(a_ije_1…n, f_1…n)
      • self attention: interaction with all possible pairs -> trivially parallelzed, FLOP O(length^2 * dim), with smaller constant inside
  • a single SA averages all impacts from other nodes -> multihead attn models different kinds of interactions
  • repeating patterns in an image
    • e.g. multiple people in a same image
    • pattern matching: non-local means, bilateral filters
    • self attn as a data dependent convolution
  • build a attn vis model
    • fully attn model, reuse vision-designed components, and replace spatial mixing conv with attn

attn vision model design, classification for a example

  • adapt an exist model: vision model add attn or attn model in nlp adapt to vision
    • e.g. transformer to vision
      • one-hot encoding -> embedding
      • first try, every pixel as a token, pass into a transformer
        • output [W*H, feature], average to [feature]
        • too expensive: quadrative cost w.r.t. input length
        • smaller image? lost of detail -> learned downsmaple of the image, conv
        • patches, lin proj to a single vector for each patch
          • what about multi scales? image -> resnet, then apply transformer at a high/medium/low level
      • attn for large resolutions?
        • cheaper attn, like conv -> local attn, each pixel only interacts with neighboring pixels, linear scaling w.r.t. num of pixels
        • blocked local attn, a single local region for all pixels within the same block, low mem, efficient, and more accurate -> larger local neighborhood
        • swin transformer: shift windows in different layers, offset windows by some margin -> eventually, all pixels can itneract with all other pixels
        • axial attention: attn on height axis and attn on width axis
        • approx attn, softmax(QK)V to QKV -> Q(KV)
    • positional encoding? no translational equivariance
      • relative positions, w.r.t. to the query pixel

survey of SA in CV

  • DETR: details omitted, this paper rather puts me off btw…
    • visulize the attn patterns
  • semantic segmentation: SETR
    • patch level attn also works
  • DINO: self training, converge with past self for self supersived learning
  • CLIP: contrastive pretraining, matching texts with images