Contrastive Learning for Gaze Estimation

Published:

NOTE: we are aware of the similar concurrent work from CVPR2022. While our intuition originates more from the pratical side, we share a lot of ideas in common. Interested readers can refer to their paper for more details.

We present here my semester project of Contrastive Learning for Gaze Estimation. We notice that self-supervised methods has demonstrated promising performance in various computer vision tasks, but the building block of it: the infoNCE loss is originally proposed in the context of classification tasks. We notice that it does not take the ground-truth gaze into account, where images with similar gazes will still be penalized. In addition, training with only one positive pair also does not offer enough information to learn a compact enough latent space as the ground truth gaze only covers a portion of all possible directions. To fix all of these, we exploit all possible positive pairs by considering images with similar gazes, such that the learned feature space is gaze-aware.

Our loss functions(poseNCEsim & poseNCEsim+), in addition to infoNCE, poseNCE for comparison are shown in the below figure

, where the weight in “weighted positive/negative pair” is from their relative gaze difference following poseNCE.

Experimentally, our method outperforms all other loss functions and learns a more compact latent space.

, where we compare our loss to a supervised baseline, data augmentation baseline, and other variations of NCE losses.

Qualitatively, our learned latent space are able to retrieve images across identity and head positions.