Revisiting Locally Supervised Learning an Alternative to End-to-end Training
Published:
e2e learning is better than locally supervised learning, why?
- hypothesis: input x has two parts: one related to output y, and another part r irrelevant with output target
- locally supervised learning losses all r and a part of y, which cannot be recovered by later layers.
- how to varify: by mutual information (an approximation), and linear separability
- infopro loss: min I(h, x) while max I(h, y)
- both term should be approximated, with negligible gap with ground truth
