GRADIENT DESCENT PROVABLY OPTIMIZES OVER-PARAMETERIZED NEURAL NETWORKS

less than 1 minute read

Published:

this work: two layer fc + ReLU able to achieve a globally optimal at linear rate using GD

  • observation: zero error on random label
  • proof…