Quantization is an effective approach to accelerate deep neural networks by restricting their weights and activation functions to low precisions. However, the training objective (loss function) becomes discontinuous so that the standard gradient either vanishes or does not exist. We discuss a notion of coarse gradient (also known as straight through estimator) that acts on smooth proxies of discontinuous functions, and (with proper design) leads to subtle descent of the loss function in training as well as satisfactory generalization accuracy. We perform convergence analysis on simplified models and experiments on image classification, some in conjunction with a feature-affinity assisted multi-level knowledge distillation to extract an efficient student network from a larger teacher network on label-free data.
Note:
Video: https://drive.google.com/file/d/1XjCkao0qH2bX5u16CxI2hAMHCvRW6pkd/view?u...