Overview

Quantization is an effective technique for significantly reducing the computational requirements of large models by training models with lower precision parameters.

Pytorch offers an easy method to convert to half precision.

We also have to make sure the input data is half precision.

Now, most articles on this topic revolve around post-training quantization. Pyorch implements the following techniques.

  1. dynamic quantization (weights quantized with activations read/stored in floating point and quantized for compute)

  2. static quantization (weights quantized, activations quantized, calibration required post training)

  3. static quantization aware training (weights quantized, activations quantized, quantization numerics modeled during training)