Publications

Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators

Published in WANT@NeurIPS, 2023

The majority of the research on the quantization of Deep Neural Networks (DNNs) is focused on reducing the precision of tensors visible by high-level frameworks (e.g., weights, activations, and gradients). However, current hardware still relies on high-accuracy core operations. Most significant is the operation of accumulating products. This high-precision accumulation operation is gradually becoming the main computational bottleneck. This is because, so far, the usage of low-precision accumulators led to a significant degradation in performance. In this work, we present a simple method to train and fine-tune high-end DNNs, to allow, for the first time, utilization of cheaper, -bits accumulators, with no significant degradation in accuracy. Lastly, we show that as we decrease the accumulation precision further, using fine-grained gradient approximations can improve the DNN accuracy.

Recommended citation: Yaniv.B, Itay.H, Daniel.S (2023). " Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators." WANT@NeurIPS. 1(1). [WANT@NeurIPS](https://openreview.net/forum?id=wMbe8fVjgf)

Minimum variance unbiased n: M sparsity for the neural gradients

Published in The Eleventh International Conference on Learning Representations, 2022

In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2, and doubles throughput by skipping computation of zero values. So far, it was mainly only used to prune weights to accelerate the forward and backward phases. We examine how this method can be used also for the neural gradients (i.e. loss gradients with respect to the intermediate neural layer outputs). To this end, we first establish a tensor-level optimality criteria. Previous works aimed to minimize the mean-square-error (MSE) of each pruned block. We show that while minimization of the MSE works fine for pruning the weights and activations, it catastrophically fails for the neural gradients. Instead, we show that accurate pruning of the neural gradients requires an unbiased minimum-variance pruning mask. We design such specialized masks, and find that in most cases, 1:2 sparsity is sufficient for training, and 2:4 sparsity is usually enough when this is not the case. Further, we suggest combining several such methods together in order to potentially speed up training even more. A reference implementation is supplied in the supplementary material.

Recommended citation: Brian.C, Itay.H, Ron.B, Daniel.S (2022). " Minimum variance unbiased n: M sparsity for the neural gradient." ICLR 2022. 1(2). [link to paper](https://openreview.net/pdf?id=vuD2xEtxZcj)

Accurate post training quantization with small calibration sets

Published in International Conference on Machine Learning, 2021

Lately, post-training quantization methods have gained considerable attention, as they are simple to use, and require only a small unlabeled calibration set. This small dataset cannot be used to fine-tune the model without significant over-fitting. Instead, these methods only use the calibration set to set the activations’ dynamic ranges. However, such methods always resulted in significant accuracy degradation, when used below 8-bits (except on small datasets). Here we aim to break the 8-bit barrier. To this end, we minimize the quantization errors of each layer or block separately by optimizing its parameters over the calibration set. We empirically demonstrate that this approach is: (1) much less susceptible to over-fitting than the standard fine-tuning approaches, and can be used even on a very small calibration set; and (2) more powerful than previous methods, which only set the activations’ dynamic ranges. We suggest two flavors for our method, parallel and sequential aim for a fixed and flexible bit-width allocation. For the latter, we demonstrate how to optimally allocate the bit-widths for each layer, while constraining accuracy degradation or model compression by proposing a novel integer programming formulation. Finally, we suggest model global statistics tuning, to correct biases introduced during quantization. Together, these methods yield state-of-the-art results for both vision and text models. For instance, on ResNet50, we obtain less than 1% accuracy degradation — with 4-bit weights and activations in all layers, but first and last. The suggested methods are two orders of magnitude faster than the traditional Quantize Aware Training approach used for lower than 8-bit quantization. We open-sourced our code \textit{https://github.com/itayhubara/CalibTIP}.

Recommended citation: Itay.H, Yury.N, Yair.H, Ron.B, Daniel.S. (2021). " Accurate post training quantization with small calibration sets." ICML 2021. 1(3).' https://proceedings.mlr.press/v139/hubara21a.html

Itay Hubara

Publications

Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators

Minimum variance unbiased n: M sparsity for the neural gradients

Accurate post training quantization with small calibration sets