Publications

Block Sparse Flash Attention

Published in arXiv, 2025

We present Block-Sparse FlashAttention (BSFA), a drop-in replacement that accelerates long-context inference while preserving model quality by addressing the quadratic complexity bottleneck.

Recommended citation: Daniel Ohayon, Itay Lamprecht, Itay Hubara, Israel Cohen, Daniel Soudry, Noam Elata. (2025). "Block Sparse Flash Attention." arXiv preprint arXiv:2512.07011. https://arxiv.org/pdf/2512.07011

Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks

Published in Transactions on Machine Learning Research (TMLR), 2025

We propose Foldable SuperNet (FoldSN), a novel method for merging multiple Transformer models trained on different tasks and initializations into a single, scalable SuperNet. This approach enables dynamic resource allocation and efficient multi-task inference.

Recommended citation: Edan Kinderman, Itay Hubara, Haggai Maron, Daniel Soudry. (2025). "Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks." TMLR 2025. https://openreview.net/pdf?id=6FqwLestHv

Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators

Published in ICLR 2024, 2024

We present a method to train and fine-tune high-end DNNs to utilize cheaper, low-bit accumulators with no significant degradation in accuracy, addressing the computational bottleneck of high-precision accumulation.

Recommended citation: Yaniv Blumenfeld, Itay Hubara, Daniel Soudry. (2024). "Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators." ICLR 2024. https://openreview.net/pdf?id=wMbe8fVjgf

Minimum Variance Unbiased N:M Sparsity for the Neural Gradients

Published in ICLR 2023, 2023

We examine how N:M sparsity can be used for neural gradients. We show that unlike weights/activations, gradients require an unbiased minimum-variance pruning mask. We design such masks and show 1:2 or 2:4 sparsity works well.

Recommended citation: Brian Chmiel, Itay Hubara, Ron Banner, Daniel Soudry. (2023). "Minimum Variance Unbiased N:M Sparsity for the Neural Gradients." ICLR 2023. https://openreview.net/pdf?id=vuD2xEtxZcj

Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Published in NeurIPS 2021, 2021

We suggest a novel transposable fine-grained sparsity mask for N:M sparsity, allowing acceleration of both forward and backward passes. We formulate finding the optimal mask as a min-cost flow problem.

Recommended citation: Itay Hubara, Brian Chmiel, Moshe Island, Ron Banner, Joseph Naor, Daniel Soudry. (2021). "Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks." NeurIPS 2021. https://proceedings.neurips.cc/paper/2021/file/b0490b85e92b64dbb5db76bf8fca6a82-Paper.pdf

Accurate Post Training Quantization With Small Calibration Sets

Published in ICML 2021, 2021

We minimize quantization errors of each layer by optimizing parameters over a small calibration set, breaking the 8-bit barrier for post-training quantization without significant overfitting.

Recommended citation: Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, Daniel Soudry. (2021). "Accurate Post Training Quantization With Small Calibration Sets." ICML 2021. http://proceedings.mlr.press/v139/hubara21a/hubara21a.pdf

The Knowledge Within: Methods for Data-Free Model Compression

Published in CVPR 2020, 2020

We propose methods for compressing models without access to the original training data, generating synthetic data that matches the statistics of the original dataset.

Recommended citation: Matan Haroush, Itay Hubara, Elad Hoffer, Daniel Soudry. (2020). "The Knowledge Within: Methods for Data-Free Model Compression." CVPR 2020. https://openaccess.thecvf.com/content_CVPR_2020/papers/Haroush_The_Knowledge_Within_Methods_for_Data-Free_Model_Compression_CVPR_2020_paper.pdf

Augment Your Batch: Improving Generalization Through Instance Repetition

Published in CVPR 2020, 2020

We propose to repeat instances within a batch with different data augmentations. This simple modification consistently improves generalization.

Recommended citation: Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, Daniel Soudry. (2020). "Augment Your Batch: Improving Generalization Through Instance Repetition." CVPR 2020. https://openaccess.thecvf.com/content_CVPR_2020/papers/Hoffer_Augment_Your_Batch_Improving_Generalization_Through_Instance_Repetition_CVPR_2020_paper.pdf

Binarized Neural Networks

Published in NeurIPS 2016, 2016

We introduce Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameter gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency.

Recommended citation: Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio. (2016). "Binarized Neural Networks." NeurIPS 2016. https://proceedings.neurips.cc/paper/2016/file/d8330f857a17c53d217014ee776bfd50-Paper.pdf