Efficiently Dispatching Flash Attention For Partially Filled Attention Masks

Institute Homepage

Institute Homepage DE Sign In

Back

Safety- and Efficiency- aligned Learning Conference Paper 2024

Safety- and Efficiency- aligned Learning

Jonas Geiping

Research Group Leader

Transformers are widely used across various applications, many of which yield sparse or partially filled attention matrices. Examples include attention masks designed to reduce the quadratic complexity of attention, sequence packing techniques, and recent innovations like tree masking for fast validation in MEDUSA. Despite the inherent sparsity in these matrices, the state-of-the-art algorithm Flash Attention still processes them with quadratic complexity as though they were dense. In this paper, we introduce Binary Block Masking, a highly efficient modification that enhances Flash Attention by making it mask-aware. We further propose two optimizations: one tailored for masks with contiguous non-zero patterns and another for extremely sparse masks. Our experiments on attention masks derived from real-world scenarios demonstrate up to a 9x runtime improvement. The implementation will be publicly released to foster further research and application.

Author(s):	Sharma, Agniv and Geiping, Jonas
Book Title:	ENSLP NeurIPS Workshop 2024
Year:	2024
Month:	December
Day:	14
Publisher:	ENSLP NeurIPS Workshop 2024

Bibtex Type:	Conference Paper (inproceedings)

Event Name:	ENSLP NeurIPS Workshop 2024
Event Place:	Vancouver, Canada
State:	Published
URL:	https://neurips2024-enlsp.github.io/accepted_papers.html

BibTex

@inproceedings{sharma_2024,
  title = {Efficiently Dispatching Flash Attention For Partially Filled Attention Masks},
  booktitle = {ENSLP NeurIPS Workshop 2024},
  abstract = {Transformers are widely used across various applications, many of which yield sparse or partially filled attention matrices. Examples include attention masks designed to reduce the quadratic complexity of attention, sequence packing techniques, and recent innovations like tree masking for fast validation in MEDUSA. Despite the inherent sparsity in these matrices, the state-of-the-art algorithm Flash Attention still processes them with quadratic complexity as though they were dense. In this paper, we introduce Binary Block Masking, a highly efficient modification that enhances Flash Attention by making it mask-aware. We further propose two optimizations: one tailored for masks with contiguous non-zero patterns and another for extremely sparse masks. Our experiments on attention masks derived from real-world scenarios demonstrate up to a 9x runtime improvement. The implementation will be publicly released to foster further research and application.},
  publisher = {ENSLP NeurIPS Workshop 2024},
  month = dec,
  year = {2024},
  slug = {sharma_2024},
  author = {Sharma, Agniv and Geiping, Jonas},
  url = {https://neurips2024-enlsp.github.io/accepted_papers.html},
  month_numeric = {12}
}

Research

Departments

Research Groups

People

Contact

Our Institute

Our History

Career

Doctoral Programs

Training

Service Units

Central Scientific Facilities

Workshops

Campus Services

Impact

Cooperation

Partners and Initiatives