awesome-fast-attention

A curated list of efficient attention modules (last update: Tue, 04 Aug 2020 05:36:27 +0000)

Efficient Attention
Articles

Efficient Attention

Paper (citations)	Implementation	Complexity	AutoRegressive	Main Idea
Generating Wikipedia by Summarizing Long Sequences (208)	memory-compressed-attention	$\mathcal{O}({b}\cdot\frac{N}{b}\cdot\frac{N}{{b}\cdot{k}}\cdot{D})$	❌	EXPAND compresses key and value + blocked attention
CBAM: Convolutional Block Attention Module (677)	attention-module	$\mathcal{O}(({N}\cdot{D}+\frac{{D}^2}{r})+({N}\cdot{D}\cdot{k}^2))$	❌	EXPAND combines the SE attention with a per pixel(local) weight
CCNet: Criss-Cross Attention for Semantic Segmentation (149)	CCNet	$\mathcal{O}({N}\cdot({H}+{W})\cdot{D})$	❌	EXPAND each pixel attends to its row and column simultaneously
Efficient Attention: Attention with Linear Complexities (2)	efficient-attention	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND Softmax(Q)(Softmax(K^T)V)
Star-Transformer (24)	fastNLP	$\mathcal{O}({N}\cdot{D})$	❌	EXPAND uses a relay(global) node and attends to/from that node
Generating Long Sequences with Sparse Transformers (139)	torch-blocksparse	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND sparse block based attention
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond (96)	GCNet	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND squeeze and excitation with an attention pooling (instead of a GAP)
SCRAM: Spatially Coherent Randomized Attention Maps (1)	-	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	✔️	EXPAND uses PatchMatch to find close keys
Interlaced Sparse Self-Attention for Semantic Segmentation (13)	IN_PAPER	$\mathcal{O}({N}\cdot{D}^2+{N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND combination of a short length and then long range(dilated) attention
Permutohedral Attention Module for Efficient Non-Local Neural Networks (2)	Permutohedral_attention_module	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND uses permutohedral lattice approximation algorithm to approximate the attention output
Large Memory Layers with Product Keys (28)	XLM	$\mathcal{O}({Q}\cdot({K}+{k}^2)\cdot{D})$	✔️	EXPAND search for nearest neighbor keys
Expectation-Maximization Attention Networks for Semantic Segmentation (38)	EMANet	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND applys expectation maximization to cluster keys into k clusters
Compressive Transformers for Long-Range Sequence Modelling (20)	compressive-transformer-pytorch	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL
BP-Transformer: Modelling Long-Range Context via Binary Partitioning (8)	BPT	$\mathcal{O}({N}\cdot{k}\cdot\log(\frac{N}{k})\cdot{D})$	✔️	EXPAND attends to distant tokens coarsely and attends to close tokens in a more fine-grained manner
Axial Attention in Multidimensional Transformers (5)	axial-attention	$\mathcal{O}({N}\cdot({H}+{W})\cdot{D})$	✔️	EXPAND apply attention on each axis separately
Reformer: The Efficient Transformer (69)	trax	$\mathcal{O}({N}\cdot\log({N})\cdot{D}^2)$	✔️	EXPAND uses LSH to find close keys
Transformer on a Diet (2)	transformer-on-diet	$\mathcal{O}({N}\cdot{k}\cdot{D})$	✔️	EXPAND dilated transformer like wavenet
Sparse Sinkhorn Attention (4)	sinkhorn-transformer	$\mathcal{O}(\frac{{N}^2}{n_b}+{n_b}^2)$	✔️	EXPAND uses a cost matrix to limit attention between buckets
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection (1)	-	$\mathcal{O}({N}\cdot{k}\cdot{D})$	✔️	EXPAND learns the q, k connections == dynamically creates a sparse attention matrix
Efficient Content-Based Sparse Attention with Routing Transformers (11)	routing-transformer	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND computes attention with same-cluster tokens (computed by online k-means)
Longformer: The Long-Document Transformer (15)	longformer	$\mathcal{O}({N}\cdot({k}+{g})\cdot{D})$	✔️	EXPAND global + blocked attention
Neural Architecture Search for Lightweight Non-Local Networks (2)	AutoNL	$\mathcal{O}((\frac{H}{h}\cdot\frac{W}{w})\cdot(\frac{D}{k})^2)$	❌	EXPAND computes Q(KV) and also down samples q, k, v both in spatial and channel dimensions
ETC: Encoding Long and Structured Data in Transformers (2)	-	$\mathcal{O}(({N}\cdot{g}+{g}^2+{N}\cdot{k})\cdot{D})$	❌	EXPAND combines global attention (star transformer with multiple global tokens) with local attention
Multi-scale Transformer Language Models (1)	IN_PAPER	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND UNet like + retina attetion is something close to BP-Transformer
Synthesizer: Rethinking Self-Attention in Transformer Models (5)	-	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND does not compute pairwise interactions
Jukebox: A Generative Model for Music (9)	jukebox	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND better attention patterns from Sparse Transformer
GMAT: Global Memory Augmentation for Transformers (0)	gmat	$\mathcal{O}({m}\cdot({N}+{m})\cdot{D})$	❌	EXPAND adds global tokens
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers (0)	google-research	$\mathcal{O}({N}\cdot{D}^2\cdot\log({D}))$	✔️	EXPAND calculate an unbiased stochastic approximation of the attention matrix
Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer (0)	-	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND does not compute pairwise interactions and uses fixed mask patters
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (1)	fast-transformers	$\mathcal{O}({N}\cdot{D}^2)$	✔️	EXPAND uses phi(q)(phi(k)v) and also improves the sequential sampling step
Linformer: Self-Attention with Linear Complexity (3)	linformer-pytorch	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND project key and value from nd to kd
Real-time Semantic Segmentation with Fast Attention (0)	-	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND l2_norm(q)(l2_norm(k)v)
Fast Transformers with Clustered Attention (0)	fast-transformers	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND groups queries together with LSH
Big Bird: Transformers for Longer Sequences (0)	-	$\mathcal{O}(({g}^2+{N}\cdot({k}+{g}+{r}))\cdot{D})$	❌	EXPAND ETC with random connections
Tensor Low-Rank Reconstruction for Semantic Segmentation (N/A)	-	$\mathcal{O}(({D}\cdot{H}\cdot{W}+{D}^2+{H}^2+{W}^2)\cdot{r})$	❌	EXPAND decompose the full attention tensor into rank one tensors (CP decomposition)