awesome-fast-attention
A curated list of efficient attention modules (last update: Tue, 04 Aug 2020 05:36:27 +0000)
Table of Contents
Efficient Attention
Paper (citations) | Implementation | Complexity | AutoRegressive | Main Idea |
---|---|---|---|---|
Generating Wikipedia by Summarizing Long Sequences (208) | memory-compressed-attention | EXPANDcompresses key and value + blocked attention |
||
CBAM: Convolutional Block Attention Module (677) | attention-module | EXPANDcombines the SE attention with a per pixel(local) weight |
||
CCNet: Criss-Cross Attention for Semantic Segmentation (149) | CCNet | EXPANDeach pixel attends to its row and column simultaneously |
||
Efficient Attention: Attention with Linear Complexities (2) | efficient-attention | EXPANDSoftmax(Q)*(Softmax(K^T)*V) |
||
Star-Transformer (24) | fastNLP | EXPANDuses a relay(global) node and attends to/from that node |
||
Generating Long Sequences with Sparse Transformers (139) | torch-blocksparse | EXPANDsparse block based attention |
||
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond (96) | GCNet | EXPANDsqueeze and excitation with an attention pooling (instead of a GAP) |
||
SCRAM: Spatially Coherent Randomized Attention Maps (1) | - | EXPANDuses PatchMatch to find close keys |
||
Interlaced Sparse Self-Attention for Semantic Segmentation (13) | IN_PAPER | EXPANDcombination of a short length and then long range(dilated) attention |
||
Permutohedral Attention Module for Efficient Non-Local Neural Networks (2) | Permutohedral_attention_module | ❌ | EXPANDuses permutohedral lattice approximation algorithm to approximate the attention output |
|
Large Memory Layers with Product Keys (28) | XLM | ✔️ | EXPANDsearch for nearest neighbor keys |
|
Expectation-Maximization Attention Networks for Semantic Segmentation (38) | EMANet | EXPANDapplys expectation maximization to cluster keys into k clusters |
||
Compressive Transformers for Long-Range Sequence Modelling (20) | compressive-transformer-pytorch | ✔️ | EXPANDcompresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL |
|
BP-Transformer: Modelling Long-Range Context via Binary Partitioning (8) | BPT | EXPANDattends to distant tokens coarsely and attends to close tokens in a more fine-grained manner |
||
Axial Attention in Multidimensional Transformers (5) | axial-attention | ✔️ | EXPANDapply attention on each axis separately |
|
Reformer: The Efficient Transformer (69) | trax | EXPANDuses LSH to find close keys |
||
Transformer on a Diet (2) | transformer-on-diet | ✔️ | EXPANDdilated transformer like wavenet |
|
Sparse Sinkhorn Attention (4) | sinkhorn-transformer | EXPANDuses a cost matrix to limit attention between buckets |
||
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection (1) | - | EXPANDlearns the q, k connections == dynamically creates a sparse attention matrix |
||
Efficient Content-Based Sparse Attention with Routing Transformers (11) | routing-transformer | EXPANDcomputes attention with same-cluster tokens (computed by online k-means) |
||
Longformer: The Long-Document Transformer (15) | longformer | EXPANDglobal + blocked attention |
||
Neural Architecture Search for Lightweight Non-Local Networks (2) | AutoNL | EXPANDcomputes Q(KV) and also down samples q, k, v both in spatial and channel dimensions |
||
ETC: Encoding Long and Structured Data in Transformers (2) | - | EXPANDcombines global attention (star transformer with multiple global tokens) with local attention |
||
Multi-scale Transformer Language Models (1) | IN_PAPER | EXPANDUNet like + retina attetion is something close to BP-Transformer |
||
Synthesizer: Rethinking Self-Attention in Transformer Models (5) | - | EXPANDdoes not compute pairwise interactions |
||
Jukebox: A Generative Model for Music (9) | jukebox | EXPANDbetter attention patterns from Sparse Transformer |
||
GMAT: Global Memory Augmentation for Transformers (0) | gmat | EXPANDadds global tokens |
||
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers (0) | google-research | EXPANDcalculate an unbiased stochastic approximation of the attention matrix |
||
Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer (0) | - | ✔️ | EXPANDdoes not compute pairwise interactions and uses fixed mask patters |
|
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (1) | fast-transformers | ✔️ | EXPANDuses phi(q)(phi(k)v) and also improves the sequential sampling step |
|
Linformer: Self-Attention with Linear Complexity (3) | linformer-pytorch | EXPANDproject key and value from nd to kd |
||
Real-time Semantic Segmentation with Fast Attention (0) | - | EXPANDl2_norm(q)*(l2_norm(k)*v) |
||
Fast Transformers with Clustered Attention (0) | fast-transformers | EXPANDgroups queries together with LSH |
||
Big Bird: Transformers for Longer Sequences (0) | - | EXPANDETC with random connections |
||
Tensor Low-Rank Reconstruction for Semantic Segmentation (N/A) | - | ❌ | EXPANDdecompose the full attention tensor into rank one tensors (CP decomposition) |