mtanghu / leap Goto Github PK

LEAP: Linear Explainable Attention in Parallel for causal language modeling with O(1) path length, and O(1) inference

License: Creative Commons Zero v1.0 Universal

Python 2.47% Jupyter Notebook 97.53%

linear-attention pytorch transformers attention-mechanism deep-learning local-attention softmax additive-attention dot-product-attention parallel

leap's Issues

Constant memory LEAP

Transformers are RNNs (link: https://arxiv.org/pdf/2006.16236.pdf) talks about constant memory gradient computations which should be entirely realizeable for this project (the math is reasonably similar in structure at least).

Seemingly currently the memory usage does scale with sequence length though this may just be because larger inputs will need more memory to store all the embeddings. To that extent, it shouldn't help the training that's done in parallel.

This will be important for infinite context in the RNN formulation though! (see #14)

Latest embeddings techniques (Infinite Context Length?)

Rotary embeddings, key query embeddings, or maybe an adapted ALiBi embedding all seem worth implementing to improve the performance of LEAP. Ideally this would be implemented as an option that can be passed in the config like BERT's position_embedding_type (link below) maybe even the same code could be used!

This should also enable Infinite Context Length, though constant memory gradient calculation (see #12) would be needed to really allow for Infinite Context Length.

https://huggingface.co/docs/transformers/v4.21.1/en/model_doc/bert#transformers.BertConfig

Equations not formatted correctly

It seems no matter how many attempts are done to fix the equation rendering, they seem to still not work on all browsers. Currently the equations seems to render about right for Firefox, but not Chrome.

Possible path forward:

Don't show equations on front page, put them only inside the src folders
Get rid of setup.py warning about equations not rendering
Use Latex to render the equations/annotations and just take a screenshot to put in the README

RNN formulation

The math for an RNN based LEAP is already shown. It just needs to be integrated with the huggingface generate API.

Also if relative embeddings are used (see #14) and constant memory gradient calculation is implemented (see #12) this would enable truly infinite context.

Rescaling Attention values experiments/different implementations

Would be good to experiment with rescaling attention values, say with full attention as well to see if you can prove something about how the attention values may be what causes training instability (the current theory is that large language models can easily be pushed into local minima that are hard to break out of when the attention values get too big before being softmax-ed). Like maybe the experiment would be, use large dimensions, and to save on memory, only use a single head and only a single layer of multihead attention (with feed forward layers after it) with comically small sequence lengths (of like 4) and small batch sizes (also use no dropout to stop regularization). Consider also only using a subset of wikitext-2. Show how larger dimensions make effect worse. Measure the pre-softmax values and see if they get larger over time.

Would also be good to see if you can try a different implementation of rescaling for LEAP could work better since it's still oddly unstable for how strongly scaled it should be.

Current querying doesn't make much sense

The current querying system of LEAP seems quite inflexible. Like take for example the winograd task of knowing what a pronoun refers to. For the pronoun token information to make it to the prediction token, the query vector for that prediction token would need to match the focus vector, thus requiring the prediction to already "know" the pronoun apriori.

Using a sigmoid gate would make sense (to replace the querying), though I'm concerned about explainability, and if the sigmoid will saturate at large scales.

A different tempting option would be to create focus weighted keys and focus weighted values which will be queried. This would require a total of 5 linear projections though...

I think one of the Fs can be shared though?, i.e. LEAP = Q \cdot w-Focus(F, K, K) * w-Focus(F, K, V) -- is this too complicated though?

Readme citations should be bibtex rather than APA

Title. It's more standard to have use bibtex citations especially on github and would improve crediting of other work

Move development section to issues or projects

It is suboptimal to have to update the README often to reflect development changes. It would be better to move these to issues or projects, and instead create a "possibilities" section which still notes the different possibilities for development.

Would help to have contribution guidelines

Improving Documentation

doc strings for all classes
typing of variables/parameters to functions
specific documentation for specific uses like just using LEAP, or LEAP text generation, or specific outputs LEAP transformer

Finish project renovation with new QKV formulation

Still working on the follow change to finish

Redo fastformer testing
Note in original README how the code and annotated and should be easy to understand
Try replacing gating with head-wise cosine similarity? -- I think this may have better backwards gradient flow and increase explainability (over simple gating). I suppose then you would have 2 "key" vectors -- think about renaming these so the math makes sense
Make a dedicated LEAP attention block that people can use, document this in new README when you have time (I suppose for now it will be decoder only, but you can easily make it bidirectional by reversing the token order and running it through again)
don't have separate linear layers to generate qkv or qffv, just have one where the output is chunked and write a comment about it
Make a short section on STRONG scaling for both readmes
Write a new README with new math and the benefit/development sections from old README, also mentioning the legacy fastformerLM that is still in the repo. Make sure to note that for now the focus is on masked attention for decoders.
Put a section about development/contributing making sure to tell people about pip install -e .

Running experiments/measurement on the Rescaled Dot-Product

The README currently claims that it keeps attention sparse and improves training stability (there is already a config option to turn it on and off/play with the rescaling constant). It would be good to show this data, as well as try implementing it with normal transformers (say GPT2) where I think it should actually mildly improve scaling performance as the dot products won't get pushed into low gradient areas of the softmax.

I think these experiments would go well in a folder named "Experiments" in the repo or something along those lines.

NaN loss when using attention mask or not using attention mask but training for long time

even when i adjust the scaling in the FastAttention class the loss is still become NaN after training for few hours

Explainability support

Some original ideas will likely be needed, but in general this should be achievable simply through measuring the dot-product similarities of F1 and F2 (and also softmaxing them) to see what tokens the model is paying attention to, then factor in the querying aspect.

A good entry point would be to reproduce a variation of the pairwise Attention diagram found in the appendix of Attention is All you Need

Encoder formulation of LEAP

In general the decoder formulation (which is already done) should be harder than the encoder formulation. One can always just perform LEAP in both directions (like a bidirectional RNN), though I'm curious if there is a more efficient implementation.

Theory to show LEAP is Turing Complete?

An Attention mechanism is considered Turing Complete if it can model any kind of sequence computation. Global LEAP with $O(1)$ path length and multiple heads should be enough to easily prove that LEAP can replicate any computation within Turing Completeness simply by performing the same steps as Turing Machine (using similar ideas and assumptions as Pérez, J., Marinković, J., & Barceló, P. (2019) like arbitrary precision, infinite recursive steps, and hard attention). Then the local/windowed attention will just allow for more parallel computation if only local computations are needed.

If this can be shown, it may be of less importance to perform one-to-one comparisions with GPT2 as there is theory to back up the expressiveness of the architecture/attention mechanism

Can't seem to get all inline equations to render in README

Some inline equations do render, some don't. I've found that two inline equations in the same sentence do NOT render either. I've found also that potentially delimiting with bullet points can occasionally help. Still can't seem to crack it though

mtanghu / leap Goto Github PK

leap's People

Contributors

Stargazers

Watchers

leap's Issues

Recommend Projects

Recommend Topics

Recommend Org