Comments (9)
I tried to compare my code with your code as well as @johnma2006 's code line-by-line, taking three code files in perspective, there seems to be no successful findings so far, except for
delta
inverse-softplus initialization which only your code had performed.I am bit stucked, Please advise.
The delta initialization is important.
from mamba.
Have you plugged in a standard Transformer first? It seems more likely that there's something wrong with the training pipeline than with any particular model.
from mamba.
It looks like you reimplemented the model from scratch, so this is beyond the scope of our ability to help. Perhaps check line by line that your implementation matches ours?
from mamba.
Hi, here is a suggestion is to check the correctness of your implementation:
- Load an instance of your implementation and the official implementation side-by-side.
- Transfer the official instance's weights into your instance.
- Make sure the forward is identical. If not, drill down into each submodule to see where the diffs are coming from.
Good luck!
from mamba.
Comment on the initialization and parameterization: They are super important in the sense that without the suitable initialization and parameterization, the learning of long-term memory with SSMs can be unstable thus difficult. (https://arxiv.org/abs/2311.14495)
from mamba.
Thanks for the comments.
I had already incorporated proper delta initialization into the mamba code, but it is not helping with training loss convergence issue yet.
I need to think from other angle perspectives. 👀
@radarFudan : I noticed that StableSSM
tries to constraint the growth rate of gradient by constraining the eigenvalues. This approach seems to complement the operations done by clip_grad_norm()
. I will give StableSSM
a go in my code implementation, will post further updates here, thanks !!
from mamba.
The stable SSM initializations may or may not help, we've never tried them. But I think the theory doesn't apply directly to the selective SSM setting. I don't think there should be anything particular that you need to do here, so either there's an issue in the implementation or somehow Mamba interacts with your data weirdly, which would be interesting.
- Have you checked that your mamba function returns the same outputs as ours, as @johnma2006 suggested?
- Is there any reason you can't directly call the model from this repository? Is the purpose of your model expository or for research?
from mamba.
I had plugged in a small bert model, and the training works alright, so I am not really sure what else is missing from my MAMBA architecture module.
Please advise.
from mamba.
I tried to compare my code with your code as well as @johnma2006 's code line-by-line, taking three code files in perspective, there seems to be no successful findings so far, except for delta
inverse-softplus initialization which only your code had performed.
I am bit stucked, Please advise.
from mamba.
Related Issues (20)
- Building wheel for mamba-ssm (setup.py) ... error HOT 5
- using ssm_state and conv_state during training HOT 6
- nnUNet environment variables are redefined? HOT 1
- no-build-isolation problem
- Any plan or interest to use OpenChat algorithm (https://github.com/imoneoi/openchat) to make mamba-chat? HOT 1
- Complex-valued training
- Testing HOT 1
- HuggingFace trainer HOT 2
- Code for training Transformer++ HOT 1
- Any suggestions for regularization? HOT 2
- does forward/eval from a trained mamba model require cuda as well? HOT 5
- Question about throughput HOT 4
- Direct Throughput Comparison to RetNet ? HOT 2
- WARNING HOT 4
- Question about cuda speedup HOT 3
- Does Mamba use any matrix multiplication? HOT 2
- Mamba is deeper but narrower than typical models of the same size HOT 2
- selective_scan_cuda HOT 3
- Packaging module missing when installing HOT 2
- Bidirectional model? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mamba.