Hi, The memory requirement for the transformer is quite large, in th

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Large memory requirement,about lucidrains/meshgpt-pytorch

Comments (60)

MarcusLoppe commented on May 28, 2024 2

@MarcusLoppe wait, those chairs are generated from the transformer? if you increase the temperature, can you get a chair that is dissimilar to the ones found in the dataset?

Okay here is the result of stepping 0.1 from 0.1 to 1.0.
Since it's only one model it doesn't know any other shapes so it just gets messed up :)

https://file.io/o1JDGVd6X2mc

from meshgpt-pytorch.

lucidrains commented on May 28, 2024 2

@MarcusLoppe you made my day

from meshgpt-pytorch.

lucidrains commented on May 28, 2024 1

@fire yeah, don't worry about it, just show the autoencoder works as in the paper without caveat, and I can figure out the attention portion. I know all there is to know about it

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024 1

@MarcusLoppe wait, those chairs are generated from the transformer?

Yes, below is the training loops. I trained for 10 epochs (200 examples dataset) but seems like 3-5 epoch would work as well. I used 1e-3 lr for encoder and 1e-2 lr for transformer since it's training only on one shape

Encoder:

Transformer:

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024 1

@MarcusLoppe did you leave this flag on btw? you may have inadvertently proved out a twist to a latest quantization research if so

I left everything as default, but I created my own trainer so I didn't use the warm up.
Not sure, I've tested it on version 0.1.1 & 0.1.12

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024 1

@MarcusLoppe ah amazing. you proved out residual LFQ without knowing it. i can probably skip out on the stochastic sampling temperature annealing logic. that complexity is not needed anymore

thank you thank you

Turning off use_residual_lfq makes it pretty bad.
If you want to test, try out my notebook. As a warning, it's very ugly and debuggy style. Just something i smashed together quickly.
Use generate_spheres function if you'd like to test something fast since it's only 80 faces.
https://file.io/y1mpUmYSJctm

use_residual_lfq = False:

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024 1

@MarcusLoppe you may not know it, but collect a few more nice lines and you got yourself a short arxiv paper. you can ask chatgpt to fill in the boring expose

there is no published work on residual LFQ

I'll give you the honour since I have no idea what it is :)

Just to be sure since I used other versions to train the encoder in the previous screenshot, I tried it again with the latest version with use_residual_lfq = True and it showed significant improvements as expected. :)

use_residual_lfq = True

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024 1

@MarcusLoppe ok, it is done! 6x shorter sequence length. i haven't fixed the kv cache yet, so inference will be slow, but let me know if you can still overfit to your repeated chair dataset (and you can try some larger meshes too)

Very nice, the ram usage have dropped quite a lot, I can even train using batch size 16 for the transformer which it then only reaches total of 14GB.

The memory usage didn't move when I trained using 1 batch size, It moved up from 2699mb to 4291mb when training using batch size 4. (240 faces)
So very successfully implementation! :)

I managed to generate the chair again successfully, have not had any success yet using the text yet. It seems to slow down the training quite a lot. Maybe implement a caching for the tokens the text conditioner generates?

I have to say is that it takes a lot of VRAM to run the trainer/transformer, its very compute intensive.

Yup, reduce the batch sizes if you are running out of memory. But it's much better today then yesterday

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024 1

@MarcusLoppe ok, it is done! 6x shorter sequence length. i haven't fixed the kv cache yet, so inference will be slow, but let me know if you can still overfit to your repeated chair dataset (and you can try some larger meshes too)

If I'm trying to train using a 4k triangles model but the encoder runs out of memory pretty fast:
OutOfMemoryError: CUDA out of memory. Tried to allocate 61.65 GiB......

If i train using a vertices 466 faces 852 model:
The encoder uses 4.2 GB @ 1 batch size.
The Transformer uses 5 GB @ 1 batch size.

This means that the new efficiency for the transformer is: 5112 / 5 = 1022 tokens = 170 triangles per 1GB.
This equals to x4 increase in efficiency per 1GB VRAM.
Not quite the x6 but I'm guessing there is some other factor that increases the memory requirement.

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024 1

@MarcusLoppe oh strange, it runs for me

nvm some indexing error, gimme 10

Success :)

Now it's at 1.7 GB @ batch size 1.
Should i set linear_attention to True?

from meshgpt-pytorch.

lucidrains commented on May 28, 2024 1

yeah, this is my first time using a graph convolution, so forgive the bugs 🙏

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024 1

yeah, this is my first time using a graph convolution, so forgive the bugs 🙏

Unforgivable 👎
😄

The inference time is quite bad, it hovers around 10 iter/s but that isn't a prio atm

from meshgpt-pytorch.

lucidrains commented on May 28, 2024 1

yup will fix by weekend, it is tricky with hierarchical transformers

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024 1

yes!

Great work with the optimizations, it's very nice to be able to have a batch size of 1 :D

Do you think adding some dropout in the resnet will improve the performance?

Setting linear_attention to true worsen the performance, both the speed and accuracy.
Setting it to true gets it stuck at 0.3 loss while turning it off lets the loss go below 0.2

linear_attention = False

linear_attention = True:

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024 1

thanks for sharing the linear attention results

I thought about it and decided to update where they are placed. If that doesn't work, we can just go with local attention

Some more tests, each epoch is 2000 steps/examples.
Seems like when I increased the dataset the old linear_attention caught almost up.
Weirdly enough linear_attn_depth = 0 has the best results.

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe once people start seeing some shapes being generated by the attention network, i can start applying my expertise here. mainly i will bring in RQ transformer (which will incidentally allow one to increase D. i think they kept it at 2 because of issues you see). can also bring in reversible networks

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe are you using flash attention?

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe if you show me that the attention network is able to generate a novel shape, then i'll start on this issue. as they say in software, "make it work, make it right, make it fast, in that order"

from meshgpt-pytorch.

fire commented on May 28, 2024

Another idea is some sort of multiply token where you can use the last triangle to affect the future triangle instead of just continue

This is like emoji encoding

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024

@MarcusLoppe once people start seeing some shapes being generated by the attention network, i can start applying my expertise here. mainly i will bring in RQ transformer (which will incidentally allow one to increase D. i think they kept it at 2 because of issues you see). can also bring in reversible networks

Alright sounds great :) You think that will have such a huge effect? It needs at least a 10x increase if you want to train on 4000 faces models, more if you want to train using higher batch size. (4000 faces * 6 = 24 000 tokens / 10 GB = 2 400 tokens/GB , current effectiveness: 252 tokens / GB)

@MarcusLoppe are you using flash attention?

Yup.

@MarcusLoppe if you show me that the attention network is able to generate a novel shape, then i'll start on this issue. as they say in software, "make it work, make it right, make it fast, in that order"

Well I did, but without the text conditioner, I trained the encoder & transformer on a 240 face chair using 1000 steps (200 examples x 5 epoch) and was able to generate a visual identical chair.
Here is the generated & ground truth:
https://file.io/VCuXAwfJ4zDc

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024

Here is the comparison, the left side is the generated one and on the right side it's the ground truth.
Only difference I see is that the corners of the generated one isn't rounded.

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe wait, those chairs are generated from the transformer? if you increase the temperature, can you get a chair that is dissimilar to the ones found in the dataset?

from meshgpt-pytorch.

commented on May 28, 2024

@MarcusLoppe Now way, that is an amazing result. How much training time did it take and on which dataset?

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

amazing. i'll start work on the reversible network + other efficient transformer tricks tomorrow

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024

amazing. i'll start work on the reversible network + other efficient transformer tricks tomorrow

Yep seem pretty nice, but don't get your hopes up since it's only the same chair x 200 times in the dataset but at least it's some-what proof of concept.
I'll try to modify the temperature.

@MarcusLoppe just curious, but are you an academic, independent researcher, startup founder? you got this working quite quickly!

Unemployed :D But it wasn't too hard to test it out.

Btw I got the error below with the lastest version when importing the MeshAutoencoder.
21 class DatasetFromTransforms(Dataset):
22 @beartype
23 def init(
24 self,
25 folder: str,
---> 26 transforms: Dict[str, Callable[Path, Tuple[Vertices, Faces]]]

TypeError: Expected a list of types, an ellipsis, ParamSpec, or Concatenate. Got <class 'pathlib.Path'>

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe overfitting to a small dataset is always the first step in "make it work" for deep learning. so this is good news

you mean funemployed 😄 you are in the right place if you are trying to break into ML

oops, let me fix

from meshgpt-pytorch.

fire commented on May 28, 2024

mmm. So does translating, rotating and scaling affect the results? The paper mentions they use that to get more data.

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe ok, let me know if that type error was fixed

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@fire yea, it is just standard data augmentations. you do this for any modality you train with

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe did you leave this flag on btw? you may have inadvertently proved out a twist to a latest quantization research if so

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe ah amazing. you proved out residual LFQ without knowing it. i can probably skip out on the stochastic sampling temperature annealing logic. that complexity is not needed anymore

thank you thank you

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

ok, i'll step on the gas pedal a bit starting tomorrow

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

from meshgpt-pytorch.

fire commented on May 28, 2024

Would be interested to help write a paper if there's any novel results we discover here (https://github.com/lucidrains/meshgpt-pytorch).

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024

ok, i'll step on the gas pedal a bit starting tomorrow

Do you think that the ideas you got in mind will make it effective enough to train using 3D models with 4k triangles?
Most 3D models have at least 4k-12k triangles, I've been struggling finding low poly count models.

If I set the batch size at 1 and train using a 4k triangle model it will require 94GB VRAM.
If I want to train it only using 10 GB VRAM, it requires that the memory requirements is lowered by 10x for a 4k triangle model.

from meshgpt-pytorch.

fire commented on May 28, 2024

I think using bitpacking techniques is possible since the positional coordinate has 1/128 quantization and 12k triangles is [less than] 2^16 vertices (65536).

I'll try out some math estimates.

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe Yeah I can make it work. I'm an expert in this arena

however, you should make sure flash attention is working properly as a first step. what type of GPU do you have?

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024

I think using bitpacking techniques is possible since the positional coordinate has 1/128 quantization and 12k triangles is [less than] 2^16 vertices (65536).

I'll try out some math estimates.

I might misunderstand but the issue lies in the token sequence since it uses 6 per face and the position coordinate doesn't take much space.

@MarcusLoppe Yeah I can make it work. I'm an expert in this arena

however, you should make sure flash attention is working properly as a first step. what type of GPU do you have?

I'm using kaggle's free GPU P100 (16GB).
I turned it off and the loss worsen pretty bad so it does something.

from meshgpt-pytorch.

fire commented on May 28, 2024

Do you know if we can add an extension token that is a special kind of token that can be used to extend the functionality of a base token. Like encode the most common face sequences in one instead of 6n tokens.

Wikipedia:

A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure (called the 'dictionary') maintained by the encoder. When the encoder finds such a match, it substitutes a reference to the string's position in the data structure.

Grammar-based codes or Grammar-based compression are compression algorithms based on the idea of constructing a context-free grammar (CFG) for the string to be compressed.

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@fire don't worry about it

more fruitful would be if you focused on the data portion, ie functions for converting all formats into the tensors needed for training, or augmentation

from meshgpt-pytorch.

fire commented on May 28, 2024

I now support ".glb", ".gltf", ".ply", ".obj", ".stl" in the MeshDataset of the Github pull request #6.

from meshgpt-pytorch.

fire commented on May 28, 2024

I wasn’t able to get all shapes to match numerically and in sorted order. I don’t think trying to solve the many imports problem will improve meshgpt but for data augmentation an approach is to do uniform scaling, rotation of a chair where the four legs still are on the floor and translation still on the floor. Like axis locked

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

ok, the RQ transformer design has crystallized during my sleep. i think i can build it this morning, and bring the sequence length down by 6x (for starters). the hardest part of the whole thing is maintaining two kv caches for the hierarchical transformers

from meshgpt-pytorch.

commented on May 28, 2024

@MarcusLoppe Can you help me out on this, as to what might be the issue that I am getting?
I set the transformer temperature to 0.1. I use a window mesh with sizes:
torch.Size([286, 3])
torch.Size([200, 3])

My Encoder Loss

Transformer Loss

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe ok, it is done! 6x shorter sequence length. i haven't fixed the kv cache yet, so inference will be slow, but let me know if you can still overfit to your repeated chair dataset (and you can try some larger meshes too)

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024

@MarcusLoppe ok, it is done! 6x shorter sequence length. i haven't fixed the kv cache yet, so inference will be slow, but let me know if you can still overfit to your repeated chair dataset (and you can try some larger meshes too)

Awsome, i'll do some tests. But is it possible to also optimize the encoder? It takes 5GB space for 240 face seq, it's less then the transformer but still a limitation if you want to go for longer sequences

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe the autoencoder doesn't have full attention, so memory should scale linearly. I can bring in some tricks there later if needed

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024

@MarcusLoppe Can you help me out on this, as to what might be the issue that I am getting? I set the transformer temperature to 0.1. I use a window mesh with sizes: torch.Size([286, 3]) torch.Size([200, 3])

My Encoder Loss

Transformer Loss

I've noticed it's very sensitive. I had to set the encoder learning rate to 1e-3 and the transformer to 1e-2 and train exactly 10 epochs with 200 examples per epoch.

Sometimes when I trained the transformer for 7 epochs it messed up the generation since the loss was high enough e.g 0.04 vs 0.003.

I hope this will resolve itself when it trains on a larger dataset since it can generalize better.

from meshgpt-pytorch.

commented on May 28, 2024

I have to say is that it takes a lot of VRAM to run the trainer/transformer, its very compute intensive.

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

success! out with doggo but will be back later

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe interesting re: autoencoder

want to try the latest version? turned off the linear attention

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe autoencoder should be even more efficient in the latest version

from meshgpt-pytorch.

MarcusLoppe commented on May 28, 2024

@MarcusLoppe autoencoder should be even more efficient in the latest version

Did you test it? It's stuck at the start of training, it wont train with either my train function nor the forward function.

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe ~~oh strange, it runs for me~~

nvm some indexing error, gimme 10

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe

runs for me now with the script below - also found a bug where it was auto deriving more face edges than there are, may explain why the memory was a bit high!

import torch

from meshgpt_pytorch import (
    MeshAutoencoder,
    MeshTransformer,
    MeshAutoencoderTrainer,
    MeshTransformerTrainer,
    DatasetFromTransforms
)

from meshgpt_pytorch.data import (
    derive_face_edges_from_faces
)

# autoencoder

autoencoder = MeshAutoencoder(
    dim = 512,
    encoder_depth = 6,
    decoder_depth = 6,
    num_discrete_coors = 128,
    linear_attention = True
)

# mock dataset

from torch.utils.data import Dataset

class MockDataset(Dataset):
    def __init__(self):
        pass

    def __len__(self):
        return 100

    def __getitem__(self, idx):
        from random import randrange
        return torch.randn(randrange(10, 20), 3), torch.randint(0, 10, (randrange(4, 8), 3))

trainer = MeshAutoencoderTrainer(
    autoencoder,
    dataset = MockDataset(),
    batch_size = 2,
    grad_accum_every = 2,
    num_train_steps = 10,
    checkpoint_every = 5,
    accelerator_kwargs = dict(
        cpu = True
    )
)

trainer()

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

yes!

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

thanks for sharing the linear attention results

I thought about it and decided to update where they are placed. If that doesn't work, we can just go with local attention

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

@MarcusLoppe thank you

decided to switch it to local attention, the conservative choice

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

Do you think adding some dropout in the resnet will improve the performance?

yup, it could help for the autoencoder in general, should be available and customizable!

from meshgpt-pytorch.

lucidrains commented on May 28, 2024

main issue should be resolved

from meshgpt-pytorch.

Large memory requirement about meshgpt-pytorch HOT 60 CLOSED

Comments (60)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent