Comments (3)
I think the only dataset I've ever run for that many epochs is the toy dset, but I will look into this. Is this a very small dataset, or are you using large regularization settings? If you aren't using --early_stopping
you may want to try that to prevent long training times.
I'm not 100% sure if that is an issue with GPU memory itself or maybe in the logging of gpu memory stats to wandb...
from spacetimeformer.
Python is refcounted (see https://discuss.pytorch.org/t/cuda-out-of-memory-on-the-8th-epoch/67288). This is not an issue with GPU memory as the python process would entirely crash as OOM and often written to /var/log/syslog. In other words, you wouldn't get the returned non-zero exit status 255.
error message.
This is triggered from the flag log_gpu_memory
in the train.py Pytorch Lightning trainer. This is deprecated and if you still want to log the metrics, try updating to using DeviceStatsMonitor
. See https://pytorch-lightning.readthedocs.io/en/stable/extensions/generated/pytorch_lightning.callbacks.DeviceStatsMonitor.html?highlight=DeviceStatsMonitor#pytorch_lightning.callbacks.DeviceStatsMonitor.
Other options include just taking out that flag and use another bash window to periodically execute nvidia-smi
query yourself. I couldn't find specific examples of why nvidia-smi
would trigger that error but it's likely not associated with GPU OOM.
from spacetimeformer.
@jhillhouse92 Thank you, that's about what I figured. I'll just remove that option in the public version, it's only in there to help me pick hparams for each dataset. The best model is usually the biggest one that fits in memory. As long as the model doesn't crash with a clear OOM error on the first backward pass it's fine for most situations... not a super important thing to log.
from spacetimeformer.
Related Issues (20)
- Multivariate sequence to univariate sequence
- Module pytorch_lightning not found HOT 1
- module 'tqdm' has no attribute 'auto' HOT 3
- AttributeError: module 'tqdm' has no attribute 'auto' HOT 8
- how should I find the datas for example
- Colab installation
- Random seed
- TypeError: accuracy() missing 1 required positional argument: 'task' HOT 1
- MisconfigurationException: `configure_optimizers` must include a monitor when a `ReduceLROnPlateau` scheduler is used HOT 5
- Training with custom dataset. Error: object is not callable HOT 2
- colab
- Pip installation
- Error while resuming training from saved checkpoint
- Huggingface Integration HOT 1
- Memory requirements to replicate on Pems-Bay HOT 2
- Working with custom dataset, IndexError: index out of range in self HOT 1
- ValueError: SyncBatchNorm layers only work with GPU modules HOT 2
- prediction vs labels
- Unable to reproduce results for spacetimeformer HOT 2
- `configure_optimizers` must include a monitor when a `ReduceLROnPlateau` scheduler is used. For example: {"optimizer": optimizer, "lr_scheduler": scheduler, "monitor": "metric_to_track"} HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spacetimeformer.