Comments (9)
@joeyhng: try model:zeroGradParameters()
instead of grad_params:zero()
from torchnet.
I may be wrong here, but it appears you're adding the network only to the second GPU, not to the first one? Can you try something like:
if nGPU > 1 then
local singlemodel = model
local model = nn.DataParallelTable(1, true, false)
for i = 1, nGPU do
cutorch.setDevice(i)
model:add(singlemodel:clone():cuda(), i)
end
cutorch.setDevice(1)
end
from torchnet.
I think the way to use multiple GPU should be correct. I'm just following fb.resnet.torch
https://github.com/facebook/fb.resnet.torch/blob/master/models/init.lua#L81
from torchnet.
I tried running the code example in your post, but I can not reproduce the issue you are seeing. Output:
Number of parameters: 11184650
iteration 1: loss=2.352886
iteration 2: loss=2.275253
iteration 3: loss=2.204911
iteration 4: loss=2.140517
iteration 5: loss=2.081086
from torchnet.
Closing this for lack of a repro or response.
from torchnet.
Have you solved the problem? I also get nan when training with multiple GPU.
Thanks
from torchnet.
from torchnet.
That seems to be the reason. It's fixed for me now. Thanks very much!!
from torchnet.
it also works for me! Thanks!
from torchnet.
Related Issues (20)
- Document uncorrect about "transform.perm"
- for ListDataset, add an onComplete argument HOT 2
- OptimEngine.test not implemented HOT 2
- fatal thread panic on parallelDatasetIterator HOT 1
- Improve ParallelDatasetIterator documentation HOT 13
- How can i use MSE criterion? HOT 5
- IndexedDataset using string as index for large dataset HOT 4
- returning vector in ListDataset problem. HOT 2
- This error is unclear - what is the problem with my code that is causing this? HOT 3
- Segmentation fault (core dumped) HOT 8
- Bug report: not entering into iterator until thorough depth. HOT 2
- ClassErrorMeter throwing size mismatch error HOT 1
- meter.MultilabelConfusionMeter invalid argument error HOT 1
- some bugs of transform.merge() HOT 1
- bug in transform.tablemergekeys() HOT 1
- Unable to install qlua, and therefore: qlua: module 'torchnet' not found HOT 1
- Hi, I doubt that whether I can use torchnet in win10 64x. Could anybody tell me? HOT 2
- "not enough memory" with small dataset when nthread>1 for ParallelDatasetIterator HOT 4
- How to add torchnet to a custom package on kaggle HOT 1
- RecursionError with meter.ConfusionMeter
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from torchnet.