Comments (23)
@knazeri it was because of self.iteration = data['iteration']
in models.py.
It makes the iteration = 2000000 which is equal to MAX_ITERS
. I changed the MAX_ITERS
in config file. Sorry for my mistake.
from edge-connect.
@Yaqiongchai Using the pre-trained model is always preferred over training from scratch. Training the network using the pre-trained model is as easy as copying the weights in your checkpoints folder!
I'm still not sure why your model does not go beyond 1000 iterations. Did you set MAX_ITERS
to a value larger than 1000? How big is your dataset?
from edge-connect.
@Yaqiongchai Using the pre-trained model is always preferred over training from scratch. Training the network using the pre-trained model is as easy as copying the weights in your checkpoints folder!
I'm still not sure why your model does not go beyond 1000 iterations. Did you setMAX_ITERS
to a value larger than 1000? How big is your dataset?
Thanks for your help! @knazeri
I just copied your pre-trained model in /checkpoints/ folder in the hope that training model can pick it up and continue training. However, I encounter the same Epoch1 problem again. No matter how I change batch-size, or num_workers, it won't work and the model was not modified at all. I am wondering what I can do at this point. I am still at stage 1.
from edge-connect.
Does it mean the model stops training (freezes) after the first epoch? Or it actually ends with a message "End training"?
from edge-connect.
Does it mean the model stops training (freezes) after the first epoch? Or it actually ends with a message "End training"?
It actually ends with a message "Tend training".
here's a screenshot (same as the previous issue
from edge-connect.
@Yaqiongchai I see your model starts training and then finishes right away. I believe there should be a minor problem with your dataset path. That means the following for loop is never executed:
edge-connect/src/edge_connect.py
Line 95 in 97c28c6
You can make sure this is the case by printing the number of images in the training set, copy this line of code at the beginning of the train
method:
print(self.train_dataset)
If it prints out zero, then you might want to double check your flie-list and/or dataset path!
from edge-connect.
Here's what I added:
And here's what have reported:
I don't think the problem is dataset path/filelist. Because I tried to rm all the *.pth in the check points in the checkpoint folder, and the program can run without a problem, up to around 50 epoches and 999 iterations to stop (end of training) and save the weights. The problem of my training (999 iterations) is that it does not save generator and dis separately, only save out one .dat file. I guess it is because the training iteration is too small, giving the small learning rate, the system is undertrained?
from edge-connect.
@Yaqiongchai It does not save the model because in your configuration the SAVE_INTERVAL
is set to 1000! That means training stops (after 999 iterations) before having the chance to save the model. Change the value of SAVE_INTERVAL
to a smaller value and you get your model saved!
from edge-connect.
@Yaqiongchai It does not save the model because in your configuration the
SAVE_INTERVAL
is set to 1000! That means training stops (after 999 iterations) before having the chance to save the model. Change the value ofSAVE_INTERVAL
to a smaller value and you get your model saved!
@knazeri
Thanks for your advice! It's great observation. One problem shot, what do you think of the training ends at 1 epoch? 8 is just the batch size though.
from edge-connect.
@Yaqiongchai No matter how I calculate it, it shouldn't be 1 epoch! Your snapshot shows that the size of the dataset is 72 while the batch size is 8, which means 9 iterations per epoch. 999 iterations leave 111 epochs. Am I missing a point?
Can you please post your exact dataset size, and all the contents in the config.yml
file here?
from edge-connect.
`MODE: 1 # 1: train, 2: test, 3: eval
MODEL: 1 # 1: edge model, 2: inpaint model, 3: edge-inpaint model, 4: joint model
MASK: 3 # 1: random block, 2: half, 3: external, 4: (external, random block), 5: (external, random block, half)
EDGE: 1 # 1: canny, 2: external
NMS: 1 # 0: no non-max-suppression, 1: applies non-max-suppression on the external edges by multiplying by Canny
SEED: 10 # random seed
GPU: [0] # list of gpu ids
DEBUG: 0 # turns on debugging mode
VERBOSE: 0 # turns on verbose mode in the output console
TRAIN_FLIST: ./datasets/m2d_train.flist
VAL_FLIST: ./datasets/places2_val.flist
TEST_FLIST: ./datasets/places2_test.flist
TRAIN_EDGE_FLIST: ./datasets/m2d_train.flist
VAL_EDGE_FLIST: ./datasets/places2_edges_val.flist
TEST_EDGE_FLIST: ./datasets/places2_edges_test.flist
TRAIN_MASK_FLIST: ./datasets/masks2nd_train.flist
VAL_MASK_FLIST: ./datasets/masks_val.flist
TEST_MASK_FLIST: ./datasets/masks_test.flist
LR: 0.0001 # learning rate
D2G_LR: 0.1 # discriminator/generator learning rate ratio
BETA1: 0.0 # adam optimizer beta1
BETA2: 0.9 # adam optimizer beta2
BATCH_SIZE: 8 # input batch size for training
INPUT_SIZE: 256 # input image size for training 0 for original size
SIGMA: 2 # standard deviation of the Gaussian filter used in Canny edge detector (0: random, -1: no edge)
MAX_ITERS: 3999 # maximum number of iterations to train the model
EDGE_THRESHOLD: 0.5 # edge detection threshold
L1_LOSS_WEIGHT: 1 # l1 loss weight
FM_LOSS_WEIGHT: 10 # feature-matching loss weight
STYLE_LOSS_WEIGHT: 250 # style loss weight
CONTENT_LOSS_WEIGHT: 1 # perceptual loss weight
INPAINT_ADV_LOSS_WEIGHT: 0.01 # adversarial loss weight
GAN_LOSS: nsgan # nsgan | lsgan | hinge
GAN_POOL_SIZE: 0 # fake images pool size
SAVE_INTERVAL: 10 # how many iterations to wait before saving model (0: never)
SAMPLE_INTERVAL: 10 # how many iterations to wait before sampling (0: never)
SAMPLE_SIZE: 12 # number of images to sample
EVAL_INTERVAL: 0 # how many iterations to wait before model evaluation (0: never)
LOG_INTERVAL: 10 # how many iterations to wait before logging training status (0: never)
`
Here's my config.yml file.
I have 72 images, all are 256256 and 72masks with size of 256256.
from edge-connect.
@knazeri The size of the image is 256 by 256, same as mask file.
from edge-connect.
I have the same problem and after first training epoch( for all 3 datasets which I checked with print(len(self.train_dataset))
command and they contain thousands of images) training ends.
Does it matter that datasets are in another drive?
from edge-connect.
@Yaqiongchai The problem is with your validation set path! You need to also provide a validation set path using VAL_FLIST
and VAL_MASK_FLIST
. These flags are set to default and there was an infinite loop with a sampler that caused the model to stop! I have fixed the code to prevent the infinite loop, but you should also include a validation set path.
Also, two minor issues in your configuration: your values for CONTENT_LOSS_WEIGHT
and INPAINT_ADV_LOSS_WEIGHT
are not what we trained our models with. They should be 0.1
and 0.1
respectively!
@aryan461 I guess you might have had the same problem! Let me know if this also resolves your issue!
from edge-connect.
@knazeri it was because of
self.iteration = data['iteration']
in models.py.
It makes the iteration = 2000000 which is equal toMAX_ITERS
. I changed theMAX_ITERS
in config file. Sorry for my mistake.
@aryan461
Could you be more specific? Do we need to change anything in models.py?
from edge-connect.
@Yaqiongchai I don't think @aryan461 issue applies to yours. He was using the pre-trained model which was already trained to 2,000,000 iterations. Your problem was not having a valid validation set path in your configuration file.
However, even if you decide not to have a validation set, I have fixed the code so that it would not freeze. You just need to pull the source!
from edge-connect.
@Yaqiongchai I don't think @aryan461 issue applies to yours. He was using the pre-trained model which was already trained to 2,000,000 iterations. Your problem was not having a valid validation set path in your configuration file.
However, even if you decide not to have a validation set, I have fixed the code so that it would not freeze. You just need to pull the source!
@knazeri
I am training now with modified VAL_FLIST, and of course data in the list. It seems to get stuck at the first epoch and would not move on. I also set CONTENT_LOSS_WEIGHT: 0.1 and INPAINT_ADV_LOSS_WEIGHT: 0.1 as you mentioned above.
Also, as long as I set MASK: 3 and EDGE: 1, TRAIN_EDGE_FLIST, VAL_EDGE_FLIST, and TEST_EDGE_FLIST would not matter, right? I am trying it out on both just training my data, and pick up the pre_trained model that I downloaded from your google drive. It does not seem to run smoothly.
Lastly, I'd like to add one line to tell me that the code is picking up the previously trained model and gonna continue to train, in models.py:
if torch.cuda.is_available(): data = torch.load(self.gen_weights_path) else: data = torch.load(self.gen_weights_path, map_location=lambda storage, loc: storage) print(self.gen_weights_path) print(self.dis_weights_path) self.generator.load_state_dict(data['generator']) self.iteration = data['iteration']
Would it be the correct way to do it?
Sorry to throw so many questions at you.
from edge-connect.
Hey kamyar,
I fixed the validation dataset, loss weight, and iterations, however I still see this" Training epoch 1" ending. And as I check ls -trl in my checkpoints folder, *.pth file was not updated. I guess it can successfully pick up the generator and discriminator, but can not continue training.
Good news is that when I rm *.pth files in the checkpoints folder, it can train smoothly, ends exactly at 111th epoch, as you calculated for me before (Kudos!) That's being said, I can have my own model, but still am seeking for a way to use your pre-trained model.
`---------- 2019-03-13 20:31:12 ---------
Wed Mar 13 20:31:13 PDT 2019
Now start training on stage 1: inpaint model training
Loading EdgeModel generator...
iteration number is: 2000000
Loading EdgeModel discriminator...
Model configurations:
MODE: 1 # 1: train, 2: test, 3: eval
MODEL: 1 # 1: edge model, 2: inpaint model, 3: edge-inpaint model, 4: joint model
MASK: 3 # 1: random block, 2: half, 3: external, 4: (external, random block), 5: (external, random block, half
)
EDGE: 1 # 1: canny, 2: external
NMS: 1 # 0: no non-max-suppression, 1: applies non-max-suppression on the external edges by multiplying by Ca
nny
SEED: 10 # random seed
GPU: [0] # list of gpu ids
DEBUG: 0 # turns on debugging mode
VERBOSE: 0 # turns on verbose mode in the output console
TRAIN_FLIST: ./datasets/m2d_train.flist
VAL_FLIST: ./datasets/m2d_validate.flist
TEST_FLIST: ./datasets/m2d_test.flist
TRAIN_EDGE_FLIST: ./datasets/m2d.flist
VAL_EDGE_FLIST: ./datasets/m2d.flist
TEST_EDGE_FLIST: ./datasets/places2_edges_test.flist
TRAIN_MASK_FLIST: ./datasets/masks2nd_train.flist
VAL_MASK_FLIST: ./datasets/m2d_test_mask6.flist
TEST_MASK_FLIST: ./datasets/m2d_test_mask.flist
LR: 0.0001 # learning rate
D2G_LR: 0.1 # discriminator/generator learning rate ratio
BETA1: 0.0 # adam optimizer beta1
BETA2: 0.9 # adam optimizer beta2
BATCH_SIZE: 8 # input batch size for training
INPUT_SIZE: 256 # input image size for training 0 for original size
SIGMA: 2 # standard deviation of the Gaussian filter used in Canny edge detector (0: random, -1: no e
dge)
MAX_ITERS: 999 # maximum number of iterations to train the model
EDGE_THRESHOLD: 0.5 # edge detection threshold
L1_LOSS_WEIGHT: 1 # l1 loss weight
FM_LOSS_WEIGHT: 10 # feature-matching loss weight
STYLE_LOSS_WEIGHT: 250 # style loss weight
CONTENT_LOSS_WEIGHT: 0.1 # perceptual loss weight
INPAINT_ADV_LOSS_WEIGHT: 0.1 # adversarial loss weight
GAN_LOSS: nsgan # nsgan | lsgan | hinge
GAN_POOL_SIZE: 0 # fake images pool size
SAVE_INTERVAL: 10 # how many iterations to wait before saving model (0: never)
SAMPLE_INTERVAL: 100 # how many iterations to wait before sampling (0: never)
SAMPLE_SIZE: 6 # number of images to sample
EVAL_INTERVAL: 0 # how many iterations to wait before model evaluation (0: never)
LOG_INTERVAL: 10 # how many iterations to wait before logging training status (0: never)
start training...
72
Training epoch: 1
8
8
End training....
code done`
from edge-connect.
@Yaqiongchai Ok, now you have the same problem as other people mentioned. Since our model is trained with 2,000,000 iterations, you need to specify a MAX_ITERS
larger than 2,000,000 if you wish to continue training with the pre-trained weights. Based on your configuration, the model stops training when the number of iterations is larger than 999!
from edge-connect.
. Based on your configuration, the model stops training when the number of iterations is larger than 999
"Based on your configuration, the model stops training when the number of iterations is larger than 999", Yes the model stops training when when the number of iteration is larger than 999, it is for the case that we don't use your pre-trained model.
On the other hand, if I'd like to continue training with the pre-trained weights, I'll need to set MAX_ITERS larger than 2,000,000, am I right?
from edge-connect.
@Yaqiongchai Yes!
from edge-connect.
Dear Yaqiong
How you solve the problem of the training process ending up instantly. Thank you very much!
Best wishes!
from edge-connect.
Dear Yaqiong
How you solve the problem of the training process ending up instantly. Thank you very much!
Best wishes!
Knazeri suggested a few things to try -- along the way we found the pre-trained model is available to download. So we used the pre-trained model and picked it up to continue train on our dataset, therefore, we did not try to solve the problem you asked, because we used the pretrained model.
The most quick and dirty way to check is to see if your train/validate/mask datasets are set correctly. You can print it out in model.py to double check. As a newbie I was able to train from the scratch but it still ends at 2000 iterations. I did not get to solve this problem.
from edge-connect.
Related Issues (20)
- Test image is being filled in a lighter shade HOT 1
- Who can help me slove this error? (when I try to train ) HOT 5
- Run the program on CoLab
- Convergency of edge model HOT 10
- Hello, After reading your paper, may I have a question that why you choice 178 for the celebA dataset drop size.
- 如果对图像修复,edge-connect感兴趣,或者需要帮助,可以联系我
- Training on Google Colab immediately stops HOT 1
- Selection of dataset
- Canny sigma HOT 1
- how to implement the visualization for the learned edges? HOT 2
- Sizes of tensors must match except in dimension 1
- New easy to use inpanting method with transformers HOT 1
- When using edge=2, training has ValueError: operands could not be broadcast together with shapes (256,256,3) (256,256)
- Why is there an error when I train MODEL4: joint model/为什么我训练MODEL4 :joint model会报错
- When I tried to start training, I got an error:RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 512, 4, 4]] is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). HOT 15
- About precision and recall during training HOT 1
- The loss function is abnormal when the edge network is trained
- RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
- a question
- Edge Model Not converging
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from edge-connect.