Giter VIP home page Giter VIP logo

Comments (23)

aryan461 avatar aryan461 commented on September 7, 2024 2

@knazeri it was because of self.iteration = data['iteration'] in models.py.
It makes the iteration = 2000000 which is equal to MAX_ITERS. I changed the MAX_ITERS in config file. Sorry for my mistake.

from edge-connect.

knazeri avatar knazeri commented on September 7, 2024

@Yaqiongchai Using the pre-trained model is always preferred over training from scratch. Training the network using the pre-trained model is as easy as copying the weights in your checkpoints folder!
I'm still not sure why your model does not go beyond 1000 iterations. Did you set MAX_ITERS to a value larger than 1000? How big is your dataset?

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on September 7, 2024

@Yaqiongchai Using the pre-trained model is always preferred over training from scratch. Training the network using the pre-trained model is as easy as copying the weights in your checkpoints folder!
I'm still not sure why your model does not go beyond 1000 iterations. Did you set MAX_ITERS to a value larger than 1000? How big is your dataset?

Thanks for your help! @knazeri
I just copied your pre-trained model in /checkpoints/ folder in the hope that training model can pick it up and continue training. However, I encounter the same Epoch1 problem again. No matter how I change batch-size, or num_workers, it won't work and the model was not modified at all. I am wondering what I can do at this point. I am still at stage 1.

from edge-connect.

knazeri avatar knazeri commented on September 7, 2024

Does it mean the model stops training (freezes) after the first epoch? Or it actually ends with a message "End training"?

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on September 7, 2024

Does it mean the model stops training (freezes) after the first epoch? Or it actually ends with a message "End training"?

It actually ends with a message "Tend training".
here's a screenshot (same as the previous issue
image

from edge-connect.

knazeri avatar knazeri commented on September 7, 2024

@Yaqiongchai I see your model starts training and then finishes right away. I believe there should be a minor problem with your dataset path. That means the following for loop is never executed:

for items in train_loader:

You can make sure this is the case by printing the number of images in the training set, copy this line of code at the beginning of the train method:

print(self.train_dataset)

If it prints out zero, then you might want to double check your flie-list and/or dataset path!

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on September 7, 2024

Here's what I added:
image
And here's what have reported:
image

I don't think the problem is dataset path/filelist. Because I tried to rm all the *.pth in the check points in the checkpoint folder, and the program can run without a problem, up to around 50 epoches and 999 iterations to stop (end of training) and save the weights. The problem of my training (999 iterations) is that it does not save generator and dis separately, only save out one .dat file. I guess it is because the training iteration is too small, giving the small learning rate, the system is undertrained?

from edge-connect.

knazeri avatar knazeri commented on September 7, 2024

@Yaqiongchai It does not save the model because in your configuration the SAVE_INTERVAL is set to 1000! That means training stops (after 999 iterations) before having the chance to save the model. Change the value of SAVE_INTERVAL to a smaller value and you get your model saved!

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on September 7, 2024

@Yaqiongchai It does not save the model because in your configuration the SAVE_INTERVAL is set to 1000! That means training stops (after 999 iterations) before having the chance to save the model. Change the value of SAVE_INTERVAL to a smaller value and you get your model saved!
@knazeri
Thanks for your advice! It's great observation. One problem shot, what do you think of the training ends at 1 epoch? 8 is just the batch size though.

from edge-connect.

knazeri avatar knazeri commented on September 7, 2024

@Yaqiongchai No matter how I calculate it, it shouldn't be 1 epoch! Your snapshot shows that the size of the dataset is 72 while the batch size is 8, which means 9 iterations per epoch. 999 iterations leave 111 epochs. Am I missing a point?
Can you please post your exact dataset size, and all the contents in the config.yml file here?

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on September 7, 2024

`MODE: 1 # 1: train, 2: test, 3: eval
MODEL: 1 # 1: edge model, 2: inpaint model, 3: edge-inpaint model, 4: joint model
MASK: 3 # 1: random block, 2: half, 3: external, 4: (external, random block), 5: (external, random block, half)
EDGE: 1 # 1: canny, 2: external
NMS: 1 # 0: no non-max-suppression, 1: applies non-max-suppression on the external edges by multiplying by Canny
SEED: 10 # random seed
GPU: [0] # list of gpu ids
DEBUG: 0 # turns on debugging mode
VERBOSE: 0 # turns on verbose mode in the output console

TRAIN_FLIST: ./datasets/m2d_train.flist
VAL_FLIST: ./datasets/places2_val.flist
TEST_FLIST: ./datasets/places2_test.flist

TRAIN_EDGE_FLIST: ./datasets/m2d_train.flist
VAL_EDGE_FLIST: ./datasets/places2_edges_val.flist
TEST_EDGE_FLIST: ./datasets/places2_edges_test.flist

TRAIN_MASK_FLIST: ./datasets/masks2nd_train.flist
VAL_MASK_FLIST: ./datasets/masks_val.flist
TEST_MASK_FLIST: ./datasets/masks_test.flist

LR: 0.0001 # learning rate
D2G_LR: 0.1 # discriminator/generator learning rate ratio
BETA1: 0.0 # adam optimizer beta1
BETA2: 0.9 # adam optimizer beta2
BATCH_SIZE: 8 # input batch size for training
INPUT_SIZE: 256 # input image size for training 0 for original size
SIGMA: 2 # standard deviation of the Gaussian filter used in Canny edge detector (0: random, -1: no edge)
MAX_ITERS: 3999 # maximum number of iterations to train the model

EDGE_THRESHOLD: 0.5 # edge detection threshold
L1_LOSS_WEIGHT: 1 # l1 loss weight
FM_LOSS_WEIGHT: 10 # feature-matching loss weight
STYLE_LOSS_WEIGHT: 250 # style loss weight
CONTENT_LOSS_WEIGHT: 1 # perceptual loss weight
INPAINT_ADV_LOSS_WEIGHT: 0.01 # adversarial loss weight

GAN_LOSS: nsgan # nsgan | lsgan | hinge
GAN_POOL_SIZE: 0 # fake images pool size

SAVE_INTERVAL: 10 # how many iterations to wait before saving model (0: never)
SAMPLE_INTERVAL: 10 # how many iterations to wait before sampling (0: never)
SAMPLE_SIZE: 12 # number of images to sample
EVAL_INTERVAL: 0 # how many iterations to wait before model evaluation (0: never)
LOG_INTERVAL: 10 # how many iterations to wait before logging training status (0: never)
`

Here's my config.yml file.
I have 72 images, all are 256256 and 72masks with size of 256256.

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on September 7, 2024

@knazeri The size of the image is 256 by 256, same as mask file.

from edge-connect.

aryan461 avatar aryan461 commented on September 7, 2024

I have the same problem and after first training epoch( for all 3 datasets which I checked with print(len(self.train_dataset)) command and they contain thousands of images) training ends.

Does it matter that datasets are in another drive?

from edge-connect.

knazeri avatar knazeri commented on September 7, 2024

@Yaqiongchai The problem is with your validation set path! You need to also provide a validation set path using VAL_FLIST and VAL_MASK_FLIST. These flags are set to default and there was an infinite loop with a sampler that caused the model to stop! I have fixed the code to prevent the infinite loop, but you should also include a validation set path.
Also, two minor issues in your configuration: your values for CONTENT_LOSS_WEIGHT and INPAINT_ADV_LOSS_WEIGHT are not what we trained our models with. They should be 0.1 and 0.1 respectively!

@aryan461 I guess you might have had the same problem! Let me know if this also resolves your issue!

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on September 7, 2024

@knazeri it was because of self.iteration = data['iteration'] in models.py.
It makes the iteration = 2000000 which is equal to MAX_ITERS. I changed the MAX_ITERS in config file. Sorry for my mistake.
@aryan461
Could you be more specific? Do we need to change anything in models.py?

from edge-connect.

knazeri avatar knazeri commented on September 7, 2024

@Yaqiongchai I don't think @aryan461 issue applies to yours. He was using the pre-trained model which was already trained to 2,000,000 iterations. Your problem was not having a valid validation set path in your configuration file.
However, even if you decide not to have a validation set, I have fixed the code so that it would not freeze. You just need to pull the source!

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on September 7, 2024

@Yaqiongchai I don't think @aryan461 issue applies to yours. He was using the pre-trained model which was already trained to 2,000,000 iterations. Your problem was not having a valid validation set path in your configuration file.
However, even if you decide not to have a validation set, I have fixed the code so that it would not freeze. You just need to pull the source!

@knazeri
I am training now with modified VAL_FLIST, and of course data in the list. It seems to get stuck at the first epoch and would not move on. I also set CONTENT_LOSS_WEIGHT: 0.1 and INPAINT_ADV_LOSS_WEIGHT: 0.1 as you mentioned above.

Also, as long as I set MASK: 3 and EDGE: 1, TRAIN_EDGE_FLIST, VAL_EDGE_FLIST, and TEST_EDGE_FLIST would not matter, right? I am trying it out on both just training my data, and pick up the pre_trained model that I downloaded from your google drive. It does not seem to run smoothly.

Lastly, I'd like to add one line to tell me that the code is picking up the previously trained model and gonna continue to train, in models.py:
if torch.cuda.is_available(): data = torch.load(self.gen_weights_path) else: data = torch.load(self.gen_weights_path, map_location=lambda storage, loc: storage) print(self.gen_weights_path) print(self.dis_weights_path) self.generator.load_state_dict(data['generator']) self.iteration = data['iteration']

Would it be the correct way to do it?
Sorry to throw so many questions at you.

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on September 7, 2024

Hey kamyar,

I fixed the validation dataset, loss weight, and iterations, however I still see this" Training epoch 1" ending. And as I check ls -trl in my checkpoints folder, *.pth file was not updated. I guess it can successfully pick up the generator and discriminator, but can not continue training.

Good news is that when I rm *.pth files in the checkpoints folder, it can train smoothly, ends exactly at 111th epoch, as you calculated for me before (Kudos!) That's being said, I can have my own model, but still am seeking for a way to use your pre-trained model.

`---------- 2019-03-13 20:31:12 ---------

Wed Mar 13 20:31:13 PDT 2019
Now start training on stage 1: inpaint model training
Loading EdgeModel generator...
iteration number is: 2000000
Loading EdgeModel discriminator...
Model configurations:

MODE: 1 # 1: train, 2: test, 3: eval
MODEL: 1 # 1: edge model, 2: inpaint model, 3: edge-inpaint model, 4: joint model
MASK: 3 # 1: random block, 2: half, 3: external, 4: (external, random block), 5: (external, random block, half
)
EDGE: 1 # 1: canny, 2: external
NMS: 1 # 0: no non-max-suppression, 1: applies non-max-suppression on the external edges by multiplying by Ca
nny
SEED: 10 # random seed
GPU: [0] # list of gpu ids
DEBUG: 0 # turns on debugging mode
VERBOSE: 0 # turns on verbose mode in the output console

TRAIN_FLIST: ./datasets/m2d_train.flist
VAL_FLIST: ./datasets/m2d_validate.flist
TEST_FLIST: ./datasets/m2d_test.flist

TRAIN_EDGE_FLIST: ./datasets/m2d.flist
VAL_EDGE_FLIST: ./datasets/m2d.flist
TEST_EDGE_FLIST: ./datasets/places2_edges_test.flist

TRAIN_MASK_FLIST: ./datasets/masks2nd_train.flist
VAL_MASK_FLIST: ./datasets/m2d_test_mask6.flist
TEST_MASK_FLIST: ./datasets/m2d_test_mask.flist

LR: 0.0001 # learning rate
D2G_LR: 0.1 # discriminator/generator learning rate ratio
BETA1: 0.0 # adam optimizer beta1
BETA2: 0.9 # adam optimizer beta2
BATCH_SIZE: 8 # input batch size for training
INPUT_SIZE: 256 # input image size for training 0 for original size
SIGMA: 2 # standard deviation of the Gaussian filter used in Canny edge detector (0: random, -1: no e
dge)
MAX_ITERS: 999 # maximum number of iterations to train the model

EDGE_THRESHOLD: 0.5 # edge detection threshold
L1_LOSS_WEIGHT: 1 # l1 loss weight
FM_LOSS_WEIGHT: 10 # feature-matching loss weight
STYLE_LOSS_WEIGHT: 250 # style loss weight
CONTENT_LOSS_WEIGHT: 0.1 # perceptual loss weight
INPAINT_ADV_LOSS_WEIGHT: 0.1 # adversarial loss weight

GAN_LOSS: nsgan # nsgan | lsgan | hinge
GAN_POOL_SIZE: 0 # fake images pool size

SAVE_INTERVAL: 10 # how many iterations to wait before saving model (0: never)
SAMPLE_INTERVAL: 100 # how many iterations to wait before sampling (0: never)
SAMPLE_SIZE: 6 # number of images to sample
EVAL_INTERVAL: 0 # how many iterations to wait before model evaluation (0: never)
LOG_INTERVAL: 10 # how many iterations to wait before logging training status (0: never)


start training...

72

Training epoch: 1
8
8

End training....
code done`

from edge-connect.

knazeri avatar knazeri commented on September 7, 2024

@Yaqiongchai Ok, now you have the same problem as other people mentioned. Since our model is trained with 2,000,000 iterations, you need to specify a MAX_ITERS larger than 2,000,000 if you wish to continue training with the pre-trained weights. Based on your configuration, the model stops training when the number of iterations is larger than 999!

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on September 7, 2024

. Based on your configuration, the model stops training when the number of iterations is larger than 999

"Based on your configuration, the model stops training when the number of iterations is larger than 999", Yes the model stops training when when the number of iteration is larger than 999, it is for the case that we don't use your pre-trained model.
On the other hand, if I'd like to continue training with the pre-trained weights, I'll need to set MAX_ITERS larger than 2,000,000, am I right?

from edge-connect.

knazeri avatar knazeri commented on September 7, 2024

@Yaqiongchai Yes!

from edge-connect.

Hgit007 avatar Hgit007 commented on September 7, 2024

Dear Yaqiong

How you solve the problem of the training process ending up instantly. Thank you very much!

Best wishes!

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on September 7, 2024

Dear Yaqiong

How you solve the problem of the training process ending up instantly. Thank you very much!

Best wishes!

@Hgit007

Knazeri suggested a few things to try -- along the way we found the pre-trained model is available to download. So we used the pre-trained model and picked it up to continue train on our dataset, therefore, we did not try to solve the problem you asked, because we used the pretrained model.

The most quick and dirty way to check is to see if your train/validate/mask datasets are set correctly. You can print it out in model.py to double check. As a newbie I was able to train from the scratch but it still ends at 2000 iterations. I did not get to solve this problem.

from edge-connect.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.