Giter VIP home page Giter VIP logo

Comments (10)

glenn-jocher avatar glenn-jocher commented on July 17, 2024

@AlainPilon hello,

Thank you for reaching out and providing detailed context about your issue. It seems like you're experiencing unexpected behavior when continuing training from a pre-existing model. Here are a few points to consider that might help address the issue:

  1. Reproducible Example: To better understand and diagnose the problem, it would be very helpful if you could provide a minimum reproducible example of your training script. This will allow us to replicate the issue on our end and offer more precise guidance. You can find more information on how to create a reproducible example here.

  2. Resume Training: When you want to continue training from where you left off, using the resume option is generally recommended. This ensures that not only the model weights but also the optimizer state and learning rate scheduler are restored. If you haven't tried this yet, you might want to give it a shot:

    yolo train resume model=path/to/your/last.pt
  3. Learning Rate and Epochs: When adding new data, it might be beneficial to adjust the learning rate and the number of epochs. A lower learning rate can help the model fine-tune more effectively on the new data without "forgetting" what it has already learned. Additionally, training for more epochs might be necessary to see the full benefit of the new data.

  4. Data Quality and Distribution: Ensure that the new images are well-distributed across the classes and do not introduce any bias. Sometimes, even high-quality images can skew the training if they are not representative of the overall dataset.

  5. Latest Versions: Please verify that you are using the latest versions of the Ultralytics YOLO packages. Sometimes, updates include important bug fixes and performance improvements that could resolve your issue.

If you can provide the additional details mentioned above, we can further assist you in troubleshooting this issue. Thank you for your patience and cooperation!

from ultralytics.

AlainPilon avatar AlainPilon commented on July 17, 2024

I added the resume command and it seems to work when looking at the output (top1_acc is at 64% right off the start).

But strangely, the output when starting the training says resume=False:

New https://pypi.org/project/ultralytics/8.2.42 available 😃 Update with 'pip install -U ultralytics'
Ultralytics YOLOv8.0.81 🚀 Python-3.8.10 torch-2.0.1+cu117 CUDA:0 (Tesla T4, 15102MiB)
yolo/engine/trainer: task=classify, mode=train, model=/home/ubuntu/s3pictures/5_class_v1_assembly_1/5_class_v1_assembly_1.pt, data=/home/ubuntu/s3pictures/5_class_v1_assembly_1, epochs=40, patience=50, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=0, workers=8, project=None, name=None, exist_ok=False, pretrained=False, optimizer=SGD, verbose=True, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=False, cos_lr=False, close_mosaic=0, resume=False, amp=True, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, v5loader=False, tracker=botsort.yaml, save_dir=runs/classify/train10

Regarding the learning rate, my understanding was that if I keep using the previous dataset to which I added new data, I should not have to decrease it since it wont "forget" about the images since it will be training on it at each epoch.

Regarding number of epoch, I put a lot (40) and save every 2 and will review afterward if there is any overfitting.

I will let the training complete and report back. thanks.

from ultralytics.

Y-T-G avatar Y-T-G commented on July 17, 2024

You have to disable warmup by adding warmup_epochs=0 and use a lower learning rate by adding optimizer="SGD", lr0=0.001

from ultralytics.

AlainPilon avatar AlainPilon commented on July 17, 2024

@Y-T-G I used this :

yolo classify train resume model='/home/ubuntu/s3pictures/5_class_v1_initial_training/5_class_v1_1.pt' data='/home/ubuntu/s3pictures/5_class_v1_initial_training'  epochs=40 imgsz=640  device=0 save_period=2 warmup_epochs=0 optimizer="SGD" lr0=0.001

but the yolo/engine/trainer output shows lr0=0.01, instead of 0.001 is this normal?

from ultralytics.

Y-T-G avatar Y-T-G commented on July 17, 2024

@Y-T-G I used this :

yolo classify train resume model='/home/ubuntu/s3pictures/5_class_v1_initial_training/5_class_v1_1.pt' data='/home/ubuntu/s3pictures/5_class_v1_initial_training'  epochs=40 imgsz=640  device=0 save_period=2 warmup_epochs=0 optimizer="SGD" lr0=0.001

but the yolo/engine/trainer output shows lr0=0.01, instead of 0.001 is this normal?

Try specifying SGD without quotes

from ultralytics.

AlainPilon avatar AlainPilon commented on July 17, 2024

does not change anything.

Probably unrelated, but the save_period=2 also does not seem to have any effect as the training only save the best.pt and last.pt

from ultralytics.

Y-T-G avatar Y-T-G commented on July 17, 2024

does not change anything.

Probably unrelated, but the save_period=2 also does not seem to have any effect as the training only save the best.pt and last.pt

The logs should say what parameters where actually used as the first line when you run it. Post the beginning of the logs before it starts training.

from ultralytics.

AlainPilon avatar AlainPilon commented on July 17, 2024

My command:

 yolo classify train resume model='/home/ubuntu/s3pictures/5_class_v1_initial_training/5_class_v1_1.pt' data='/home/ubuntu/s3pictures/5_class_v1_initial_training'  epochs=10 imgsz=640  device=0 save_period=1 warmup_epochs=0 optimizer=SGD lr0=0.001

log output:

yolo/engine/trainer: task=classify, mode=train, model=/home/ubuntu/s3pictures/5_class_v1_initial_training/5_class_v1_1.pt, data=/home/ubuntu/s3pictures/5_class_v1_initial_training, epochs=40, patience=50, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=0, workers=8, project=None, name=None, exist_ok=False, pretrained=False, optimizer=SGD, verbose=True, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=False, cos_lr=False, close_mosaic=0, resume=False, amp=True, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, v5loader=False, tracker=botsort.yaml, save_dir=runs/classify/train10
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed ✅
optimizer: SGD(lr=0.01) with parameter groups 50 weight(decay=0.0), 51 weight(decay=0.0005), 51 bias
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/classify/train10
Starting training for 40 epochs...

notice that I asked for 10 epoch and the script wants to train for 40. My initial training was for 40 epoch, is it taking the value from there?

from ultralytics.

AlainPilon avatar AlainPilon commented on July 17, 2024

Found the issue!

I upgraded ultralytics and then the command crashed because my initial training had already reached epoch 40 which is incompatible with the resume command.

I removed it and everything works as expected.

Moral of the story: resume should only be used when the initial training has not been completed.

from ultralytics.

glenn-jocher avatar glenn-jocher commented on July 17, 2024

Hello @AlainPilon,

Thank you for the update and for sharing your findings! It's great to hear that you were able to identify the issue and resolve it. Indeed, the resume command is designed to continue training from an interrupted state, and it can cause conflicts if the initial training has already completed.

For future reference, if you want to fine-tune or continue training on a model that has already completed its initial training, you can simply load the pre-trained model without using resume and start a new training session. This approach ensures that you can build upon the learned weights without the constraints of the previous training session's state.

Here's an example command for fine-tuning:

yolo classify train model='/home/ubuntu/s3pictures/5_class_v1_initial_training/5_class_v1_1.pt' data='/home/ubuntu/s3pictures/5_class_v1_initial_training' epochs=10 imgsz=640 device=0 save_period=1 warmup_epochs=0 optimizer=SGD lr0=0.001

If you encounter any further issues or have additional questions, feel free to reach out. We're here to help!

Best regards and happy training! 🚀

from ultralytics.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.