Giter VIP home page Giter VIP logo

vega's Introduction

Vega

English | 中文


Vega ver1.8.5 released

  • Bug Fixed:

    • Fixed a bug when the SPNAS algorithm cluster training fails.
    • Fixed bugs such as model copy failure in safe mode.

Introduction

Vega is an AutoML algorithm tool chain developed by Noah's Ark Laboratory, the main features are as follows:

  1. Full pipeline capabilities: The AutoML capabilities cover key functions such as Hyperparameter Optimization, Data Augmentation, Network Architecture Search (NAS), Model Compression, and Fully Train. These functions are highly decoupled and can be configured as required, construct a complete pipeline.
  2. Industry-leading AutoML algorithms: Provides Noah's Ark Laboratory's self-developed industry-leading algorithm (Benchmark) and Model Zoo to download the state-of-the-art (SOTA) models.
  3. Fine-grained network search space: The network search space can be freely defined, and rich network architecture parameters are provided for use in the search space. The network architecture parameters and model training hyperparameters can be searched at the same time, and the search space can be applied to Pytorch, TensorFlow and MindSpore.
  4. High-concurrency neural network training capability: Provides high-performance trainers to accelerate model training and evaluation.
  5. Multi-Backend: PyTorch (GPU and Ascend 910), TensorFlow (GPU and Ascend 910), MindSpore (Ascend 910).
  6. Ascend platform: Search and training on the Ascend 910 and model evaluation on the Ascend 310.

Algorithm List

Category Algorithm Description reference
NAS CARS: Continuous Evolution for Efficient Neural Architecture Search Structure Search Method of Multi-objective Efficient Neural Network Based on Continuous Evolution ref
NAS ModularNAS: Towards Modularized and Reusable Neural Architecture Search A code library for various neural architecture search methods including weight sharing and network morphism ref
NAS MF-ASC Multi-Fidelity neural Architecture Search with Co-kriging ref
NAS NAGO: Neural Architecture Generator Optimization An Hierarchical Graph-based Neural Architecture Search Space ref
NAS SR-EA An Automatic Network Architecture Search Method for Super Resolution ref
NAS ESR-EA: Efficient Residual Dense Block Search for Image Super-resolution Multi-objective image super-resolution based on network architecture search ref
NAS Adelaide-EA: SEGMENTATION-Adelaide-EA-NAS Network Architecture Search Algorithm for Image Segmentation ref
NAS SP-NAS: Serial-to-Parallel Backbone Search for Object Detection Serial-to-Parallel Backbone Search for Object Detection Efficient Search Algorithm for Object Detection and Semantic Segmentation in Trunk Network Architecture ref
NAS SM-NAS: Structural-to-Modular NAS Two-stage object detection architecture search algorithm Coming soon
NAS Auto-Lane: CurveLane-NAS An End-to-End Framework Search Algorithm for Lane Lines ref
NAS AutoFIS An automatic feature selection algorithm for recommender system scenes ref
NAS AutoGroup An automatically learn feature interaction for recommender system scenes ref
NAS MF-ASC Multi-Fidelity neural Architecture Search with Co-kriging ref
Model Compression Quant-EA: Quantization based on Evolutionary Algorithm Automatic mixed bit quantization algorithm, using evolutionary strategy to quantize each layer of the CNN network ref
Model Compression Prune-EA Automatic channel pruning algorithm using evolutionary strategies ref
HPO ASHA: Asynchronous Successive Halving Algorithm Dynamic continuous halving algorithm ref
HPO BOHB: Hyperband with Bayesian Optimization Hyperband with Bayesian Optimization ref
HPO BOSS: Bayesian Optimization via Sub-Sampling A universal hyperparameter optimization algorithm based on Bayesian optimization framework for resource-constraint hyper-parameters search ref
Data Augmentation PBA: Population Based Augmentation: Efficient Learning of Augmentation Policy Schedules Data augmentation based on PBT optimization ref
Data Augmentation CycleSR: Unsupervised Image Super-Resolution with an Indirect Supervised Path Unsupervised style migration algorithm for low-level vision problem. ref
Fully Train Beyond Dropout: Feature Map Distortion to Regularize Deep Neural Networks Neural network training (regularization) based on disturbance of feature map ref
Fully Train Circumventing Outliers of AutoAugment with Knowledge Distillation Joint knowledge distillation and data augmentation for high performance classication model training, achieved 85.8% Top-1 accuracy on ImageNet 1k Coming soon

Installation

Run the following commands to install Vega:

pip3 install --user --upgrade noah-vega

Usage

Run the vega command to run the Vega application. For example, run the following command to run the CARS algorithm:

vega ./examples/nas/cars/cars.yml

The cars.yml file contains definitions such as pipeline, search algorithm, search space, and training parameters. Vega provides more than 40 examples for reference: Examples, Example Guide, and Configuration Guide.

The security mode is applicable to communication with high security requirements. Before running this command, run the security configuration.

vega ./examples/nas/cars/cars.yml -s

Reference

Reader Refrence
User Install Guide, Deployment Guide, Configuration Guide, Security Configuration, Examples, Evaluate Service
Developer Development Reference, Quick Start Guide, Dataset Guide, Algorithm Development Guide

FAQ

For common problems and exception handling, please refer to FAQ.

Citation

@misc{wang2020vega,
      title={VEGA: Towards an End-to-End Configurable AutoML Pipeline},
      author={Bochao Wang and Hang Xu and Jiajin Zhang and Chen Chen and Xiaozhi Fang and Ning Kang and Lanqing Hong and Wei Zhang and Yong Li and Zhicheng Liu and Zhenguo Li and Wenzhi Liu and Tong Zhang},
      year={2020},
      eprint={2011.01507},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Cooperation and Contribution

Welcome to use Vega. If you have any questions or suggestions, need help, fix bugs, contribute new algorithms, or improve the documentation, submit an issue in the community. We will reply to and communicate with you in a timely manner.

vega's People

Contributors

chenboability avatar cndylan avatar dawncc avatar dependabot[bot] avatar dynasty666 avatar emilie1001 avatar fmsnew avatar forestkey avatar hasanirtiza avatar idiomaticrefactoring avatar lzc06 avatar marsggbo avatar mengzhibin avatar qixiuai avatar shaido987 avatar shuharold avatar sisrfeng avatar this-50m avatar xinyuecai2016 avatar yaohan404 avatar zhangjiajin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vega's Issues

ERROR Illegal alpha.

when I run run_example.py
output:
ERROR Illegal alpha.
Then I watch source

idx = torch.argmax(alpha[start:end, :], dim=1) cnt = 0 if torch.nonzero(idx).size(0) > 2: logger.error("Illegal alpha.")
the shape of torch.nonzero(idx) is torch.Size([5, 1]) \ torch.Size([4, 1]) or torch.Size([3, 1])
if you want control the limit of connection number can use
if sum(alphaalpha[start:end, :]) > 2:

thanks

Is there no code about guilded mutation in esr-ea or I missed it ?

Hi, thanks for your awesome work, but i doubt whether the implement of EA's mutation is correct, the paper says that we should acquire the block credits during model evaluation procedure, which can be used to guide the mutation to accelerate searching and find better architecture, I find that vega's implementation is general mutaion, could you help me? thanks!

Train Pipeline

I want to run the training pipeline, can you give the file of /data/2019_mdc_lane/c00523047/mass_storage/culane/CULane/dataset.py?

PRUNE_EA parallel_search error

when I set parallel_search: True in prune.yml, I get this error

Traceback (most recent call last):
File "", line 1, in
File "/wn/vega/zeus/trainer_base.py", line 153, in train_process
self._train_loop()
File "/wn/vega/zeus/trainer_base.py", line 279, in _train_loop
self.callbacks.before_train()
File "/wn/vega/zeus/trainer/callbacks/callback_list.py", line 139, in before_train
callback.before_train(logs)
File "/wn/vega/vega/algorithms/compression/prune_ea/prune_trainer_callback.py", line 61, in before_train
self.latency_count = calc_forward_latency(self.trainer.model, count_input, sess_config)
File "/wn/vega/zeus/metrics/forward_latency.py", line 30, in calc_forward_latency
step_cfg = UserConfig().data.get("nas")
AttributeError: 'NoneType' object has no attribute 'get'

Its all fine to set parallel_search:False and run prune algo demo, whats wrong with parallel_search

[Bug] "Not found serial results"

Training SP-NAS with the default configurations, spnas.yml (the one you guys provide) breaks in the second phase where it starts training nas2. It complains it cannot find the file total_list_s.csv. I think the problem is in the variable remote_output_path; when using default settings, the code expects the file to be in the folder nas2, which is not yet created. Instead, the file is at the following location:

tasks/0726.021123.846/output/total_list_s.csv

instead the current code(with default params) searches the file at following location:

tasks/0726.021123.846/output/nas2/total_list_s.csv

and below is the trail of the error message

2020-07-26 04:56:45.680 INFO Start pipeline step: [nas2]
vega-0.9.1-py3.7.egg/vega/core/pipeline/pipeline.py", line 58, in run
    PipeStep().do()
 vega/algorithms/nas/sp_nas/spnas_pipe_step.py", line 27, in __init__
    super().__init__()
vega/core/pipeline/nas_pipe_step.py", line 28, in __init__
    self.generator = Generator()
vega/core/pipeline/generator.py", line 25, in __init__
    self.search_alg = SearchAlgorithm(self.search_space)

vega-0.9.1-py3.7.egg/vega/algorithms/nas/sp_nas/sp_nas.py", line 50, in __init__
    ), "Not found serial results!"
AssertionError: Not found serial results!

Fully train of pba

Hi, I have run pba and get some augment policies. But I do not known how to fully train the model with the found policy. Could you help me?

NameError: name 'IMAGENET_DEFAULT_MEAN' is not defined

I follow the quickstart example and get the following error , can anyone help?

File "quickstart.py", line 60, in
vega.run("./my.yml")
File "/home/lchen/.conda/envs/clpython/lib/python3.7/site-packages/vega/core/run.py", line 34, in run
_init_env(cfg_path)
File "/home/lchen/.conda/envs/clpython/lib/python3.7/site-packages/vega/core/run.py", line 62, in _init_env
set_backend(General.backend, General.device_category)
File "/home/lchen/.conda/envs/clpython/lib/python3.7/site-packages/vega/core/backend_register.py", line 64, in set_backend
register_pytorch()
File "/home/lchen/.conda/envs/clpython/lib/python3.7/site-packages/vega/core/backend_register.py", line 20, in register_pytorch
import vega.core.trainer.timm_trainer_callback
File "/home/lchen/.conda/envs/clpython/lib/python3.7/site-packages/vega/core/trainer/timm_trainer_callback.py", line 60, in
mean=IMAGENET_DEFAULT_MEAN,
NameError: name 'IMAGENET_DEFAULT_MEAN' is not defined

Question on "auto_lane_pointlane_codec.py"

hello! I'm wondering when the result of (self.points_per_line / self.feature_height) is not integer, will this line have some problems? (by default it's 72/18=4)

center_y = y_list[int(self.points_per_line / self.feature_height) * (self.feature_height - 1 - h)]

should it be center_y = y_list[int((self.points_per_line / self.feature_height) * (self.feature_height - 1 - h))] ?

where is the dataset.py?

hello, I met this error when I try to train the auto-lane model with culane dataset. There is no dataset.py in culane dataset, so where can I get dataset.py file, many thanks!!!
Traceback (most recent call last):
File "", line 1, in
File "/home/haha/.local/lib/python3.7/site-packages/vega/core/trainer/trainer.py", line 152, in train_process
self.build(model=self.model, hps=self.hps, load_ckpt_flag=self.load_ckpt_flag)
File "/home/haha/.local/lib/python3.7/site-packages/vega/core/trainer/trainer.py", line 189, in build
mode='train', loader=train_loader)
File "/home/haha/.local/lib/python3.7/site-packages/vega/core/trainer/trainer.py", line 360, in _init_dataloader
dataset = dataset_cls(mode=mode)
File "/home/haha/.local/lib/python3.7/site-packages/vega/datasets/pytorch/auto_lane_datasets.py", line 97, in init
train=load_module(self.args.dataset_file).create_train_subset(),
File "/home/haha/.local/lib/python3.7/site-packages/vega/datasets/pytorch/common/auto_lane_utils.py", line 214, in load_module
spec.loader.exec_module(mod)
File "", line 724, in exec_module
File "", line 859, in get_code
File "", line 916, in get_data
FileNotFoundError: [Errno 2] No such file or directory: '/cache/dataset/CULane/dataset.py'
2020-09-08 15:47:20.949 INFO {'code': 'r34_48_1-1111-1-22112111111111111111+012-122', 'method': 'random'}

The performance of CARS from example

Hi, thanks for the great work!
I am curious about the output I get from cars algorithm in the examples.
I got 86.488 as best top-1 valid accuracy after running the command given in readme.
Should the accuracy suppose to be more higher or I need to modify the cars.yml for better performance.

Potential bug in spnet.py ?

I have seemingly installed vega correctly and I can import it as well.
I have correctly placed the pre-trained models and dataset in the respective folders (cache/models and cache/datasets folder).

I am trying to run inside example folder the following command:
python run_example.py nas/sp_nas/spnas.yml

The code breaks with the following error
vega/algorithms/nas/sp_nas/spnet/spnet.py line 636, in __init__ TypeError: '<' not supported between instances of 'str' and 'int' assert max(out_indices) < num_stages
The error message is clear that you have string on one side and int on the other. When I print both out_indices and num_stages
I see out_indices is out_indices: {'__tuple__': True, 'items': [0, 1, 2, 3]} and max(out_indices) returns simply returns items
Is it a bug or there is an issue in my python or any other lib version etc. ?
I am using Python 3.7.7

How to reproduce results of the paper ? Plese provide config files for the models in model zoo

I am trying to use the ecp model(spnet_checkpoint_ecp.pth) provided in the model zoo to reproduce results of the paper.
When I try to execute test.py in spnet/tools/ it complains about mismatch between keys of the models.

Without the config files (such as faster_rcnn_r50_fpn_1x.py) I cannot do anything?
I have also tried to use the file you guys mention in the issue #14. The code breaks with the error that it is missing attribute in keep_all_stages

Suggestion: I think it would be great if you guys can provide a small example of running you pre-trained model.
Thanks

SP-NAS Config file for EuroCity Persons model

I am trying to reproduce numbers for EuroCityPersons using your pre-trained model in the Zoo. I have downloaded model from zoo for ECP. However, the corresponding config file is missing. Can you please provide it.

How to test SP-Nas after training ?

I am sorry but it is puzzling on how to test Sp-Nas after training finishes. So my problem is that I have trained SP-Net in the example folder using the following command:
python run_example.py ./nas/sp_nas/spnas.yml

so the code is using the config file that you guys provide called /nas/sp_nas/faster_rcnn_r50_fpn_1x.py . It trains fine and during training it evaluates on the validation set and everything is fine and mAp is also reasonable.
Without changing anything in the code or the .yml file or anything, basically cloning it and setting up the dataset paths and pre-trained model and thats it.

However, after training finishes I am trying to run the saved model in the folder:
examples/tasks/0719.042952.773/output/2/1112-1112-11111-21-1-11.pth

Using the command
python test.py vega/examples/nas/sp_nas/faster_rcnn_r50_fpn_1x.py --checkpoint examples/tasks/0719.042952.773/output/2/1112-1112-11111-21-1-11.pth --out res.pkl
but it gives basically 0 mAp which I know is wrong. Do I need to change something in the config file faster_rcnn_r50_fpn_1x.py or how can I run the test, could you please elaborate ?

question about generate the ground truth

When the two lane lines are so close that they are in the same grid. Which lane line the grid will respond to?

And another question is about the adaptive score masking:
what does the uxf and uyf mean?

Where to find download the vega-0.9.1-py3-none-any.whl ?

Probably because of the fact I am new to vega but I cannot find this wheel vega-0.9.1-py3-none-any.whl. It says on the installation page download the vega-0.9.1-py3-none-any.whl file in the release directory
Where is the release directory if someone can kindly point out.
I know we can built from the source but I would like to see.

step_cfg = UserConfig().data.get("nas")

When I run quant_ea.yaml, it shows that:
Traceback (most recent call last):
File "", line 1, in
File "/root/.local/lib/python3.6/site-packages/zeus/trainer_base.py", line 153, in train_process
self._train_loop()
File "/root/.local/lib/python3.6/site-packages/zeus/trainer_base.py", line 279, in _train_loop
self.callbacks.before_train()
File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/callback_list.py", line 139, in before_train
callback.before_train(logs)
File "/root/.local/lib/python3.6/site-packages/vega/algorithms/compression/quant_ea/quant_trainer_callback.py", line 62, in before_train
self.latency_count = calc_forward_latency(model, count_input, sess_config)
File "/root/.local/lib/python3.6/site-packages/zeus/metrics/forward_latency.py", line 31, in calc_forward_latency
step_cfg = UserConfig().data.get("nas")
AttributeError: 'NoneType' object has no attribute 'get'

AutoLaneHead forward error

Error caused by super class's Module function: forward; When training, Cls AutoLaneHead's function forward_train is not called.

Issues regarding the computation of FLOPS with thop

In model_statistics.py, the FLOPS is computed with the 3rd party package "thop". In their GitHub repo, it has been explained that the output of profile is actually MACs instead of FLOPS.

It is even puzzling with this line of code: self.gflops, self.kparams = flops_count * 1600 * 1e-9, params_count * 1e-3

Multiplying 1e-9 is to make it GMACS, but why multiplying 1600?

RuntimeError: Dataset not found or corrupted. You can use download=True to download it

2020-08-28 23:18:40.631 ERROR Failed to run pipeline.
Traceback (most recent call last):
File "/usr/local/python3.7/lib/python3.7/site-packages/vega/core/pipeline/pipeline.py", line 52, in run
PipeStep().do()
File "/usr/local/python3.7/lib/python3.7/site-packages/vega/core/pipeline/nas_pipe_step.py", line 43, in do
self._dispatch_trainer(res)
File "/usr/local/python3.7/lib/python3.7/site-packages/vega/core/pipeline/nas_pipe_step.py", line 73, in _dispatch_trainer
self.master.run(trainer)
File "/usr/local/python3.7/lib/python3.7/site-packages/vega/core/scheduler/local_master.py", line 42, in run
worker.train_process()
File "/usr/local/python3.7/lib/python3.7/site-packages/vega/core/trainer/trainer.py", line 152, in train_process
self.build(model=self.model, hps=self.hps, load_ckpt_flag=self.load_ckpt_flag)
File "/usr/local/python3.7/lib/python3.7/site-packages/vega/core/trainer/trainer.py", line 189, in build
mode='train', loader=train_loader)
File "/usr/local/python3.7/lib/python3.7/site-packages/vega/core/trainer/trainer.py", line 360, in _init_dataloader
dataset = dataset_cls(mode=mode)
File "/usr/local/python3.7/lib/python3.7/site-packages/vega/datasets/pytorch/cifar10.py", line 41, in init
transform=Compose(self.transforms.transform), download=self.args.download)
File "/root/.local/lib/python3.7/site-packages/torchvision/datasets/cifar.py", line 67, in init
raise RuntimeError('Dataset not found or corrupted.' +
RuntimeError: Dataset not found or corrupted. You can use download=True to download it
2020-08-28 23:18:40.631 ERROR None

I try to use the CARS algorithm to search for datasets on cifar10. The dataset path is set correctly, but the preceding error occurs. Why?

Why VEGA?

Thanks for the open-source and continuous maintenance of VEGA. Why is VEGA, is there any story? :)

can user set worker_path in config?

evey time you run pipline.py . the worker_path is random generated to save model and tensorboard log. such as 0126.091415.578 and 0126.090933.510 etc. it is very inconvenient. can you set the fixed worker_path in yml txt?
in addition, during one task - training , i found some parameter set in yml and some parameter set in config py. it is very informal. as a big company . i think you should Specificate code framework. tks

some code is missing in quick_start.md

in vega/docs/cn/developer/quick_start.md

@NetworkFactory.register(NetTypes.CUSTOM)

    def __init__(self, desc):
        super(SimpleCnn, self).__init__()

it may be missing this code:
class SimpleCnn(nn.Module):

Inference fo auto lane

Could you help tell how to run the inference.py to inference the auto lane model?
In the model_zoo.md, there is only model and desc file provided but inference code.
I try to use vega/model_zoo/inference.py and dont know how to set the data_type and data_path.

If you can provide document of auto lane inferencing, it will help a lot.

[URL fail] lackness of URL for pre-trained model source.

I found that we cannot download the pretrained models which vega provided in this page https://github.com/huawei-noah/vega/blob/master/docs/cn/user/examples.md.
Can you fix the url links of pre-trained models or provide other ones?

run_cluster_horovod_train.sh: No such file or directory

When I try to solve this problem for esr_ea algorithm by this way: #84 , The error shows that : "/root/.local/lib/python3.6/site-packages/vega/core/pipeline/horovod/run_cluster_horovod_train.sh: No such file or directory," . Could you tell me how to deal with it?

how to run the auto_lane

I want to run the auto_lane according to the user guide but get the errors. the configure files have changed, but the command recommended is not changed
how to configure the right files of lane detection?
how to register the class name before using them?
how to study the Curvelane-NAS algorithms step by step?
for example,
how to configure the yaml file in some dir some file, how to set the weight between the background and the lane lines (0.4 vs 1.0 maybe not fit for some model according to my experiments of SCNN? because their mIOU only about 0.105 if the weight of the background is set 0.4, the lane lines' weights are set to 1.0. So it is not possible to use mIOU 0.5 to calculate whether the corresponding lane line exists.
please give detailed parameters of your when training and testing on curve lane dataset)
.........

I have read the sure guide carefully, but I don't know how to handle the questions above.
please solve the problems or update the corresponding guide

yml file of the conda environment ?

is it possible for you guys to share the yml file of your conda environment. I am running into some issues due to versions because of pip.

Failed to save desc

I cannot find where this log is saved and dont know how to analysis.
can you help answer me?

Failed to save desc, file=/home/mengzhibin/vega/tasks/1105.182606.151/workers/nas/4/desc_4.json, desc={'detector': {'name': 'AutoLaneDetector', 'modules': ['backbone', 'neck', 'head'], 'num_class': 2, 'method': 'random', 'code': 'x50(2x24d)_48_112111-211112-1-1+122-022', 'backbone': {'name': 'ResNeXtVariantDet', 'arch': '112111-211112-1-1', 'base_depth': 50, 'base_channel': 48, 'groups': 2, 'base_width': 24, 'num_stages': 4, 'strides': (1, 2, 2, 2), 'dilations': (1, 1, 1, 1), 'out_indices': (0, 1, 2, 3), 'frozen_stages': -1, 'zero_init_residual': False, 'norm_cfg': {'type': 'BN', 'requires_grad': True}, 'conv_cfg': {'type': 'Conv'}, 'out_channels': [384, 1536, 1536, 1536], 'style': 'pytorch'}, 'neck': {'arch_code': '122-022', 'name': 'FeatureFusionModule', 'in_channels': [384, 1536, 1536, 1536]}, 'head': {'base_channel': 1792, 'num_classes': 2, 'up_points': 73, 'down_points': 72, 'name': 'AutoLaneHead'}, 'limits': {'GFlops': 1}}, 'modules': ['detector']}, msg=local variable 'value' referenced before assignment Failed to save performance, file=/home/mengzhibin/vega/tasks/1105.182606.151/workers/nas/4/performance_4.json, desc={'LaneMetric': 0.0}, msg=local variable 'value' referenced before assignment

[Bug] Problem in SP-NAS (fullytrain) ERROR Failed to load records from model folder.

Hi,

I am trying to train the full pipeline [nas1, nas2, fullytrain] of SP-NAS. I did not change anything, except I changed one line in spnas.yml, that is I changed:

pipeline: [nas1] to pipeline: [nas1, nas2, fullytrain]

It trains fine for nas1 and nas2. However, the code breaks by complaining that it cannot find records. This is the error trail.
Can you suggest a quick fix ?

2020-09-24 10:08:23.81 INFO performance save to vega/examples/tasks/0924.025954.103/workers/nas2/11/performance
2020-09-24 10:08:24.275 INFO Latest checkpoint save to vega/examples/tasks/0924.025954.103/output/11
2020-09-24 10:08:24.276 INFO update generator, step name: nas2, worker id: 11
2020-09-24 10:08:24.277 INFO SpNas.update(), performance file=vega/examples/tasks/0924.025954.103/workers/nas2/11/performance/performance.pkl
2020-09-24 10:08:24.321 INFO Start pipeline step: [fullytrain]
2020-09-24 10:08:24.322 INFO init FullyTrainPipeStep...
2020-09-24 10:08:24.322 INFO FullyTrainPipeStep started...
2020-09-24 10:08:24.324 ERROR Failed to load records from model folder, folder=vega/examples/tasks/0924.025954.103/output/nas2
2020-09-24 10:08:24.324 WARNING Failed to dump records, report is emplty.

output/nas2 this folder is never created by the code.

the losses of auto lane are not converging on CurveLanes Dataset

I test the loss terms in vega/search_space/networks/pytorch/detectors/auto_lane_detector.py:

     `image = input
    loc_targets = kwargs['gt_loc']
    cls_targets = kwargs['gt_cls']

    feat = self.extract_feat(image)
    predict = self.head(feat)

    loc_preds = predict['predict_loc']
    cls_preds = predict['predict_cls']
    cls_targets = cls_targets[..., 1].view(-1)
    pmask = cls_targets > 0
    nmask = ~ pmask
    fpmask = pmask.float()
    fnmask = nmask.float()
    cls_preds = cls_preds.view(-1, cls_preds.shape[-1])
    loc_preds = loc_preds.view(-1, loc_preds.shape[-1])
    loc_targets = loc_targets.view(-1, loc_targets.shape[-1])
    total_postive_num = torch.sum(fpmask)
    total_negative_num = torch.sum(fnmask)  # Number of negative entries to select
    negative_num = torch.clamp(total_postive_num * self.NEGATIVE_RATIO, max=total_negative_num, min=1).int()
    positive_num = torch.clamp(total_postive_num, min=1).int()
    # cls loss begin
    bg_fg_predict = F.log_softmax(cls_preds, dim=-1)
    fg_predict = bg_fg_predict[..., 1]
    bg_predict = bg_fg_predict[..., 0]
    max_hard_pred = find_k_th_small_in_a_tensor(bg_predict[nmask].detach(), negative_num)
    fnmask_ohem = (bg_predict <= max_hard_pred).float() * nmask.float()
    total_cross_pos = -torch.sum(self.ALPHA * fg_predict * fpmask)
    total_cross_neg = -torch.sum(self.ALPHA * bg_predict * fnmask_ohem)
    # class loss end
    # regression loss begin
    length_weighted_mask = torch.ones_like(loc_targets)
    length_weighted_mask[..., self.LANE_POINTS_NUM_DOWN] = 10
    valid_lines_mask = pmask.unsqueeze(-1).expand_as(loc_targets)
    valid_points_mask = (loc_targets != 0)
    unified_mask = length_weighted_mask.float() * valid_lines_mask.float() * valid_points_mask.float()
    smooth_huber = huber_fun(loc_preds - loc_targets) * unified_mask
    loc_smooth_l1_loss = torch.sum(smooth_huber, -1)
    point_num_per_gt_anchor = torch.sum(valid_points_mask.float(), -1).clamp(min=1)
    total_loc = torch.sum(loc_smooth_l1_loss / point_num_per_gt_anchor)
    # regression loss end
    total_cross_pos = total_cross_pos / positive_num
    total_cross_neg = total_cross_neg / positive_num
    total_loc = total_loc / positive_num`

on the CurveLanes Dataset, using the provided optimizer parameters,such as lr=0.02, weight_decay=1e-4,momentum=0.9, etc.
And build a model structrue just like the readme file"https://github.com/huaweinoah/vega/blob/master/docs/en/algorithms/auto_lane.md".

However the loss values are pretty large:
loss_pos = 36.0+
loss_neg = 10.+
loss_loc = 100.0+

and did not convergen after 12 epoches, which is mentioned in this paper :https://arxiv.org/abs/2007.12147

Does the loss terms in auto_lane_detector.py are wrong? or just I miss some important steps?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.