Giter VIP home page Giter VIP logo

models's Introduction

MindSpore Logo

Welcome to the Model Zoo for MindSpore

The MindSpore models repository provides different task domains, classic SOTA model implementations and end-to-end solutions. The purpose is to make it easier for MindSpore users to use MindSpore for research and product development.

In order to facilitate developers to enjoy the benefits of MindSpore framework, we will continue to add typical networks and some of the related pre-trained models. If you have needs for the model zoo, you can file an issue on gitee or MindSpore, We will consider it in time.

Directory Description
official • A collections of SOTA models implemented by MindSpore Latest API
• Maintained by MindSpore Team
research • A collections of research models implemented by researchers and institution
• Maintained by researchers and institution
community • A list of github/gitee repos of toolkit/models powered by MindSpore versions in the README
• Model file is not necessarily provided

WHAT IS NEW

  • We've done code refactoring for classic SOTA models,modularized data processing, model definition&creation, training process and other common components with new lanched MindSpore CV/NLP/Audio/Yolo/OCR Series toolbox. link.

  • Old models were implemented by original MindSpore API with tricks for model training speedup.

Disclaimers

Mindspore only provides scripts that downloads and preprocesses public datasets. We do not own these datasets and are not responsible for their quality or maintenance. Please make sure you have permission to use the dataset under the dataset’s license. The models trained on these dataset are for non-commercial research and educational purpose only.

To dataset owners: we will remove or update all public content upon request if you don’t want your dataset included on Mindspore, or wish to update it in any way. Please contact us through a Github/Gitee issue. Your understanding and contribution to this community is greatly appreciated.

MindSpore is Apache 2.0 licensed. Please see the LICENSE file.

License

Apache License 2.0

FAQ

For more information about MindSpore framework, please refer to FAQ

  • Q: How to resolve the lack of memory while using the model directly under "models" with errors such as Failed to alloc memory pool memory?

    A: The typical reason for insufficient memory when directly using models under "models" is due to differences in operating mode (PYNATIVE_MODE), operating environment configuration, and license control (AI-TOKEN).

    • PYNATIVE_MODE usually uses more memory than GRAPH_MODE , especially in the training graph that needs back propagation calculation, there are two ways to try to solve this problem. Method 1: You can try to use some smaller batch size; Method 2: Add context.set_context(mempool_block_size="XXGB"), where the current maximum effective value of "XX" can be set to "31". If method 1 and method 2 are used in combination, the effect will be better.
    • The operating environment will also cause similar problems due to the different configurations of NPU cores, memory, etc.;
    • Different gears of License control (AI-TOKEN ) will cause different memory overhead during execution. You can also try to use some smaller batch sizes.
  • Q: How to resolve the error about the interface are not supported in some network operations, such as cann not import?

    A: Please check the version of MindSpore and the branch you fetch the modelzoo scripts. Some model scripits in latest branch will use new interface in the latest version of MindSpore.

  • Q: What is Some RANK_TBAL_FILE which mentioned in many models?

    A: RANK_TABLE_FILE is the config file of cluster on Ascend while running distributed training. For more information, you could refer to the generator hccl_tools and Parallel Distributed Training Example

  • Q: How to run the scripts on Windows system?

    A: Most the start-up scripts are written in bash, but we usually can't run bash directly on Windows. You can try start python directly without bash scripts. If you really need the start-up bash scripts, we suggest you the following method to get a bash environment on Windows:

    1. Use a virtual system or docker container with linux system. Then run the scripts in the virtual system or container.
    2. Use WSL, you could turn on the Windows Subsystem for Linux on Windows to obtain an linux system which could run the bash scripts.
    3. Use some bash tools on Windows, such as cygwin and git bash.
  • Q: How to resolve the compile error point to gflags when infer on ascend310 with errors such as undefined reference to 'google::FlagRegisterer::FlagRegisterer'?

    A: Please check the version of GCC and gflags. You can refer to GCC and gflags to install GCC and gflags. You need to ensure that the components used are ABI compatible, for more information, please refer to _GLIBCXX_USE_CXX11_ABI.

  • Q: How to solve the error when loading dataset in mindrecord format on Mac system, such as Invalid file, failed to open files for reading mindrecord files.?

    A: Please check the system limit with ulimit -a, if the number of file descriptors is 256 (default), you need to use ulimit -n 1024 to set it to 1024 (or larger). Then check whether the file is damaged or modified.

  • Q: What should I do if I can't reach the accuracy while training with several servers instead of a single server?

    A: Most of the models has only been trained on single server with at most 8 pcs. Because the batch_size used in MindSpore only represent the batch size of single GPU/NPU, the global_batch_size will increase while training with multi-server. Different gloabl_batch_size requires different hyper parameter including learning_rate, etc. So you have to optimize these hyperparameters will training with multi-servers.

models's People

Contributors

520zhangjie520 avatar a1085728420 avatar agnjason avatar alexdotham avatar amexpaccount avatar au3c2 avatar bravozyz avatar chaunceytoy avatar dengzhuo6 avatar denisov-expasoft avatar fangmanlin avatar go-lee avatar hangangqiang avatar hannibal120 avatar it-is-a-robot avatar jiarongji avatar jingyangxiang avatar latrawy avatar luxuff avatar lvyufeng avatar mo-hai avatar qujialing avatar smight88 avatar tangtt-xbb avatar tomzw11 avatar vincent34 avatar vitalyybezuglyj avatar wscjxky avatar wuyanernuo avatar zhanghuiyao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

models's Issues

How to load the model's parameters during prediction after the data and optimizator parallelism at the training time? 【pangu-alpha采用数据并行+优化器并行方式训练,predict时候如何加载参数。】

Task Description

How to load pangu-alpha model's parameters during prediction after the data and optimizator parallelism at the training time?
【pangu-alpha采用数据并行+优化器并行方式训练,predict时候如何加载参数。】

Task Goal

In the MindSpore tutorial and also the course, there are several instructions about how to use the distributed model do to the training and prediction(model loading). but those instructions only include the data parallelism and automatic parallelism. Following those instructions, there is only one generated checkpoint file, and that is straightforward how to load the model during prediction. However, I cannot find out any instruction to introduce how to load the mode if I trained my model with data parallelism and optimizator parallelism. In this case, each card will generate a checkpoint file, and I am not sure which one should be loaded during prediction. For example, I use 64 cards to train my model, and wanna use 1 card or 8 cards to predict. In this case, there are multiple checkpoint files, which one should I select to use?

【在MindSpore的教程中,关于分布式并行模型的训练和加载,只介绍了数据并行和自动并行两种情况,这两种情况保存的参数只有一个checkpoint文件,加载方法比较简单。然而,在其他的一些情况,MindSpore的教程及Readme中,没有说明如何处理。比如,在使用“数据并行”+“优化器并行”,每张卡的checkpoint是不一样的,不知道具体加载那个checkpoint。比如使用64卡训练,想单卡推理或者8卡推理加载,该如何操作?】

Figure 4 疑问

作者,你好,请问可以提高一下关于Figure 4的代码吗?非常感谢

any update for training code?

Thanks for your great work.

I have been trying to reproduce your work [Semi-Supervised Domain Adaptation based on Dual-level Domain Mixing for Semantic Segmentation (DDM)] , but it seems that I'm missing a few important parts.
Is there any plan for providing training code and procedure?

RuntimeError: For 'Reshape', the size of 'input_x': {3456} is not equal to the size of the first output: {5760}

I use the dataset you provided,but I can't train.How to solve this problem?

root@0563a279aa9b:/data# DEVICE_ID=0 python train.py
Start time : 2022-09-22 08:07:09

infos : {'dataset_path': './dataset/', 'backbone_pretrained': './src/model/res2net_pretrained.ckpt', 'dataset_train': 'PASCAL_SBD', 'datasets_val': ['GrabCut', 'Berkeley'], 'epochs': 33, 'train_only_epochs': 32, 'val_robot_interval': 1, 'lr': 0.007, 'batch_size': 8, 'max_num': 0, 'size': (384, 384), 'device': 'CPU', 'num_workers': 4, 'itis_pro': 0.7, 'max_point_num': 20, 'record_point_num': 5, 'pred_tsh': 0.5, 'miou_target': [0.9, 0.9], 'resume': None, 'snapshot_path': './snapshot'}

Traceback (most recent call last):
File "train.py", line 35, in
mine = Trainer(p)
File "/data/src/trainer.py", line 111, in init
size=p["size"][0], backbone_pretrained=p["backbone_pretrained"]
File "/data/src/model/fcanet.py", line 295, in init
resnet.load_pretrained_model(backbone_pretrained)
File "/data/src/model/res2net.py", line 267, in load_pretrained_model
tmp[:, :3, :, :] = parameter_dict["conv1_0.weight"]
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/common/tensor.py", line 344, in setitem
out = tensor_operator_registry.get('setitem')(self, index, value)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py", line 67, in _tensor_setitem
return tensor_setitem_by_tuple(self, index, value)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py", line 803, in tensor_setitem_by_tuple
return tensor_setitem_by_tuple_with_tensor(self, index, value)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py", line 956, in tensor_setitem_by_tuple_with_tensor
tuple_index, value, idx_advanced = remove_expanded_dims(tuple_index, F.shape(data), value)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/compile_utils.py", line 1156, in remove_expanded_dims
value = F.reshape(value, value_shape)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/ops/function/array_func.py", line 857, in reshape
return reshape
(input_x, input_shape)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 294, in call
return _run_op(self, self.name, args)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 98, in wrapper
results = fn(*arg, **kwargs)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 748, in _run_op
output = real_run_op(obj, op_name, args)
RuntimeError: For 'Reshape', the size of 'input_x': {3456} is not equal to the size of the first output: {5760}


  • C++ Call Stack: (For framework developers)

mindspore/ccsrc/plugin/device/cpu/kernel/memcpy_cpu_kernel.cc:37 Launch

Run-Time and Memory Measurement

(Regarding eppmvsnet)
Hi,
I am trying to measure the runtime and memory usage of a set of methods as your table 3 in the paper shows, but didn't get the same numbers. Could you provide more details regarding how you measure them? Thanks!

cv/FDA-BNN missing files

Thanks for the awesome work, but there are some missing files in cv/FDA-BNN, such as trainer, config files.

Have any plans to upload these files?

论文中3.3部分公式11的代码实现疑问

作者你好,看了原始论文中3.3部分的公式11,包含两个全连接层,其中第二个全连接层还有skip-connection。但在代码实现中跟论文不一样。autodis.py这个文件AutoDisModel类的construct成员函数的252-261行是实现autodis embedding的,对于论文中的公式11,代码实现的时候只用了一个全连接层,请教下是什么原因?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.