renqianluo / nao Goto Github PK

View Code? Open in Web Editor NEW

284.0 14.0 66.0 6.29 MB

Neural Architecture Optimization

License: GNU General Public License v3.0

Python 97.76% Shell 2.24%

nao's Introduction

Neural Architecture Optimization

This is the Code for the Paper Neural Architecture Optimization.

Authors: Renqian Luo*, Fei Tian*, Tao Qin, En-Hong Chen, Tie-Yan Liu. *=equal contribution

NEW:

For Pytorch implementation of cnn part, please visit NAO_pytorch, which includes the results on Imagenet.

License

The codes and models in this repo are released under the GNU GPLv3 license.

Citation

If you find this work helpful in your research, please use the following BibTex entry to cite our paper.

@inproceedings{NAO,
  title={Neural Architecture Optimization},
  author={Renqian Luo and Fei Tian and Tao Qin and En-Hong Chen and Tie-Yan Liu},
  booktitle={Advances in neural information processing systems},
  year={2018}
}

This is not an official Microsoft product.

Requirment and Dependency

Tensorflow >= 1.4.0

Pytorch == 0.3.1

CIFAR-10

With Weight Sharing

To Search Architectures

To search the CNN architectures for CIFAR-10 with weight sharing, please refer to:

Script	Data	GPU	Search Time
./NAO-WS/cnn/train_search.sh	`Google Drive` `Baidu Pan`	1 V100	7.5 hours

cd NAO-WS/cnn
bash train_search.sh

Once the search is done, the final pool of architectures will be in models/child/arch_pool. You can choose top-5 architectures to run them using train_final.sh and pass in the arch by setting the fixed_arc argument.

To obtain the best architecture, we perform grid search on the hyper-parameters for the top-5 architectures discovered.

To Train Discovered Architectures

To train a fixed CNN architecture, for example, our best architecture discovered, please refer to:

Script	GPU	Time	Model Checkpoint	Parameter Size	Error Rate
./NAO-WS/cnn/train_final.sh	1 P40	42 hours	`Google Drive` `Baidu Pan`	2.5M	3.50

and run:

cd NAO-WS/cnn
bash train_final.sh

If you want to run it with cutout, add --child_cutout_size=16 in the script.

To Directly Evaluate an Architecture

To directly evaluate an architecture, for example, our best architecture discovered, please download the checkpoint above, move all the files to NAO-WS/cnn/models folder and run:

cd NAO-WS/cnn
bash test_final.sh    #This should give you an accuracy of 96.50% (error rate of 3.50%) without cutout

Without Weight Sharing

To Search Architectures

Please refer to details in ./NAO/README.md

To Train Discovered Architectures

Please download data at Google Drive Baidu Pan

You can train the best architecture discovered (show in Fig. 1 in the Appendix of the paper) using:

Dataset	Script	GPU	Time	Checkpoint	Error Rate (Test)
CIFAR-10	./NAO/cnn/train_cifar10_final.sh	2 P40	5 days	`Google Drive` `Baidu Pan`	2.10%
CIFAR-100	./NAO/cnn/train_cifar100_final.sh	2 P40	5 days	`Google Drive` `Baidu Pan`	14.80%
Imagenet	refer to `NAO_pytorch`	4 P40	6 days	TBD	25.70%

by running:

cd NAO/cnn
bash train_cifar10_final.sh
bash train_cifar100_final.sh

To Directly Evaluate an Architecturethe

To directly evaluate an architecture, for example, our best architecture discovered, please download the checkpoint above, move all the files to NAO/cnn/models/cifar10 or NAO/cnn/models/cifar100/ , and run:

cd NAO/cnn
bash test_cifar10.sh     #This should give you an accuracy of 97.94% (error rate of 2.06%)
bash test_cifar100.sh    #This should give you an accuracy of 85.20% (error rate of 14.81%)

PTB

To Search Architectures

To search the RNN architectures for PTB with weight sharing, please refer to:

Script	GPU	Search Time
./NAO-WS/rnn/train_search.sh	1 V100	8 hours

cd NAO-WS/rnn
bash train_search.sh

Once the search is done, the final pool of architectures will be in models/child/arch_pool. You can choose top-10 architectures to run them using train_final.sh and pass in the arch by setting the arch argument.

To Train Discovered Architectures

To train a fixed RNN architecture, for example, our best architecture discovered, please refer to:

Script	Model Checkpoint	GPU	Time	PPL (Test)
./NAO-WS/rnn/train_final.sh	`Google Drive` `Baidu Pan`	1 V100	4 days	56.80

cd NAO-WS/rnn
bash train_final.sh   #This should give you a test ppl of 56.66 at the end of training

To Directly Evaluate an Architecture

To directly evaluate an architecture, for example, our best architecture discovered, please download the checkpoint above, move all the files to ./NAO/rnn/models folder and run:

cd NAO-WS/cnn
bash test_final.sh    #This should give you a test ppl of 56.66

Without Weight Sharing

To Search Architectures

Please refer to details in NAO/README.md

To Train Discovered Architectures

You can train the best architecture discovered (showin in Fig. 2 in the Appendix of the paper) using:

Dataset	Script	GPU	Time	Checkpoint	PPL (Test)
PTB	./NAO/rnn/train_ptb_final.sh	1 V100	4 days	`Google Drive` `Baidu Pan`	56.02
WikiText-2	./NAO/rnn/train_wt2_final.sh	1 V100	4 days	`Google Drive` `Baidu Pan`	67.10

To Directly Evaluate an Architecture

To directly evaluate an architecture, for example, our best architecture discovered, please download the checkpoint above, move all the files to NAO/rnn/models/ptb or NAO/rnn/models/wt2 , and run:

cd NAO/rnn
bash test_ptb.sh    #This should give you a test ppl of 56.02
bash test_wt2.sh    #This should give you a test ppl of 67.10

Acknowledgements

We thank Hieu Pham for the discussion on some details of ENAS implementation, and Hanxiao Liu for the code base of language modeling task in DARTS . We furthermore thank the anonymous reviewers for their constructive comments.

nao's People

Contributors

Stargazers

Watchers

nao's Issues

Error: Key basicdecoderstep/dense/kernel not found in checkpoint

I am trying to run cnn without weight sharing. It runs fine for training but repeatedly shows the error during prediction.
I would appreciate any help over there.

test_final.sh no child/T_i

I want to try the best architecture.

I use models Google Drive

and run:
cd NAO-WS/cnn bash train_final.sh

log:
NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from
the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Key child/T_i not found in checkpoint
[[node save/RestoreV2 (defined at test.py:89) ]]

I can't reproduce the architecture search process following the default configuration in this cpde.

Hi @renqianluo ,

Recently, I try to reproduce your work. I use the code and reproduce the test process. It's great! But I cannot reproduce the architecture search process with the default configuration to obtain a similar architecture or performance of NAO_WS. Can you give me some advice for reproduction?

Is is possible to search a network in parallel with weight sharing?

The paper using WS method (ENAS, NAO) run the experiments in 1*GPU, is there have some problems in parallel training?

Training final does not work

Hello,

Thank you for open-sourcing the code. The training_final does not work for NAO-WS/rnn/train_final.sh. It seems like there are few bugs.

Thanks

AttributeError: type object 'scipy.interpolate.interpnd.array' has no attribute '__reduce_cython__'

When I run your code, I found a AttributeError, is it because the wrong version of scipy? which version I need? Sincerely yours.

Request for search data

Hi,

First of all, thank you very much for open sourcing your code, it has been very helpful and I will definitely make sure to cite it in my work.
I was wondering if it would be possible for you to release the 1000x evaluations obtained during the search process for CIFAR10? I am particularly curious about the performance of NAOnet during the search process to have an idea about the gap between performance during search vs. final training performance.

Thanks,
Felipe

Is parameter N in train_cifar10_search.sh the same as N in paper?

Hey, I'm still new to this area, but it confuses me that parameter N in train_cifar10_search.sh is multiplied by 3 when generating model. However, in the paper, it seems that N is the total normal convolution cell used to construct final model. So I'm wondering if these two Ns are the same or is N in code equals 3 * N in the paper?

InvalidArgumentError (see above for traceback): Default AvgPoolingOp only supports NHWC.

I followed your Requirment and Dependency Tensorflow == 1.4.0, Pytorch == 0.3.1, But I got follow error:

InvalidArgumentError (see above for traceback): Default AvgPoolingOp only supports NHWC.
[[Node: child_1/layer_0/cell_4/x_conv/average_pooling2d/AvgPool = AvgPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="SAME", strides=[1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Am I wrong for the configuration?

I also tried on tensorflow-gpu, it either does not work.

What is reduction cell stands for?

Awesome idea and awesome implementation! However I have a question here, what does reduction cell stands for? For a CNN architecture, according to my understanding, only convolutional cell should be enough?

How to use multiple GPU to search?

Confusion about computing the gradient of predictor?

In 'NAO/cnn/encoder/encoder.py', the gradient of predictor is computed as follow:
new_arch_outputs = self.encoder_outputs - self.params['predict_lambda'] * grads_on_outputs

However, I think the gradient should be computed as described in paper:
$h_{t}^{'} = h_{t} + \eta \frac{\partial f}{\partial h_{t}}, e_{x^{'}}=\{h_{1}^{'},\dots,h_{T}^{'}\}$

Is there an error with computing the gradient of predictor in code?

想请教一下关于连续空间上的预测器的问题！

您好！我对您的工作中使用NAO处理音频数据的工作很感兴趣，但是没有搞明白预测器是怎么工作的，请问训练用的数据的Label是什么呢？您的代码中提供的PTB数据集内的三个txt文档又是怎么使用的呢？烦请解惑，十分感谢！

Doesn't work for train_final.sh

When I use python 2.7, Pytorch == 0.3.1,Tensorflow == 1.12.0, I got following error,

TypeError: load() got an unexpected keyword argument 'encoding'

what is “genotypes” in model.py and utils.py

I want to run "PTB Without Weight Sharing" but I found that the project missing the file "genotypes.py". It's probably not a package because I cannot find it on https://pypi.org/project

ResourceExhustedError: OOM when executing NAO-WS/cnn/train_search.sh

Hi,

I have cloned the code and run without modification under python3.6 virtual environment with following packages installed

Package Version

absl-py 0.6.1
astor 0.7.1
gast 0.2.0
grpcio 1.17.1
Markdown 3.0.1
numpy 1.15.4
Pillow 5.3.0
pip 18.1
pkg-resources 0.0.0
protobuf 3.6.1
PyYAML 3.13
setuptools 39.1.0
six 1.12.0
tensorboard 1.9.0
tensorflow-gpu 1.9.0
termcolor 1.1.0
torch 0.3.1
torchvision 0.2.1
Werkzeug 0.14.1
wheel 0.32.3

However, a out-of-memory exceptioin occurs when executing NAO-WS/cnn/train_search.sh on a GTX-1080 with 8GB memory.

Could you point out how to fix the issue ?

Error message:

lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

...

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[5,160,80,8,8] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: child_1/layer_6/cell_3/y/stack = Pack[N=5, T=DT_FLOAT, axis=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"](child_1/layer_6/cell_3/y/conv_3x3/stack_1/FusedBatchNorm, child_1/layer_6/cell_3/y/conv_5x5/stack_1/FusedBatchNorm, child_1/layer_6/cell_3/y/avg_pool/average_pooling2d/AvgPool, child_1/layer_6/cell_3/y/max_pool/max_pooling2d/MaxPool, child_1/layer_6/cell_3/y/strided_slice_2)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Node: child_2/gradients/concat_12/_22493 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_42968_child_2/gradients/concat_12", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

train_search.sh 'Caused by op 'child_1/stem_conv/Conv2D''

tensorflow 1.13.1
pytorch 0.4.1

I am running the code:
cd NAO-WS/cnn bash train_final.sh
log:
`2019-06-19 07:42:32.452404: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-06-19 07:42:32.479993: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node child_1/stem_conv/Conv2D}}]]
[[{{node child_2/gradients/concat_14}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_search.py", line 382, in
tf.app.run(argv=[sys.argv[0]] + unparsed)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train_search.py", line 376, in main
train()
File "train_search.py", line 214, in train
child_epoch = child_train(child_params)
File "/NAO/NAO-WS/cnn/model_search.py", line 1142, in train
loss, lr, gn, tr_acc, _ = sess.run(run_ops)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
run_metadata=run_metadata)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
raise six.reraise(*original_exc_info)
File "/opt/conda/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
return self._sess.run(*args, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
run_metadata=run_metadata)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
return self._sess.run(*args, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node child_1/stem_conv/Conv2D (defined at /NAO/NAO-WS/cnn/model_search.py:529) ]]
[[node child_2/gradients/concat_14 (defined at /NAO/NAO-WS/cnn/utils.py:62) ]]

Caused by op 'child_1/stem_conv/Conv2D', defined at:
File "train_search.py", line 382, in
tf.app.run(argv=[sys.argv[0]] + unparsed)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train_search.py", line 376, in main
train()
File "train_search.py", line 214, in train
child_epoch = child_train(child_params)
File "/NAO/NAO-WS/cnn/model_search.py", line 1118, in train
child_ops = get_ops(images, labels, params)
File "/NAO/NAO-WS/cnn/model_search.py", line 1097, in get_ops
child_model.connect_controller(params['arch_pool'], params['arch_pool_prob'])
File "/NAO/NAO-WS/cnn/model_search.py", line 1059, in connect_controller
self._build_train()
File "/NAO/NAO-WS/cnn/model_search.py", line 980, in _build_train
logits = self._model(self.x_train, is_training=True, reuse=tf.AUTO_REUSE)
File "/NAO/NAO-WS/cnn/model_search.py", line 529, in _model
images, w, [1, 1, 1, 1], "SAME", data_format=self.data_format)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node child_1/stem_conv/Conv2D (defined at /NAO/NAO-WS/cnn/model_search.py:529) ]]
[[node child_2/gradients/concat_14 (defined at /NAO/NAO-WS/cnn/utils.py:62) ]]
`

Add simpler MLP example?

eval_once function bug

I used your NAO-WS code and just found a small bug in your source code of below. The total_exp variable is not correct because there is often one batch with the size smaller than self.eval_batch_size(the last batch).
https://github.com/renqianluo/NAO/blob/master/NAO-WS/cnn/model.py#L379-L390

3 Maybe there is a mistake in func "_enas_cell" ?

In the func "_enas_cell" of /NAO-WS/cnn/model_search.py, “num_possible_inputs = curr_cell + 1”. When curr_cell=0, prev_cell can be 0 or 1, num_possible_inputs=curr_cell+1=1, so the shape of "w" will be (1, avg_pool_c * out_filters), the code "w[prev_cell]" may cause a mistake of "out of index".
And I see in the func "_enas_conv", num_possible_inputs = curr_cell + 2. The other parts of these two functions are similar. So i wonder whether "curr_cell+1" should be "curr_cell+2" ? Thank you!

renqianluo / nao Goto Github PK

nao's Introduction

Neural Architecture Optimization

NEW:

License

Citation

Requirment and Dependency

CIFAR-10

With Weight Sharing

To Search Architectures

To Train Discovered Architectures

To Directly Evaluate an Architecture

Without Weight Sharing

To Search Architectures

To Train Discovered Architectures

To Directly Evaluate an Architecturethe

PTB

To Search Architectures

To Train Discovered Architectures

To Directly Evaluate an Architecture

Without Weight Sharing

To Search Architectures

To Train Discovered Architectures

To Directly Evaluate an Architecture

Acknowledgements

nao's People

Contributors

Stargazers

Watchers

Forkers

nao's Issues

Recommend Projects

Recommend Topics

Recommend Org