Giter VIP home page Giter VIP logo

tensorpack's Introduction

Tensorpack

Tensorpack is a neural network training interface based on graph-mode TensorFlow.

ReadTheDoc Gitter chat model-zoo

Features:

It's Yet Another TF high-level API, with the following highlights:

  1. Focus on training speed.
  • Speed comes for free with Tensorpack -- it uses TensorFlow in the efficient way with no extra overhead. On common CNNs, it runs training 1.2~5x faster than the equivalent Keras code. Your training can probably gets faster if written with Tensorpack.

  • Scalable data-parallel multi-GPU / distributed training strategy is off-the-shelf to use. See tensorpack/benchmarks for more benchmarks.

  1. Squeeze the best data loading performance of Python with tensorpack.dataflow.
  • Symbolic programming (e.g. tf.data) does not offer the data processing flexibility needed in research. Tensorpack squeezes the most performance out of pure Python with various autoparallelization strategies.
  1. Focus on reproducible and flexible research:
  1. It's not a model wrapper.
  • There are too many symbolic function wrappers already. Tensorpack includes only a few common layers. You can use any TF symbolic functions inside Tensorpack, including tf.layers/Keras/slim/tflearn/tensorlayer/....

See tutorials and documentations to know more about these features.

Examples:

We refuse toy examples. Instead of showing tiny CNNs trained on MNIST/Cifar10, we provide training scripts that reproduce well-known papers.

We refuse low-quality implementations. Unlike most open source repos which only implement papers, Tensorpack examples faithfully reproduce papers, demonstrating its flexibility for actual research.

Vision:

Reinforcement Learning:

Speech / NLP:

Install:

Dependencies:

  • Python 3.3+.
  • Python bindings for OpenCV. (Optional, but required by a lot of features)
  • TensorFlow ≥ 1.5
    • TF is not not required if you only want to use tensorpack.dataflow alone as a data processing library
    • When using TF2, tensorpack uses its TF1 compatibility mode. Note that a few examples in the repo are not yet migrated to support TF2.
pip install --upgrade git+https://github.com/tensorpack/tensorpack.git
# or add `--user` to install to user's local directories

Please note that tensorpack is not yet stable. If you use tensorpack in your code, remember to mark the exact version of tensorpack you use as your dependencies.

Citing Tensorpack:

If you use Tensorpack in your research or wish to refer to the examples, please cite with:

@misc{wu2016tensorpack,
  title={Tensorpack},
  author={Wu, Yuxin and others},
  howpublished={\url{https://github.com/tensorpack/}},
  year={2016}
}

tensorpack's People

Contributors

aprlirainkun avatar armandmcqueen avatar bluerythem avatar bzamecnik avatar cykustcc avatar dan-anghel avatar dev-hjyoo avatar dongzhuoyao avatar eddiepierce avatar eliberis avatar eyaler avatar janpf avatar jasonhang avatar jimmycai91 avatar julienc91 avatar maciejjaskowski avatar mek-yt avatar patwie avatar philippwerner avatar ppwwyyxx avatar skylion007 avatar sunskyf avatar tals avatar thuzhf avatar vfdev-5 avatar wangg12 avatar yg320 avatar ymy513 avatar yselivonchyk avatar zsc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tensorpack's Issues

Could you teach me how to know the test accuracy?

Thank you for reading my issue.
Could you teach me how to know the test accuracy?
I could know about validation error but I couldn't know how to test the new data which are not used in training phase.

I am trying to use DoReFa Net

Quantize scheme for FPGA

This paper uses round((2^k-1)x)/(2^k-1) to quantize x, which may be not suitable for FPGA to represent. For example, if k=2, the quantized valve is 0, 0.33, 0.67, 1, however FPGA only could represents, 0, 0.25, 0.5, 0.75, if not using lookup table.

Equation (1) from DoReFa Net - XNOR-Bitcount equivalent for bitwise dot product

Hello, I'm hoping you could help me understand Equation (1) from the DoReFa-Net paper. You say that

the following equivalence computes the dot product of two bit vectors x and y:
Σixiyi = bitcount(xnor(xi, yi)), xi, yi ∈ {0, 1} ∀ i

If we evaluate with two bit-vectors, a := {1, 1, 0} and b := {0, 1, 0} this equivalence does not seem to hold.

a := 1 1 0
b := 0 1 0
let c := xnor(a, b) == 0 1 1
bitcount(c) = 2
bit-wise dot product of (a, b) == 1 * 0 + 1 * 1 + 0 * 0 = 1
2 != 1

What am I misunderstanding? Thank you.

Exception gym.error.Error: Error('env has been garbage collected.

After running MsPacman-v0 I'm getting this error after about an hour of it working fine
$
$ python run-atari.py --load MsPacman-v0.tfmodel --env MsPacman-v0

....
('Total:', 7500.0)
('Total:', 6770.0)
('Total:', 6970.0)
('Total:', 6110.0)
Exception gym.error.Error: Error('env has been garbage collected. To keep using a monitor, you must keep around a reference to the env object. (HINT: try assigning the env to a variable in your code.)',) in <bound method AtariEnv.del of <gym.envs.atari.atari_env.AtariEnv object at 0x11c966450>> ignored
Exception gym.error.Error: Error('env has been garbage collected. To keep using a monitor, you must keep around a reference to the env object. (HINT: try assigning the env to a variable in your code.)',) in <bound method Monitor.del of <gym.monitoring.monitor.Monitor object at 0x11c966410>> ignored
$
$
$
$

Training a model in Env Pong-v0 from scratch has not progress

Hi, ppwwyyxx,

I am trying to using your code to learn A3C. Could you tell me how long does it take to train a model fro scratch?

the command I use:
./train-atari.py --env Pong-v0 --gpu 0

I use one Tesla K40 and Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz, after training 30 minutes it's still show as follows:

[1112 22:35:24 @concurrency.py:24] Starting EnqueueThread
[1112 22:35:24 @base.py:130] Start training with global_step=0
  0%|                                                               |0/6000[00:00<?,?it/s]

Is there anything wrong? Thanks a lot

Python 2

@ppwwyyxx awesome work and thanks for sharing it. Do you know which parts really require python 2 or how non trivial it may be to adapt it to python 3?

Examples no training progress

I setup tensorpack and i can run all the examples. But the training shows no progress at all. For Cifar 10 and mnist the training and validation error doesnt change at all over multiple epochs.

I use tensorflow 0.9 and a GTX 1080. A hint of any kind would be helpfull. I like the architecture of the tensorpack and i would like to use it.

Quantize the weights to ternary (-1, 0 1)?

I would like to quantize the weights to ternary (-1, 0 +1) , which could increase the diversity of the conv. kernel while won't afford much more logic resource in FPGA, I wrote it like:

E = tf.stop_gradient(tf.reduce_mean(tf.abs(x)))
clip_x = tf.clip_by_value(x/E, -2.0, 2.0) / 2.0
with G.gradient_override_map({"Floor": "Identity"}):
    return tf.round(clip_x) * E

However, it seems not work fine? Is there any good approach to implement this?

A3C batch modification

I wonder what kind of modification to the original A3C algorithm you have made in your Batch A3C variant? Could you describe it in pseudo code?

resnet on imagenet

Not an issue per se but more of a question: could you provide resent configs (34, 50, 101, 1001, etc) for training on imagenet dataset?

Thanks

Before and After Step Design

Hi,
Could you please alow acces to "before_step" and "after_step" ? sometimes it is crucial to

  1. run something more often then once per epoch (or just with different frequency)
  2. to run something explicitely BEFORE train_op (like some tensor statistics)
  3. it makes life easier defining timers etc. (it is crucial to benchmark times of train_op, other_run_ops, loading data etc.)

Equation (3) From DoReFa July v2 Paper - only defining a single fixed-point int, not a sequence?

Hello again, thanks for the updated paper - equation 1 now makes sense to me. Could you please help me understand your intent with equation 3?

You say the following:
dorefa v2 paper equation 3

However, it does not seem like x and y are sequences of integers, it seems like they are single integers. For example, x is some M-bitwidth fixed-point integer, and y is some K-bitwidth fixed point integer. The reason being, if they were sequences of multiple M-bitwidth (and K-bitwidth) integers, say p many M-bitwidth integers and q many K-bitwidth integers, then the bitwise dot product would need to iterate over these sequences.

Currently, the bitwise dot product as defined only executes MK many summations. There is no Σ for the p many M-bitwidth integers, nor for the q many K-bitwidth integers.

Also, you define x and y as summations of bits to varying powers of 2 (beginning part of the quote above). It seems like they could only represent a single fixed-point integer, not a sequence of fixed-point integers.

Please let me know if I am misunderstanding something, thank you.

EDIT: Two more points:
(1) I think we would also need the constraint that p == q, i.e. that the two sequences are of equivalent lengths, if they are to represent the dot product of two vectors.

(2) The bitcount operation inside the summation seems irrelevant. The bitcount of a bitwise and will be 1 iff the bitwise and evaluates to 1, and 0 otherwise. In other words, it is redundant with the and operation.

I feel I may be misunderstanding what your notation means. Please help me clarify, thank you.

Quantizing Gradients - Meaning of max0() operator in DoReFa v2 paper?

Thank you for your help so far.

(1) In section 2.5 on quantizing gradients you use an operator called max0 but do not define it. I did not find a definition in the XNOR or BNN papers either. What does this operator do? How is it different from the regular max() operator?

(2) Second, you say that dr / 2max0(|dr|) + 1/2 is an affine transform to map the gradient into [0,1], but it seems like in your code you apply an additional step to manually clip the values. Why do you need this additional step?

Code: https://github.com/ppwwyyxx/tensorpack/blob/master/examples/DoReFa-Net/dorefa.py

 def grad_fg(op, x):
            rank = x.get_shape().ndims
            assert rank is not None
            maxx = tf.reduce_max(tf.abs(x), list(range(1,rank)), keep_dims=True)
            x = x / maxx
            n = float(2**bitG-1)
            x = x * 0.5 + 0.5 + tf.random_uniform(
                    tf.shape(x), minval=-0.5/n, maxval=0.5/n)
            x = tf.clip_by_value(x, 0.0, 1.0) # this is the extra step not in the paper
            x = quantize(x, bitG) - 0.5
            return x * maxx * 2

(3) I am also having trouble understanding this line, could you please explain? - maxx = tf.reduce_max(tf.abs(x), list(range(1,rank)), keep_dims=True).

It seems like list(range(1,rank)) is somehow related to your statement that "Here dr = ∂c/∂r is the back-propagated gradient of the output r of some layer, and the maximum is taken over all axis of the gradient tensor dr except for the mini-batch axis (therefore each instance in a mini-batch will have its own scaling factor)", but I do not understand this sentence either. Thank you for your help!

Multi Task Learning

Does support for multi task learning exist? If not, which way would fit this architecture? I am talking about a setup where a new task is chosen for each minibatch pass.

Thank you for any help you can provide.

I am making DoReFa Net for cifar image set.

Thank you for reading my issue.
Now I am making DoReFa Net for cifar image data set.
I would like to implement " .BatchNorm() " in the source code ,but
a error was occured.
The error is this.

Traceback (most recent call last):
File "cifar-dorefa.py", line 197, in
SimpleTrainer(config).train()
File "/home/tomohiro/github/tensorpack/tensorpack/train/trainer.py", line 84, in train
self.main_loop()
File "/home/tomohiro/github/tensorpack/tensorpack/train/base.py", line 108, in main_loop
callbacks.setup_graph(self) # TODO use weakref instead?
File "/home/tomohiro/github/tensorpack/tensorpack/callbacks/base.py", line 52, in setup_graph
self._setup_graph()
File "/home/tomohiro/github/tensorpack/tensorpack/callbacks/group.py", line 126, in _setup_graph
cb.setup_graph(self.trainer)
File "/home/tomohiro/github/tensorpack/tensorpack/callbacks/base.py", line 52, in setup_graph
self._setup_graph()
File "/home/tomohiro/github/tensorpack/tensorpack/callbacks/inference.py", line 88, in _setup_graph
input_names, self.output_tensors)
File "/home/tomohiro/github/tensorpack/tensorpack/train/trainer.py", line 117, in get_predict_func
return self.predictor_factory.get_predictor(input_names, output_names, 0)
File "/home/tomohiro/github/tensorpack/tensorpack/train/trainer.py", line 42, in get_predictor
self._build_predict_tower()
File "/home/tomohiro/github/tensorpack/tensorpack/train/trainer.py", line 55, in _build_predict_tower
self.model, self.towers, prefix=self.PREFIX)
File "/home/tomohiro/github/tensorpack/tensorpack/predict/base.py", line 112, in build_multi_tower_prediction_graph
model._build_graph(input_vars, False)
File "cifar-dorefa.py", line 76, in _build_graph
.BatchNorm('bn2')
File "/home/tomohiro/github/tensorpack/tensorpack/models/init.py", line 53, in f
ret = layer(name, self._t, _args, *_kwargs)
File "/home/tomohiro/github/tensorpack/tensorpack/models/_common.py", line 54, in wrapped_func
outputs = func(_args, *_actual_args)
File "/home/tomohiro/github/tensorpack/tensorpack/models/batch_norm.py", line 70, in BatchNorm
assert not use_local_stat
AssertionError

And my program is this.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
# File: cifar-convnet.py
# Author: Yuxin Wu <[email protected]>
import tensorflow as tf
import argparse
import numpy as np
import os

from tensorpack import *
import tensorpack.tfutils.symbolic_functions as symbf
from tensorpack.tfutils.summary import *
from dorefa import get_dorefa
"""
A small convnet model for Cifar10 or Cifar100 dataset.

Cifar10:
    90% validation accuracy after 40k step.
    91% accuracy after 80k step.
    19.3 step/s on Tesla M40

Not a good model for Cifar100, just for demonstration.
"""
BITW = 1
BITA = 2
BITG = 6
BATCH_SIZE = 32
class Model(ModelDesc):
    def __init__(self, cifar_classnum):
        super(Model, self).__init__()
        self.cifar_classnum = cifar_classnum

    def _get_input_vars(self):
        return [InputVar(tf.float32, [None, 30, 30, 3], 'input'),
                InputVar(tf.int32, [None], 'label')]

    def _build_graph(self, input_vars, is_training):

        image, label = input_vars
        image = image / 4.0     # just to make range smaller

        fw, fa, fg = get_dorefa(BITW, BITA, BITG)
        # monkey-patch tf.get_variable to apply fw
        old_get_variable = tf.get_variable
        def new_get_variable(name, shape=None, **kwargs):
            v = old_get_variable(name, shape, **kwargs)
            # don't binarize first and last layer
            if name != 'W' or 'conv0' in v.op.name or 'fct' in v.op.name:
                return v
            else:
                logger.info("Binarizing weight {}".format(v.op.name))
                return fw(v)
        tf.get_variable = new_get_variable

        def nonlin(x):
            if BITA == 32:
                return tf.nn.relu(x)    # still use relu for 32bit cases
            return tf.clip_by_value(x, 0.0, 1.0)

        def activate(x):
            return fa(nonlin(x))
        def cabs(x):
            return tf.minimum(1.0, tf.abs(x), name='cabs')

        keep_prob = tf.constant(0.5 if is_training else 1.0)

        if is_training:
            tf.image_summary("train_image", image, 10)

        with argscope(BatchNorm, decay=0.9, epsilon=1e-4), \
                argscope(FullyConnected, use_bias=False, nl=tf.identity), \
                argscope(Conv2D, nl=BNReLU(is_training), use_bias=False, kernel_shape=3):
            logits = LinearWrap(image) \
                    .Conv2D('conv1.1', out_channel=64)\
                    .Conv2D('conv1.2', out_channel=64) \
                    .BatchNorm('bn2')\
                    .apply(fg)\
                    .apply(activate)\
                    .MaxPooling('pool1', 3, stride=2, padding='SAME') \
                    .apply(activate)\
                    .Conv2D('conv2.1', out_channel=128)\
                    .apply(activate)\
                    .Conv2D('conv2.2', out_channel=128)\
                    .MaxPooling('pool2', 3, stride=2, padding='SAME') \
                    .apply(activate)\
                    .Conv2D('conv3.1', out_channel=128, padding='VALID') \
                    .apply(fg)\
                    .BatchNorm('bn1')\
                    .apply(activate)\
                    .Conv2D('conv3.2', out_channel=128, padding='VALID') \
                    .apply(activate)\
                    .FullyConnected('fc0', 1024 + 512,
                           b_init=tf.constant_initializer(0.1)) \
                    .tf.nn.dropout(keep_prob) \
                    .FullyConnected('fc1', 512,
                           b_init=tf.constant_initializer(0.1)) \
                    .apply(cabs)\
                    .FullyConnected('linear', out_dim=self.cifar_classnum, nl=tf.identity)()

        cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits, label)
        cost = tf.reduce_mean(cost, name='cross_entropy_loss')
        tf.get_variable = old_get_variable
        prob = tf.nn.softmax(logits, name='output')

        # compute the number of failed samples, for ClassificationError to use at test time
        wrong = symbf.prediction_incorrect(logits, label)
        nr_wrong = tf.reduce_sum(wrong, name='wrong')
        # monitor training error
        add_moving_summary(tf.reduce_mean(wrong, name='train_error'))

        # weight decay on all W of fc layers
        wd_cost = tf.mul(0.004,
                         regularize_cost('fc.*/W', tf.nn.l2_loss),
                         name='regularize_loss')
        add_moving_summary(cost, wd_cost)

        add_param_summary([('.*/W', ['histogram'])])   # monitor W
        self.cost = tf.add_n([cost, wd_cost], name='cost')

def get_data(train_or_test, cifar_classnum):
    isTrain = train_or_test == 'train'
    if cifar_classnum == 10:
        ds = dataset.Cifar10(train_or_test)
    else:
        ds = dataset.Cifar100(train_or_test)
    if isTrain:
        augmentors = [
            imgaug.RandomCrop((30, 30)),
            imgaug.Flip(horiz=True),
            imgaug.Brightness(63),
            imgaug.Contrast((0.2,1.8)),
            imgaug.GaussianDeform(
                [(0.2, 0.2), (0.2, 0.8), (0.8,0.8), (0.8,0.2)],
                (30,30), 0.2, 3),
            imgaug.MeanVarianceNormalize(all_channel=True)
        ]
    else:
        augmentors = [
            imgaug.CenterCrop((30, 30)),
            imgaug.MeanVarianceNormalize(all_channel=True)
        ]
    ds = AugmentImageComponent(ds, augmentors)
    ds = BatchData(ds, 128, remainder=not isTrain)
    if isTrain:
        ds = PrefetchData(ds, 3, 2)
    return ds
def get_config(cifar_classnum):
    logger.auto_set_dir()

    # prepare dataset
    dataset_train = get_data('train', cifar_classnum)
    step_per_epoch = dataset_train.size()
    dataset_test = get_data('test', cifar_classnum)

    sess_config = get_default_sess_config(0.5)

    nr_gpu = get_nr_gpu()
    lr = tf.train.exponential_decay(
        learning_rate=1e-2,
        global_step=get_global_step_var(),
        decay_steps=step_per_epoch * (30 if nr_gpu == 1 else 20),
        decay_rate=0.5, staircase=True, name='learning_rate')
    tf.scalar_summary('learning_rate', lr)

    return TrainConfig(
        dataset=dataset_train,
        optimizer=tf.train.AdamOptimizer(lr, epsilon=1e-3),
        callbacks=Callbacks([
            StatPrinter(),
            ModelSaver(),
            InferenceRunner(dataset_train, ClassificationError())#dataset_testに書き換える
        ]),
        session_config=sess_config,
        model=Model(cifar_classnum),
        step_per_epoch=step_per_epoch,
        max_epoch=250,
    )

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--gpu', help='comma separated list of GPU(s) to use.') # nargs='*' in multi mode
    parser.add_argument('--load', help='load model')
    parser.add_argument('--classnum', help='10 for cifar10 or 100 for cifar100',
                        type=int, default=10)
    args = parser.parse_args()

    if args.gpu:
        os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu
    else:
        os.environ['CUDA_VISIBLE_DEVICES'] = '0'

    with tf.Graph().as_default():
        config = get_config(args.classnum)
        if args.load:
            config.session_init = SaverRestore(args.load)
        if args.gpu:
            config.nr_tower = len(args.gpu.split(','))
        #QueueInpuTrainer(config).train()
        SimpleTrainer(config).train()

I changed the with statement as "with argscope(BatchNorm, decay=0.9, epsilon=1e-4, use_local_stat=is_training), ",and the another error was occured.

Traceback (most recent call last):
  File "cifar-dorefa.py", line 204, in <module>
    QueueInputTrainer(config).train()
  File "/home/tomohiro/github/tensorpack/tensorpack/train/trainer.py", line 222, in train
    grads = self._single_tower_grad()
  File "/home/tomohiro/github/tensorpack/tensorpack/train/trainer.py", line 204, in _single_tower_grad
    self.model.build_graph(self.dequed_inputs, True)
  File "/home/tomohiro/github/tensorpack/tensorpack/models/model_desc.py", line 60, in build_graph
    self._build_graph(model_inputs, is_training)
  File "cifar-dorefa.py", line 79, in _build_graph
    .Conv2D('conv1.1', out_channel=64)\
  File "/home/tomohiro/github/tensorpack/tensorpack/models/__init__.py", line 53, in f
    ret = layer(name, self._t, *args, **kwargs)
  File "/home/tomohiro/github/tensorpack/tensorpack/models/_common.py", line 54, in wrapped_func
    outputs = func(*args, **actual_args)
  File "/home/tomohiro/github/tensorpack/tensorpack/models/conv2d.py", line 62, in Conv2D
    return nl(tf.nn.bias_add(conv, b) if use_bias else conv, name='output')
  File "/home/tomohiro/github/tensorpack/tensorpack/models/nonlin.py", line 74, in BNReLU
    x = BatchNorm('bn', x, is_training, **kwargs)
  File "/home/tomohiro/github/tensorpack/tensorpack/models/_common.py", line 54, in wrapped_func
    outputs = func(*args, **actual_args)
TypeError: BatchNorm() got multiple values for keyword argument 'use_local_stat'

Sorry for very long sentences,but if my code is completed,then I can contribute for yours.

Image Augmentors

  • Saturation / Hue
  • Gaussian noise
  • Salt-Pepper noise
  • Rotation
  • Perspective

Error when enabling the float64 in train and inference

I want to enable float64 in the train and inference, I only change the input type from float32 to float64, but I got the following error...what is wrong? I checked the tf document, it should support float64.

Input 'filter' of 'Conv2D' Op has type float32 that does not match type float64 of argument 'input'

Error on custom gym-env with train-atari.py

I'm created my own gym environment, and am trying to use your code on it. The state is an 84x84x3 image and has discrete actions. Everything runs fine up until it starts the graph, and I get the following error.

File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 894, in _run
% (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (2, 84, 84, 16) for Tensor u'state:0', which has shape '(?, 84, 84, 12)'

Have you seen anything like this in your development, or do you have any clue as to what is going wrong?

Thanks so much. This is a really fantastic package.

Half VGG

Hi guys!
I saw at CVPR 2016 your demo with real-time network which segments people and works on mobile. Are you planning to make model and code for it open source?
Thank you

Using bitwise convolutions with negative weights?

Could not reopen the issue, please see here for more context: #27 (comment)

The sign-vs-unsign problem is more relevant in FPGA. But as we are only doing summation, unsign numbers should be fine.

How do you implement this bitwise dot product kernel (equation 3 in section 2.1, DoReFa v2 paper) for negative weights?

The quantizek function defined in Section 2.2 as Equation 5 outputs a number ro ∈ [0, 1]. The affine transform on Fwk(ri) in Equation 9 takes the output of a quantizek function and multiplies by 2 and subtracts 1: Fwk(ri) = 2 * quantizek(stuff) - 1.

Thus Fwk(ri) ∈ [-1, 1].

However, the procedure you define in Equation 3 only works for unsigned values. If some values xi in the sequence x or yi in the sequence y are negative, then their contribution to the dot product is a subtraction, not an addition, so the simple bitcount(and()) operation no longer suffices.

How did you change the bitwise dot product procedure to account for negative weights?

One possibility:

  1. Add an additional sign bit to all M-bit fixed point integers xix and all K-bit fixed point integers yiy.
  2. This bit is 1 if the number xi is negative, and 0 if xi is positive (likewise for the yi), but does not count as a place-value bit for multiplication.
  3. Let bitwise_and(m, k) = and(cm(x), ck(y)), ∀(m, k), ignoring the sign bits.
  4. Let bitwise_sign = xor(xisigned bit, yisigned bit). This gives us the sign of the product of xi and yi
  5. ∀ bitwise_and(m, k) ∀(m, k), note that bitwise_sign(xi, yi) is a vector giving the sign for each element in bitwise_and(m, k).
  6. For each pair of vectors ( bitwise_and(m, k), bitwise_sign(xi, yi) ) ∀(m,k), drop all members of bitwise_and(m, k) and their corresponding signs in bitwise_sign(xi, yi) where bitwise_and(m, k) == 0. This leaves us with the cases where the bitwise multiplication produced a 1, along with their signs.
  7. For each pair of vectors ( bitwise_and(m, k), bitwise_sign(xi, yi) ) ∀(m,k), compute bitcount[ bitwise_sign(xi, yi) ] to get the total number of negatives for the (m*k) place-value. The total number of positives is given by len(bitwise_sign(xi, yi)) - bitcount[ bitwise_sign(xi, yi) ].
  8. Use the negative and positive accumulations in 7 to get the signed contribution to the dot product.

Multi-Task Learning

In reference to #29.

I am also interested in implementing a multi-task learning model using tensorpack - similar to the "alternating training" example in https://jg8610.github.io/Multi-Task/. In this example, there are 2 different datasets for 2 tasks, and we want to train a model which uses 1 shared layer, and 1 task-specific layer for each task. While the example using plain tensorflow is clear, I am not sure what the best approach is using tensorpack.

If we generate two different DataFlow objects such that each generates different data for each task, how would you use the SyncMultiGPUTrainer with two different DataFlow objects? Since this trainer uses the QueueInputTrainer is it possible to be enqueuing two different DataFlow objects?

Also, for the task specific layers, would it better to use separate layers with separate cost functions (as in the example) or a single layer with selectable weights? How would you tell the trainer to select the appropriate weights during training?

Any help / advice is appreciated.

DoReFa Classification Error

When running the DoReFa alexnet-126.py classification example on a single image, I encounter the following error:

File "./alexnet-dorefa.py", line 304, in <module>
    run_image(Model(), ParamRestore(np.load(args.load).item()), args.run)
...
tensorflow.python.framework.errors.InvalidArgumentError: AttrValue must not have reference type value of float_ref
     for attr 'dtype'
    ; NodeDef: Placeholder = Placeholder[dtype=DT_FLOAT_REF, shape=[], _device="/device:CPU:0"](); Op<name=Placeholder; signature= -> output:dtype; attr=dtype:type; attr=shape:shape,default=[]>

I am running Python 2.7, and this issue occurs regardless of execution on CPU or GPU. Any idea could be causing this?

Gradient Update Step - DoReFa v2 Paper

Thank you for your help. I have three questions about the update procedure for updating weights.

  1. When you initialize the weights on first-run of the neural net, do you initialize to low-bitwidth samples from the normal distribution, or do you initialize to full-precision values?
  2. What bit-width does the update step for the weights use? In the algorithm on page 6 it looks like Wkt+1 = Update(Wk, gWk, η) is operating on full precision weights. Why not use the quantized weights instead?
  3. How do you calculate ∂Wkb / ∂Wk? This should be the partial derivative of the quantized weights with respect to the full precision weights, but I do not know how you calculate that.

Thank you.

Is there a bug in imgaug

Hi, I have been recently using your great tensorpack, but I think the imgaug part may seem exists some bug.
The code below just flip a dataflow, and the result is rather confusing. Perhaps you can try and see the result?

`
%matplotlib inline
import matplotlib.pyplot as plt
from tensorpack import *
from tensorpack.tfutils.symbolic_functions import *
from tensorpack.tfutils.summary import *

cifar10 = dataset.Cifar10('train', shuffle=False)
cifar10.reset_state()
for img_label in cifar10.get_data():
plt.figure()
plt.imshow(img_label[0])
break

flip_cifar10 = AugmentImageComponent(cifar10, [imgaug.Flip(horiz=True, prob=1.0),])
flip_cifar10.reset_state()
for img_label in flip_cifar10.get_data():
plt.figure()
plt.imshow(img_label[0])
break
`

I tried Pretrained alexnet model. But I can not.

I tried to run:
./alexnet-dorefa.py --load alexnet-126.npy --run a.jpg --dorefa 1,2,6

but taceback occured.

Traceback (most recent call last):
File "./alexnet-dorefa.py", line 305, in
run_image(Model(), ParamRestore(np.load(args.load, encoding='latin1').item()), args.run)
File "./alexnet-dorefa.py", line 256, in run_image
meta = dataset.ILSVRCMeta()
File "/home/sounansu/work/tensorflow/tensorpack/tensorpack/dataflow/dataset/ilsvrc.py", line 34, in init
self.caffepb = get_caffe_pb()
File "/home/sounansu/work/tensorflow/tensorpack/tensorpack/utils/loadcaffe.py", line 83, in get_caffe_pb
proto_path = download(CAFFE_PROTO_URL, dir)
File "/home/sounansu/work/tensorflow/tensorpack/tensorpack/utils/fs.py", line 39, in download
logger.error("Failed to download {}".format(url))
NameError: global name 'logger' is not defined

Please advice!

Error with only 1 gpu

I got an error when I tried to train a model with only 1 gpu, not sure if the error goes away with 2 or not.

Here is the command line
python2.7 train-atari.py --env Breakout-v0 --gpu 0
Here is the error message.
Traceback (most recent call last):
File "train-atari.py", line 258, in
AsyncMultiGPUTrainer(config, predict_tower=predict_tower).train()
TypeError: Can't instantiate abstract class AsyncMultiGPUTrainer with abstract methods get_predict_func

Thanks for any help with this. It looks like a great package. I was able to get run_atari.py to run and got good output movies.

protoc error

Which version of protoc is required by tensorpack/tensorpack/utils/loadcaffe.py line 119?
I have 2.6.1 and I get an error saying
"caffe proto compilation failed! Did you install protoc?"
AssertionError: caffe proto compilation failed! Did you install protoc?

I am on Ubuntu 16.04, running CUDA 8 and CUDNN 5.

pooling layers

Hi,

Do you have any idea if removing pooling layers would affect the training results and convergence rate for reinforcement learning on Atari games?

Thanks!

Bug in inference get_output_tensors method

  1. _get_output_vars method is not defined, should be _get_output_tensors.
  2. InferenceRunner should call "vc.get_output_tensors()" as opposed to "vc._get_output_tensors()" in line 94.

Rookie Here, Could Someone Please Explain How to Run This? I'm Easy. All Modules are loaded

It's not you, tensorpack. It's me.
I'm 100% ready to go with everything installed I just don't know how to code what needs to be coded here.
I've installed gym. The full version.
I have every program and every module set to go in my virtualenv but I'm not sure what to write in run-atari.py, train-atari.py, I'd like to see this play Breakout so I have the Breakout-v0.tfmodel file, all in my folder with the modules. There's also different algorithms for the games, like this one...
https://gym.openai.com/evaluations/eval_L55gczPrQJamMGihq9tzA
I'd appreciate if someone could tell me where this code goes as well.

I've run SpaceInvadors-v0 from a tutorial, the repo of which is here...
https://github.com/llSourcell/Game-AI
If there's an easy way to swap SpaceInvaders with another game and algorithm in this setup, that works for me too.

I've installed additional modules so just to make it clear, if someone could spell out, in specific code what needs to happen to deploy...
run-atari.py and train-atari.py, and have them play this file... Breakout-v0.tfmodel
and where this algorithm goes... https://gym.openai.com/evaluations/eval_L55gczPrQJamMGihq9tzA

Thank You!

GPU memory cost

Could you tell me the GPU memory cost and batch size in your resnet-101 and resnet-152 training.

DoReFa accuracy

Question on accuracy of DoReFa Alexnet on Imagenet dataset.

With "--dorefa 1,2,6" I am getting train-error-top1: 0.51935, val-error-top1: 0.30192 and train-error-top5: 0.26953

This is better than the top-1 single-crop validation error of 51% mentioned in the comments in the alexnet-dorefa.py. Are the numbers I am seeing expected or I am getting garbage? The above numbers are at 48 epochs on a 2 GPU Titan X Pascal machine.

Thanks.

Image segmentation code available?

Hello,

Thank you very much for sharing your codes, very impressive! Now I am using resnet for some implementation and wondering if you already have image segmentation codes or examples. If you have, would you let me know where to find, please?

Appreciate on your help already!

Super class init method not called for some image augmentors

Several image augmentors (RandomCrop, RandomCropRandomShape ...) do not call super class init method, hence self.rng is not set until reset is explicitly called.
This makes usage in standalone mode infeasible with the error self.rng variable is not defined.

The numerial difference between the tensorflow and numpy

I wrote my own conv and fc layer based on numpy for data comparison of my FPGA implementation. However, I found that results from the tensorflow in your framwork and my nunpy implementation has tiny difference. For example, for the fc layer, I give the same input, while the two outputs keep a ratio of 1.0165, I don't know where does it come from? I guess it maybe from the tf.reduce_mean, which may has tiny difference from numpy's mean function. So how I can get the E value of each layer in the inference stage, so that I can check it, or what other reason do you think can also induce this difference?, thanks.

About the configuration of alexnet-dorefa

I see that in your alexnet-dorefa network configuration, which has tiny difference from standard alexnet / ZF, especially in the first two layers...what is your considerations?

Different learning rate per layer

For tensorflow you can use multiple optimizers to achieve different learning rates per layer.
Is there support for this in tensorpack or do I have to write a custom trainer?

Run cifar-10 residual net?

[25 17:32:54 [email protected]:tensorpack] Found cifar10 data in /home/eli/Downloads/tensorpack-master/tensorpack/dataflow/dataset/cifar10_data.
[25 17:32:55 [email protected]:tensorpack] Found cifar10 data in /home/eli/Downloads/tensorpack-master/tensorpack/dataflow/dataset/cifar10_data.
Traceback (most recent call last):
File "cifar10-resnet.py", line 196, in
QueueInputTrainer(config).train()
File "/home/eli/Downloads/tensorpack-master/tensorpack/train/trainer.py", line 185, in train
grads = self._single_tower_grad()
File "/home/eli/Downloads/tensorpack-master/tensorpack/train/trainer.py", line 134, in _single_tower_grad
cost_var = self.model.get_cost(model_inputs, is_training=True)
File "/home/eli/Downloads/tensorpack-master/tensorpack/models/model_desc.py", line 53, in get_cost
return self._get_cost(input_vars, is_training)
File "cifar10-resnet.py", line 83, in _get_cost
l = conv('conv0', image, 16, 1)
File "cifar10-resnet.py", line 52, in conv
W_init=tf.random_normal_initializer(stddev=np.sqrt(2.0/9/channel)))
File "", line 2, in Conv2D
TypeError: wrapper() takes exactly 1 argument (11 given)
[25 17:32:56 [email protected]:tensorpack] Prefetch process exiting...
[25 17:32:56 [email protected]:tensorpack] Prefetch process exited.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.