Giter VIP home page Giter VIP logo

vowpalwabbit / vowpal_wabbit Goto Github PK

View Code? Open in Web Editor NEW
8.4K 349.0 1.9K 159.64 MB

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

Home Page: https://vowpalwabbit.org

License: Other

Makefile 0.23% Shell 0.33% C 0.77% C# 16.75% Python 5.95% Perl 1.18% C++ 55.75% R 0.32% Java 1.39% Tcl 0.01% Ruby 0.06% HTML 0.21% Batchfile 0.08% ANTLR 0.01% Scala 0.10% Jupyter Notebook 2.87% CMake 2.28% PLSQL 11.65% CSS 0.04% Handlebars 0.03%
c-plus-plus machine-learning online-learning contextual-bandits reinforcement-learning active-learning learning-to-search cpp

vowpal_wabbit's Introduction

Vowpal Wabbit

Linux build status Windows build status

codecov Total Alerts

This is the Vowpal Wabbit fast online learning code.

Why Vowpal Wabbit?

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning. There is a specific focus on reinforcement learning with several contextual bandit algorithms implemented and the online nature lending to the problem well. Vowpal Wabbit is a destination for implementing and maturing state of the art algorithms with performance in mind.

  • Input Format. The input format for the learning algorithm is substantially more flexible than might be expected. Examples can have features consisting of free form text, which is interpreted in a bag-of-words way. There can even be multiple sets of free form text in different namespaces.
  • Speed. The learning algorithm is fast -- similar to the few other online algorithm implementations out there. There are several optimization algorithms available with the baseline being sparse gradient descent (GD) on a loss function.
  • Scalability. This is not the same as fast. Instead, the important characteristic here is that the memory footprint of the program is bounded independent of data. This means the training set is not loaded into main memory before learning starts. In addition, the size of the set of features is bounded independent of the amount of training data using the hashing trick.
  • Feature Interaction. Subsets of features can be internally paired so that the algorithm is linear in the cross-product of the subsets. This is useful for ranking problems. The alternative of explicitly expanding the features before feeding them into the learning algorithm can be both computation and space intensive, depending on how it's handled.

Visit the wiki to learn more.

Getting Started

For the most up to date instructions for getting started on Windows, MacOS or Linux please see the wiki. This includes:

vowpal_wabbit's People

Contributors

aartibagul avatar alekh avatar arielf avatar ataymano avatar bassmang avatar danmelamed avatar eisber avatar elevated-jenkins avatar hal3 avatar jackgerrits avatar johnlangford avatar kaiweichang avatar lalo avatar lhoang29 avatar lokitoth avatar martinpopel avatar nicknussbaum avatar olgavrou avatar peterychang avatar petricek avatar pierce1987 avatar pmineiro avatar rajan-chari avatar rajanchari avatar sharatsc avatar sidsen avatar stross avatar trufanov-nok avatar tzukuoh avatar wfenchel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vowpal_wabbit's Issues

--raw_predictions doesn't work with --ect and --cb, but works with --oaa

Looks like --raw_predictions works with binary classification and multi-class with --oaa, but not with --ect (and contextual bandit raw_predictions is also broken).

Here are the details:

train.dat:

2 | a c
1 | b d
3 | a b c
1 | b c
3 | a d

test.dat:

2 | a c d
1 | c d

Commands that produce a non-empty raw.txt:

vw -d train.dat --oaa 3 -f cb.model
vw -t -d test.dat -i cb.model --raw_predictions raw.txt

Commands that produce an empty raw.txt:

vw -d train.dat --ect 3 -f cb.model
vw -t -d test.dat -i cb.model --raw_predictions raw.txt

hadoop operation is impossible if there are more blocks in the input data than map slots

Hadoop operation using spanning_tree as described in cluster/README_cluster is impossible if the number of training data blocks is larger than the number of map slots on the cluster.
In my case, there are 784 files in the training data and only 120 map slots.
This means that there will be 784 map jobs and the cluster only allows 120 of them at the same time
Since a vw map job is allowed to terminate only when all map jobs have connected to spanning_tree, the cluster hangs: the finished vw jobs are waiting for kid_count from spanning_tree and spanning_tree is waiting for all map jobs to report in.

wap & invert_hash => Segmentation fault

$ echo '1:0 2:1 | a b' | vw --wap 3 --invert_hash file.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = 
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
Segmentation fault (core dumped)

Without --invert_hash it works.

holdout got broken

In the example below the constant weight gets learned correctly but holdout for some reason reports 0 loss.

To replicate:

echo "\
1 |
-1 |
1 |
1 |
-1 |
1 |
1 |
-1 |
1 |
1 |
-1 |
1 |
" | vowpalwabbit/vw --passes 100 -k --cache_file bug.cache 
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = bug.cache
Reading datafile = 
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
1.000000   1.000000            1         1.0   1.0000   0.0000        1
1.470775   1.941550            2         2.0  -1.0000   0.3934        1
1.103478   0.736181            4         4.0   1.0000   0.2565        1
1.202135   1.300793            8         8.0  -1.0000   0.4195        1
0.000000   0.000000           16        16.0   1.0000   0.2857        1 h
0.000000   0.000000           32        32.0  -1.0000   0.3600        1 h

finished run
number of examples per pass = 12
passes used = 4
weighted example sum = 49
weighted label sum = 13
average loss = 0 h
best constant = 0.288889
total feature number = 49

LDA hangs on OSX

I attempted to run LDA and vw starts, prints some output with little explanation - and then seems to stall indefinitely with 0 CPU utilization.

The command I ran was vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 2250 --minibatch 1 --power_t 0.9 --initial_t 1 -b 16 --cache_file vw.ap.cache --passes 10 -p predictions.ap.dat --invert_hash topics.ap.dat ap.dat

and I also tried

vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 2250 --minibatch 1 --power_t 0.9 --initial_t 1 -b 16 --cache_file vw.ap.cache --passes 10 -p predictions.ap.dat ap.dat

With the same result.

The output is:

Num weight bits = 16
learning rate = 0.5
initial_t = 1
power_t = 0.9
decay_learning_rate = 1
predictions = predictions.ap.dat
can't open: vw.ap.cache, error = No such file or directory
creating cache_file = vw.ap.cache
Reading datafile = ap.dat
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
14.712118  14.712118           1         1.0  unknown   0.0000      180
15.060098  15.408078           2         2.0  unknown   0.0000      164
15.391964  15.723830           4         4.0  unknown   0.0000      193
16.130727  16.869490           8         8.0  unknown   0.0000       44
17.644160  19.157593          16        16.0  unknown   0.0000      144
17.501729  17.359298          32        32.0  unknown   0.0000      160
16.360737  15.219745          64        64.0  unknown   0.0000       70
14.964551  13.568365         128       128.0  unknown   0.0000      109
13.682254  12.399957         256       256.0  unknown   0.0000       79
12.650178  11.618102         512       512.0  unknown   0.0000      179
11.870831  11.091484        1024      1024.0  unknown   0.0000      113

I built against the boost libraries from macports.

If I attempt to run the arguments in the order presented on the LDA tutorial a segmentation fault occurs. Any incorrect argument in general causes a segmentation fault.

I can email my data file if desired. I attempted to attach it but evidently on images are supported as attachments...

Some OS/Machine info:

  Software  OS X 10.9.2 (13C64)
  Model Name:   MacBook Air
  Model Identifier: MacBookAir5,2

heisenbug making travis randomly fail

Opening a separate issue for this so we have it documented in one place.

Once in awhile, out of the blue we get a travis build fail 'make test'.

A typical difference in reference stderr vs actual looks like this:

--- train-sets/ref/0002a.stderr 2014-03-25 16:39:55.699762013 -0700
+++ stderr.tmp  2014-04-11 15:27:33.978773705 -0700
@@ -28,8 +28,7 @@
 0.000837   0.000906        65536     65536.0   0.5043   0.5774      197

 finished run
-number of examples per pass = 74746
-passes used = 1
+number of examples = 74746
 weighted example sum = 69521
 weighted label sum = 35113.3
 average loss = 0.000841171

This difference depends on the value of all->current_pass in main.cc.

It looks as if there's a race condition between two threads in updating (vs checking) this variable so sometimes we end up with it being 0 and others with it being > 0 at the point of the check.

Here's a little script that you may find handy to reproduce the problem. It keeps trying until the problem happens:

#!/bin/bash

case "$@" in
    [1-9]*) TestNos=$1 ;;
    *) TestNos=5 ;;
esac

case `pwd` in
    */test) : ;;
    *) cd test ;;
esac

OutLog=/tmp/vw.heisen.bug

counter=1

while true; do
    echo "=== Run# $counter" 1>&2
    ./RunTests -d -fe -E 0.001 $TestNos >$OutLog 2>&1

    case $? in
        0)  : -- keep trying ;;
        *)  echo
            echo "=== failed after $counter attempts."
            # Need to auto-find the relevant stderr
            diff_files=`grep FAILED $OutLog | perl -ne '
                /\(([^)]+)\)[^(]+\(([^)]+)/ && print "$1 $2\n"'`
            diff -u $diff_files
            exit 1
            ;;
    esac
    counter=$(($counter + 1))
done

Regularization per namespace

It is currently possible to do L1 and L2 regularization on all parameters. However the physical dimension could be different for each namespaces. For instance, if I have the age of a person, some other characteristics and his genotype, I don't want to put the same regularization on his characteristics and to his genotype. I may like to put a strong L1 regularization on the genotype, and no regularization on his age.

It would be nice to have an option like '—l1 namespace value' where we could specify the regularization parameters for a specific namespace.

audit and progress feature counts broken for --lrq

The mini example below demonstrates two issues:

  1. audit outputs just k (=2) latent features although there are in fact 2 x k (=4) latent features (left and right). Also the features are all named the same.
  2. The standard progress output then reports just 3 features (probably does not include the latent features at all while -q and matrix factorizations include the expanded feature set).
[vpetricek@hadoop0000 benchmark]$ echo "|u 1 |a 2" | vw --lrq ua2 --audit
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
creating low rank quadratic features for pairs: ua2
using no cache
Reading datafile =
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.000000
        u^1:60292:1:0@0 a^2:108240:1:0@0        lrq^u^a:108241:0:0@0    lrq^u^a:108242:0:0@0    Constant:202096:1:0@0
0.000000   0.000000            1         1.0  unknown   0.0000        3

finished run
number of examples per pass = 1
passes used = 1
weighted example sum = 1
weighted label sum = 0
average loss = 0
best constant = nan
total feature number = 3

make gd_mf_weights failing on Mac

The command make gd_mf_weights in library fails for me on OSX 10.8.3 (output below). (I have been able to build and run vw however.) I am on the latest version (706f883) of the repo. There is probably something obvious I am doing wrong - I don't have much experience building C++.

Output of make gf_mf_weights below:

g++     gd_mf_weights.cc   -o gd_mf_weights
Undefined symbols for architecture x86_64:
  "VW::initialize(std::basic_string<char, std::char_traits<char>, std::allocator<char> >)", referenced from:
      _main in ccusci3L.o
  "VW::read_example(vw&, char*)", referenced from:
      _main in ccusci3L.o
  "VW::finish_example(vw&, example*)", referenced from:
      _main in ccusci3L.o
  "VW::finish(vw&)", referenced from:
      _main in ccusci3L.o
  "boost::program_options::to_internal(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)", referenced from:
      std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > boost::program_options::to_internal<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)in ccusci3L.o
  "boost::program_options::variables_map::variables_map()", referenced from:
      _main in ccusci3L.o
  "boost::program_options::options_description::add_options()", referenced from:
      _main in ccusci3L.o
  "boost::program_options::options_description::m_default_line_length", referenced from:
      _main in ccusci3L.o
  "boost::program_options::options_description::options_description(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned int)", referenced from:
      _main in ccusci3L.o
  "boost::program_options::options_description_easy_init::operator()(char const*, boost::program_options::value_semantic const*, char const*)", referenced from:
      _main in ccusci3L.o
  "boost::program_options::options_description_easy_init::operator()(char const*, char const*)", referenced from:
      _main in ccusci3L.o
  "boost::program_options::arg", referenced from:
      boost::program_options::typed_value<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, char>::name() constin ccusci3L.o
  "boost::program_options::store(boost::program_options::basic_parsed_options<char> const&, boost::program_options::variables_map&, bool)", referenced from:
      _main in ccusci3L.o
  "boost::program_options::detail::cmdline::set_additional_parser(boost::function1<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>)", referenced from:
      boost::program_options::basic_command_line_parser<char>::extra_parser(boost::function1<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>)in ccusci3L.o
  "boost::program_options::detail::cmdline::set_options_description(boost::program_options::options_description const&)", referenced from:
      boost::program_options::basic_command_line_parser<char>::options(boost::program_options::options_description const&)in ccusci3L.o
  "boost::program_options::detail::cmdline::get_canonical_option_prefix()", referenced from:
      boost::program_options::basic_command_line_parser<char>::run()  in ccusci3L.o
  "boost::program_options::detail::cmdline::run()", referenced from:
      boost::program_options::basic_command_line_parser<char>::run()  in ccusci3L.o
  "boost::program_options::detail::cmdline::style(int)", referenced from:
      boost::program_options::basic_command_line_parser<char>::style(int)in ccusci3L.o
  "boost::program_options::detail::cmdline::cmdline(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)", referenced from:
      boost::program_options::basic_command_line_parser<char>::basic_command_line_parser(int, char const* const*)in ccusci3L.o
  "boost::program_options::notify(boost::program_options::variables_map&)", referenced from:
      _main in ccusci3L.o
  "boost::program_options::validate(boost::any&, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*, int)", referenced from:
      boost::program_options::typed_value<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, char>::xparse(boost::any&, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) constin ccusci3L.o
  "boost::program_options::operator<<(std::basic_ostream<char, std::char_traits<char> >&, boost::program_options::options_description const&)", referenced from:
      _main in ccusci3L.o
  "boost::program_options::value_semantic_codecvt_helper<char>::parse(boost::any&, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool) const", referenced from:
      vtable for boost::program_options::typed_value<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, char>in ccusci3L.o
  "typeinfo for boost::program_options::value_semantic_codecvt_helper<char>", referenced from:
      typeinfo for boost::program_options::typed_value<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, char>in ccusci3L.o
  "vtable for boost::program_options::variables_map", referenced from:
      boost::program_options::variables_map::~variables_map()in ccusci3L.o
  NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
  "vtable for boost::program_options::value_semantic_codecvt_helper<char>", referenced from:
      boost::program_options::value_semantic_codecvt_helper<char>::value_semantic_codecvt_helper()in ccusci3L.o
      boost::program_options::value_semantic_codecvt_helper<char>::~value_semantic_codecvt_helper()in ccusci3L.o
  NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status
make: *** [gd_mf_weights] Error 1

No error messages issued when Boost is absent from system

I think it would be useful to have a clear, specific error (or warning) message issued from the vw Makefile when boost is not detected on the system. The program appears to compile and link just fine in the absence of the boost library, but for the resulting vw executable, command line arguments can not be used and instead the error is issued when any arguments are used:
$ vw --help
vw: unknown option help

Also note that I am using gcc on a Red Hat Linux

Here's what I did to resolve this:

  1. Download, compile boost following step 5 (Prepare to Use a Boost Library Binary; i.e. using bootstrap.sh and b2 toolset=gcc) since program_options requires this as described here: http://www.boost.org/doc/libs/1_55_0/more/getting_started/unix-variants.html#get-boost
  2. Updated the BOOST_INCLUDE and BOOST_LIBRARY variables in the primary vw Makefile to wherever you installed the boost /include and /lib
  3. make vw as otherwise described

segfault on "-q"

I have just two categories: "domain" and "keyword" and when I pass "-q dk" on the command line, I get a segfault

Best Constant vs Constant Feature

More of an expected usage question than a code issue, but I was wondering if you had some insight into how the "Best constant" relates to the constant feature calculated by VW.

Specifically, I'm performing squared-loss on a data set and, though all of my values fall between 0.0 and 1.0, and the best constant is right around what I'd expect, the constant feature (along with many others) ends up being significantly negative. When performing prediction on new data instances, a prediction of 0.0 is returned an overwhelming percentage of the time, and if I compute the values myself using the sum of the hash values, the result is almost always negative.

I was wondering if you had any insight into why this might be happening.

Thanks,
Jeff

Validator doesn't work with input format example

The data format validator output does not agree with the "Example with a tag" from the input format page. If that result is cut-and-paste into the validator, the importance weight "1.0" is not read. However, if a space is put after the word zebra and before the pipe "|", then the weight is correctly read. As a result of this discrepancy, it is unclear if a space after zebra is needed.

Test24 fails under OpenVZ

On openvz-amd64 machine,Test #24 (Actives simulation) gives me slightly different stderr output (and actually a better loss) than expected:

RunTests: test 24: stdout OK
 --- diff -N --minimal --side-by-side --suppress-common-lines --ignore-all-space --strip- railing-cr -W 150 train-sets/ref/active-simulation.t24.stderr stderr.tmp
1.005870   1.244963            2       502.2  -1.0000   0.1158      10    |     1.005869   1.244949            2       502.2  -1.0000   0.1158      10
1.007812   1.170191            3       508.2  -1.0000   0.0818       5    |     1.007810   1.170181            3       508.2  -1.0000   0.0817       5
 ...
average loss = 0.240509                                                   |     average loss = 0.235951
best constant = -0.136917                                                 |     best constant = -0.157285
total queries = 889                                                       |     total queries = 884
RunTests: test 24: FAILED: ref(train-sets/ref/active-simulation.t24.stderr) != stderr(stderr.tmp)

On 32-bit machines, the test goes OK. On 64-bit machine with uname -r 2.6.32-13-pve, it goes OK as well (even if I use the vw binary compiled on the 2.6.32-5-openvz-amd64 machine).

This is a low-priority issue, no real error. Feel free to close it if no one is interested in finding/fixing the cause.

Allow timeouts on daemon connections

Currently, VW doesn't seem to set any sort of timeouts on socket connections. This can lead to issues, particularly when used by situations where you need to create many short lived connections. Hung connections can quickly accumulate, blocking all of the worker processes. (The server could also benefit from a select() loop on connections, but that is a bigger design change.)

Example using DAgger

Thanks for continuing with this project, and focusing some more on documentation.

Please could you provide an example using DAgger.

Also on the other examples, I think it would be easier if you provided a small sample of the training data to show what is going into the system.

syntax error building

I am building on a mac and I think I have all of the tools listed, but it has been a long time since I did this type of work. (Sorry if so.)

./autogen.sh
make
searn.cc line 710 or so, there is an assert(priv->...) which seems to want to be srn.priv->...

I changed it locally and it compiles.

I am afraid to just fix it.

display progress of file loading in vw-varinfo

It would be very useful if vw-varinfo script displayed progress as it's reading input data file for collecting namespace information.

This would be particularly useful for large datasets

latest `vw` fails test 59 due to missing training set file

While running make test on latest jl master I get a test failure:

RunTests: test 59: '/usr/bin/timeout 20 ../vowpalwabbit/vw -d train-sets/argmax_data -k -c --passes 20 --search_rollout oracle --search_alpha 1e-8 --search_task argmax --search 2 --holdout_off' failed (exitcode=1)

The reason is that test/train-sets/argmax_data is missing (running same command as test 59 from top level standalone I get):

Reading datafile = test/train-sets/argmax_data
can't open: test/train-sets/argmax_data, error = No such file or directory
vw: std::exception

Hal: can you check this in?

Thanks!

Average loss of 1 (and incorrect labels) reported in --binary mode

Perhaps the easiest way to explain this is to show output with and without the --binary option with everything else held fixed. My input lines look like

-1 |Text hello this is an example
1 |Text and another one

Without --binary everything behaves itself:

$ vw -kcd examples.txt --loss_function logistic
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
creating cache_file = examples.txt.cache
Reading datafile = examples.txt
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.641010   0.641010            3         3.0  -1.0000  -0.1864        4
0.577645   0.514281            6         6.0  -1.0000  -0.5663        7
0.476604   0.355354           11        11.0  -1.0000  -2.1518       35
0.405201   0.333797           22        22.0  -1.0000  -2.0671       21
0.307740   0.210278           44        44.0  -1.0000  -1.3955       16
0.213920   0.117918           87        87.0  -1.0000  -2.8665       11
0.146646   0.079372          174       174.0  -1.0000  -1.7852        3
0.098552   0.050458          348       348.0  -1.0000  -3.5060        6
0.078723   0.058894          696       696.0  -1.0000  -4.2158       13
0.055447   0.032171         1392      1392.0  -1.0000  -4.3897        5
0.053205   0.050964         2784      2784.0  -1.0000  -8.2964       17
0.046466   0.039726         5568      5568.0  -1.0000  -3.8190        4
0.036872   0.027276        11135     11135.0  -1.0000  -3.5071        2
0.035412   0.033953        22269     22269.0  -1.0000  -8.4659       15

finished run
number of examples per pass = 39442
passes used = 1
weighted example sum = 39442
weighted label sum = -38960
average loss = 0.0311541
best constant = -0.987779
total feature number = 504260

But with --binary I see an average loss of exactly 1 reported in each log line and in the final output. Also a weighted label sum of exactly 0 (which is incorrect), and "current label" of 3212836864 (= 0xbf800000) for all examples, also incorrect, the labels are all 1 or -1:

$ vw -kcd examples.txt --loss_function logistic --binary
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
creating cache_file = examples.txt.cache
Reading datafile = examples.txt
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
1.000000   1.000000          3      3.0   3212836864       -1        4
1.000000   1.000000          6      6.0   3212836864       -1        7
1.000000   1.000000         11     11.0   3212836864       -1       35
1.000000   1.000000         22     22.0   3212836864       -1       21
1.000000   1.000000         44     44.0   3212836864       -1       16
1.000000   1.000000         87     87.0   3212836864       -1       11
1.000000   1.000000        174    174.0   3212836864       -1        3
1.000000   1.000000        348    348.0   3212836864       -1        6
1.000000   1.000000        696    696.0   3212836864       -1       13
1.000000   1.000000       1392   1392.0   3212836864       -1        5
1.000000   1.000000       2784   2784.0   3212836864       -1       17
1.000000   1.000000       5568   5568.0   3212836864       -1        4
1.000000   1.000000      11135  11135.0   3212836864       -1        2
1.000000   1.000000      22269  22269.0   3212836864       -1       15

finished run
number of examples per pass = 39442
passes used = 1
weighted example sum = 39442
weighted label sum = 0
average loss = 1
best constant = 0
total feature number = 504260

I'm using the latest master from github. I get similar behaviour using other loss functions (hinge, squared), with multiple passes and in bfgs mode. I also tried labels of 0 and 1 which didn't help.

Test 46 fails on OSX 10.9.2

Hello,

I've just downloaded VW for the first time. It looks like an amazing project.

I have no idea if I've found a bug or if I've done something wrong. I'm using clang and boost 1.55.0 from homebrew on a 2.26 GHz Core 2 Duo MBP with 8GB of RAM and OSX 10.9.2. I cloned the master branch, and the last revision is from 2 May, hash 4b431bf.

Make test gives the following output:

192-168-1-8:vowpal_wabbit acooper$ make test
<snip>
RunTests: test 45: stderr OK
RunTests: test 45: predict OK
RunTests: test 46: stderr OK
Use of uninitialized value $line2 in concatenation (.) or string at ./RunTests line 379, <$sdiff> line 1.
Use of uninitialized value $line2 in split at ./RunTests line 383, <$sdiff> line 1.
--- diff -u --minimal train-sets/ref/sequence_data.nonldf.test-beam20.predict sequence_data.predict
--- train-sets/ref/sequence_data.nonldf.test-beam20.predict 2014-05-04 12:11:55.000000000 +1000
+++ sequence_data.predict   2014-05-04 13:00:41.000000000 +1000
@@ -13,8 +13,8 @@
 1.76761    5 4 3 1 2 
 1.76761    5 4 2 1 2 
 1.76948    5 3 2 1 3 
-1.76948    5 4 3 1 3 
 1.76948    5 4 2 1 3 
+1.76948    5 4 3 1 3 
 1.77888    5 3 2 1 4 
 1.77889    5 4 3 1 4 
 1.77889    5 4 2 1 4 
RunTests: test 46: FAILED: ref(train-sets/ref/sequence_data.nonldf.test-beam20.predict) != predict(sequence_data.predict)
make: *** [test] Error 1

Regularization not working in most recent git version?

Hi,

I am quite new to vw, so excuse me if this is a silly question, but it seems to me that regularization (at least l1) is not working in the current version of the code. Even though STDERR reports usage of l1 ('using l1 regularization = 100') result of training is exactly (!) the same as without l1 regularization -- which seems unlikely for l1_lambda=100.

/toolbox/vowpal_wabbit/vw --sgd --noconstant -c -f out_l1 --l1 100 --passes 1 --readable_model out_l1.readable -k < x0
using l1 regularization = 100
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
final_regressor = out_l1
creating cache_file = .cache
Reading datafile =
num sources = 1
average since example example current current current
loss last counter weight label predict features
4.000000 4.000000 1 1.0 2.0000 0.0000 104
...
1.118318 0.962480 524288 524288.0 2.0000 2.0007 5

finished run
number of examples per pass = 807449
passes used = 1
weighted example sum = 807449
weighted label sum = 1.6647e+06
average loss = 1.01294
best constant = 2.06168
total feature number = 22483110

Here's the run without l1:

vw --sgd --noconstant -c -f out --passes 1 --readable_model out.readable -k < x0
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
final_regressor = out
creating cache_file = .cache
Reading datafile =
num sources = 1
average since example example current current current
loss last counter weight label predict features
4.000000 4.000000 1 1.0 2.0000 0.0000 104
...
1.118318 0.962480 524288 524288.0 2.0000 2.0007 5

finished run
number of examples per pass = 807449
passes used = 1
weighted example sum = 807449
weighted label sum = 1.6647e+06
average loss = 1.01294
best constant = 2.06168
total feature number = 22483110

output models are exactly the same:

diff out.readable out_l1.readable

However, with an older version from one of my colleagues, vw behaves as I would expect:

old_vw --sgd --noconstant -c -f out_l1 --l1 100 --passes 1 --readable_model out_l1.readable -k < x0
using l1 regularization = 100
final_regressor = out_l1
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
creating cache_file = .cache
Reading datafile =
num sources = 1
average since example example current current current
loss last counter weight label predict features
4.000000 4.000000 1 1.0 2.0000 0.0000 104
...
4.248032 4.324745 524288 524288.0 2.0000 0.0000 5

finished run
number of examples per pass = 807449
passes used = 1
weighted example sum = 807449
weighted label sum = 1.6647e+06
average loss = 4.30842
best constant = 2.06168
total feature number = 22483110

Here's the version from the old binary

old_vw --version
7.4.0

spanning_tree as a non-daemon process

By the default spanning_tree goes to daemon mode. Although the pid stored in the pid file can be used to terminate the spanning_tree process, there are still some management issues when vw is run at scale: one is that I observed cases where failures leave unmanaged spanning_tree daemons behind. I have a monitoring process to properly terminate the daemon in such cases, but the monitoring process also sometimes fail, resulting into runaway daemons in our cluster.

It would be great if we add a --nondaemon option to the spanning_tree process so that the parent process (or the node manager) can properly control it.

Potential Build Issue

I don't know if this is an actual bug or it's just because I'm a newbie.

I've heard awesome things about vowpal wabbit, so I'm trying to install it onto my Vista platform using cygwin.

I have boost, zlib, g++ all installed in cygwin.

I followed this procedure:

  1. Cloned from this site: git://github.com/JohnLangford/vowpal_wabbit.git
  2. cd vowpal_wabbit
  3. execute make

After executing make

I get the following build report:

$ make
which: no clang++ in (/usr/local/bin:/usr/bin:/cygdrive/c/)
cd vowpalwabbit; make -j 8 things
make[1]: Entering directory '/cygdrive/c/vowpal_wabbit/vowpalwabbit'
echo #define PACKAGE_VERSION "grep AC_INIT ../configure.ac | cut -d '[' -f 3 | cut -d ']' -f 1" > config.h
g++ -MM topk.cc > topk.d
g++ -MM cbify.cc > cbify.d
g++ -MM bs.cc > bs.d
g++ -MM nn.cc > nn.d
g++ -MM sender.cc > sender.d
g++ -MM loss_functions.cc > loss_functions.d
g++ -MM parser.cc > parser.d
g++ -MM example.cc > example.d
g++ -MM print.cc > print.d
g++ -MM noop.cc > noop.d
g++ -MM bfgs.cc > bfgs.d
g++ -MM mf.cc > mf.d
g++ -MM gd_mf.cc > gd_mf.d
g++ -MM lda_core.cc > lda_core.d
g++ -MM learner.cc > learner.d
g++ -MM gd.cc > gd.d
g++ -MM accumulate.cc > accumulate.d
g++ -MM parse_args.cc > parse_args.d
g++ -MM network.cc > network.d
g++ -MM scorer.cc > scorer.d
g++ -MM parse_example.cc > parse_example.d
g++ -MM searn_sequencetask.cc > searn_sequencetask.d
g++ -MM searn.cc > searn.d
g++ -MM wap.cc > wap.d
g++ -MM cb_algs.cc > cb_algs.d
g++ -MM cb.cc > cb.d
g++ -MM csoaa.cc > csoaa.d
g++ -MM cost_sensitive.cc > cost_sensitive.d
g++ -MM lrq.cc > lrq.d
g++ -MM binary.cc > binary.d
g++ -MM autolink.cc > autolink.d
g++ -MM ect.cc > ect.d
g++ -MM oaa.cc > oaa.d
g++ -MM multiclass.cc > multiclass.d
g++ -MM simple_label.cc > simple_label.d
g++ -MM rand48.cc > rand48.d
g++ -MM cache.cc > cache.d
g++ -MM unique_sort.cc > unique_sort.d
g++ -MM parse_primitives.cc > parse_primitives.d
g++ -MM parse_regressor.cc > parse_regressor.d
g++ -MM io_buf.cc > io_buf.d
g++ -MM global_data.cc > global_data.d
g++ -MM memory.cc > memory.d
g++ -MM hash.cc > hash.d
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c main.cc -o ma in.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c hash.cc -o ha sh.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c memory.cc -o memory.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c global_data.c c -o global_data.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c io_buf.cc -o io_buf.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c parse_regress or.cc -o parse_regressor.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c parse_primiti ves.cc -o parse_primitives.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c unique_sort.c c -o unique_sort.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c cache.cc -o c ache.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c rand48.cc -o rand48.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c simple_label. cc -o simple_label.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c multiclass.cc -o multiclass.o
rand48.cc:11:14: warning: use of C++0x long long integer constant [-Wlong-long]
uint64_t a = 0xeece66d5deece66dULL;
^
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasin g -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c oaa.cc -o oaa .o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c ect.cc -o ect.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c autolink.cc -o autolink.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c binary.cc -o binary.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c lrq.cc -o lrq.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c cost_sensitive.cc -o cost_sensitive.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c csoaa.cc -o csoaa.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c cb.cc -o cb.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c cb_algs.cc -o cb_algs.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c wap.cc -o wap.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c searn.cc -o searn.o
lrq.cc: In instantiation of ‘void LRQ::predict_or_learn(LRQ::LRQstate&, LEARNER::learner&, example&) [with bool is_learn = true]’:
lrq.cc:239:53: required from here
lrq.cc:136:76: warning: narrowing conversion of ‘left’ from ‘unsigned char’ to ‘char’ inside { } is ill-formed in C++11 [-Wnarrowing]
char subname[4] = { left, '^', right, '\0' };
^
lrq.cc:136:76: warning: narrowing conversion of ‘right’ from ‘unsigned char’ to ‘char’ inside { } is ill-formed in C++11 [-Wnarrowing]
lrq.cc: In instantiation of ‘void LRQ::predict_or_learn(LRQ::LRQstate&, LEARNER::learner&, example&) [with bool is_learn = false]’:
lrq.cc:240:56: required from here
lrq.cc:136:76: warning: narrowing conversion of ‘left’ from ‘unsigned char’ to ‘char’ inside { } is ill-formed in C++11 [-Wnarrowing]
lrq.cc:136:76: warning: narrowing conversion of ‘right’ from ‘unsigned char’ to ‘char’ inside { } is ill-formed in C++11 [-Wnarrowing]
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c searn_sequencetask.cc -o searn_sequencetask.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c parse_example.cc -o parse_example.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c scorer.cc -o scorer.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c network.cc -o network.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c parse_args.cc -o parse_args.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c accumulate.cc -o accumulate.o
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c gd.cc -o gd.o
searn.cc:283:31: warning: integer constant is too large for ‘long’ type -Wlong-long(ceil( log10((float)10000000000+1) ))) + 1; // max action id
^
searn.cc:1864:5: warning: integer constant is too large for ‘long’ type [-Wlong-long]
size_t neighbor_constant = 8349204823;
^
searn.cc: In function ‘void Searn::print_update(vw&, Searn::searn&)’:
searn.cc:1847:43: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 3 has type ‘size_t {aka unsigned int}’ [-Wformat=]
fprintf(stderr, " %15lusec", num_sec);
^
searn.cc:1847:43: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 3 has type ‘size_t {aka unsigned int}’ [-Wformat=]
searn.cc: In function ‘void Searn::add_neighbor_features(Searn::searn&)’:
searn.cc:1864:32: warning: large integer implicitly truncated to unsigned type [-Woverflow]
size_t neighbor_constant = 8349204823;
^
g++ -march=native -Wall -pedantic -O3 -fomit-frame-pointer -fno-strict-aliasing -ffast-math -D_FILE_OFFSET_BITS=64 -DNDEBUG -I /usr/include -c learner.cc -o learner.o
searn_sequencetask.cc: In function ‘void SequenceSpanTask::structured_predict(Searn::searn&, std::vector<example*>)’:
searn_sequencetask.cc:244:91: error: call of overloaded ‘predict(example_&, uint32_t, v_array&)’ is ambiguous
last_prediction = srn.predict(ec[i], MULTICLASS::get_example_label(ec[i]), y_allowed);
^
searn_sequencetask.cc:244:91: note: candidates are:
In file included from searn_sequencetask.h:9:0,
from searn_sequencetask.cc:6:
searn.h:47:14: note: uint32_t Searn::searn::predict(example
, uint32_t, v_array)
uint32_t predict(example* ec, uint32_t one_ystar, v_array<uint32_t>* yallowed=NULL); // if there is a single oracle action
^
In file included from searn_sequencetask.h:9:0,
from searn_sequencetask.cc:6:
searn.h:51:14: note: uint32_t Searn::searn::predict(example
, size_t, v_array, v_array)
uint32_t predict(example_ ecs, size_t ec_len, v_array<uint32_t>* ystar, v_array<uint32_t>* yallowed=NULL); // if there is a single oracle action
^
Makefile:28: recipe for target 'searn_sequencetask.o' failed
make[1]: *** [searn_sequencetask.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/cygdrive/c/vowpal_wabbit/vowpalwabbit'
Makefile:75: recipe for target 'vw' failed
make: *** [vw] Error 2

save_resume makes results worse

When you want to train N subsequent models on N data sets, you must use --save_resume flag in the first N-1 trainings, but you SHOULD NOT use it in the last (N-th) training, if you want to get the same results as when training on all the N data sets concatenated. John Langford confirmed that "this looks bugly".

I attach an example with N=2.
Not using --save_resume makes the final test loss (0.223781) only a bit worse than the baseline (0.225275).
However, using --save_resume in both trainings makes the final test loss much worse (0.267824).

### prepare data train=set00,set01 test=set02
cd vowpal_wabbit/test/train-sets
split -dl 4000 rcv1_small.dat set

### train on concatenated training sets
cat set00 set01 | vw --loss_function=logistic -f final.model
cat set02 | vw --loss_function=logistic -i final.model -t
# average loss = 0.225275

### train separately, first with --save_resume, second without
cat set00 | vw --loss_function=logistic -f A.model --save_resume
cat set01 | vw --loss_function=logistic -i A.model -f final.model
cat set02 | vw --loss_function=logistic -i final.model -t
# average loss = 0.225275

### train separately, both models with --save_resume
cat set00 | vw --loss_function=logistic -f A.model --save_resume
cat set01 | vw --loss_function=logistic -i A.model --save_resume -f final.model
cat set02 | vw --loss_function=logistic -i final.model -t
# average loss = 0.267824

### train separately, without --save_resume
cat set00 | vw --loss_function=logistic -f A.model
cat set01 | vw --loss_function=logistic -i A.model -f final.model
cat set02 | vw --loss_function=logistic -i final.model -t
# average loss = 0.223781

Can Vowpal Wabbit handle datasize ~ 90 GB?

We have extracted features from search engine query log data and the feature file (as per input format of Vowpal Wabbit) amounts to 90.5 GB. The reason for this huge size is necessary redundancy in our feature construction. Vowpal Wabbit claims to be able to handle TBs of data in a matter of few hours. In addition to that, VW uses a hash function which takes almost no RAM. But When we run logistic regression using VW on our data, within a few minutes, it uses up all of RAM and then stalls. This is the command we use-

vw -d train_output --power_t 1 --cache_file train.cache -f data.model --compressed --loss_function logistic --adaptive --invariant --l2 0.8e-8 --invert_hash train.model

train_output is the input file we want to train VW on, and train.model is the expected model obtained after training

Any help is welcome!

Segmentation fault with csoaa_ldf

echo '1:0 |f
2:1 |f a

shared | a
1:1 |f
2:0 |f b
' | vw --csoaa_ldf=m

results in segmentation fault without any warning about invalid input.
The problem here is probably specifying an example with no features (just namespace f).
If vw is not able to handle such examples, it should end with an indicative error message.

It took me several hours to find this minimal test case. Once during the process, I had 945 example-blocks in the training data and removing any of the example-blocks resulted in eliminating the segfault. However, only one example was the defective one with one line without features and I was not able to find it. (The reason for the defect was I've forgotten to delete pipe symbols from feature names, so accidentally one line appeared as if it had many namespaces and no features.)

Even in this minimal test case, you can eliminate the segfault (without correcting its real cause) by:

  • adding a feature to the first line, or
  • adding a feature to the penultimate line, or
  • removing the "shared" line, or
  • swapping the last two lines, or
  • adding a new (valid) example-block before/after the first one.

spanning_tree scalability issue when accepting new connections

When running vw at scale we observed cases where vw worker cannot connect to the spanning_tree server in all_reduce_init function. The problem seems to be that the spanning_tree server performs accepting connections as well as connection initializations (the initial conversation between vw worker and the spanning tree process) all in a single thread. The sapnning_tree server is therefore not accepting new connections while it is busy with initialization of other connections. This becomes an issue when many thousands of workers are trying to connect to the spanning_tree at the same time.

The proper solution would be to fix the spanning_tree connection acceptance to be more scalable. Alternatively, the vw worker could retry if the connection fails.

As a workaround, I put a random delay at the start of all_reduce_init to let the connection establishment load evenly distribute over time. I tried with a couple of delays and the below worked for me:

srand(node);
int range = total / 50 + 1;//e.g 1500 nodes -> 300s
int stime = (rand()%(range))+1;
cerr << "sleep for " << stime << " out of " << range << endl;
sleep(stime);
cerr << "endsleep " << endl;

The delay is relatively high but it is inevitable until we have a proper fix for spanning_tree scalability issue.

failing tests (17, 24, 31, 32, 42, 46, 50, 54, 58) with cygwin build (head and latest release)

here is my make test results with -d.

$ ./RunTests -d -f -E 0.001 ../vowpalwabbit/vw ../vowpalwabbit/vw
testing cygwin Testing vw: ../vowpalwabbit/vw
Testing lda: ../vowpalwabbit/vw
RunTests: '-D' to see any diff output
RunTests: '-o' to force overwrite references
RunTests: '-e' to abort on first failure
RunTests: test 1: minor (<0.001) precision differences ignored
RunTests: test 1: stderr OK
RunTests: test 2: stderr OK
RunTests: test 2: predict OK
RunTests: test 3: stderr OK
RunTests: test 4: stdout OK
RunTests: test 4: stderr OK
RunTests: test 5: minor (<0.001) precision differences ignored
RunTests: test 5: stderr OK
RunTests: test 6: minor (<0.001) precision differences ignored
RunTests: test 6: stderr OK
RunTests: test 6: minor (<0.001) precision differences ignored
RunTests: test 6: predict OK
RunTests: test 7: stderr OK
RunTests: test 8: stderr OK
RunTests: test 8: minor (<0.001) precision differences ignored
RunTests: test 8: predict OK
RunTests: test 9: stderr OK
RunTests: test 9: predict OK
RunTests: test 10: stderr OK
RunTests: test 10: predict OK
RunTests: test 11: stderr OK
RunTests: test 12: stderr OK
RunTests: test 13: stderr OK
RunTests: test 14: stdout OK
RunTests: test 14: stderr OK
RunTests: test 15: stdout OK
RunTests: test 15: stderr OK
RunTests: test 16: stdout OK
RunTests: test 16: minor (<0.001) precision differences ignored
RunTests: test 16: stderr OK
--- c:/cygwin/bin/diff.exe -u --minimal train-sets/ref/wiki1K.stderr stderr.tmp
--- train-sets/ref/wiki1K.stderr 2014-06-07 10:19:30.457299200 -0700
+++ stderr.tmp 2014-06-07 10:55:49.469939700 -0700
@@ -7,21 +7,21 @@
num sources = 1
average since example example current current current
loss last counter weight label predict features
-10.149301 10.149301 1 1.0 unknown 0.0000 732
-10.369812 10.590324 2 2.0 unknown 0.0000 27
-10.325923 10.282033 4 4.0 unknown 0.0000 53
-10.401762 10.477602 8 8.0 unknown 0.0000 60
-10.356291 10.310820 16 16.0 unknown 0.0000 26
-10.472940 10.589588 32 32.0 unknown 0.0000 125
-10.474844 10.476749 64 64.0 unknown 0.0000 313
-10.425304 10.375763 128 128.0 unknown 0.0000 50
-10.005548 9.585792 256 256.0 unknown 0.0000 33
-9.331692 8.657836 512 512.0 unknown 0.0000 26
+10.149613 10.149613 1 1.0 unknown 0.0000 732
+10.369892 10.590171 2 2.0 unknown 0.0000 27
+10.325892 10.281891 4 4.0 unknown 0.0000 53
+10.401685 10.477478 8 8.0 unknown 0.0000 60
+10.356175 10.310665 16 16.0 unknown 0.0000 26
+10.472894 10.589612 32 32.0 unknown 0.0000 125
+10.474811 10.476727 64 64.0 unknown 0.0000 313
+10.425250 10.375689 128 128.0 unknown 0.0000 50
+9.620644 8.816037 256 256.0 unknown 0.0000 33
+8.965084 8.309524 512 512.0 unknown 0.0000 26

finished run
number of examples = 1000
weighted example sum = 1000
weighted label sum = 0
-average loss = 8.87286
-best constant = -nan
+average loss = 8.61022
+best constant = nan
total feature number = 86919
RunTests: test 17: FAILED: ref(train-sets/ref/wiki1K.stderr) != stderr(stderr.tm p)
RunTests: test 18: stderr OK
RunTests: test 19: stderr OK
RunTests: test 20: stderr OK
RunTests: test 20: predict OK
RunTests: test 21: minor (<0.001) precision differences ignored
RunTests: test 21: stderr OK
RunTests: test 22: stdout OK
RunTests: test 22: minor (<0.001) precision differences ignored
RunTests: test 22: stderr OK
RunTests: test 23: stdout OK
RunTests: test 23: minor (<0.001) precision differences ignored
RunTests: test 23: stderr OK
--- c:/cygwin/bin/diff.exe -u --minimal train-sets/ref/active-simulation.t24.stderr stderr.tmp
--- train-sets/ref/active-simulation.t24.stderr 2014-06-07 10:19:29.248492900 -0700
+++ stderr.tmp 2014-06-07 10:56:09.757455700 -0700
@@ -8,29 +8,29 @@
average since example example current current current
loss last counter weight label predict features
1.000000 1.000000 1 490.2 1.0000 0.0000 50
-0.948717 0.769927 69 630.8 -1.0000 -0.2203 87
-0.858263 0.497263 158 788.8 -1.0000 -0.0597 271
-0.735962 0.333673 222 1028.6 -1.0000 -0.5260 210
-0.681838 0.467658 357 1288.6 -1.0000 0.1321 33
-0.605799 0.319710 569 1631.1 1.0000 0.4440 29
-0.532594 0.240221 802 2039.5 unknown 0.2668 94
-0.481641 0.277913 1168 2549.6 unknown -0.4865 49
-0.413761 0.145727 1492 3195.2 -1.0000 -0.5541 62
-0.362012 0.155045 1957 3994.1 -1.0000 -0.1249 57
-0.348557 0.294790 2420 4993.7 unknown 1.0000 27
-0.326618 0.244195 3208 6322.9 -1.0000 -0.5406 32
-0.292655 0.157249 3824 7908.8 -1.0000 -0.4031 155
-0.307975 0.367171 4921 9955.6 1.0000 -0.6614 42
-0.272426 0.130283 6425 12445.4 unknown 1.0000 83
-0.254926 0.184944 8367 15557.5 unknown -0.9977 193
+0.948708 0.769842 69 630.7 -1.0000 -0.2203 87
+0.858257 0.497185 158 788.8 -1.0000 -0.0597 271
+0.736019 0.333786 222 1028.5 -1.0000 -0.5258 210
+0.681825 0.467234 357 1288.2 -1.0000 0.1319 33
+0.605749 0.319566 569 1630.6 1.0000 0.4441 29
+0.532547 0.240214 802 2038.9 unknown 0.2668 94
+0.481565 0.277754 1168 2549.0 unknown -0.4865 49
+0.413670 0.145694 1492 3194.8 -1.0000 -0.5545 62
+0.361942 0.155087 1957 3993.7 -1.0000 -0.1249 57
+0.348583 0.295150 2419 4992.2 unknown 0.2537 48
+0.326559 0.243901 3208 6322.4 -1.0000 -0.5406 32
+0.292596 0.157204 3824 7908.4 -1.0000 -0.4030 155
+0.307897 0.367010 4921 9955.4 1.0000 -0.6612 42
+0.269469 0.117621 6374 12474.8 -1.0000 -0.8160 38
+0.247154 0.158283 8269 15607.2 -1.0000 -0.2450 34

finished run
number of examples per pass = 10000
passes used = 1
-weighted example sum = 17822.7
-weighted label sum = -1553.71
-average loss = 0.2416
-best constant = -0.181495
+weighted example sum = 17652.2
+weighted label sum = -1387.78
+average loss = 0.239165
+best constant = -0.165277
total feature number = 779394
-total queries = 803
+total queries = 815

RunTests: test 24: FAILED: ref(train-sets/ref/active-simulation.t24.stderr) != stderr(stderr.tmp)
RunTests: test 25: minor (<0.001) precision differences ignored
RunTests: test 25: stderr OK
RunTests: test 25: minor (<0.001) precision differences ignored
RunTests: test 25: predict OK
RunTests: test 26: minor (<0.001) precision differences ignored
RunTests: test 26: stderr OK
RunTests: test 26: minor (<0.001) precision differences ignored
RunTests: test 26: predict OK
RunTests: test 27: stderr OK
RunTests: test 27: minor (<0.001) precision differences ignored
RunTests: test 27: predict OK
RunTests: test 28: stderr OK
RunTests: test 28: minor (<0.001) precision differences ignored
RunTests: test 28: predict OK
RunTests: test 29: minor (<0.001) precision differences ignored
RunTests: test 29: stderr OK
RunTests: test 30: minor (<0.001) precision differences ignored
RunTests: test 30: stderr OK
--- c:/cygwin/bin/diff.exe -u --minimal train-sets/ref/remask.stderr stderr.tmp
--- train-sets/ref/remask.stderr 2014-06-07 10:19:29.945958000 -0700
+++ stderr.tmp 2014-06-07 10:59:08.335425400 -0700
@@ -8,21 +8,21 @@
num sources = 1
average since example example current current current
loss last counter weight label predict features
-0.217147 0.217147 1 1.0 1.0000 0.5340 51
-0.286438 0.355730 2 2.0 0.0000 0.5964 104
-0.184439 0.082439 4 4.0 0.0000 0.2333 135
-0.163583 0.142727 8 8.0 0.0000 0.1131 146
-0.154976 0.146369 16 16.0 1.0000 0.5777 24
-0.175539 0.196103 32 32.0 0.0000 0.1863 32
-0.187968 0.200398 64 64.0 0.0000 0.0000 61
-0.166674 0.145379 128 128.0 1.0000 0.8292 106
+0.217245 0.217245 1 1.0 1.0000 0.5339 51
+0.286434 0.355623 2 2.0 0.0000 0.5963 104
+0.184443 0.082453 4 4.0 0.0000 0.2332 135
+0.163595 0.142747 8 8.0 0.0000 0.1133 146
+0.155013 0.146431 16 16.0 1.0000 0.5775 24
+0.175604 0.196195 32 32.0 0.0000 0.1864 32
+0.188473 0.201341 64 64.0 0.0000 0.0000 61
+0.167958 0.147443 128 128.0 1.0000 0.8168 106

finished run
number of examples per pass = 200
passes used = 1
weighted example sum = 200
weighted label sum = 91
-average loss = 0.135049
+average loss = 0.137744
best constant = 0.455
best constant's loss = 0.247975
total feature number = 15482
RunTests: test 31: FAILED: ref(train-sets/ref/remask.stderr) != stderr(stderr.tmp)
--- c:/cygwin/bin/diff.exe -u --minimal train-sets/ref/remask.final.stderr stderr.tmp
--- train-sets/ref/remask.final.stderr 2014-06-07 10:19:29.940954700 -0700
+++ stderr.tmp 2014-06-07 10:59:08.590594500 -0700
@@ -8,20 +8,20 @@
average since example example current current current
loss last counter weight label predict features
0.000000 0.000000 1 1.0 1.0000 1.0000 51
-0.191596 0.383191 2 2.0 0.0000 0.6190 104
-0.095798 0.000000 4 4.0 0.0000 0.0000 135
-0.091403 0.087007 8 8.0 0.0000 0.0000 146
-0.075219 0.059035 16 16.0 1.0000 1.0000 24
-0.063804 0.052389 32 32.0 0.0000 0.0000 32
-0.081903 0.100002 64 64.0 0.0000 0.0000 61
-0.081818 0.081734 128 128.0 1.0000 1.0000 106
+0.168834 0.337668 2 2.0 0.0000 0.5811 104
+0.084417 0.000000 4 4.0 0.0000 0.0000 135
+0.083558 0.082699 8 8.0 0.0000 0.0000 146
+0.073237 0.062917 16 16.0 1.0000 1.0000 24
+0.063764 0.054291 32 32.0 0.0000 0.0066 32
+0.082466 0.101168 64 64.0 0.0000 0.0000 61
+0.083072 0.083678 128 128.0 1.0000 1.0000 106

finished run
number of examples per pass = 200
passes used = 1
weighted example sum = 200
weighted label sum = 91
-average loss = 0.0706821
+average loss = 0.0737064
best constant = 0.455
best constant's loss = 0.247975
total feature number = 15482
RunTests: test 32: FAILED: ref(train-sets/ref/remask.final.stderr) != stderr(stderr.tmp)
RunTests: test 33: minor (<0.001) precision differences ignored
RunTests: test 33: stderr OK
RunTests: test 34: minor (<0.001) precision differences ignored
RunTests: test 34: stderr OK
RunTests: test 34: minor (<0.001) precision differences ignored
RunTests: test 34: predict OK
RunTests: test 35: minor (<0.001) precision differences ignored
RunTests: test 35: stderr OK
RunTests: test 36: minor (<0.001) precision differences ignored
RunTests: test 36: stderr OK
RunTests: test 37: minor (<0.001) precision differences ignored
RunTests: test 37: stderr OK
RunTests: test 38: minor (<0.001) precision differences ignored
RunTests: test 38: stderr OK
RunTests: test 39: minor (<0.001) precision differences ignored
RunTests: test 39: stderr OK
RunTests: test 40: minor (<0.001) precision differences ignored
RunTests: test 40: stderr OK
RunTests: test 41: minor (<0.001) precision differences ignored
RunTests: test 41: stderr OK
--- c:/cygwin/bin/diff.exe -u --minimal train-sets/ref/lda-2pass-hang.stderr stderr.tmp
--- train-sets/ref/lda-2pass-hang.stderr 2014-06-07 10:19:29.770840900 -0700
+++ stderr.tmp 2014-06-07 10:59:15.127949200 -0700
@@ -8,21 +8,21 @@
num sources = 1
average since example example current current current
loss last counter weight label predict features
-12.797082 12.797082 1 1.0 unknown 0.0000 201
-12.934175 13.071269 2 2.0 unknown 0.0000 220
-13.475964 14.017752 4 4.0 unknown 0.0000 136
-14.728280 15.980597 8 8.0 unknown 0.0000 371
-15.885340 17.042400 16 16.0 unknown 0.0000 138
-17.174329 18.463318 32 32.0 unknown 0.0000 276
-17.150571 17.126814 64 64.0 unknown 0.0000 55
-16.497889 15.845206 128 128.0 unknown 0.0000 131
-15.940465 15.383042 256 256.0 unknown 0.0000 433
-15.306914 14.673363 512 512.0 unknown 0.0000 61
+12.796932 12.796932 1 1.0 unknown 0.0000 201
+12.904143 13.011354 2 2.0 unknown 0.0000 220
+12.981576 13.059008 4 4.0 unknown 0.0000 136
+12.921413 12.861250 8 8.0 unknown 0.0000 371
+12.610071 12.298730 16 16.0 unknown 0.0000 138
+12.427485 12.244900 32 32.0 unknown 0.0000 276
+12.062268 11.697051 64 64.0 unknown 0.0000 55
+11.726858 11.391447 128 128.0 unknown 0.0000 131
+11.544197 11.361536 256 256.0 unknown 0.0000 433
+11.339482 11.134766 512 512.0 unknown 0.0000 61

finished run
number of examples = 1000
weighted example sum = 1000
weighted label sum = 0
-average loss = 14.3325
-best constant = -nan
+average loss = 11.0719
+best constant = nan
total feature number = 193156
RunTests: test 42: FAILED: ref(train-sets/ref/lda-2pass-hang.stderr) != stderr(stderr.tmp)
RunTests: test 43: stderr OK
RunTests: test 44: stderr OK
RunTests: test 44: predict OK
RunTests: test 45: stderr OK
RunTests: test 45: predict OK
RunTests: test 46: stderr OK
RunTests: test 46: sequence_data.predict: no data. Can't compare
--- c:/cygwin/bin/diff.exe -u --minimal train-sets/ref/sequence_data.nonldf.test-beam20.predict sequence_data.predict
--- train-sets/ref/sequence_data.nonldf.test-beam20.predict 2014-06-07 10:19:30.324210000 -0700
+++ sequence_data.predict 2014-06-07 10:59:15.974513200 -0700
@@ -13,8 +13,8 @@
1.76761 5 4 3 1 2
1.76761 5 4 2 1 2
1.76948 5 3 2 1 3
-1.76948 5 4 3 1 3
1.76948 5 4 2 1 3
+1.76948 5 4 3 1 3
1.77888 5 3 2 1 4
1.77889 5 4 3 1 4
1.77889 5 4 2 1 4
RunTests: test 46: FAILED: ref(train-sets/ref/sequence_data.nonldf.test-beam20.predict) != predict(sequence_data.predict)
RunTests: test 47: stderr OK
RunTests: test 48: stderr OK
RunTests: test 48: predict OK
RunTests: test 49: stderr OK
RunTests: test 49: predict OK
--- c:/cygwin/bin/diff.exe -u --minimal train-sets/ref/sequence_data.ldf.test-beam20.stderr stderr.tmp
--- train-sets/ref/sequence_data.ldf.test-beam20.stderr 2014-06-07 10:19:30.147091600 -0700
+++ stderr.tmp 2014-06-07 10:59:16.933152300 -0700
@@ -11,13 +11,13 @@
loss last counter weight label predict features
average since sequence example current label current predicted current cur cur predic. examples
loss last counter weight sequence prefix sequence prefix features pass pol made gener.
-3.000000 3.000000 1 1.000000 [5 4 3 2 1 ] [5 4 3 2 1 ] 0 0 0 1155 0
+4.000000 4.000000 1 1.000000 [5 4 3 2 1 ] [5 4 3 2 1 ] 0 0 0 1155 0

finished run
number of examples per pass = 1
passes used = 1
weighted example sum = 1
weighted label sum = 0
-average loss = 3
+average loss = 4
best constant = -inf
total feature number = 0
RunTests: test 50: FAILED: ref(train-sets/ref/sequence_data.ldf.test-beam20.stderr) != stderr(stderr.tmp)
RunTests: test 50: sequence_data.predict: no data. Can't compare
--- c:/cygwin/bin/diff.exe -u --minimal train-sets/ref/sequence_data.ldf.test-beam20.predict sequence_data.predict
--- train-sets/ref/sequence_data.ldf.test-beam20.predict 2014-06-07 10:19:30.142088700 -0700
+++ sequence_data.predict 2014-06-07 10:59:16.930150700 -0700
@@ -1,6 +1,6 @@
1.78814e-07 5 4 3 2 1
-1 5 4 3 2 4
1 5 4 3 2 5
+1 5 4 3 2 4
1.00594 5 4 3 2 3
1.0319 5 4 3 5 4
1.0332 5 4 3 4 3
@@ -9,13 +9,13 @@
1.1154 5 4 5 4 3
1.13135 5 4 4 3 2
1.24821 5 5 4 3 2
-1.71179 5 4 2 1 2
1.71179 5 3 2 1 2
+1.71179 5 4 2 1 2
1.71179 5 4 3 1 2
1.71438 5 4 2 1 3
1.71438 5 3 2 1 3
1.71438 5 4 3 1 3
1.72888 5 2 1 2 1
1.73146 5 2 1 3 2
-1.73258 5 4 2 1 5
+1.73258 5 3 2 1 5

RunTests: test 50: FAILED: ref(train-sets/ref/sequence_data.ldf.test-beam20.predict) != predict(sequence_data.predict)
RunTests: test 51: stderr OK
RunTests: test 52: stderr OK
RunTests: test 52: predict OK
RunTests: test 53: stderr OK
RunTests: test 53: predict OK
RunTests: test 54: minor (<0.001) precision differences ignored
RunTests: test 54: stderr OK
--- c:/cygwin/bin/diff.exe -u --minimal train-sets/ref/sequencespan_data.nonldf.test-beam20.predict sequencespan_data.predict
--- train-sets/ref/sequencespan_data.nonldf.test-beam20.predict 2014-06-07 10:19:30.396257100 -0700
+++ sequencespan_data.predict 2014-06-07 10:59:17.894792900 -0700
@@ -1,21 +1,21 @@
--0.781838 2 6 1 6 2 1 6 4 5 4 5 1 4 6 6
--0.781838 2 6 1 6 2 1 6 4 5 4 5 1 4 6 6
--0.781838 2 6 1 6 2 1 6 4 5 4 5 1 4 6 7
+-0.781838 2 6 1 6 2 1 6 4 5 4 5 1 6 7 6
-0.781838 2 6 1 6 2 1 6 4 5 4 5 1 4 6 7
--0.781838 2 6 1 6 2 1 6 4 5 4 5 1 6 6 6
--0.781838 2 6 1 6 2 1 6 4 5 4 5 1 6 6 6
-0.781838 2 6 1 6 2 1 6 4 5 4 5 1 6 6 7
-0.781838 2 6 1 6 2 1 6 4 5 4 5 1 6 6 7
+-0.781838 2 6 1 6 2 1 6 4 5 4 5 1 4 6 7
+-0.781838 2 6 1 6 2 1 6 4 5 4 5 1 4 6 6
+-0.781838 2 6 1 6 2 1 6 4 5 4 5 1 4 6 6
-0.781838 2 6 1 6 2 1 6 4 5 4 5 1 6 7 7
-0.781838 2 6 1 6 2 1 6 4 5 4 5 1 6 7 7
+-0.781838 2 6 1 6 2 1 6 4 5 4 5 1 6 6 6
+-0.781838 2 6 1 6 2 1 6 4 5 4 5 1 6 6 6
-0.781838 2 6 1 6 2 1 6 4 5 4 5 1 6 7 6
--0.781838 2 6 1 6 2 1 6 4 5 4 5 1 6 7 6
--0.781811 2 6 1 6 2 6 6 4 5 4 5 1 4 6 6
--0.781811 2 6 1 6 2 6 6 4 5 4 5 1 4 6 6
+-0.781811 2 6 1 6 2 6 7 4 5 4 5 1 6 6 7
-0.781811 2 6 1 6 2 6 6 4 5 4 5 1 4 6 7
-0.781811 2 6 1 6 2 6 6 4 5 4 5 1 4 6 7
--0.781811 2 6 1 6 2 6 6 4 5 4 5 1 6 6 6
--0.781811 2 6 1 6 2 6 6 4 5 4 5 1 6 6 6
--0.781811 2 6 1 6 2 6 6 4 5 4 5 1 6 6 7
--0.781811 2 6 1 6 2 6 6 4 5 4 5 1 6 6 7
+-0.781811 2 6 1 6 2 6 6 4 5 4 5 1 6 7 7
+-0.781811 2 6 1 6 2 6 6 4 5 4 5 1 6 7 7
+-0.781811 2 6 1 6 2 6 7 4 5 4 5 1 6 6 7
+-0.781811 2 6 1 6 2 6 7 4 5 4 5 1 6 6 6
+-0.781811 2 6 1 6 2 6 7 4 5 4 5 1 6 6 6

RunTests: test 54: FAILED: ref(train-sets/ref/sequencespan_data.nonldf.test-beam20.predict) != predict(sequencespan_data.predict)
RunTests: test 55: stderr OK
RunTests: test 56: stderr OK
RunTests: test 56: predict OK
RunTests: test 57: stderr OK
RunTests: test 57: predict OK
RunTests: test 58: stderr OK
RunTests: test 58: sequencespan_data.predict: no data. Can't compare
--- c:/cygwin/bin/diff.exe -u --minimal train-sets/ref/sequencespan_data.nonldf-bilou.test-beam20.predict sequencespan_data.predict
--- train-sets/ref/sequencespan_data.nonldf-bilou.test-beam20.predict 2014-06-07 10:19:30.361234200 -0700
+++ sequencespan_data.predict 2014-06-07 10:59:18.975512900 -0700
@@ -8,10 +8,10 @@
0.638117 2 1 1 2 2 1 6 7 7 7 7 1 6 4 2
0.641751 2 1 1 2 2 1 6 7 7 7 7 1 6 6 7
0.665272 2 1 1 2 2 1 6 7 7 7 7 1 6 4 4
-0.665499 2 1 1 2 2 1 6 7 7 7 7 1 6 2 3
-0.665499 2 1 1 2 2 1 6 7 7 7 7 1 6 2 3
0.665499 2 1 1 2 2 1 6 7 7 7 7 1 6 4 5
0.665499 2 1 1 2 2 1 6 7 7 7 7 1 6 4 5
+0.665499 2 1 1 2 2 1 6 7 7 7 7 1 6 2 3
+0.665499 2 1 1 2 2 1 6 7 7 7 7 1 6 2 3
0.668673 2 1 1 2 2 1 6 7 7 7 7 1 6 4 6
0.706717 2 1 6 2 2 1 6 7 7 7 7 1 6 4 1
0.733527 2 1 1 2 2 1 6 7 7 7 7 2 3 3 3
RunTests: test 58: FAILED: ref(train-sets/ref/sequencespan_data.nonldf-bilou.test-beam20.predict) != predict(sequencespan_data.predict)
RunTests: test 59: stderr OK
RunTests: test 60: minor (<0.001) precision differences ignored
RunTests: test 60: stderr OK

Failing Test 54

Standard build on Ubuntu 14.04. All tests pass except for test 54 and the error message is posted below.

--- diff -u --minimal train-sets/ref/sequencespan_data.nonldf.test-beam20.stderr stderr.tmp
--- train-sets/ref/sequencespan_data.nonldf.test-beam20.stderr  2014-06-29 01:31:31.810181273 -0700
+++ stderr.tmp  2014-06-29 02:26:17.542293760 -0700
@@ -12,13 +12,13 @@
 loss       last          counter      weight    label  predict features
 average    since      sequence         example            current label      current predicted  current   cur   cur         predic.        examples
 loss       last        counter          weight          sequence prefix        sequence prefix features  pass   pol            made          gener.
-10.000000  10.000000         1        1.000000   [2 1 1 2 2 1 6 7 7 ..] [2 6 1 6 2 1 6 4 5 ..]        0     0     0            1192               0
+11.000000  11.000000         1        1.000000   [2 1 1 2 2 1 6 7 7 ..] [2 6 1 6 2 1 6 4 5 ..]        0     0     0            1182               0

 finished run
 number of examples per pass = 1
 passes used = 1
 weighted example sum = 1
 weighted label sum = 0
-average loss = 10
+average loss = 11
 best constant = -inf
 total feature number = 0
RunTests: test 54: FAILED: ref(train-sets/ref/sequencespan_data.nonldf.test-beam20.stderr) != stderr(stderr.tmp)
make[1]: *** [test] Error 1
make[1]: Leaving directory `/home/user/vowpal_wabbit'
make: *** [test] Error 2

--invert_hash does not work with --oaa

With older VW (7.4.0), the following code

    echo '1 | a b c
    3 | a d e' | vw --oaa 3 --invert_hash model.inv
    cat  model.inv

prints a nice inverted-hash model with lines like ^a:108232:0.046370. With the newest VW, it does not work (no features are printed).

unclosed connection to spanning_tree process

After SGD steps finishes I see that the number of open file descriptors of the spanning_tree process doubles. My guess is that the vw workers do not properly close the connection to the spanning_tree after SGD phase and open a new one for the BFGS phase.

Normally not a big deal but some servers have put a limit on the number of open file descriptors, which would prevent vw from functioning when it is run in scale.

make test fails on linux x64

make test
vw running test-suite...
(cd test && ./RunTests -d -fe -E 0.001 ../vowpalwabbit/vw ../vowpalwabbit/vw)
testing linux Testing vw: ../vowpalwabbit/vw
Testing lda: ../vowpalwabbit/vw
RunTests: '-D' to see any diff output
RunTests: '-o' to force overwrite references
RunTests: test 1: stderr OK
RunTests: test 2: stderr OK
RunTests: test 2: predict OK
RunTests: test 3: stderr OK
RunTests: test 4: stdout OK
RunTests: test 4: stderr OK
RunTests: test 5: minor (<0.001) precision differences ignored
RunTests: test 5: stderr OK
RunTests: test 6: stderr OK
RunTests: test 6: predict OK
RunTests: test 7: stderr OK
RunTests: test 8: stderr OK
RunTests: test 8: minor (<0.001) precision differences ignored
RunTests: test 8: predict OK
RunTests: test 9: stderr OK
RunTests: test 9: predict OK
RunTests: test 10: stderr OK
RunTests: test 10: predict OK
RunTests: test 11: stderr OK
RunTests: test 12: stderr OK
RunTests: test 13: stderr OK
RunTests: test 14: stdout OK
RunTests: test 14: stderr OK
RunTests: test 15: stdout OK
RunTests: test 15: stderr OK
RunTests: test 16: stdout OK
RunTests: test 16: stderr OK
RunTests: test 17: minor (<0.001) precision differences ignored
RunTests: test 17: stderr OK
RunTests: test 18: stderr OK
RunTests: test 19: stderr OK
RunTests: test 20: stderr OK
RunTests: test 20: predict OK
RunTests: test 21: minor (<0.001) precision differences ignored
RunTests: test 21: stderr OK
RunTests: test 22: stdout OK
RunTests: test 22: stderr OK
RunTests: test 23: stdout OK
RunTests: test 23: stderr OK
--- diff -N --minimal --suppress-common-lines --ignore-all-space --strip-trailing-cr --side-by-side -W 160 train-sets/ref/active-simulation.t24.stderr stderr.tmp
0.923209 0.617612 83 613.4 1.0000 0.0494 46 | 0.923206 0.617608 83 613.4 1.0000 0.0494 46
0.843993 0.528494 169 767.4 unknown -0.0771 87 | 0.843990 0.528514 169 767.4 unknown -0.0771 87
0.772938 0.491751 261 961.3 1.0000 0.1576 13 | 0.772931 0.491736 261 961.3 1.0000 0.1576 13
0.688758 0.352542 430 1201.9 1.0000 -0.0663 18 | 0.688754 0.352555 430 1202.0 1.0000 -0.0663 18
0.614280 0.322631 569 1508.9 1.0000 0.3900 29 | 0.614280 0.322626 569 1508.9 1.0000 0.3900 29
0.540212 0.262029 796 1910.6 -1.0000 -0.5583 46 | 0.540212 0.262017 796 1910.6 -1.0000 -0.5583 46
0.469826 0.188841 1190 2389.2 unknown 0.3206 44 | 0.469827 0.188854 1190 2389.3 unknown 0.3206 44
0.430273 0.272207 1491 2987.1 -1.0000 0.1254 70 | 0.430278 0.272222 1491 2987.1 -1.0000 0.1254 70
0.376527 0.161817 1901 3734.8 unknown -0.4625 48 | 0.376532 0.161816 1901 3734.8 unknown -0.4625 48
0.370952 0.348673 2127 4669.3 unknown -0.4482 179 | 0.370957 0.348679 2127 4669.4 unknown -0.4482 179
0.359658 0.314489 2813 5836.9 unknown -0.2033 35 | 0.359667 0.314513 2813 5836.9 unknown -0.2033 35
0.328361 0.203584 3462 7300.9 -1.0000 -1.0000 151 | 0.328371 0.203595 3462 7300.8 -1.0000 -1.0000 151
0.296159 0.170245 4340 9168.1 -1.0000 -0.5148 70 | 0.296166 0.170237 4340 9168.0 -1.0000 -0.5148 70
0.279052 0.210640 5887 11460.5 unknown 0.5904 40 | 0.277338 0.202040 5933 11460.4 unknown 0.5531 81
0.255498 0.163368 6996 14390.5 1.0000 0.5009 49 | 0.257099 0.176171 7195 14326.4 unknown 0.2642 138
0.240619 0.181113 9174 17988.9 unknown -0.7411 57 | 0.240224 0.172730 9786 17908.3 unknown -0.3206 47
weighted example sum = 19090.5 | weighted example sum = 18393.3
weighted label sum = -1355 | weighted label sum = -1447.61
average loss = 0.240509 | average loss = 0.235951
best constant = -0.136917 | best constant = -0.157285
total queries = 889 | total queries = 884
RunTests: test 24: FAILED: ref(train-sets/ref/active-simulation.t24.stderr) != stderr(stderr.tmp)
make: *** [test] Error 1

uname -a
Linux xxxxx 3.2.0-58-generic #88-Ubuntu SMP Tue Dec 3 17:37:58 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

LDA seems to return more words than the dictionary

Hi,

I ran through the wiki1k example and it worked fine. I don't know the original words, so I tried to make up my own testcase and run lda. I am using vw 7.6.1.

I have 1,469,029 examples of a total dictionary of 735,338 words. Here is a verification and my lda run -

$ cat tweets.vw | tr ' ' '\n' | cut -d ":" -f1 | sort -un | tail -2
735337
735338
$ wc -l tweets.vw
 1469029 tweets.vw

$ vw tweets.vw --lda 100 --lda_alpha 0.1 --lda_rho 0.1 --lda_D 1469029 --minibatch 256 --power_t 0.5 --initial_t 1 -b 20 --cache_file tweets.vwcache --passes 1 -p predictions.dat --readable_model topics.dat
Num weight bits = 20
learning rate = 0.5
initial_t = 1
power_t = 0.5
predictions = predictions.dat
can't open: tweets.vwcache, error = No such file or directory
creating cache_file = tweets.vwcache
Reading datafile = tweets.vw
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
14.897726  14.897726           1         1.0  unknown   0.0000        8
14.720736  14.543746           2         2.0  unknown   0.0000       18
14.788717  14.856698           4         4.0  unknown   0.0000        9
14.740150  14.691583           8         8.0  unknown   0.0000       10
14.765978  14.791806          16        16.0  unknown   0.0000       11
14.737989  14.709999          32        32.0  unknown   0.0000       22
14.800730  14.863472          64        64.0  unknown   0.0000        6
14.785085  14.769440         128       128.0  unknown   0.0000        9
14.806227  14.827369         256       256.0  unknown   0.0000        3
12.989683  11.173139         512       512.0  unknown   0.0000       14
11.575379  10.161075        1024      1024.0  unknown   0.0000        7
10.541325  9.507270         2048      2048.0  unknown   0.0000       14
9.820894   9.100464         4096      4096.0  unknown   0.0000       16
9.434462   9.048030         8192      8192.0  unknown   0.0000        4
9.340609   9.246755        16384     16384.0  unknown   0.0000        9
9.106786   8.872963        32768     32768.0  unknown   0.0000       17
9.136901   9.167015        65536     65536.0  unknown   0.0000       22
9.114424   9.091946       131072    131072.0  unknown   0.0000        4
8.946457   8.778491       262144    262144.0  unknown   0.0000        6
8.855298   8.764139       524288    524288.0  unknown   0.0000        7
8.562221   8.269143      1048576   1048576.0  unknown   0.0000        3

finished run
number of examples = 1469029
weighted example sum = 1.46903e+06
weighted label sum = 0
average loss = 8.6193
best constant = 1
total feature number = 15938974

$ wc -l topics.dat
 1048587 topics.dat

topics.dat has 11 header lines, but how did I end up with 1,048,576 words and not 735,338? The input file is formatted like so -

$ head -2 tweets.vw
| 1:2 2:2 3:2 4:2 5:2 6:2 7:2 8:2
| 6:2 9:2 10:2 11:2 12:2 13:2 14:3 15:2 16:2 17:2 18:2 19:2 20:2 21:2 22:2 23:2 24:2 25:2

Thanks

simplest classification example

I use Vowpal Wabbit 7.3 for 10-class MNIST classification, but can't get any reasonable results.

My usage of vw:

./vw -d mnist_data/mnist.train --oaa 10 -f mnist_data/mnist.model
./vw -i mnist_data/mnist.model -t mnist_data/mnist.train -p mnist_data/mnist.res

I use 70k digits for training and then test on same data, but predictions always "10".

How can I tune VW properly for MNIST?

make test in mac. some test fail, is everything ok?

I try make test with or without ./configure. some test ok while some still fail.
here is some test failed. I will be very happy to know whether everything is ok? thanks a lot.

RunTests: test 38: minor (<0.001) precision differences ignored
RunTests: test 38: stderr OK
RunTests: test 39: minor (<0.001) precision differences ignored
RunTests: test 39: stderr OK
RunTests: test 40: stderr OK
RunTests: test 41: minor (<0.001) precision differences ignored
RunTests: test 41: stderr OK
RunTests: test 42: minor (<0.001) precision differences ignored
RunTests: test 42: stderr OK
RunTests: test 43: stderr OK
RunTests: test 44: stderr OK
RunTests: test 44: predict OK
RunTests: test 45: stderr OK
RunTests: test 45: predict OK
RunTests: test 46: stderr OK
Use of uninitialized value $line2 in concatenation (.) or string at ./RunTests line 379, <$sdiff> line 1.
Use of uninitialized value $line2 in split at ./RunTests line 383, <$sdiff> line 1.
RunTests: test 46: FAILED: ref(train-sets/ref/sequence_data.nonldf.test-beam20.predict) != predict(sequence_data.predict)
RunTests: test 47: stderr OK
RunTests: test 48: stderr OK
RunTests: test 48: predict OK
RunTests: test 49: stderr OK
RunTests: test 49: predict OK
RunTests: test 50: FAILED: ref(train-sets/ref/sequence_data.ldf.test-beam20.stderr) != stderr(stderr.tmp)
Use of uninitialized value $line2 in concatenation (.) or string at ./RunTests line 379, <$sdiff> line 1.
Use of uninitialized value $line2 in split at ./RunTests line 383, <$sdiff> line 1.
RunTests: test 50: FAILED: ref(train-sets/ref/sequence_data.ldf.test-beam20.predict) != predict(sequence_data.predict)
RunTests: test 51: stderr OK
RunTests: test 52: stderr OK
RunTests: test 52: predict OK
RunTests: test 53: stderr OK
RunTests: test 53: predict OK
RunTests: test 54: minor (<0.001) precision differences ignored
RunTests: test 54: stderr OK
RunTests: test 54: FAILED: ref(train-sets/ref/sequencespan_data.nonldf.test-beam20.predict) != predict(sequencespan_data.predict)
RunTests: test 55: stderr OK
RunTests: test 56: stderr OK
RunTests: test 56: predict OK
RunTests: test 57: stderr OK
RunTests: test 57: predict OK
RunTests: test 58: stderr OK
Use of uninitialized value $line2 in concatenation (.) or string at ./RunTests line 379, <$sdiff> line 1.
Use of uninitialized value $line2 in split at ./RunTests line 383, <$sdiff> line 1.
RunTests: test 58: FAILED: ref(train-sets/ref/sequencespan_data.nonldf-bilou.test-beam20.predict) != predict(sequencespan_data.predict)
RunTests: test 59: stderr OK

LDA readable output

My readable output is like this:

Version 7.3.2
Min label:0.000000
Max label:1.000000
bits:13
0 pairs: 
0 triples: 
rank:0
lda:50
0 ngram: 
0 skip: 
options:
0 0.010000 0.010000 0.010000 0.014960 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.055702 0.010000 0.014960 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.010097 0.042937 0.031609 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.082389 0.010000 0.010000 0.010000 0.010000 0.010000 0.023432 0.010000 0.010000 0.010000 0.010000 0.014960 0.014960 

Which is unlike the expected topic-per-line word-per-column shown in the lda.pdf.

There are many more lines than topics, but many less than words.

Is this a bug?

A possible bug with the new "multiple_occurrences" warning

I have a train.dat with the following contents:

1:2:0.4 | a c
3:0.5:0.2 | b d
4:1.2:0.5 | a b c
2:1:0.3 | b c
3:1.5:0.7 | a d

I run these commands:

vw --cb 4 -d train.dat -f cb.model
vw -i cb.model -f cb.model -d train.dat
vw -i cb.model

The first two commands run fine. The third one causes VW to exit with "vw: multiple_occurrences":

$ vw -i cb.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
vw: multiple_occurrences

Is it because -f is being specified twice? If -f is not specified in the second line, cb.model is not updated. Similarly, does save_resume need to be specified again each time the model is updated? Or is it a flag that is just set once?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.