adbrebs / taxi Goto Github PK

View Code? Open in Web Editor NEW

260.0 260.0 93.0 3.38 MB

Winning entry to the Kaggle taxi competition

Python 83.42% Shell 1.98% HTML 0.95% JavaScript 12.56% CSS 0.84% Makefile 0.25%

taxi's People

Contributors

Stargazers

Watchers

Forkers

jimstearns206 charles-cai sendit2me mathn samim23 bigsea2015 letorresl jbisconde yuchen99 provemyself zhchxi11 jiangwm tplink32 cxysteven nixoz jammie rajpushkar83 gblasius isoyang woolenwang nkhuyu duthchao fedorajzf kratarth1203 laskarcyber tyutgps weidezhang yaodi833 yield22 jjdblast wangfengfighting lambdalpha hengcai nzubia muhammadammad taylerablake alexsisu dragoncircle giserh sainiudit dionman wuzhongdehua ky-xt hma02 weeshlow adhlanfauzan jwwei prayeryd zgcgreat tongzhenguo nofeetbird0321 gdpan919 vathsalaachar randomthoughts2018 guangxush octobeeeer zhiweicui ompanda wanyelin satadru5 ashishlal mohsenhaghaieghshenasfard merico34 tucougl tiffen geogubd mephis357 jmiao18 thlvia mntw rikirolly yaowanwei96 hisham32 tabassumhajira yazici xinchan dlvnkenye scott198510 zhaojuanjuan511 jabud james-bao satyajeetkr7 dawnywu fagan2888 cranehzm faridk84 guowenfeng-p wwwzyf valeman programmeradu gimyuenlee ccweng jasmine94623

taxi's Issues

Error in data preparation

Hello, thanks for sharing your amazing work.
I'm trying to run your code but I can't figure out this error while i'm running data/make_valid_cut.py test_times_0:

/h/r/taxi # ❯❯❯ data/make_valid_cut.py test_times_0
Number of cuts: 5
Traceback (most recent call last):
  File "data/make_valid_cut.py", line 78, in <module>
    make_valid(sys.argv[1], outpath)
  File "data/make_valid_cut.py", line 24, in make_valid
    for line in taxi_it('train'):
  File "/home/root/taxi/data/hdf5.py", line 63, in taxi_it
    dataset = TaxiDataset(which_set, filename)
  File "/home/root/taxi/data/hdf5.py", line 16, in __init__
    super(TaxiDataset, self).__init__(self.data_path, (which_set,), **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/fuel/datasets/hdf5.py", line 146, in __init__
    "{}.".format(self.available_splits))
ValueError: '('train',)' split is not provided by this dataset. Available splits are (u'test', u'unique_taxi_id', u'train', u'stands', u'unique_origin_call').

undefined symbol: _ZdlPvm'

Thank you for sharing your work!

When I run the train.py, This error will appear.

ImportError: ('The following error happened while compiling the node', DeepCopyOp(TensorConstant{0.0}), '\n', '/home/lab508/.theano/compiledir_Linux-4.10--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64/tmpaxNtgT/407a281f6283a0ff044720a43f4d2c3f.so: undefined symbol: _ZdlPvm', '[DeepCopyOp(TensorConstant{0.0})]')

3455 instead of 3392 clusters in cluster_arrival.py

First of all, I want to say thank you for all your work and for sharing it to us.

I have one question, I have done all the instructions you wrote, but when I run the cluster_arrival.py the result of the code is 3455 clusters instead of 3392 of your document. Did you do any preparation of the data before run it that script?

Thank you for your attention. And sorry for my English, it is a little bad

Regards

Question about input

When using RNN, how did you guys train the trajectories prefix for all the data?
Did you put all the trajectories in a huge matrix and trained them at the same time.
Or training them one by one?

md5sum on train.csv.zip error

Hi, I've got an error when using the prepare.sh script ran in my taxi repo. I've download the dataset with the kaggle api per instructions here https://github.com/Kaggle/kaggle-api and it appears to download it successfully. As is, the file name was train.zip. I renamed it train.csv.zip which the prepare.sh script expects. When running the script again, I get this error noted below. Anyone encountered this problem, please comment.

md5sum train.csv.zip... 68e8e939fbd1e1880b1617295d5046f0 failed

Dataset

@ejls @Alexis211 j'ai rapidement regarde le code. Quelques remarques sur le dataset:

ne serait-ce pas plus pratique (potentiellement efficace, en particulier si on veut faire du in-memory) de creer un beau hdf5 et utiliser H5PYDataset de fuel?
dans l'implementation actuelle, je crois que la classe Select(Transformer) n'est pas necessaire car fuel a deja des fonctionnalites implementees a cet effet. Dans les arguments du constructeur de la classe mere Dataset de fuel, il faut juste preciser les sources qu'on veut selectionner et utiliser la fonction https://github.com/bartvm/fuel/blob/master/fuel/datasets/base.py#L161 dans get_data.

TODO

Couche de sortie

Deux choses :

enlever la moyenne
enlever les classes

Sur le temps, il semblerait que les classes n'apportent rien, si ce n'est un espèce de prior à travers une initialisation non uniforme.

Sur la destination, j'ai peut-être une erreur dans mon implémentation mais ça merde complètement.

[TODO] cuts pour la génération de données de train

Implémentation OK, ce n'est pas très efficace car SQLite n'est pas optimisé pour ce genre de queries mais sur de gros modèles ça reste tout à fait négligeable.

Validité de la procédure encore non vérifiée (modèle en cours de training).

how to compute backpropagation in your own MLP,especially in the last layer?

Hi, as the question says in the title, can u help me ? Thanks!!!

The last 1 point of the last k points is the destination, use the same values as the features and the labels, is this rational??

First, thanks for sharing your amazing work!
I have a question and hope to get reply from you.
“we chose to consider only the first k points and last k points of the trajectory preﬁx, which gives us a total of 2k points, or 4k numerical values for our input vector. For the winning model we took k = 5.”
In this case, the last 1 point of the last 5 points is the destination/target, the neural net will learn to establish a strong relationship between the target and the last 1 point (2 features: latitude and longitude), regardless of other features. Right?
And in the test dataset, the last 1 point is not the real destination, cause the trip is not finished. Why you could get a good results? I just can’t understand this, and could you explain for me?

Looking forward your reply, thanks in advance.

import data

when I try to run csv_to_hdf5.py with parameters I get this

AttributeError: 'module' object has no attribute 'stands_size'

whats "import data" ? Is it a python lib or some floder missing

Batch processing

@Alexis211 @ejls

Section Time % of total

Before training 0.00 0.00%
DataStreamMonitoring 0.00 0.00%
Printing 0.00 0.00%
Other 0.00 0.00%
Initialization 0.82 1.69%
Training 47.90 98.31%
Before epoch 0.44 0.91%
DataStreamMonitoring 0.44 0.91%
Printing 0.00 0.00%
Other 0.00 0.00%
Epoch 47.46 97.40%
Read data 41.36 84.89%
Before batch 0.24 0.50%
DataStreamMonitoring 0.08 0.17%
Printing 0.13 0.26%
Other 0.03 0.07%
Train 4.78 9.82%
After batch 0.95 1.94%
DataStreamMonitoring 0.78 1.59%
Printing 0.14 0.28%
Other 0.03 0.07%
Other 0.12 0.26%
Other 0.00 0.00%
After training 0.00 0.00%
DataStreamMonitoring 0.00 0.00%
Printing 0.00 0.00%
Other 0.00 0.00%

Pour chaque epoch, 85% du temps est consacre a la lecture du dataset et a la creation des batchs...
Je pense que ce que le hdf5 de Etienne va un peu aider mais il faudra mettre en place la creation des batchs dans des process a part.

FilterSources error

$eight: taxi: ./prepare.sh
This script will prepare the data.
You should run it from inside the repository.
You should set the TAXI_PATH variable to where the data downloaded from kaggle is.
Three data files are needed: train.csv.zip, test.csv.zip and metaData_taxistandsID_name_GPSlocation.csv.zip. They can be found at the following url: https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i/data

Checking dependencies

h5py... version 2.5.0 (we used version 2.5.0)
theano... version 0.7.0 (we used version 0.7.0.dev)
fuel... version 0.0.1 (we used version 0.0.1)
blocks... version 0.0.1 (we used version 0.0.1)
sklearn... version 0.16.1 (we used version 0.16.1)

Checking data

TAXI_PATH is set to kaggle-data
md5sum train.csv.zip... 87a1b75adfde321dc163160b495964e8 ok
md5sum test.csv.zip... 47133bf7349cb80cc668fa56af8ce743 ok
md5sum metaData_taxistandsID_name_GPSlocation.csv.zip... fecec7286191af868ce8fb208f5c7643 ok

Extracting data

unziping train.csv.zip... Archive: kaggle-data/train.csv.zip
inflating: kaggle-data/train.csv
ok
md5sum train.csv... 68cc499ac4937a3079ebf69e69e73971 ok
unziping test.csv.zip... Archive: kaggle-data/test.csv.zip
inflating: kaggle-data/test.csv
ok
md5sum test.csv... f2ceffde9d98e3c49046c7d998308e71 ok
unziping metaData_taxistandsID_name_GPSlocation.csv.zip... Archive: kaggle-data/metaData_taxistandsID_name_GPSlocation.csv.zip
inflating: kaggle-data/metaData_taxistandsID_name_GPSlocation.csv
inflating: kaggle-data/__MACOSX/._metaData_taxistandsID_name_GPSlocation.csv
ok
patching error in metadata csv... ok
md5sum metaData_taxistandsID_name_GPSlocation.csv... 724805b0b1385eb3efc02e8bdfe9c1df ok

Conversion of training set to HDF5

This might take some time
read train: begin
read train: 10000 done
......
read train: 1710000 done
read train: writing
read train: end
First origin_call not present in training set: 57106
read test: begin
read test: writing
read test: end

Generation of validation set

This might take some time
initialization... ok
cutting... Number of cuts: 5
Traceback (most recent call last):
File "data/make_valid_cut.py", line 78, in
make_valid(sys.argv[1], outpath)
File "data/make_valid_cut.py", line 24, in make_valid
for line in taxi_it('train'):
File "/Users/eight/repos/taxi/data/hdf5.py", line 63, in taxi_it
dataset = TaxiDataset(which_set, filename)
File "/Users/eight/repos/taxi/data/hdf5.py", line 16, in init
super(TaxiDataset, self).init(self.data_path, (which_set,), **kwargs)
File "/usr/local/lib/python2.7/site-packages/fuel/datasets/hdf5.py", line 146, in init
"{}.".format(self.available_splits))
ValueError: '('train',)' split is not provided by this dataset. Available splits are (u'test', u'unique_taxi_id', u'train', u'stands', u'unique_origin_call').
ok

Generation of destination cluster

This might take some time
generating... Traceback (most recent call last):
File "data_analysis/cluster_arrival.py", line 13, in
from data.transformers import add_destination
File "/Users/eight/repos/taxi/data/transformers.py", line 9, in
from fuel.transformers import Batch, Mapping, SortMapping, Transformer, Unpack, FilterSources
ImportError: cannot import name FilterSources
ok

Creating output folders

mkdir model_data... mkdir: model_data: File exists
ok
mkdir output... mkdir: output: File exists
ok

The data was successfully prepared
To train the winning model on gpu, you can now run the following command:
THEANO_FLAGS=floatX=float32,device=gpu,optimizer=fast_run python2 train.py dest_mlp_tgtcls_1_cswdtx_alexandre

$eight: taxi: THEANO_FLAGS=floatX=float32,device=gpu,optimizer=fast_run python2 train.py dest_mlp_tgtcls_1_cswdtx_alexandre
Using gpu device 0: GeForce GT 650M
Traceback (most recent call last):
File "train.py", line 44, in
config = importlib.import_module('.%s' % model_name, 'config')
File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
File "/Users/eight/repos/taxi/config/dest_mlp_tgtcls_1_cswdtx_alexandre.py", line 8, in
from model.dest_mlp_tgtcls import Model, Stream
File "/Users/eight/repos/taxi/model/dest_mlp_tgtcls.py", line 7, in
from model.mlp import FFMLP, Stream
File "/Users/eight/repos/taxi/model/mlp.py", line 13, in
from data import transformers
File "/Users/eight/repos/taxi/data/transformers.py", line 9, in
from fuel.transformers import Batch, Mapping, SortMapping, Transformer, Unpack, FilterSources
ImportError: cannot import name FilterSources
$eight: taxi:

Visualizer

How is the missing/mutually exclusive data handled (client ID or taxi stand ID)?

First, thank you for this open source code, and nice work

"the client called the taxi by phone, then we have a client ID. If the client called the taxi at a taxi stand, then we have a taxi stand ID. Otherwise we have no client identification"
so client ID and taxi stand ID seem to be mutually exclusive
How is this issue handled?
Also, if there is no client identification, what to do then?
I guess use some kind of average embedding values as the inputs to MLP?

Validation set

On a des différences de 20% sur l'erreur en temps entre validation et test.

Lots of FutureWarning

When running the train.py code. I got lots of FutureWarning like the following. This might be due to a numpy update from the version this project was based on.

python2.7/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full((4,), -8.5676041) will return an array of dtype('float32')

I tracked down to find where the numpy.full was called and found it in data/transformers.py

I recommend adding dtype=numpy.float32 to those numpy.full function call to avoid showing them during training.