adbrebs / taxi Goto Github PK
View Code? Open in Web Editor NEWWinning entry to the Kaggle taxi competition
Winning entry to the Kaggle taxi competition
Hello, thanks for sharing your amazing work.
I'm trying to run your code but I can't figure out this error while i'm running data/make_valid_cut.py test_times_0
:
/h/r/taxi # ❯❯❯ data/make_valid_cut.py test_times_0
Number of cuts: 5
Traceback (most recent call last):
File "data/make_valid_cut.py", line 78, in <module>
make_valid(sys.argv[1], outpath)
File "data/make_valid_cut.py", line 24, in make_valid
for line in taxi_it('train'):
File "/home/root/taxi/data/hdf5.py", line 63, in taxi_it
dataset = TaxiDataset(which_set, filename)
File "/home/root/taxi/data/hdf5.py", line 16, in __init__
super(TaxiDataset, self).__init__(self.data_path, (which_set,), **kwargs)
File "/usr/local/lib/python2.7/dist-packages/fuel/datasets/hdf5.py", line 146, in __init__
"{}.".format(self.available_splits))
ValueError: '('train',)' split is not provided by this dataset. Available splits are (u'test', u'unique_taxi_id', u'train', u'stands', u'unique_origin_call').
Thank you for sharing your work!
When I run the train.py, This error will appear.
ImportError: ('The following error happened while compiling the node', DeepCopyOp(TensorConstant{0.0}), '\n', '/home/lab508/.theano/compiledir_Linux-4.10--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64/tmpaxNtgT/407a281f6283a0ff044720a43f4d2c3f.so: undefined symbol: _ZdlPvm', '[DeepCopyOp(TensorConstant{0.0})]')
Hi
First of all, I want to say thank you for all your work and for sharing it to us.
I have one question, I have done all the instructions you wrote, but when I run the cluster_arrival.py the result of the code is 3455 clusters instead of 3392 of your document. Did you do any preparation of the data before run it that script?
Thank you for your attention. And sorry for my English, it is a little bad
Regards
When using RNN, how did you guys train the trajectories prefix for all the data?
Did you put all the trajectories in a huge matrix and trained them at the same time.
Or training them one by one?
Hi, I've got an error when using the prepare.sh script ran in my taxi repo. I've download the dataset with the kaggle api per instructions here https://github.com/Kaggle/kaggle-api and it appears to download it successfully. As is, the file name was train.zip. I renamed it train.csv.zip which the prepare.sh script expects. When running the script again, I get this error noted below. Anyone encountered this problem, please comment.
md5sum train.csv.zip... 68e8e939fbd1e1880b1617295d5046f0 failed
@ejls @Alexis211 j'ai rapidement regarde le code. Quelques remarques sur le dataset:
Deux choses :
Sur le temps, il semblerait que les classes n'apportent rien, si ce n'est un espèce de prior à travers une initialisation non uniforme.
Sur la destination, j'ai peut-être une erreur dans mon implémentation mais ça merde complètement.
Implémentation OK, ce n'est pas très efficace car SQLite n'est pas optimisé pour ce genre de queries mais sur de gros modèles ça reste tout à fait négligeable.
Validité de la procédure encore non vérifiée (modèle en cours de training).
Hi, as the question says in the title, can u help me ? Thanks!!!
First, thanks for sharing your amazing work!
I have a question and hope to get reply from you.
“we chose to consider only the first k points and last k points of the trajectory prefix, which gives us a total of 2k points, or 4k numerical values for our input vector. For the winning model we took k = 5.”
In this case, the last 1 point of the last 5 points is the destination/target, the neural net will learn to establish a strong relationship between the target and the last 1 point (2 features: latitude and longitude), regardless of other features. Right?
And in the test dataset, the last 1 point is not the real destination, cause the trip is not finished. Why you could get a good results? I just can’t understand this, and could you explain for me?
Looking forward your reply, thanks in advance.
when I try to run csv_to_hdf5.py with parameters I get this
AttributeError: 'module' object has no attribute 'stands_size'
whats "import data" ? Is it a python lib or some floder missing
Before training 0.00 0.00%
DataStreamMonitoring 0.00 0.00%
Printing 0.00 0.00%
Other 0.00 0.00%
Initialization 0.82 1.69%
Training 47.90 98.31%
Before epoch 0.44 0.91%
DataStreamMonitoring 0.44 0.91%
Printing 0.00 0.00%
Other 0.00 0.00%
Epoch 47.46 97.40%
Read data 41.36 84.89%
Before batch 0.24 0.50%
DataStreamMonitoring 0.08 0.17%
Printing 0.13 0.26%
Other 0.03 0.07%
Train 4.78 9.82%
After batch 0.95 1.94%
DataStreamMonitoring 0.78 1.59%
Printing 0.14 0.28%
Other 0.03 0.07%
Other 0.12 0.26%
Other 0.00 0.00%
After training 0.00 0.00%
DataStreamMonitoring 0.00 0.00%
Printing 0.00 0.00%
Other 0.00 0.00%
Pour chaque epoch, 85% du temps est consacre a la lecture du dataset et a la creation des batchs...
Je pense que ce que le hdf5 de Etienne va un peu aider mais il faudra mettre en place la creation des batchs dans des process a part.
$eight: taxi: ./prepare.sh
This script will prepare the data.
You should run it from inside the repository.
You should set the TAXI_PATH variable to where the data downloaded from kaggle is.
Three data files are needed: train.csv.zip, test.csv.zip and metaData_taxistandsID_name_GPSlocation.csv.zip. They can be found at the following url: https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i/data
h5py... version 2.5.0 (we used version 2.5.0)
theano... version 0.7.0 (we used version 0.7.0.dev)
fuel... version 0.0.1 (we used version 0.0.1)
blocks... version 0.0.1 (we used version 0.0.1)
sklearn... version 0.16.1 (we used version 0.16.1)
TAXI_PATH is set to kaggle-data
md5sum train.csv.zip... 87a1b75adfde321dc163160b495964e8 ok
md5sum test.csv.zip... 47133bf7349cb80cc668fa56af8ce743 ok
md5sum metaData_taxistandsID_name_GPSlocation.csv.zip... fecec7286191af868ce8fb208f5c7643 ok
unziping train.csv.zip... Archive: kaggle-data/train.csv.zip
inflating: kaggle-data/train.csv
ok
md5sum train.csv... 68cc499ac4937a3079ebf69e69e73971 ok
unziping test.csv.zip... Archive: kaggle-data/test.csv.zip
inflating: kaggle-data/test.csv
ok
md5sum test.csv... f2ceffde9d98e3c49046c7d998308e71 ok
unziping metaData_taxistandsID_name_GPSlocation.csv.zip... Archive: kaggle-data/metaData_taxistandsID_name_GPSlocation.csv.zip
inflating: kaggle-data/metaData_taxistandsID_name_GPSlocation.csv
inflating: kaggle-data/__MACOSX/._metaData_taxistandsID_name_GPSlocation.csv
ok
patching error in metadata csv... ok
md5sum metaData_taxistandsID_name_GPSlocation.csv... 724805b0b1385eb3efc02e8bdfe9c1df ok
This might take some time
read train: begin
read train: 10000 done
......
read train: 1710000 done
read train: writing
read train: end
First origin_call not present in training set: 57106
read test: begin
read test: writing
read test: end
This might take some time
initialization... ok
cutting... Number of cuts: 5
Traceback (most recent call last):
File "data/make_valid_cut.py", line 78, in
make_valid(sys.argv[1], outpath)
File "data/make_valid_cut.py", line 24, in make_valid
for line in taxi_it('train'):
File "/Users/eight/repos/taxi/data/hdf5.py", line 63, in taxi_it
dataset = TaxiDataset(which_set, filename)
File "/Users/eight/repos/taxi/data/hdf5.py", line 16, in init
super(TaxiDataset, self).init(self.data_path, (which_set,), **kwargs)
File "/usr/local/lib/python2.7/site-packages/fuel/datasets/hdf5.py", line 146, in init
"{}.".format(self.available_splits))
ValueError: '('train',)' split is not provided by this dataset. Available splits are (u'test', u'unique_taxi_id', u'train', u'stands', u'unique_origin_call').
ok
This might take some time
generating... Traceback (most recent call last):
File "data_analysis/cluster_arrival.py", line 13, in
from data.transformers import add_destination
File "/Users/eight/repos/taxi/data/transformers.py", line 9, in
from fuel.transformers import Batch, Mapping, SortMapping, Transformer, Unpack, FilterSources
ImportError: cannot import name FilterSources
ok
mkdir model_data... mkdir: model_data: File exists
ok
mkdir output... mkdir: output: File exists
ok
The data was successfully prepared
To train the winning model on gpu, you can now run the following command:
THEANO_FLAGS=floatX=float32,device=gpu,optimizer=fast_run python2 train.py dest_mlp_tgtcls_1_cswdtx_alexandre
$eight: taxi: THEANO_FLAGS=floatX=float32,device=gpu,optimizer=fast_run python2 train.py dest_mlp_tgtcls_1_cswdtx_alexandre
Using gpu device 0: GeForce GT 650M
Traceback (most recent call last):
File "train.py", line 44, in
config = importlib.import_module('.%s' % model_name, 'config')
File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
File "/Users/eight/repos/taxi/config/dest_mlp_tgtcls_1_cswdtx_alexandre.py", line 8, in
from model.dest_mlp_tgtcls import Model, Stream
File "/Users/eight/repos/taxi/model/dest_mlp_tgtcls.py", line 7, in
from model.mlp import FFMLP, Stream
File "/Users/eight/repos/taxi/model/mlp.py", line 13, in
from data import transformers
File "/Users/eight/repos/taxi/data/transformers.py", line 9, in
from fuel.transformers import Batch, Mapping, SortMapping, Transformer, Unpack, FilterSources
ImportError: cannot import name FilterSources
$eight: taxi:
First, thank you for this open source code, and nice work
"the client called the taxi by phone, then we have a client ID. If the client called the taxi at a taxi stand, then we have a taxi stand ID. Otherwise we have no client identification"
so client ID and taxi stand ID seem to be mutually exclusive
How is this issue handled?
Also, if there is no client identification, what to do then?
I guess use some kind of average embedding values as the inputs to MLP?
On a des différences de 20% sur l'erreur en temps entre validation et test.
When running the train.py code. I got lots of FutureWarning like the following. This might be due to a numpy update from the version this project was based on.
python2.7/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full((4,), -8.5676041) will return an array of dtype('float32')
I tracked down to find where the numpy.full
was called and found it in data/transformers.py
I recommend adding dtype=numpy.float32
to those numpy.full
function call to avoid showing them during training.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.