Giter VIP home page Giter VIP logo

taxi's People

Contributors

adbrebs avatar alexis211 avatar ejls avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

taxi's Issues

Error in data preparation

Hello, thanks for sharing your amazing work.
I'm trying to run your code but I can't figure out this error while i'm running data/make_valid_cut.py test_times_0:

/h/r/taxi # ❯❯❯ data/make_valid_cut.py test_times_0
Number of cuts: 5
Traceback (most recent call last):
  File "data/make_valid_cut.py", line 78, in <module>
    make_valid(sys.argv[1], outpath)
  File "data/make_valid_cut.py", line 24, in make_valid
    for line in taxi_it('train'):
  File "/home/root/taxi/data/hdf5.py", line 63, in taxi_it
    dataset = TaxiDataset(which_set, filename)
  File "/home/root/taxi/data/hdf5.py", line 16, in __init__
    super(TaxiDataset, self).__init__(self.data_path, (which_set,), **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/fuel/datasets/hdf5.py", line 146, in __init__
    "{}.".format(self.available_splits))
ValueError: '('train',)' split is not provided by this dataset. Available splits are (u'test', u'unique_taxi_id', u'train', u'stands', u'unique_origin_call').

undefined symbol: _ZdlPvm'

Thank you for sharing your work!

When I run the train.py, This error will appear.

ImportError: ('The following error happened while compiling the node', DeepCopyOp(TensorConstant{0.0}), '\n', '/home/lab508/.theano/compiledir_Linux-4.10--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64/tmpaxNtgT/407a281f6283a0ff044720a43f4d2c3f.so: undefined symbol: _ZdlPvm', '[DeepCopyOp(TensorConstant{0.0})]')

3455 instead of 3392 clusters in cluster_arrival.py

Hi

First of all, I want to say thank you for all your work and for sharing it to us.

I have one question, I have done all the instructions you wrote, but when I run the cluster_arrival.py the result of the code is 3455 clusters instead of 3392 of your document. Did you do any preparation of the data before run it that script?

Thank you for your attention. And sorry for my English, it is a little bad

Regards

Question about input

When using RNN, how did you guys train the trajectories prefix for all the data?
Did you put all the trajectories in a huge matrix and trained them at the same time.
Or training them one by one?

md5sum on train.csv.zip error

Hi, I've got an error when using the prepare.sh script ran in my taxi repo. I've download the dataset with the kaggle api per instructions here https://github.com/Kaggle/kaggle-api and it appears to download it successfully. As is, the file name was train.zip. I renamed it train.csv.zip which the prepare.sh script expects. When running the script again, I get this error noted below. Anyone encountered this problem, please comment.

md5sum train.csv.zip... 68e8e939fbd1e1880b1617295d5046f0 failed

Dataset

@ejls @Alexis211 j'ai rapidement regarde le code. Quelques remarques sur le dataset:

  • ne serait-ce pas plus pratique (potentiellement efficace, en particulier si on veut faire du in-memory) de creer un beau hdf5 et utiliser H5PYDataset de fuel?
  • dans l'implementation actuelle, je crois que la classe Select(Transformer) n'est pas necessaire car fuel a deja des fonctionnalites implementees a cet effet. Dans les arguments du constructeur de la classe mere Dataset de fuel, il faut juste preciser les sources qu'on veut selectionner et utiliser la fonction https://github.com/bartvm/fuel/blob/master/fuel/datasets/base.py#L161 dans get_data.

TODO

  • RNN
  • tester différents embeddings
  • essayer métrique
  • memory network
  • essayer des modèles plus gros pour overfitter
  • entraîner 1 modèle par taille de chemin

Couche de sortie

Deux choses :

  • enlever la moyenne
  • enlever les classes

Sur le temps, il semblerait que les classes n'apportent rien, si ce n'est un espèce de prior à travers une initialisation non uniforme.

Sur la destination, j'ai peut-être une erreur dans mon implémentation mais ça merde complètement.

[TODO] cuts pour la génération de données de train

Implémentation OK, ce n'est pas très efficace car SQLite n'est pas optimisé pour ce genre de queries mais sur de gros modèles ça reste tout à fait négligeable.

Validité de la procédure encore non vérifiée (modèle en cours de training).

The last 1 point of the last k points is the destination, use the same values as the features and the labels, is this rational??

First, thanks for sharing your amazing work!
I have a question and hope to get reply from you.
“we chose to consider only the first k points and last k points of the trajectory prefix, which gives us a total of 2k points, or 4k numerical values for our input vector. For the winning model we took k = 5.”
In this case, the last 1 point of the last 5 points is the destination/target, the neural net will learn to establish a strong relationship between the target and the last 1 point (2 features: latitude and longitude), regardless of other features. Right?
And in the test dataset, the last 1 point is not the real destination, cause the trip is not finished. Why you could get a good results? I just can’t understand this, and could you explain for me?

Looking forward your reply, thanks in advance.

import data

when I try to run csv_to_hdf5.py with parameters I get this

AttributeError: 'module' object has no attribute 'stands_size'

whats "import data" ? Is it a python lib or some floder missing

Batch processing

@Alexis211 @ejls

Section Time % of total

Before training 0.00 0.00%
DataStreamMonitoring 0.00 0.00%
Printing 0.00 0.00%
Other 0.00 0.00%
Initialization 0.82 1.69%
Training 47.90 98.31%
Before epoch 0.44 0.91%
DataStreamMonitoring 0.44 0.91%
Printing 0.00 0.00%
Other 0.00 0.00%
Epoch 47.46 97.40%
Read data 41.36 84.89%
Before batch 0.24 0.50%
DataStreamMonitoring 0.08 0.17%
Printing 0.13 0.26%
Other 0.03 0.07%
Train 4.78 9.82%
After batch 0.95 1.94%
DataStreamMonitoring 0.78 1.59%
Printing 0.14 0.28%
Other 0.03 0.07%
Other 0.12 0.26%
Other 0.00 0.00%
After training 0.00 0.00%
DataStreamMonitoring 0.00 0.00%
Printing 0.00 0.00%
Other 0.00 0.00%

Pour chaque epoch, 85% du temps est consacre a la lecture du dataset et a la creation des batchs...
Je pense que ce que le hdf5 de Etienne va un peu aider mais il faudra mettre en place la creation des batchs dans des process a part.

FilterSources error

$eight: taxi: ./prepare.sh
This script will prepare the data.
You should run it from inside the repository.
You should set the TAXI_PATH variable to where the data downloaded from kaggle is.
Three data files are needed: train.csv.zip, test.csv.zip and metaData_taxistandsID_name_GPSlocation.csv.zip. They can be found at the following url: https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i/data

Checking dependencies

h5py... version 2.5.0 (we used version 2.5.0)
theano... version 0.7.0 (we used version 0.7.0.dev)
fuel... version 0.0.1 (we used version 0.0.1)
blocks... version 0.0.1 (we used version 0.0.1)
sklearn... version 0.16.1 (we used version 0.16.1)

Checking data

TAXI_PATH is set to kaggle-data
md5sum train.csv.zip... 87a1b75adfde321dc163160b495964e8 ok
md5sum test.csv.zip... 47133bf7349cb80cc668fa56af8ce743 ok
md5sum metaData_taxistandsID_name_GPSlocation.csv.zip... fecec7286191af868ce8fb208f5c7643 ok

Extracting data

unziping train.csv.zip... Archive: kaggle-data/train.csv.zip
inflating: kaggle-data/train.csv
ok
md5sum train.csv... 68cc499ac4937a3079ebf69e69e73971 ok
unziping test.csv.zip... Archive: kaggle-data/test.csv.zip
inflating: kaggle-data/test.csv
ok
md5sum test.csv... f2ceffde9d98e3c49046c7d998308e71 ok
unziping metaData_taxistandsID_name_GPSlocation.csv.zip... Archive: kaggle-data/metaData_taxistandsID_name_GPSlocation.csv.zip
inflating: kaggle-data/metaData_taxistandsID_name_GPSlocation.csv
inflating: kaggle-data/__MACOSX/._metaData_taxistandsID_name_GPSlocation.csv
ok
patching error in metadata csv... ok
md5sum metaData_taxistandsID_name_GPSlocation.csv... 724805b0b1385eb3efc02e8bdfe9c1df ok

Conversion of training set to HDF5

This might take some time
read train: begin
read train: 10000 done
......
read train: 1710000 done
read train: writing
read train: end
First origin_call not present in training set: 57106
read test: begin
read test: writing
read test: end

Generation of validation set

This might take some time
initialization... ok
cutting... Number of cuts: 5
Traceback (most recent call last):
File "data/make_valid_cut.py", line 78, in
make_valid(sys.argv[1], outpath)
File "data/make_valid_cut.py", line 24, in make_valid
for line in taxi_it('train'):
File "/Users/eight/repos/taxi/data/hdf5.py", line 63, in taxi_it
dataset = TaxiDataset(which_set, filename)
File "/Users/eight/repos/taxi/data/hdf5.py", line 16, in init
super(TaxiDataset, self).init(self.data_path, (which_set,), **kwargs)
File "/usr/local/lib/python2.7/site-packages/fuel/datasets/hdf5.py", line 146, in init
"{}.".format(self.available_splits))
ValueError: '('train',)' split is not provided by this dataset. Available splits are (u'test', u'unique_taxi_id', u'train', u'stands', u'unique_origin_call').
ok

Generation of destination cluster

This might take some time
generating... Traceback (most recent call last):
File "data_analysis/cluster_arrival.py", line 13, in
from data.transformers import add_destination
File "/Users/eight/repos/taxi/data/transformers.py", line 9, in
from fuel.transformers import Batch, Mapping, SortMapping, Transformer, Unpack, FilterSources
ImportError: cannot import name FilterSources
ok

Creating output folders

mkdir model_data... mkdir: model_data: File exists
ok
mkdir output... mkdir: output: File exists
ok

The data was successfully prepared
To train the winning model on gpu, you can now run the following command:
THEANO_FLAGS=floatX=float32,device=gpu,optimizer=fast_run python2 train.py dest_mlp_tgtcls_1_cswdtx_alexandre

$eight: taxi: THEANO_FLAGS=floatX=float32,device=gpu,optimizer=fast_run python2 train.py dest_mlp_tgtcls_1_cswdtx_alexandre
Using gpu device 0: GeForce GT 650M
Traceback (most recent call last):
File "train.py", line 44, in
config = importlib.import_module('.%s' % model_name, 'config')
File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
File "/Users/eight/repos/taxi/config/dest_mlp_tgtcls_1_cswdtx_alexandre.py", line 8, in
from model.dest_mlp_tgtcls import Model, Stream
File "/Users/eight/repos/taxi/model/dest_mlp_tgtcls.py", line 7, in
from model.mlp import FFMLP, Stream
File "/Users/eight/repos/taxi/model/mlp.py", line 13, in
from data import transformers
File "/Users/eight/repos/taxi/data/transformers.py", line 9, in
from fuel.transformers import Batch, Mapping, SortMapping, Transformer, Unpack, FilterSources
ImportError: cannot import name FilterSources
$eight: taxi:

How is the missing/mutually exclusive data handled (client ID or taxi stand ID)?

First, thank you for this open source code, and nice work

"the client called the taxi by phone, then we have a client ID. If the client called the taxi at a taxi stand, then we have a taxi stand ID. Otherwise we have no client identification"
so client ID and taxi stand ID seem to be mutually exclusive
How is this issue handled?
Also, if there is no client identification, what to do then?
I guess use some kind of average embedding values as the inputs to MLP?

Validation set

On a des différences de 20% sur l'erreur en temps entre validation et test.

Lots of FutureWarning

When running the train.py code. I got lots of FutureWarning like the following. This might be due to a numpy update from the version this project was based on.

python2.7/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full((4,), -8.5676041) will return an array of dtype('float32')

I tracked down to find where the numpy.full was called and found it in data/transformers.py

I recommend adding dtype=numpy.float32 to those numpy.full function call to avoid showing them during training.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.