Giter VIP home page Giter VIP logo

deep_learning_for_tabular_data's Introduction

Deep learning for tabular data

Deep Learning can be used also for predictions based on tabular data, the data you most commonly find in databases and in tables. During the presentation session of this workshop it is discussed about how such an approach works and how it is competitive in respect of more popular machine learning algorithms such as gradient boosting. The workshop itself demonstrates how to achieve good results using TensorFlow, it high level API, Keras, integrated with more classical approaches based on Scikit-learn and Pandas.

Workshop code on Colab:

Open in Colab

Follow the tutorial on Youtube (GDG Venezia 2019)

GDG Venezia 2019

https://www.youtube.com/watch?v=nQgUt_uADSE&t=1533s

deep_learning_for_tabular_data's People

Contributors

lmassaron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

deep_learning_for_tabular_data's Issues

error "Passing list-likes to .loc or [] with any missing labels is no longer supported."

I used my own data to run your code. My model is regression. I followed your code and it is okay for catboost, but for deeplearning part, I got the following error messages:

KeyError Traceback (most recent call last)
in
52 shuffle=True)
53
---> 54 history = model.fit(train_batch,
55 # validation_data=(tb.transform(X.iloc[test_idx]), y[test_idx]),
56 validation_data=test_batch,

~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1048 training_utils.RespectCompiledTrainableState(self):
1049 # Creates a tf.data.Dataset and handles batch and epoch iteration.
-> 1050 data_handler = data_adapter.DataHandler(
1051 x=x,
1052 y=y,

~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in init(self, x, y, sample_weight, batch_size, steps_per_epoch, initial_epoch, epochs, shuffle, class_weight, max_queue_size, workers, use_multiprocessing, model, steps_per_execution)
1098
1099 adapter_cls = select_data_adapter(x, y)
-> 1100 self._adapter = adapter_cls(
1101 x,
1102 y,

~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in init(self, x, y, sample_weights, shuffle, workers, use_multiprocessing, max_queue_size, model, **kwargs)
900 self._keras_sequence = x
901 self._enqueuer = None
--> 902 super(KerasSequenceAdapter, self).init(
903 x,
904 shuffle=False, # Shuffle is handed in the _make_callable override.

~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in init(self, x, y, sample_weights, workers, use_multiprocessing, max_queue_size, model, **kwargs)
777 # Since we have to know the dtype of the python generator when we build the
778 # dataset, we have to look at a batch to infer the structure.
--> 779 peek, x = self._peek_and_restore(x)
780 peek = self._standardize_batch(peek)
781 peek = _process_tensorlike(peek)

~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in _peek_and_restore(x)
911 @staticmethod
912 def _peek_and_restore(x):
--> 913 return x[0], x
914
915 def _handle_multiprocessing(self, x, workers, use_multiprocessing,

~/projects/ifp85/tabular.py in getitem(self, index)
348 def getitem(self, index):
349 indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
--> 350 samples, labels = self.__data_generation(indexes)
351 return samples, labels
352

~/projects/ifp85/tabular.py in __data_generation(self, selection)
342 return dct, self.y[selection]
343 else:
--> 344 return self.tbt.transform(self.X.iloc[selection, :]), self.y[selection]
345 else:
346 return self.X.iloc[selection, :], self.y[selection]

~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/series.py in getitem(self, key)
904 return self._get_values(key)
905
--> 906 return self._get_with(key)
907
908 def _get_with(self, key):

~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/series.py in _get_with(self, key)
939 # (i.e. self.iloc) or label-based (i.e. self.loc)
940 if not self.index._should_fallback_to_positional():
--> 941 return self.loc[key]
942 else:
943 return self.iloc[key]

~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/indexing.py in getitem(self, key)
877
878 maybe_callable = com.apply_if_callable(key, self.obj)
--> 879 return self._getitem_axis(maybe_callable, axis=axis)
880
881 def _is_scalar_access(self, key: Tuple):

~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1097 raise ValueError("Cannot index with multidimensional key")
1098
-> 1099 return self._getitem_iterable(key, axis=axis)
1100
1101 # nested tuple slicing

~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
1035
1036 # A collection of keys
-> 1037 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
1038 return self.obj._reindex_with_indexers(
1039 {axis: [keyarr, indexer]}, copy=True, allow_dups=True

~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1252 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1253
-> 1254 self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
1255 return keyarr, indexer
1256

~/.virtualenvs/tf24/lib/python3.8/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1313
1314 with option_context("display.max_seq_items", 10, "display.width", 80):
-> 1315 raise KeyError(
1316 "Passing list-likes to .loc or [] with any missing labels "
1317 "is no longer supported. "

KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([ 963, 26089, 37285, 32796, 21419,\n ...\n 7514, 35430, 5619, 9022, 40319],\n dtype='int64', length=253). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

I couldn't know how to solve this.
By the way, I don't fully understand the meaning of variables sizes and categorical_levels

tb = TabularTransformer(numeric = numeric_variables,
ordinal = [],
lowcat = [],
highcat = categorical_variables)

tb.fit(X.iloc[train_idx])
sizes = tb.shape(X.iloc[train_idx])
categorical_levels = dict(zip(categorical_variables, sizes[1:]))
print(f"Input array sizes: {sizes}")
print(f"Categorical levels: {categorical_levels}\n")

Thank you very much!

Feature importance

How would you go about finding the feature importance for the DNN model?

High GPU Memory-Usage but zero volatile gpu-util

Fri Jan 29 23:29:47 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 32% 32C P8 23W / 350W | 23081MiB / 24245MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 27737 C ...rtualenvs/tf24/bin/python 23079MiB |
+-----------------------------------------------------------------------------+

I checked that my GPU is available. I think my GPU always waits for CPU to process data. Do you know how to improve the utility of GPU? I tried your example, most of the time, GPU-Util was 0%, sometimes, it showed 20%.

errors when training deep learning model ('list' object has no attribute 'keys')

AttributeError Traceback (most recent call last)
in
43 shuffle=True)
44
---> 45 history = model.fit_generator(train_batch,
46 validation_data=(tb.transform(X.iloc[test_idx]), y[test_idx]),
47 epochs=30,

~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
1845 'will be removed in a future version. '
1846 'Please use Model.fit, which supports generators.')
-> 1847 return self.fit(
1848 generator,
1849 steps_per_epoch=steps_per_epoch,

~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1048 training_utils.RespectCompiledTrainableState(self):
1049 # Creates a tf.data.Dataset and handles batch and epoch iteration.
-> 1050 data_handler = data_adapter.DataHandler(
1051 x=x,
1052 y=y,

~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in init(self, x, y, sample_weight, batch_size, steps_per_epoch, initial_epoch, epochs, shuffle, class_weight, max_queue_size, workers, use_multiprocessing, model, steps_per_execution)
1115 dataset = self._adapter.get_dataset()
1116 if class_weight:
-> 1117 dataset = dataset.map(_make_class_weight_map_fn(class_weight))
1118 self._inferred_steps = self._infer_steps(steps_per_epoch, dataset)
1119

~/.virtualenvs/tf24/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in _make_class_weight_map_fn(class_weight)
1276 weighting.
1277 """
-> 1278 class_ids = list(sorted(class_weight.keys()))
1279 expected_class_ids = list(range(len(class_ids)))
1280 if class_ids != expected_class_ids:

AttributeError: 'list' object has no attribute 'keys'

GPU utilization mostly 0% during training

Fri Jan 29 23:29:47 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 32% 32C P8 23W / 350W | 23081MiB / 24245MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 27737 C ...rtualenvs/tf24/bin/python 23079MiB |
+-----------------------------------------------------------------------------+

I think my GPU always waits for CPU to process data. Do you know how to improve the utility of GPU?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.