alteryx / automated-manual-comparison Goto Github PK
View Code? Open in Web Editor NEWAutomated vs Manual Feature Engineering Comparison. Implemented using Featuretools.
License: BSD 3-Clause "New" or "Revised" License
Automated vs Manual Feature Engineering Comparison. Implemented using Featuretools.
License: BSD 3-Clause "New" or "Revised" License
Hi,
thanks for your article. Automated Feature Engineering is very promising.
I am running the Loan Repayment script right now to compare it with my own engineered features.
I am very curious about the results.
What is the recommended horse power to compute the result on one day (like mentioned in the article)?
Elapsed: 18:50:30 | Remaining: 22358:53:57 | Progress: 0%| | Calculated: 3/3563 chunks
The ft.py uses one job by default. Any other value but 1 crashes the script.
I am using a r4.2xlarge aws ec2 instance. But with one job it cannot utilize more than one core.
Nevertheless even with all eight cores, it would still take weeks.
Can you recommend some specs to speed this up?
Best regards
When I ran your notebook Automated Engine Life.ipynb
cell [in] 5, I got error messages 'ike this:
tornado.application - ERROR - Exception in Future after timeout
Traceback (most recent call last):
File "/home/xuzhang/anaconda3/envs/featuretools/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
concurrent.futures._base.CancelledError
distributed.comm.tcp - WARNING - Closing dangling stream in
Any advice? Thanks
at Automated Loan Repayment page of following codes:
app_types = {}
# Handle the Boolean variables:
for col in app:
if (app[col].nunique() == 2) and (app[col].dtype == float):
app_types[col] = vtypes.Boolean
# Remove the `TARGET`
del app_types['TARGET']
print('There are {} Boolean variables in the application data.'.format(len(app_types)))
the result should be 0 but not 32
After alteryx/featuretools#783 is merged, we can update the notebook titled Feature Matrix with Dask EntitySet.ipynb
to use the improved api. This notebook is currently located in the add-dask-notebook
branch.
Dear @WillKoehrsen,
Thanks for your amazing article and I really appreciate this work.
But I have something confused while reading at the end of Metrics section, you said "One customer, 8 different labels. It seems like it might be difficult to predict this customer's spending given her fluctuating total spending! We'll have to see if Featuretools is up to the task."
I don't understand those 8 different labels you mentioned here, is it a binarized label issue?
Could you please explain it?
Thank you very much!!
Allen
I have 47 G RAM, 24 CPUs, how many partition should I choose in this dataset?
I ran the notebook Featuretools on Dask.ipynb
on my local machine, however something wrong happened when b.compute()
run.
10 feature matrix have generated when the error happen.
Here are the error info:
tornado.application - ERROR - Exception in callback <bound method BokehTornado._keep_alive of <bokeh.server.tornado.BokehTornado object at 0x7f9488d69d68>>
Traceback (most recent call last):
File "/home/lili/anaconda3/lib/python3.6/site-packages/tornado/ioloop.py", line 1208, in _run
self._next_timeout = self.io_loop.time()
File "/home/lili/anaconda3/lib/python3.6/site-packages/bokeh/server/tornado.py", line 514, in _keep_alive
c.send_ping()
File "/home/lili/anaconda3/lib/python3.6/site-packages/bokeh/server/connection.py", line 46, in send_ping
self._socket.ping(codecs.encode(str(self._ping_count), "utf-8"))
File "/home/lili/anaconda3/lib/python3.6/site-packages/tornado/websocket.py", line 367, in ping
self.ws_connection.write_ping(data)
File "/home/lili/anaconda3/lib/python3.6/site-packages/tornado/websocket.py", line 882, in write_ping
self._write_frame(True, 0x9, data)
File "/home/lili/anaconda3/lib/python3.6/site-packages/tornado/websocket.py", line 846, in _write_frame
return self.stream.write(frame)
File "/home/lili/anaconda3/lib/python3.6/site-packages/tornado/iostream.py", line 525, in write
future = self._set_read_callback(callback)
File "/home/lili/anaconda3/lib/python3.6/site-packages/tornado/iostream.py", line 1058, in _check_closed
size = 128 * 1024
tornado.iostream.StreamClosedError: Stream is closed
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-34-82e469b60feb> in <module>()
1 overall_start = timer()
----> 2 b.compute()
3 overall_end = timer()
4
5 print(f"Total Time Elapsed: {round(overall_end - overall_start, 2)} seconds.")
~/anaconda3/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
154 dask.base.compute
155 """
--> 156 (result,) = compute(self, traverse=False, **kwargs)
157 return result
158
~/anaconda3/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
393 keys = [x.__dask_keys__() for x in collections]
394 postcomputes = [x.__dask_postcompute__() for x in collections]
--> 395 results = schedule(dsk, keys, **kwargs)
396 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
397
~/anaconda3/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, **kwargs)
2198 try:
2199 results = self.gather(packed, asynchronous=asynchronous,
-> 2200 direct=direct)
2201 finally:
2202 for f in futures.values():
~/anaconda3/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
1567 return self.sync(self._gather, futures, errors=errors,
1568 direct=direct, local_worker=local_worker,
-> 1569 asynchronous=asynchronous)
1570
1571 @gen.coroutine
~/anaconda3/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
643 return future
644 else:
--> 645 return sync(self.loop, func, *args, **kwargs)
646
647 def __repr__(self):
~/anaconda3/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
275 e.wait(10)
276 if error[0]:
--> 277 six.reraise(*error[0])
278 else:
279 return result[0]
~/anaconda3/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
~/anaconda3/lib/python3.6/site-packages/distributed/utils.py in f()
260 if timeout is not None:
261 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262 result[0] = yield future
263 except Exception as exc:
264 error[0] = sys.exc_info()
~/anaconda3/lib/python3.6/site-packages/tornado/gen.py in run(self)
1097
1098 def set_result(self, key, result):
-> 1099 """Sets the result for ``key`` and attempts to resume the generator."""
1100 self.results[key] = result
1101 if self.yield_point is not None and self.yield_point.is_ready():
~/anaconda3/lib/python3.6/site-packages/tornado/gen.py in run(self)
1105 except:
1106 future_set_exc_info(self.future, sys.exc_info())
-> 1107 self.yield_point = None
1108 self.run()
1109
~/anaconda3/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1443 six.reraise(type(exception),
1444 exception,
-> 1445 traceback)
1446 if errors == 'skip':
1447 bad_keys.add(key)
~/anaconda3/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
690 value = tp()
691 if value.__traceback__ is not tb:
--> 692 raise value.with_traceback(tb)
693 raise value
694 finally:
~/anaconda3/lib/python3.6/site-packages/dask/bag/core.py in reify()
1547 def reify(seq):
1548 if isinstance(seq, Iterator):
-> 1549 seq = list(seq)
1550 if seq and isinstance(seq[0], Iterator):
1551 seq = list(map(list, seq))
~/anaconda3/lib/python3.6/site-packages/dask/bag/core.py in map_chunk()
1707 else:
1708 for a in zip(*args):
-> 1709 yield f(*a)
1710
1711 # Check that all iterators are fully exhausted
<ipython-input-25-75ac088d04b8> in feature_matrix_from_entityset()
11 n_jobs = 1,
12 verbose = True,
---> 13 chunk_size = es['app'].df.shape[0])
14
15 feature_matrix.to_csv('data/fm/p%d_fm.csv' % es_dict['num'], index = True)
~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in calculate_feature_matrix()
256 cutoff_df_time_var=cutoff_df_time_var,
257 target_time=target_time,
--> 258 pass_columns=pass_columns)
259
260 feature_matrix = pd.concat(feature_matrix)
~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in linear_calculate_chunks()
518 cutoff_df_time_var,
519 target_time, pass_columns,
--> 520 backend=backend)
521 feature_matrix.append(_feature_matrix)
522 # Do a manual garbage collection in case objects from calculate_chunk
~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in calculate_chunk()
340 ids,
341 precalculated_features=precalculated_features,
--> 342 training_window=window)
343
344 id_name = _feature_matrix.index.name
~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/utils.py in wrapped()
32 def wrapped(*args, **kwargs):
33 if save_progress is None:
---> 34 r = method(*args, **kwargs)
35 else:
36 time = args[0].to_pydatetime()
~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in calc_results()
314 precalculated_features=precalculated_features,
315 ignored=all_approx_feature_set,
--> 316 profile=profile)
317 return matrix
318
~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/pandas_backend.py in calculate_all_features()
194
195 handler = self._feature_type_handler(test_feature)
--> 196 result_frame = handler(group, input_frames)
197
198 output_frames_type = self.feature_tree.output_frames_type(test_feature)
~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/pandas_backend.py in _calculate_agg_features()
421 funcname = func
422 if callable(func):
--> 423 funcname = func.__name__
424
425 to_agg[variable_id].append(func)
AttributeError: 'functools.partial' object has no attribute '__name__'
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50460 remote=tcp://127.0.0.1:45867>
If I have a dataset with only one table, how to use featuretools? If the whole features are all Continuous Numeric Data, is the featuretools still useful? Many thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.