h2oai / wave-h2o-automl Goto Github PK
View Code? Open in Web Editor NEWWave App for H2O AutoML
Home Page: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
License: Apache License 2.0
Wave App for H2O AutoML
Home Page: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
License: Apache License 2.0
This might be useful for the user to save.
This might be an H2O bug actually...
Currently we allow the user to optionally upload a test file, but we really always want to have a test set for the Explain functions.
If the user only uploads a training csv, then we should automatically create a test set that will be used in the Explain functions. We should allow the user to adjust the fraction of their training data that will be used for the test set.
Related to this: #14
max_models
and max_runtime_secs
to a text field.There should be a better response than the app to go blank when the user clicks on Model Explain too soon. The same thing happens when you click on AutoML Viz -> Variable Explain. Maybe it just says "No models trained yet".
We might consider changing the version requirement in requirements.txt to <3.36.0.2
(or put a lower bound on it), so that users can have more flexibility in using it with an existing/installed version of h2o on their machine.
If we want people to be able to use earlier versions of h2o (e.g. below 3.32.1.1), then we need to abstract some of the leaderboard code a bit more. The "algo"
column was only added in 3.32.1.1, for example. And other extended leaderboard columns were added in prior versions (3.28.0.1).
If we add learning curve, that was also not added to some specific recent version of h2o, etc.
If the classification toggle is kept on (the default) and you use a regression dataset, it will throw an error on the backend because it looks for max_per_class_error
on the leader model, which is not there. So it trains the whole AutoML and then spits an error at the end, which is bad.
Ideally we'd have the app automatically set the toggle once the user selects a real valued target column, but right now it doesn't work that way...
Update to the same list that we have here: https://github.com/h2oai/wave-h2o-automl/blob/main/about.md
Features:
They are too wide for metrics columns. https://wave.h2o.ai/docs/api/ui#table_column
Also add "Task Type" as output in the text above Leaderboard.
In the case of many columns, this would be not ideal, so picker will be better. https://wave.h2o.ai/docs/widgets/form/picker/#basic-picker
To avoid the computational delay, let's start the explain tasks immediately after the training finishes using Background Tasks. Then we can store the data/images in the app so they will be ready to serve by the time the user clicks on the Explain tabs. We will just store the default model images for Model Explain and generate the other ones on-demand.
Useful blog: https://medium.com/@unusualcode/background-jobs-in-wave-or-how-not-to-kill-your-ui-ae1fed95693a
Right now, we are passing a default of 60 mins, which is causing the progress bar to be confused about it's progress. When the user sets max_models
but not max_runtime_secs
then max_runtime_secs
should be set to None.
I think this used to work...? Here's an error when I tried to load a CSV (replicated with multiple CSVs).
I can upload the file, and it appears in the training set dropdown. I can select it, but when I click Train we get this error:
Unhandled exception
Traceback (most recent call last):
File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/h2o_wave/server.py", line 320, in _process
await self._handle(q)
File "./src/app.py", line 915, in serve
await train_menu(q)
File "./src/app.py", line 283, in train_menu
q.app.train_df = pd.read_csv(local_path)
File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 462, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 819, in __init__
self._engine = self._make_engine(self.engine)
File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1867, in __init__
self._open_handles(src, kwds)
File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1362, in _open_handles
self.handles = get_handle(
File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/common.py", line 642, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/_f/43b1c3b5-fefe-4b8c-835a-ecacd3e6e632/higgs_10k.csv'
2022/07/19 18:11:02 * /cd47eb9e-86b6-42ef-bc48-14b771410fdd {"d":[{"k":"main"},{"k":"plot1"},{"k":"plot21"},{"k":"plot22"},{"k":"plot31"},{"k":"plot32"},{"k":"foo"},{},{"k":"__unhandled_error__","d":{"view":"markdown","box":"1 1 12 10","title":"Error","content":"```\nTraceback (most recent call last):\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/h2o_wave/server.py\", line 320, in _process\n await self._handle(q)\n File \"./src/app.py\", line 915, in serve\n await train_menu(q)\n File \"./src/app.py\", line 283, in train_menu\n q.app.train_df = pd.read_csv(local_path)\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 610, in read_csv\n return _read(filepath_or_buffer, kwds)\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 462, in _read\n parser = TextFileReader(filepath_or_buffer, **kwds)\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 819, in __init__\n self._engine = self._make_engine(self.engine)\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 1050, in _make_engine\n return mapping[engine](self.f, **self.options) # type: ignore[call-arg]\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 1867, in __init__\n self._open_handles(src, kwds)\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 1362, in _open_handles\n self.handles = get_handle(\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/common.py\", line 642, in get_handle\n handle = open(\nFileNotFoundError: [Errno 2] No such file or directory: '/_f/43b1c3b5-fefe-4b8c-835a-ecacd3e6e632/higgs_10k.csv'\n\n```"}}]}
Currently it will bring up a similar "model explain" (original) page within the Leaderboard tab. Let's just redirect to our Model Explain tab, with the requested model selected.
It's very small by default; use FIGSIZE[0] x2.
There should be a way to see the hyperparameters of the trained models. This could be another tab in the Model Explain interface.
Currently, every time you view the AutoML Viz or Model Explain tab, all the data/images are regenerated from scratch. Let's cache this in the app and serve it from the cached copy rather than regenerating each time.
For each plot, we can store the plot in the app:
e.g. q.app.varimp_heat_plot
and q.app.mc_plot
Example, something like this:
# Model Correlation Heatmap (1)
try:
train = h2o.H2OFrame(q.app.train_df)
y = q.app.target
if q.app.is_classification:
train[y] = train[y].asfactor()
if q.app.mc_plot is None:
q.app.mc_plot = q.app.aml.model_correlation_heatmap(frame = train, figsize=(FIGSIZE[0], FIGSIZE[0]))
q.page['plot21'] = ui.image_card(
box='charts_left',
title="Model Correlation Heatmap Plot",
type="png",
image=get_image_from_matplotlib(q.app.mc_plot),
)
If you import a CSV with no header, it will use the first row instead of creating a dummy header. Not sure why since if you do this in h2o.import_file()
it will work correctly, so it seems like the generic Wave data loader is not working correctly.
Example:
import h2o
h2o.init()
train = h2o.import_file("https://github.com/h2oai/h2o-tutorials/raw/0bd643cddc850eb8692f1e3ff7d8211e4168c7d2/tutorials/data/higgs_10k.csv")
train.columns # proper column names (C1, C2...)
In the app, it will show columns with names that are "numeric":
These will be displayed in the H2OAI Cloud App store, so they need to be updated./
Also update the screenshot in the README.
.round(5)
on all the columns in the leaderboard. For now, we are using h2o 3.32 so there's only a subset of the full leaderboard columns right now (we can update later).Maybe not Real Applications, but a lot of demos will use data frames, so a scoring function that takes a data frame and returns one would be nice. This is a rough draft which needs to be cleaned up, but putting hear as a place holder for the ModelOps Utiliies
def get_predictions(rows: pd.DataFrame):
# handle nulls
rows = rows.where(pd.notnull(rows), "")
# every value needs to be a string
vals = rows.values.tolist()
for i in range(len(vals)):
vals[i] = [str(x) for x in vals[i]]
# create a string that is in the expected dictionary format
dictionary = '{"fields": ' + str(df.columns.tolist()) + ', "rows": ' + str(vals) + '}'
dictionary = dictionary.replace("'", '"'). #mlops needs double quotes!
# use the utility function
dict_preds = mlops_get_score('https://model.wave.h2o.ai/f2659e88-cbad-4ae0-baf0-e25daef42461/model/score',
dictionary)
# turn the returned dict into a dataframe
preds = pd.DataFrame(data=dict_preds['score'], columns=dict_preds['fields'])
# join with original data, assumption is the row order never changes
return pd.concat([rows, preds], axis=1)
Let's add a very small dataset that is a regression problem. Maybe the wine dataset? (6497 rows x 13 columns). We can do an 80/20 split to create train and test csvs and upload them to the repo.
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/explain.html#explain-models
Currently we have credit card which is ~24k and is binary classification. This will speed up demos and also wine quality is nice since we use it in all our other explainability demos.
To split and export file: https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/h2o.html?highlight=export#h2o.export_file
import h2o
h2o.init()
# Import wine quality dataset
f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
df = h2o.import_file(f)
# Split into train & test
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]
h2o.export_file(train, path="wine_quality_train.csv")
h2o.export_file(test, path="wine_quality_test.csv")
The files that will be uploaded are: wine_quality_train.csv and wine_quality_test.csv
It seems useful to have some basic visualizations about the training data (e.g. shape, distribution of response, etc). These visualizations should probably be automatically generated when the user uploads a CSV file.
Currently the test_file
is stored as test_df
but then not used anywhere. We should consider what we want to do with a test file (such as passing on to the explain functions).
Currently they are drop-down, which poses an issue for datasets with a large number of columns.
Please go through: https://h2oai.atlassian.net/wiki/spaces/PROD/pages/2986049638/Wave+App+Checklist and make any changes as needed, thanks!
Use updated version from app.toml file.
This was part of the original functionality and it seems useful to keep around.
There could be a case where the user does not care about any of the explanations and wants to use all of their data for training. So we should allow them to do that.
We would expect that clicking on the Stacked Ensemble would bring up model visualizations for the SE, but instead it just redirects to the top base model. I think we can create some SE-specific visualizations here that would be more helpful.
Will be used in the H2O AI Appstore.
This is an option in AutoML, so we should also expose it here.
If you realize you did something wrong or want to quit, let's make a button to do that gracefully.
We should add the Pareto front. To start, let's just display the default (x,y) axis, which is prediction time on x axis and model performance on y axis. In the future we can consider making this more interactive.
There's a classification button, which is set to classification by default. However, that makes it easy to train a classification model when you really want a regression model. If the response is real-valued and you try to do classification, it currently just breaks wave (e.g. wine quality data using "alcohol" as response). So we need a better error.
Another issue is when you have an integer valued column as the response which should be numeric/regression (e.g. wine quality using "quality" as response), it will still train a classification model when asked to. You don't realize that until after you read the column headers in the leaderboard.
The current training interface is pretty minimal. I think we should expose all AutoML parameters, but hide most of them in the "Expert" settings section. We can expose another set as "Advanced". Here's the full list of AutoML params: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html#automl-interface
Basic Data Parameters:
Advanced Data Parameters:
Basic Training Parameters:
Advanced:
Expert:
If you use wine dataset with "quality" and let it train a classification (multiclass model), the PD plot is broken. That's because we are missing the part where we set a reference class, so we need to update the code to input a reference class.
For multiclass, we should add a second picker box with the list of classes, so that you can choose what class you view the PD plot for.
Current version is: 3.36.1.2
Let's add two new menus item (tabs) for the H2O Explain output.
AutoML Explain tab (only group/AutoML level plots)
We have stored the AutoML object in the q.app.aml
location, and we can re-use the aml
object from here to generate the AutoML plots.
Model Explain tab:
First section is model only:
Second section is row-related plots (for this same model):
Currently we are going to use the plots directly from matplotlib, so that the users can download the images easily and it looks the same as R/Py. Also it's easier to re-use rather than try to re-create some of the more complex plots. However, we may re-consider this approach (if we do, we will re-use the plotting code to generate plots that the user can download with the click of a button).
Some of the explain functions require a test set. Right now they are re-using the train set (which is not good #14). We should not force the user to provide a test set in case they don't care about using the explain functionality, but we might want to add a checkbox (selected by default) that says "Automatically create a test set (used in some explainability features)". If the user selects a test set from the drop down, then that will be used instead. However, if the user wants to use train only, then they can unselect the checkbox and then we can remove the plots that require a test set.
From support ticket
App isn’t fitting screen on Apple M1 Macbook. I can’t see the ends. Its stretched and there is no option to fix it.
App Details
ID: bec14f76-cc10-4886-823c-d631c6492867
NAME: H2O AutoML
VERSION: 0.3.0
It's fine the first time, but if you go away and come back, it breaks the app.
,{"k":"__unhandled_error__","d":{"view":"markdown","box":"1 1 12 10","title":"Error","content":"```\nTraceback (most recent call last):\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/server.py\", line 341, in _process\n await self._handle(q)\n File \"/Users/me/h2oai/github/wave-h2o-automl/./src/app.py\", line 1112, in serve\n elif not await handle_on(q):\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/routing.py\", line 179, in handle_on\n if await _match_predicate(predicate, func, arity, q, arg_value):\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/routing.py\", line 129, in _match_predicate\n await _invoke_handler(func, arity, q, arg)\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/routing.py\", line 119, in _invoke_handler\n await func(q, arg)\n File \"/Users/me/h2oai/github/wave-h2o-automl/./src/app.py\", line 910, in aml_varimp\n ui.picker(name='column_pd', label='Select Column', choices=choices, max_choices = 1, values = [q.app.pd_col]),\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/ui.py\", line 1741, in picker\n return Component(picker=Picker(\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/types.py\", line 4471, in __init__\n _guard_vector('Picker.values', values, (str,), False, True, False)\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/types.py\", line 50, in _guard_vector\n _guard_scalar(f'{name} element', value, types, False, non_empty, False)\n File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/types.py\", line 37, in _guard_scalar\n raise ValueError(f'{name}: want one of {types}, got {type(value)}')\nValueError: Picker.values element: want one of (<class 'str'>,), got <class 'NoneType'>\n\n```"}}]}
It's too hard to read at this font...
Change all fonts: https://stackoverflow.com/questions/3899980/how-to-change-the-font-size-on-a-matplotlib-plot
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.