nteract / papermill Goto Github PK
View Code? Open in Web Editor NEW๐ Parameterize, execute, and analyze notebooks
Home Page: http://papermill.readthedocs.io/en/latest/
License: BSD 3-Clause "New" or "Revised" License
๐ Parameterize, execute, and analyze notebooks
Home Page: http://papermill.readthedocs.io/en/latest/
License: BSD 3-Clause "New" or "Revised" License
I was half expecting the values in the cell tagged as "parameters" to work as default values. This would be convenient for things like random seeds.
Provide plain functions that let you run a parametrized notebook directly on a dask client.
from dask.distributed import Client
client = Client()
futures = []
for param1 in range(20):
future = client.submit(papermill.execute_notebook, notebook, param1=param1)
# Future<NotebookNode>
futures.append(future)
summary = client.submit(summarize_notebooks, futures)
df = summary.result()
# DataFrame<PapermillNotebookSummary>
Mostly a note for me to fix the issue, but we're building universal wheel while python 2 has a stricter requirements section for ipython. This means we're pushing that stricter requirement to python 3 installs that use the wheel.
When a user doesn't specify a parameter tag cell it would be nice if papermill defaulted to some sane logical setting. Nominally parameterizing the beginning of the notebook seems reasonable and would make adoption of existing notebooks quicker when they have naive inputs.
Inputs with escaped or wrapped double quotes inside (-p foo '{"bar":"baz"}'
) causes notebook execution to fail with syntax errors. The example above will result in foo = "{"bar":"baz"}"
in the notebook which isn't valid Python with the wrapping quotes.
Make sure we work across python2 and python3
I was going through the example "Displaying Plots and Images Saved by Other Notebooks". However when I tried to display plots in another notebook, it failed.
I found that, in the ceil that has the plot, the ceill has the keys:[u'output_type', u'data', u'metadata'], and the data field actuallly has the plot, but the metadata is a empty dictionary. It seems to me the metadata is not correctly set up for the ceil, but I am not sure what happened.
My ipython version is 5.3.0, python is 2.7. matplotlib version is 1.5.1.
We are currently monkey patching the preprocessor in the execute.py module. We should be using subclassing the Preprocessor class as intended.
https://github.com/nteract/papermill/blob/master/papermill/execute.py#L22
Hey all!
I'd love to make a menu option in nteract to denote a parameter cell in a single click within the nteract web app. As for design, for the moment just to make this simple with our current setup, I'd just be adding it to our menu:
When it is a parameterized cell, we can show a special border on the top or otherwise to indicate that it's a parameter cell. Haven't thought much on the design much here other than that we want some way to see it visually as users.
It's easy enough for me to mark it with a tag under the covers, however, it would be really nice to just set it in the cell's metadata (or even using the metadata.name
attribute):
{
"cells": [
{
"cell_type": "code",
"metadata": {
"papermill": {
"parameter_cell": true
}
},
"outputs": [],
"source": [
"x = 3\n",
"y=3"
]
},
],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 2
}
What do you all think? Would papermill
be open to looking for this metadata in addition to the current tagging mechanism?
Maybe .ipynb
and .json
are the only reasonable choices here? Maybe just a warning?
I'm using papermill to parameterize some notebooks which I'm then exporting as html reports using nbconvert. But, I've run into issues with nbconvert erroring out if I use it after running pm.execute_notebook()
. Specifically, this line is the cause since it hijacks nbconvert's Preprocessor.preprocess
method: https://github.com/nteract/papermill/blob/master/papermill/execute.py#L141
I ended up getting it to work by saving the original nbconvert preprocess method and then reassigning it after I've used papermill. But, I was wondering, is there a better way?
We can do it and anyone can help. ๐
I've tagged this as Hacktoberfest friendly, feel free to ask questions.
Must have knowledge:
Optional (you can learn it):
It's ok if people want to create issues for each of the untested areas then go ahead and tackle them
Either as part of this repo or a new repo called something like "papermill-example-workflow", create a regularly running job on CircleCI that will run a notebook via papermill and post the result somewhere. It should be a nice way to demonstrate how to use papermill while also operating as a good functional test.
CircleCI will need to be enabled for the repo for this to work.
Hi,
I have been using papermill and NbConvert to parameterize and convert Jupyter Notebooks. It worked really well until recently I started getting the following messages:
/home/florathecat/anaconda3/lib/python3.6/site-packages/nbconvert/filters/datatypefilter.py:41: UserWarning: Your element with mimetype(s) dict_keys(['application/papermill.record+json']) is not able to be represented.
mimetypes=output.keys())
The notebooks are still parameterized and converted fine. It is just the error message bugs me. I have tried to update conda, python, nbconvert, and reinstall papermill. None of these seem to work. I'll appreciate if somebody can help.
Yun
Follow on to the work from #74 to more clearly identify the original cell that was tagged as a parameters cell and the inserted cell with passed parameters.
On a notebook that has been executed, it would be helpful if the cell run duration was stored with the cell output. Ideally this could be visualized as well, perhaps something like:
In [5]:
1:42
Versions above 0.11 seems to present an error when running the record() function.
As the example on the readme file:
"""notebook.ipynb"""
import papermill as pm
pm.record("hello", "world")
pm.record("number", 123)
pm.record("some_list", [1, 3, 5])
pm.record("some_dict", {"a": 1, "b": 2})`
gives:
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-447d49aec58c> in <module>()
2 import papermill as pm
3
----> 4 pm.record("hello", "world")
5 pm.record("number", 123)
6 pm.record("some_list", [1, 3, 5])
~\Anaconda3\envs\mestrado\lib\site-packages\papermill\api.py in record(name, value)
33 # IPython.display.display takes a tuple of objects as first parameter
34 # `http://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html#IPython.display.display`
---> 35 ip_display(({RECORD_OUTPUT_TYPE: {name: value}},), raw=True)
36
37
~\Anaconda3\envs\mestrado\lib\site-packages\IPython\core\display.py in display(include, exclude, metadata, transient, display_id, *objs, **kwargs)
293 for obj in objs:
294 if raw:
--> 295 publish_display_data(data=obj, metadata=metadata, **kwargs)
296 else:
297 format_dict, md_dict = format(obj, include=include, exclude=exclude)
~\Anaconda3\envs\mestrado\lib\site-packages\IPython\core\display.py in publish_display_data(data, metadata, source, transient, **kwargs)
118 data=data,
119 metadata=metadata,
--> 120 **kwargs
121 )
122
~\Anaconda3\envs\mestrado\lib\site-packages\ipykernel\zmqshell.py in publish(self, data, metadata, source, transient, update)
115 if transient is None:
116 transient = {}
--> 117 self._validate_data(data, metadata)
118 content = {}
119 content['data'] = encode_images(data)
~\Anaconda3\envs\mestrado\lib\site-packages\IPython\core\displaypub.py in _validate_data(self, data, metadata)
48
49 if not isinstance(data, dict):
---> 50 raise TypeError('data must be a dict, got: %r' % data)
51 if metadata is not None:
52 if not isinstance(metadata, dict):
TypeError: data must be a dict, got: {'application/papermill.record+json': {'hello': 'world'}}`
It'd be useful to have an additional tag which tells papermill to skip particular cells. This would enable cells which have lots of prints, long outputs, or graph outputs which aren't needed for all run to be visible in the notebook but not run on each papermill execution.
Today papermill is hanging when kernels get OOM killed instead of returning with an error status code within some reasonable timeframe.
Steps to reproduce:
We'd like to see papermill be able to instrument/request memory usage. This is currently in progress with jupyter/jupyter#264.
Ideally, we'd stick the metrics within the metadata per cell in a similar way to how we do timing (duration, start time, end time, etc.).
Mostly a note to self, as I'll be traveling for a few weeks and would like to check this out: https://github.com/hz-inova/run_jnb
For example
The target notebook, 'test.ipynb'
,
import pandas as pd
import numpy as np
dates = pd.date_range('20130101', periods=200)
df = pd.DataFrame(np.random.randn(200,4), index=dates, columns=list('ABCD'))
df
The code running it
import papermill as pm
pm.execute_notebook(
notebook='test.ipynb',
output='test_out.ipynb',
)
When open the output, it will display
However, if I use df.head(20)
, there will not be such a problem
With the following papermill run,
$ papermill -p x 1 -p y 70 template.ipynb out.ipynb
/usr/local/lib/python2.7/site-packages/jupyter_client/connect.py:157: RuntimeWarning: Failed to set sticky bit on u'/var/folders/kd/cylz4mhs1_9cpsjh0_c_gzfr0000gn/T': [Errno 1] Operation not permitted: '/var/folders/kd/cylz4mhs1_9cpsjh0_c_gzfr0000gn/T'
RuntimeWarning,
it does write out the out.ipynb
file, though the RuntimeWarning
made me think that it completely failed. This raises a few things for me:
--verbose / -v
mode to show progress at the CLI?just noticed while writing a blog post linking to you! ;)
Can papermill be installed in a docker container with python2.7? My docker container does not have a connection to the internet. I have successfully installed the dependencies (botocore, boto3, tqdm, click, s3transfer). When I try to install using pip install --user papermill-0.12.3-py3-none-any.whl
I get hte message papermill-0.12.3-py3-none-any.whl is not a supported wheel on this platform
. Any ideas?
I tried to record
a numpy array but that fails somewhere in a step to clean the JSON. I think the problem is that there is no builtin way to convert a numpy array to JSON.
What is the recommendation for outputting numpy arrays? For example I could imagine that "arrays tend to be too big, so store them somewhere else and record
the path to it" is one option. An alternative would be to add a step to pickle
/serialize types that can't be stored as JSON in record
before dumping them into the notebook.
I'd be up for implementing the latter option but wanted to discuss ideas first.
Traceback
$ ipython
Python 3.6.1 (default, Apr 4 2017, 09:40:21)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import papermill
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-790e320f196c> in <module>()
----> 1 import papermill
~/code/src/github.com/nteract/papermill/papermill/__init__.py in <module>()
5 del get_versions
6
----> 7 from execute import (
8 execute_notebook,
9 set_environment_variable_names,
ModuleNotFoundError: No module named 'execute'
In
papermill/papermill/execute.py
Lines 157 to 163 in 7b717e7
it makes the assumption that only one cell will have any parameters being defined in the notebook and raises an error if this is not the case.
We have name
as a cell-level metadata field in the nbformat spec that specifically requires values to be unique across the notebook. If you want to maintain this uniqueness qualifier it would be cleaner to require setting the name on a cell rather than a tag.
If the only reason to not do this is that the UI for setting a name
vs. tags
is more inconvenient, that suggests we have a frontend UI issue. In that caseโฆ using tags is just a compromise made for the current state of front-ends and should probably be deprecated as soon as possible in favour of setting name
properly.
From a user:
It would be cool if papermill had an option to hide code cell input when generating an output notebook. This could be very helpful when generating a nightly report and not wanting to see the code that generates the report.
Simple enough. nteract uses metadata.inputHidden
as a boolean value to indicate if a cell's input is hidden.
I suppose a flag like --hide-inputs
or something? Maybe if we wanted to be more opinionated about the naming we'd call it --report-mode
or --mode=report
?
Thus far convergence on flag name is --report-mode
.
We should add a changelog file and include how to update / automate it's population in RELEASE.md.
Right now all arguments passed as -p foo 2
are created as float
s in the notebook. When using parameters to specify shapes of numpy arrays and the like you end up having to convert them to integers first. We could use the type of the value of the parameter as a guide and convert arguments to that type.
So a "parameters" cell with foo = 2.123
and papermill ... -p foo 2
would result in 2.
but a cell with foo = 2
would result in 2
.
The kernels today are usually capturing stdout and stderr messages directly and buffing them into the cell json contents. But if one is running papermill and the kernel dies (e.g. OOM, kill -9) the active messages get lost. I'm trying to figure best approaches for capturing these logs in these events.
There is an attempt to explore capturing pipes on the kernel process via:
ipython/ipykernel#315
It would be handy if we had a dryrun mode which parameterized a notebook and saved it to the output path without actually executing the cells. This would enable preparing notebooks for execution elsewhere or with alternative kernels in an upstream process. Should be relatively simple to add.
I love the run_notebook option and this is probably the main use-case, but sometimes i'd like to be able (or better have a service to which I don't have direct access to like binder) to start an interactive notebook server but with the parameters predetermined
a la
papermill save --input input.ipynb --parameter a='hello world' --output output.ipynb
jupyter notebook.ipynb
or better
papermill serve input.ipynb --parameter a='hello world'
would others see this as useful as well?
As I see in the doc, papermill requires
Parameterizing a Notebook.
To parameterize your notebook designate a cell with the tag parameters. Papermill looks for the parameters cell and replaces those values with the parameters passed in at execution time.
However, I don't find places to add tag to cell in original Jupyter Notebook. Is it only available in nteract?
// Parameters
val RUN_TS = 1528866180240
==========================================
Name: Compile Error
Message: <console>:5: error: integer number too large
val RUN_TS = 1528866180240
^
<console>:6: error: ';' expected but 'val' found.
val RUN_DATE_MINUS_1 = 20180612
^
StackTrace:
Should instead populate as val RUN_TS = 1528866180240L
Let's start using pytest with future tests since they are simpler to write and maintain. This should result in better code coverage over time.
Passing an http route as the notebook input results in an io.read error trying to use the local handler. Instead it should check for http prefixed uris.
When trying to execute a notebook I get the following error:
TraitError: The 'timeout' trait of an ExecutePreprocessor instance must be an integer, but a value of None <type 'NoneType'> was specified.
Code:
pm.execute_notebook(
'TEST.ipynb',
'TEST-out.ipynb',
parameters = dict(test_param='3333')
)
Notebook: TEST.zip
Full stacktrace
Using:
papermill: 0.12.1
Jupyter-notebook: 5.3.1
Python 2.7.11 |Anaconda 2.2.0 (32-bit)| (default, Mar 4 2016, 15:18:41)
I could really use some of the recent changes in my current tasks. Any objections or particular PRs people want in this release? Was going to maybe wait for #100.
Looks like a dependency broke somewhere. Discovered it in #88 where I though I had messed up but clean master recreates the issue now for me. I have an older virtualenv where the tests pass while a fresh env fails. The error is with traitlets but that version didn't change so it's another package interacting with it. If it doesn't resolve itself beforehand I'll look more deeply at it on Monday.
Broken pip list
$ pip freeze -l
ansiwrap==0.8.3
attrs==17.4.0
backports-abc==0.5
backports.shutil-get-terminal-size==1.0.0
bleach==2.1.2
boto3==1.5.14
botocore==1.8.28
certifi==2017.11.5
chardet==3.0.4
click==6.7
codecov==2.0.13
configparser==3.5.0
coverage==4.4.2
decorator==4.2.0
docutils==0.14
entrypoints==0.2.3
enum34==1.1.6
funcsigs==1.0.2
functools32==3.2.3.post2
futures==3.2.0
html5lib==1.0.1
idna==2.6
ipykernel==4.7.0
ipython==5.5.0
ipython-genutils==0.2.0
ipywidgets==7.1.0
Jinja2==2.10
jmespath==0.9.3
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.1
jupyter-console==5.2.0
jupyter-core==4.4.0
MarkupSafe==1.0
mistune==0.8.3
mock==2.0.0
nbconvert==5.3.1
nbformat==4.4.0
notebook==5.2.2
numpy==1.14.0
pandas==0.22.0
pandocfilters==1.4.2
pathlib2==2.3.0
pbr==3.1.1
pexpect==4.3.1
pickleshare==0.7.4
pluggy==0.6.0
prompt-toolkit==1.0.15
ptyprocess==0.5.2
py==1.5.2
Pygments==2.2.0
pytest==3.3.2
pytest-cov==2.5.1
python-dateutil==2.6.1
pytz==2017.3
PyYAML==3.12
pyzmq==16.0.3
qtconsole==4.3.1
requests==2.18.4
s3transfer==0.1.12
scandir==1.6
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.11.0
terminado==0.8.1
testpath==0.3.1
textwrap3==0.9.1
tornado==4.5.3
tqdm==4.19.5
traitlets==4.3.2
urllib3==1.22
wcwidth==0.1.7
webencodings==0.5.1
widgetsnbextension==3.1.0
Working pip list
$ pip freeze -l
ansiwrap==0.8.3
astroid==1.5.3
attrs==17.3.0
backports-abc==0.5
backports.functools-lru-cache==1.4
backports.shutil-get-terminal-size==1.0.0
bleach==2.1.1
boto3==1.4.7
botocore==1.7.44
certifi==2017.11.5
chardet==3.0.4
click==6.7
codecov==2.0.10
configparser==3.5.0
coverage==4.4.2
decorator==4.1.2
docutils==0.14
entrypoints==0.2.3
enum34==1.1.6
funcsigs==1.0.2
functools32==3.2.3.post2
future==0.16.0
futures==3.1.1
html5lib==1.0b10
idna==2.6
ipykernel==4.6.1
ipython==5.5.0
ipython-genutils==0.2.0
ipywidgets==7.0.4
isort==4.2.15
Jinja2==2.10
jmespath==0.9.3
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.1.0
jupyter-console==5.2.0
jupyter-core==4.4.0
lazy-object-proxy==1.3.1
MarkupSafe==1.0
mccabe==0.6.1
mistune==0.8.1
mock==2.0.0
nbconvert==5.3.1
nbformat==4.4.0
notebook==5.2.1
numpy==1.13.3
pandas==0.21.0
pandocfilters==1.4.2
papermill==0.11.5
pathlib2==2.3.0
pbr==3.1.1
pexpect==4.3.0
pickleshare==0.7.4
pkginfo==1.4.1
pluggy==0.6.0
prompt-toolkit==1.0.15
ptyprocess==0.5.2
py==1.5.2
Pygments==2.2.0
pylint==1.7.4
pytest==3.3.1
pytest-cov==2.5.1
python-dateutil==2.6.1
pytz==2017.3
PyYAML==3.12
pyzmq==16.0.3
qtconsole==4.3.1
requests==2.18.4
requests-toolbelt==0.8.0
s3transfer==0.1.11
scandir==1.6
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.11.0
terminado==0.7
testpath==0.3.1
textwrap3==0.9.1
tornado==4.5.2
tqdm==4.19.4
traitlets==4.3.2
twine==1.9.1
urllib3==1.22
wcwidth==0.1.7
webencodings==0.5.1
widgetsnbextension==3.0.7
wrapt==1.10.11
Instead of having to have a special directory to read from, I'd like to be able to read in a collection of notebooks like this:
pm.read_notebooks('out-*.ipynb')
I'm assuming this could probably be done with the glob
module.
Passing objects representing dicts is painful in papermill right now. You have to pass each value individually and catch the corresponding names inside the notebook. But when you have a large group of parameters all together (in say a json object) it'd be very convenient to specify those associated values on the input side but not inside the notebook.
Specifically a syntax along the lines of -p foo.bar.baz value
could be passed to the command line resulting in the parameters cell a dict of form "foo" = {"bar": {"baz": "value"}}}
. For repeated paths the dict could be enriched such that a second param of the form -p foo.bar.baz2 value2
would combine to form foo = {"bar": {"baz": "value", "baz2": "value2"}}}
. These parameters wouldn't need to share prefix paths, so -p foo.bar2 baz3
would augment the top foo
dict instead of the nested foo -> baz
dict.
This would enable passing dynamic or many valued parameters that are together associated as individual inputs which are human readable and getting a clean hierarchy of dicts on the output.
Given the merge behavior of each assignment it could also be used to merge with existing dict variables to provide foo = {'default': 'values'}
beforehand and have augmentation via the command line pass-through.
From #110 (review) I wanted to keep the issue open for discussing improving our s3 code and testing.
Michel:
Ideally, we would want to use a library to mock boto: either moto (https://github.com/spulec/moto) or placebo (https://github.com/garnaat/placebo) is a good choice. There may be others. Idk. Any preferences?
However, you/we may want to ask yourselves/ourselves whether such module (s3.py) should really exist. It probably makes sense to use a library such as s3fs (https://github.com/dask/s3fs). This is what Pandas uses for S3 and given that this project already has a dependency on Pandas it will not add an "exotic" dependency. Not to say that tests should not be written but if we were to spend time on it we may as well refactor the code and use s3fs.
Matt:
I generally agree with that approach. I'ved used https://github.com/jubos/fake-s3 in the past with a before hook launch -- but it adds ruby as a dependency to tests so I wouldn't recommend it here.
I haven't used s3fs before but your argument sounds solid. There may be a case to be made to add a minimal test here without too much refactor and then go a bigger PR with the swap-over. I'd leave that judgement call to you but I can probably help with s3fs changes/testing later on as well.
I get a notebook validation failed
Notebook validation failed: {u'TEST_All': 7.2} is not valid under any of the given schemas:
{
"TEST_All": 7.2
}
The code is the most basic possible:
import papermill as pm
pm.record("TEST_All",7.2)
Using:
papermill: 0.12.1
Jupyter-notebook: 5.3.1
Python 2.7.11 |Anaconda 2.2.0 (32-bit)| (default, Mar 4 2016, 15:18:41)
Any help is appreciated
Attached notebook: TEST.zip
.
There's a bit of inconsistency of how Python2 is being used alongside Python 3. After test coverage is at 80% ish, we should consider refactoring which libraries are used to best support python2 and python3 so that the code base essentially handles both with a minimum of "if 2" or "if 3". Focus on selecting the best practices for migration from 2 to 3.
Cells overwrite the contents of the parameter tagged cell which is non-intuitive and easy to mess up. Instead it would be more reasonable if the tagged cell used values there as default values and simply appended its contents to the cell definition. Repeated names would still be overwritten and non-papermill execution could enumerate sane values without breaking papermill execution as it is today. But, it would allow for easier assignment of defaults without adding a new cell. Today all the notebooks I've seen for papermill leave a blank cell for parameters which has to be explained to each person who sees a papermill notebook as opposed to a normal notebook when there shouldn't be so much of a difference.
A guide for new contributors would help with sprints & all kinds of new contributors.
When in non-development, an install attempts to read the requirements-dev.txt
(when, of course, it shouldn't be here). This fails all papermill
installs from pip
currently:
Lines 26 to 27 in 4e950d0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.