Notebooks Academy: Write Production-Ready Code From Jupyter [DRAFT]
The course teaches how to use Jupyter to develop maintainable and production-ready code.
Please comment on this issue with your feedback! Are we missing any topics?
Lessons
- Why? The prototype, then refactor problem
- Writing clean notebooks
- Version control
- Hidden state
- Modularization
- Refactoring legacy pipelines
- Building data pipelines
- Integration testing
- Debugging
- Running pipelines in the cloud
- Notebook meta-analysis
- Using SQL in Jupyter
- Deployment
Format
11 video lessons, 20-30 minutes each.
I'm thinking of doing this a project-based course, so by the end of it, students have a pipeline up and running this dataset looks interesting.
Pre-requisites
- Experience working with standard open-source tools: Jupyter, pandas, and scikit-learn
Syllabus
1. Why? The prototype, then refactor problem
Introduction to the problem: Developing projects in single notebooks cause a lot of trouble. They are hard to maintain, test and review. However, if we follow some best practices, and with the help of some open-source tools, we can implement a workflow that allows us to go from Jupyter to production instantly.
Related material
2. Writing clean notebooks
This lesson shows best practices for writing clean notebooks (it takes most of its content from the blog post).
Related material
3. Version control
It's challenging to version control Jupyter Notebooks because the .ipynb
format is JSON. This lesson shows how to change the underlying format to .py
and still interact with those files as notebooks.
Notes
- Show other alternatives such as nbdime and the Jupyterlab-git plugin
- Discuss jupytext's pairing feature to store the output in a separate file
Related material
4. Hidden state
Since notebooks are developed interactively, excessive editing often leads to broken notebooks. This lesson introduces notebooks smoke testing: we execute them with a sample of the data on each git push
using papermill. It also shows how to set up GitHub Actions.
Related material
5. Modularization
Modularizing code is critical to developing maintainable and testable software. This lesson shows how to create a package to modularize our work, define functions in Python modules, and unit test those functions using pytest.
Notes
- Show how
IPython
auto-reloading works
- Covers pytest basic features: fixtures, parametrizing, testing exceptions
Related material
6. Refactoring legacy pipelines
A lot of existing pipelines live in notebooks. This lesson shows how to refactor a monolithic notebook-based project into a data pipeline.
Related material
7. Building data pipelines
Long notebooks are hard to manage because many variables and code are involved. Breaking down our analysis in multiple steps allows us to collaborate better and test our notebooks.
Notes
- Why is structure important?
- Mention advantages of building a data pipeline: can do integration testing, run tasks in parallel
Related material
8. Integration testing
Garbage in, garbage out. Testing for data quality at each stage of our pipeline ensures that we meet a minimum level of data quality. This lesson shows how to do integration testing after executing each notebook.
Related material
9. Debugging
Debugging data pipelines is challenging; however, having a robust unit and integration test suite helps us debug more effectively. This lesson shows how to debug data pipelines by conducting root cause analysis using pytest and the Python debugger.
Notes
- Show how to debug failing tests
- Cover Jupyter's visual debugger
- Show how to use
ipdb
- Debugging with
IPython.embed()
- Debugging code with breakpoints
debuglater
feature in Ploomber
Related material
10. Running pipelines in the cloud
When working with large datasets, we may want to run our pipeline in the cloud. This lesson shows how to use Ploomber to run a pipeline in AWS and Kubernetes and retrieve results.
11. Notebook meta-analysis
The .ipynb
format is self-contained to store code and output. Such output can be anything from text, tables, or images. This lesson shows how to analyze the content of a Jupyter notebook to extract its output to evaluate and compare model experiments.
Related materials
12. Using SQL in Jupyter
Related materials
13. Deployment
This lesson shows how to generate a deployment artifact using a previously trained model to serve predictions.
Notes
- Show how to use Ploomber's pipeline composition capabilities to create a serving pipeline
- Discuss the importance of dependency locking
- Generating a source distribution
- When to use Docker (and when not to)
Related material
Basic materials
I may record these additional short lessons to cover the basics of dependency management and virtual environments.
Lessons to be considered
- Profiling notebooks (memory, CPU, GPU)
- Report generation (nbconvert, quarto)
- Dashboards (voilá)
- Technical blogging (Jupyblog)
Optional lessons
Optional lessons I may record.