Clean code in Jupyter notebooks

This post is inspired by a video from 2017 PyData conference in Berlin. Here I focus on several main points.

Notebook structure

☝

How big should a notebook file be?

Hypothesis — Data — Interpretation

☝

Keep your notebooks small!

(4-10 cells each)

How?

I found this part particularly surprising, because my previous notebooks accompanying research papers have been huge. But by looking into his talk, I accepted this viewpoint.

Example: a fat notebook is split into several files in one directory.

Cache and images are separate folders.

☝

Use shared libraries.

Typical structure of the ipynb file.

Imports

Get Data

Transform Data

Modelling

Visualisation

Making sense of the data

☝

Don't hide the model source code inside a module.

Because a reader wants to understand how the model works.

This advice concerns more data engineers. Remember that most function definitions should be moved into a module.

☝

Duplication is better than wrong abstraction.

Example: adding boolean variables to function declaration to cover different cases is bad because it is not really readable.

My own commentary: I would absolutely avoid duplication in my code because it creates a space for future errors. Deduplication (also very close to refactoring) should be done with care and thought. Most likely, the distinct parts of the code could be isolated and moved to different functions? Refactoring is a hard topic in general. Check out this website.

☝

How big one cell should be?

One "idea — execution — output" triplet per cell.

Stability of your code

☝

Why write tests directly in notebooks?

You encounter functions in notebooks, so it is useful for the reader to see what these functions can do. Tests are examples of function usage.

☝

Restart and run all.

Make sure that at any point you can re-run all the code from scratch with the same results.

But in many cases it takes a lot of time to load the data and manipulate the data.

Re-running the whole notebook should be fast.

So you should be using some caching mechanism.

You definitely need to invest into a caching mechanism!

Anyway, clean after yourself. Which means, you will have prototyping cells and output cells. Make sure that these two don't mess with one another.

Cannot agree more. My next post is dedicated to exploration of one of the very efficient caching mechanisms that can be used straight out-of-the-box in ipython.

☝

Using a simple UI, you can turn a Notebook into a Product.

Other remarks

🤔

Version control in ipynb is not a completely solved problem :(