Usage
databooks is a tool designed to make the life of Jupyter notebook users easier,
especially when it comes to sharing and versioning notebooks. That is because Jupyter
notebooks are actually JSON files, with extra metadata that are useful for Jupyter but
unnecessary for many users. When committing notebooks you commit all the metadata that
may cause some issues down the line. This is where databooks comes in.
The package currently has 3 main features, exposed as CLI commands
- databooks meta: to remove unnecessary notebook metadata that can cause git conflicts
- databooks fix: to fix conflicts after they've occurred, by parsing versions of the conflicting file and computing its difference in a Jupyter-friendly way, so you (user) can manually resolve them in the Jupyter interface
- databooks assert: to assert that the notebook metadata actually conforms to desired values - ensure that notebook has sequential execution count, tags, etc.
- databooks show: to show a rich representation of the notebooks in the terminal
databooks meta
The only thing you need to pass is a path. We have sensible defaults to do the rest.
databooks meta path/to/notebooks
With that, for each notebook in the path, by default:
- It will remove execution counts
- It won't remove cell outputs
- It will remove metadata from all cells (such as cell tags or ids)
- It will remove all metadata from your notebook (including kernel information)
- It won't overwrite files for you
Nonetheless, the tool is highly configurable. You could choose to remove cell outputs by
passing --rm-outs. Or if there is some metadata you'd like to keep, such as cell tags,
you can do so by passing --cell-meta-keep tags. Also, if you do want to save the clean
notebook you can either pass a prefix (--prefix ...) or a suffix (--suffix ...) that
will be added before writing the file, or you can simply overwrite the source file
(--overwrite).
databooks fix
In databooks meta we try to avoid git conflicts. In databooks fix we fix conflicts after
they have occurred. Similar to databooks meta ..., the only required argument here
is a path.
databooks fix path/to/notebooks
For each notebook in the path that has git conflicts:
- It will keep the metadata from the notebook in HEAD
- For the conflicting cells, it will wrap some special cells around the differences, like in normal git conflicts
Similarly to databooks meta, the default behavior can be changed by passing a
configuration pyproject.toml file or specifying the CLI arguments. You could, for
instance, keep the metadata from the notebook in BASE (as opposed to HEAD). If you
know you only care about the notebook cells in HEAD or BASE, then you could pass
--cells-head or --no-cells-head and not worry about fixing conflicted cells in Jupyter.
You can also pass a special --cell-fields-ignore parameter, that will remove the cell
metadata from both versions of the conflicting notebook before comparing them. This is
because depending on your Jupyter version you may have an id field, that will be unique
for each cell. That is, all the cells will be considered different even if they have the
same source and outputs as their ids are different. By removing id and
execution_count (we'll do this by default) we only compare the actual code and outputs
to determine if the cells have changed or not.
Note
If a notebook with conflicts (thus not valid JSON/Jupyter) is committed to the repo,
databooks fix will not consider the file as something to fix - same behavior as git.
Fun fact
"Fix" may be a misnomer: the "broken" JSON in the notebook is not actually fixed - instead we compare the versions of the notebook that caused the conflict.
databooks assert
In databooks meta, we remove unwanted metadata. But sometimes we may still want some
metadata (such as cell tags), or more than that, we may want the metadata to have
certain values. This is where databooks assert comes in. We can use this command to
ensure that the metadata is present and has the desired values.
databooks assert is akin to (and inspired by) Python's assert.
Therefore, the user must pass a path and a string with the expression to be evaluated
for each notebook.
databooks assert path/to/notebooks --expr "python expression to assert on notebooks"
This can be used, for example, to make sure that the notebook cells were executed in
order. Or that we have markdown cells, or to set a maximum number of cells for each
notebook.
Evidently, there are some limitations to the expressions that a user can pass.
Variables in scope:
All the variables available for in your assert expressions are subclasses of Pydantic
models. Therefore, you can use these
models as regular python objects (i.e.: to access the cell types, one could write
[cell.cell_type for cell in nb.cells]). For convenience's sake, follows a list of
currently supported variables that can be used in assert expressions:
- nb: Jupyter notebook found in path
- raw_cells: notebook "raw" cells
- md_cells: notebook markdown cells
- code_cells: notebook code cells
- exec_cells: executed notebook code cells
Built-in functions:
These limitations are designed to allow anyone to use databooks assert safely. This is
because we use built-in's eval,
and eval is really dangerous.
To mitigate that (and for your safety), we actually parse the string and only allow a
couple of operations to happen. Check out our tests
to see what is and isn't allowed and see the source
to see how that happens!
It's also relevant to mention that to avoid repetitive typing you can configure the tool to
omit the source string. An even more powerful method is to combine it with pre-commit
or CI/CD. Check out the rest of the "Usage" section for more info!
Recipes
It can be a bit repetitive and tedious to write out expressions to be asserted. Or even hard to think of how to express these assertions about notebooks. With that in mind, we also include "user recipes". These recipes store some useful expressions to be checked, to be used both as shorthand of other expressions and inspiration for you to come up with your own recipe! Feel free to submit a PR with your recipe or open an issue if you're having issues coming up with a recipe for your goal.
has-tags
- Description: Assert that there is at least one cell with tags.
- Source: any(getattr(cell.metadata, 'tags', []) for cell in nb.cells)
has-tags-code
- Description: Assert that there is at least one code cell with tags.
- Source: any(getattr(cell.metadata, 'tags', []) for cell in code_cells)
max-cells
- Description: Assert that there are less than 64 cells in the notebook.
- Source: len(nb.cells) < 64
no-empty-code
- Description: Assert that there are no empty code cells in the notebook.
- Source: all(cell.source for cell in code_cells)
seq-exec
- Description: Assert that the executed code cells were executed sequentially (similar effect to when you 'restart kernel and run all cells').
- Source: [c.execution_count for c in exec_cells] == list(range(1, len(exec_cells) + 1))
seq-increase
- Description: Assert that the executed code cells were executed in increasing order.
- Source: [c.execution_count for c in exec_cells] == sorted([c.execution_count for c in exec_cells])
startswith-md
- Description: Assert that the first cell in notebook is a markdown cell.
- Source: nb.cells[0].cell_type == 'markdown'
Tip
If your use case is more complex and cannot be translated into a single expression,
you can always download databooks and use it as a part of your script!
databooks show
Sometimes we may want to quickly visualize the notebook. However, it can be a bit cumbersome to start the Jupyter server, navigate to the file that we'd like inspect and open it in the terminal. Moreover, by opening the file in Jupyter we may already modify the notebook metadata.
This is where databooks show comes in place. Simply specify the
path(s) to the notebook(s) to visualize them in the terminal. You can optionally pass
pager to open a scrollable window that won't actually print the notebook contents in
the terminal.