Clearing Output Data in Jupyter Notebooks using a Bash Script Hook
In my previous article, I described why you may need to clear output data in your Jupyter notebooks. As at the time I participated in a pre-sail project for AI Superior, we required a quick solution to achieve this goal. That is why I used Python-based pre-commit framework to create a pipeline to clear output data. However, this approach requires you to install additional Python package into your system, that might not be always possible. Therefore, at the time I decided that I would implement this approach as a pure Bash script. Recently, I have found some spare time and decided to dig deeper into this topic. As a result of my explorations, I developed a git pre-commit hook that clears Jupyter output cells and wrote this article describing it. If you are an adept of ‘show me the code’ and do not want to read the article, you can find the final script here.
Table of Contents
Testbed
As in the previous article, I use the same testbed to exemplify my approach. Using this link, you can download a test dataset and a proof-of-concept notebook.
Alternatively, you can create these files manually. Create a file dataset/private_dataset.csv
and copy-paste the following data:
private_dataset.csv
first_column,second_column
0,10
1,11
2,12
3,13
4,14
5,15
6,16
7,17
8,18
9,19
After that, create a simple Jupyter notebook for this dataset analysis. For instance, you can use the following one as the first draft:
Data Analysis Notebook
import pandas as pd
df = pd.read_csv('dataset/private_dataset.csv')
df.head()
first_column | second_column | |
---|---|---|
0 | 0 | 10 |
1 | 1 | 11 |
2 | 2 | 12 |
3 | 3 | 13 |
4 | 4 | 14 |
As before, the .gitignore
file looks the following way:
.gitignore
# ignoring auxiliary directories and files
.ipynb_checkpoints
.venv
.directory
# ignoring data in the dataset directory
dataset/
If you have cloned a repository then open directory with git hook scripts (.git/hooks
), or initialize a git repository if you created a new repo:
$ git init
Now, we are ready to write a hook to clear Jupyter notebook output cells.
Solution
Go to .git/hooks
directory, find the pre-commit.sample
script, copy it and rename the obtained file as pre-commit
. Now, open it and add the following content:
#!/bin/bash
#
# This pre-commit hook clears output cells in Jupyter notebooks.
# setting bash strict mode
set -o errexit
set -o pipefail
set -o nounset
IFS=$'\n\t'
# functions
function elementIn () {
local elem="$1" # Save first argument in a variable
shift # Shift all arguments to the left (original $1 gets lost)
local arr=("$@") # Rebuild the array with rest of arguments
if printf '%s\n' "${arr[@]}" | grep -q --line-regexp "${elem}"; then
return 0
fi
return 1
}
# what commit we should compare against (initial or HEAD)
if git rev-parse --verify HEAD >/dev/null 2>&1
then
against=HEAD
else
# Initial commit: diff against an empty tree object
against=$(git hash-object -t tree /dev/null)
fi
IPYNB_FILES=()
while IFS='' read -r line; do
IPYNB_FILES+=( "$line" );
# finding all staged *.ipynb files added, copied, modified or renamed since last commit
done < <(git diff-index --name-only --cached --diff-filter=ACMR "${against}" -- | grep -i \.ipynb$ )
for FILE in "${IPYNB_FILES[@]}"; do
echo "Processing file: '$FILE'"
# you may need to provide the full path to 'jupyter' executable
# if it is not in the path
jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace "$FILE"
# echo "Current EXIT_CODE value: $?"
done
MODIFIED_FILES=()
while IFS='' read -r line; do
MODIFIED_FILES+=( "$line" );
# list all modified files
done < <(git ls-files --modified --exclude-standard)
AMOUNT=0
for mfile in "${MODIFIED_FILES[@]}"; do
if elementIn "$mfile" "${IPYNB_FILES[@]}"; then
echo "'$mfile' has been modified by pre-commit hook!"
AMOUNT=$((AMOUNT+1))
fi
done
if [[ $AMOUNT -eq 0 ]]; then
echo "No ipynb files were modified!"
exit 0
else
echo "Pre-commit hook modified $AMOUNT ipynb files!"
exit 1
fi
Now, when you do commit this hook script will look for the modified *.ipynb
files and clear their output cells.
This script requires some explanation. At first, in order to discover what files has been changed it finds out what commit (stored in the against
variable) should be used as a reference. It usually points to HEAD
, unless this is the first commit in the repository. Then, the script checks the difference (git diff-index
) between data in the staged area (--cached
) and the data in the snapshot identified by the against
commit. We select all filenames (--name-only
) ending with *.ipynb
(grep -i \.ipynb$
) that has been added, copied, modified or renamed. Each such file is processed with the jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace
command that clears all Jupyter notebook output cells.
Then, we list all modified (--modified
) files (git ls-files
) excluding the ignored ones (--exclude-standard
). If a modified file is among those that has been processed by the jupyter nbconvert
command, then this means that the file indeed has contained uncleared Jupyter notebook output cells. If such files are found we exit with the exit code equal to 1
. This will interrupt committing process, so that we will have a possibility to add the modified files to the staged area.
If no files are modified then we exit with the 0
exit code and continue the commit process.