Clearing Output Data in Jupyter Notebooks using a Bash Script Hook

In my previous article, I described why you may need to clear output data in your Jupyter notebooks. As at the time I participated in a pre-sail project for AI Superior, we required a quick solution to achieve this goal. That is why I used Python-based pre-commit framework to create a pipeline to clear output data. However, this approach requires you to install additional Python package into your system, that might not be always possible. Therefore, at the time I decided that I would implement this approach as a pure Bash script. Recently, I have found some spare time and decided to dig deeper into this topic. As a result of my explorations, I developed a git pre-commit hook that clears Jupyter output cells and wrote this article describing it. If you are an adept of ‘show me the code’ and do not want to read the article, you can find the final script here.

Update 14/07/2021: Thanks to @davdis, I have fixed the bug on incorrect reporting of the modified notebooks amount.
Table of Contents

Testbed

As in the previous article, I use the same testbed to exemplify my approach. Using this link, you can download a test dataset and a proof-of-concept notebook.

Alternatively, you can create these files manually. Create a file dataset/private_dataset.csv and copy-paste the following data:

private_dataset.csv

first_column,second_column
0,10
1,11
2,12
3,13
4,14
5,15
6,16
7,17
8,18
9,19

After that, create a simple Jupyter notebook for this dataset analysis. For instance, you can use the following one as the first draft:

Data Analysis Notebook

import pandas as pd
df = pd.read_csv('dataset/private_dataset.csv')
df.head()

first_columnsecond_column
0010
1111
2212
3313
4414

As before, the .gitignore file looks the following way:

.gitignore

# ignoring auxiliary directories and files
.ipynb_checkpoints
.venv
.directory

# ignoring data in the dataset directory
dataset/

If you have cloned a repository then open directory with git hook scripts (.git/hooks), or initialize a git repository if you created a new repo:

$ git init

Now, we are ready to write a hook to clear Jupyter notebook output cells.

Solution

Go to .git/hooks directory, find the pre-commit.sample script, copy it and rename the obtained file as pre-commit. Now, open it and add the following content:

#!/bin/bash
#
# This pre-commit hook clears output cells in Jupyter notebooks.

# setting bash strict mode
set -o errexit 
set -o pipefail
set -o nounset
IFS=$'\n\t'


# functions
function elementIn () {
  local elem="$1"  # Save first argument in a variable
  shift            # Shift all arguments to the left (original $1 gets lost)
  local arr=("$@") # Rebuild the array with rest of arguments
  if printf '%s\n' "${arr[@]}" | grep -q --line-regexp "${elem}"; then
    return 0
  fi
  return 1
}

# what commit we should compare against (initial or HEAD)
if git rev-parse --verify HEAD >/dev/null 2>&1
then
	against=HEAD
else
	# Initial commit: diff against an empty tree object
	against=$(git hash-object -t tree /dev/null)
fi

IPYNB_FILES=()
while IFS='' read -r line; do 
	IPYNB_FILES+=( "$line" ); 
# finding all staged *.ipynb files added, copied, modified or renamed since last commit
done < <(git diff-index --name-only --cached --diff-filter=ACMR "${against}" -- | grep -i \.ipynb$ )
for FILE in "${IPYNB_FILES[@]}"; do
	echo "Processing file: '$FILE'"
	# you may need to provide the full path to 'jupyter' executable
	# if it is not in the path
	jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace "$FILE"
	# echo "Current EXIT_CODE value: $?"
done

MODIFIED_FILES=()
while IFS='' read -r line; do 
	MODIFIED_FILES+=( "$line" );
# list all modified files 
done < <(git ls-files --modified --exclude-standard)
AMOUNT=0
for mfile in "${MODIFIED_FILES[@]}"; do
	if elementIn "$mfile" "${IPYNB_FILES[@]}"; then
		echo "'$mfile' has been modified by pre-commit hook!" 
		AMOUNT=$((AMOUNT+1))
	fi
done

if [[ $AMOUNT -eq 0 ]]; then
	echo "No ipynb files were modified!"
	exit 0
else
	echo "Pre-commit hook modified $AMOUNT ipynb files!"
	exit 1
fi

Now, when you do commit this hook script will look for the modified *.ipynb files and clear their output cells.

This script requires some explanation. At first, in order to discover what files has been changed it finds out what commit (stored in the against variable) should be used as a reference. It usually points to HEAD, unless this is the first commit in the repository. Then, the script checks the difference (git diff-index) between data in the staged area (--cached) and the data in the snapshot identified by the against commit. We select all filenames (--name-only) ending with *.ipynb (grep -i \.ipynb$) that has been added, copied, modified or renamed. Each such file is processed with the jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace command that clears all Jupyter notebook output cells.

Then, we list all modified (--modified) files (git ls-files) excluding the ignored ones (--exclude-standard). If a modified file is among those that has been processed by the jupyter nbconvert command, then this means that the file indeed has contained uncleared Jupyter notebook output cells. If such files are found we exit with the exit code equal to 1. This will interrupt committing process, so that we will have a possibility to add the modified files to the staged area.

If no files are modified then we exit with the 0 exit code and continue the commit process.

Related