Clearing Output Data in Jupyter Notebooks using Pre-commit Framework
Recently, I have participated in a project at AI Superior aimed at the analysis of a dataset with sensitive data. So as the data have to remain private, initially we shared the dataset through a secure channel and took measures to prevent its accidental distribution (we put the dataset in a separate directory and configured git to ignore this folder and other directories containing intermediate processing results). However, working on this project I have noticed that Jupyter notebook, that is a kind of standard tool used for data analysis, may be a source of sensitive data leakage.
Update 27/10/2020: I have developed a git hook to clear Jupyter output cells data that does not rely on pre-commit framework. You can find the description of the approach in the article “Clearing Output Data in Jupyter Notebooks using a Bash Script Hook”.
Update 31/10/2020: I have found a better approach to clear Jupyter output cells data that relies on git attributes. You can find my latest finding in the article “Clearing Output Data in Jupyter Notebooks using Git Attributes”.
Table of Contents
The issue is that Jupyter notebook besides the code also stores the output produced when the code cells are executed. The output may contain pieces of sensitive information, therefore if the notebook is shared publicly anyone can read this private data.
To exemplify this issue, let us consider the following artificial project (you can find it here). Let us imagine that the dataset containing private data
private_dataset.csv is stored in a directory
dataset/. This directory is added to
.gitignore file, thus the files in it are ignored by git and do not appear in the list of files for staging. Hence, the dataset remains local.
first_column,second_column 0,10 1,11 2,12 3,13 4,14 5,15 6,16 7,17 8,18 9,19
# ignoring auxiliary directories and files .ipynb_checkpoints .venv .directory # ignoring data in the dataset directory dataset/
Now, let us create a simple Jupyter notebook to analyze this dataset. Typically, the first actions during the analysis are the following:
Data Analysis Notebook
import pandas as pd
df = pd.read_csv('dataset/private_dataset.csv')
We load the dataset into a dataframe using the
pandas.read_csv() function and then check that the dataset is loaded properly, e.g., by calling the
pandas.DataFrame.head() function. As a result, Jupyter outputs the first 5 entries from the dataset. When you save the notebook, Jupyter stores this output. Therefore, people who have access to the notebook may get access to the portions of sensitive data (although the dataset itself is not shared).
The obvious solution to this issue is to clear all outputs before committing the notebook to version control system. When you have Jupyter notebook opened, you can do this by selecting the
Cell -> All Output -> Clear menu item. However, based on my experience sometimes you forget to do this, thus making the data leakage that is hard to plumb (once commit is pushed to the public repository it is very hard to remove it). Therefore, to prevent this we must not commit notebooks that contain output data.
An additional benefit of such prevention system is the reduction of polluting commits in the repository history. Indeed, instead of, e.g.,
pandas.DataFrame.head() I often use the pandas.DataFrame.sample() function that instead of
n top records from a dataset outputs
n rows chosen randomly. Therefore, each execution of such notebook produces different output that is required to be processed by git (committed or discarded).
In order to address this issue, let us make a pipeline that automatically checks every commit and removes Jupyter notebook output data. In order to develop such system we will make use of the git hooks functionality. Git provides facilities to execute some custom scripts before or after some git actions are carried out. These facilities and the scripts are called hooks.
You can develop these hook scripts using your system’s scripting facilities (check script samples in the
.git/hooks directory), however I will use the pre-commit framework that simplifies this process considerably.
However, before using this tool we have to install this framework. The pre-commit framework is developed using Python, therefore we have to install it using this language toolsets. In the article, I have described how I configure my Python environment. Here, I will work within the same framework. However, you may need to adapt the instructions for your setup. At first, let us install pre-commit into the
tool3 pyenv environment:
$ pyenv activate tools3 $ pip install pre-commit $ pyenv deactivate
So as the
tools3 environment activated globally, the path to the pre-commit executable is added to the
PATH variable, and we can run this tool using only its name:
$ pyenv which pre-commit /home/yury/.pyenv/versions/tools3/bin/pre-commit $ pre-commit --version pre-commit 2.4.0
Now, let us create a pre-commit hook to address our problem. In order to do this, in the root of your git repository create the
.pre-commit-config.yaml file with the following content:
repos: - repo: local hooks: - id: jupyter-nb-clear-output name: jupyter-nb-clear-output files: \.ipynb$ stages: [commit] language: system entry: jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace
This yaml code describes a new pre-commit hook. Let us consider what this configuration means. So as we are developing a new pre-commit hook, the
repo option should be equal to
repo option specifies from which repository we should download pre-commit hooks). The
hooks section describes what hooks should be used. In this case, we create a new hook with
id equal to
id is used to uniquely identify a hook) and with the same
name (used for output purposes). The
\.ipynb$ file pattern tells that the hook should be applied only to the files with the
.ipynb extension (Jupyter notebook files). Additionally, it should be run only during the git commit stage (
The last two options
entry specify what language is used to install the hook and what executable to run correspondingly. As an entry point I use
jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace. This command clears all Jupyter notebook output data applying the Jupyter nbconvert
ClearOutputPreprocessor filter to a file and storing the result to the same file (the
--inplace option). You can test this command on a Jupyter notebook:
$ jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace PoC.ipynb
So as the
jupyter command is available globally in my setup, I do not need to provide full path to the executable. Moreover, I run this command without
poetry run prefix because I do not need to use packages installed in the project virtual environment.
.pre-commit-config.yaml file is created, we have to activate the configuration installing the hook described in this file. The simplest way to do this is to run the following command in the root directory of the repository:
$ pre-commit install pre-commit installed at .git/hooks/pre-commit
After executing this command, a new
pre-commit file should be created in the
.git/hooks directory. Note that you have to run this command each time you have modified the configuration file.
Testing the Solution
Now, let us test our solution. After all actions, my test repository has the following structure:
$ tree -a -I '.venv|.directory|.ipynb_checkpoints|.git' . ├── dataset │ └── private_dataset.csv ├── .gitignore ├── PoC.ipynb ├── poetry.lock ├── .pre-commit-config.yaml └── pyproject.toml
Let us stage our
PoC.ipynb notebook containing some output data and commit it to git:
$ git add PoC.ipynb $ git commit -m "Update notebook" jupyter-nb-clear-output..................................................Failed - hook id: jupyter-nb-clear-output - duration: 0.62s - files were modified by this hook [NbConvertApp] Converting notebook PoC.ipynb to notebook [NbConvertApp] Writing 1029 bytes to PoC.ipynb
As you can see when you try to commit git calls
jupyter-nb-clear-output pre-commit hook. The commit fails because our
jupyter-nb-clear-output pre-commit hook modifies the
PoC.ipynb file. We can see this using the
git status command:
$ git status On branch master No commits yet Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: PoC.ipynb Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) modified: PoC.ipynb
Therefore, we have to add the modified files once again and commit:
$ git add PoC.ipynb $ git commit -m "Update notebook" jupyter-nb-clear-output..................................................Passed - hook id: jupyter-nb-clear-output - duration: 0.51s [NbConvertApp] Converting notebook PoC.ipynb to notebook [NbConvertApp] Writing 1029 bytes to PoC.ipynb [master (root-commit) a6d61bd] Update notebook 1 file changed, 61 insertions(+) create mode 100644 PoC.ipynb
Now, the commit process passes, and if you open the notebook you will see that it does not have any output.