Clearing Output Data in Jupyter Notebooks using Git Attributes

Table of Contents

In my previous articles (“Clearing Output Data in Jupyter Notebooks using Pre-commit Framework” and “Clearing Output Data in Jupyter Notebooks using a Bash Script Hook”), I described how to clear output data in Jupyter notebooks using the pre-commit framework and the git hook script correspondingly. Both these approaches are usable and could be applied for your project repositories. However, recently I have found the third way how to clear Jupyter notebook output cells that seems to me more clear and easier to implement. In this article, I describe my last findings.

Approach

Git provides facilities to assign custom attributes to pathnames. Later, these attributes can be used to influence the behavior of some git operations, for instance, check-out and check-in. Using this functionality, we can define filter commands that will be executed on every file with a particular attribute during check-in (filter.clean config parameter) and check-out (filter.smudge config parameter).

In order to implement this approach, the following actions need to be performed. At first, we have to specify the name of the attribute that will be assigned to all Jupyter notebook files. In order to do this, we need to create either $GIT_DIR/info/attributes (if you do not want this file to appear in the git repository tree) or .gitattributes (if you want this file to be under version control) files with the following content:

*.ipynb filter=jupyternotebook

The first part (*.ipynb) is the pathname to which you assign the attribute (filter) with the specific value (jupyternotebook). Personally, I use the value equal to jupyternotebook but you can choose the value whichever you like. Just remember that later we will use this value in git configuration file. It is also possible to define git attributes on the global and system levels (see this stackoverflow answer for details). This could be convenient if you need to modify the behavior of some git commands in every repository of the user or the machine correspondingly.

To change the behavior of the commands based on the attributes’ values, we should modify git configuration. As you know, we can set git config options on the repository, user or system level. So as you might not need to clear Jupyter output data in every repository, I would recommend to specify these options on the repository level. In order to do this, open the $GIT_DIR/config file and add the following lines:

[filter "jupyternotebook"]
	clean = jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace %f
	required

These lines tell git to execute specific filter.clean command (jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace) on every file (%f) that is assigned with the jupyternotebook attribute. The option required tells that the command must succeed.

Whenever you change the clean filter, you have to renormalise your repository:

$ git add --renormalize .

After these changes, git will run automatically the command to clear output cells on every Jupyter notebook file added to the staging area.

Contrary to the git hook based approaches described in the previous articles, this approach would modify the files before adding them to the staging area. Therefore, you do not need to add the modified files twice.

How does this work? Let me exemplify the approach using the same testbed described in the previous article. Imagine that during an analysis we have modified the notebook PoC.ipynb, and now it contains some sensitive data in the output cells. In order to apply our filter.clean command simply add the notebook to the staging area:

$ git add PoC.ipynb 
[NbConvertApp] Converting notebook PoC.ipynb to notebook
[NbConvertApp] Writing 1236 bytes to PoC.ipynb

Now, if you open the file you would see that the output cells are cleared. Note that this approach does not require to add files to the staging area after the command is executed. Thus, you do need to run one command less, that can save some time.

Sharing Git Repository Config

Neither the approaches described in the previous articles nor the one described here allow to share and apply automatically the clear actions for a repository: each requires some manual configurations from git user. Indeed, if you use pre-commit framework based approach, the configuration is described in the .pre-commit-config.yaml file that is git-tracked. However, in order to be used you have to install the pre-commit framework and activate it for the repository. Similarly, in the case of the pure git hook based approach the directory with the hooks can be tracked but repository configuration options should be modified in order to point to the directory location.

Till this point of time, I have not managed to find the approach to overcome this issue. However, the one described below is tolerable. In order to apply the approach described in this article, I create two files .gitattributes and .gitconfig in the repository root:

.gitattributes
*.ipynb filter=jupyternotebook
.gitconfig
[filter "jupyternotebook"]
	clean = jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace %f
	required

Then, in order to include options from the .gitconfig file to the local git configuration options I run the following command in the git repository root directory:

$ git config --local include.path ../.gitconfig

This command adds the following entry to your $GIT_DIR/config file, thus forcing it to include configuration options defined in the .gitconfig file:

[include]
	path = ../.gitconfig

You need to run this command once after the repository is cloned. To my point of view, this is the better way to share repository configuration options. However, if you know the one allowing to share and apply git configuration automatically, let me know. You can use the same approach to set a custom path to the directory with your git hook scripts.

Yury Zhauniarovich
Yury Zhauniarovich
R&D Engineer
Lead Data Scientist
Cyber Security Researcher

Related