data analysis

Clearing Output Data in Jupyter Notebooks using Git Attributes

In my previous articles (“Clearing Output Data in Jupyter Notebooks using Pre-commit Framework” and “Clearing Output Data in Jupyter Notebooks using a Bash Script Hook”), I described how to clear output data in Jupyter notebooks using the pre-commit framework and the git hook script correspondingly. Both these approaches are usable and could be applied for your project repositories. However, recently I have found the third way how to clear Jupyter notebook output cells that seems to me more clear and easier to implement. In this article, I describe my last findings.

Clearing Output Data in Jupyter Notebooks using a Bash Script Hook

In my previous article, I described why you may need to clear output data in your Jupyter notebooks. As at the time I participated in a pre-sail project for AI Superior, we required a quick solution to achieve this goal. That is why I used Python-based pre-commit framework to create a pipeline to clear output data. However, this approach requires you to install additional Python package into your system, that might not be always possible. Therefore, at the time I decided that I would implement this approach as a pure Bash script. Recently, I have found some spare time and decided to dig deeper into this topic. As a result of my explorations, I developed a git pre-commit hook that clears Jupyter output cells and wrote this article describing it. If you are an adept of ‘show me the code’ and do not want to read the article, you can find the final script here.

Clearing Output Data in Jupyter Notebooks using Pre-commit Framework

Recently, I have participated in a project at AI Superior aimed at the analysis of a dataset with sensitive data. So as the data have to remain private, initially we shared the dataset through a secure channel and took measures to prevent its accidental distribution (we put the dataset in a separate directory and configured git to ignore this folder and other directories containing intermediate processing results). However, working on this project I have noticed that Jupyter notebook, that is a kind of standard tool used for data analysis, may be a source of sensitive data leakage.