In my previous articles (“Clearing Output Data in Jupyter Notebooks using Pre-commit Framework” and “Clearing Output Data in Jupyter Notebooks using a Bash Script Hook”), I described how to clear output data in Jupyter notebooks using the pre-commit framework and the git hook script correspondingly. Both these approaches are usable and could be applied for your project repositories. However, recently I have found the third way how to clear Jupyter notebook output cells that seems to me more clear and easier to implement. In this article, I describe my last findings.
In my previous article, I described why you may need to clear output data in your Jupyter notebooks. As at the time I participated in a pre-sail project for AI Superior, we required a quick solution to achieve this goal. That is why I used Python-based pre-commit framework to create a pipeline to clear output data. However, this approach requires you to install additional Python package into your system, that might not be always possible. Therefore, at the time I decided that I would implement this approach as a pure Bash script. Recently, I have found some spare time and decided to dig deeper into this topic. As a result of my explorations, I developed a git pre-commit hook that clears Jupyter output cells and wrote this article describing it. If you are an adept of ‘show me the code’ and do not want to read the article, you can find the final script here.
Recently, I have participated in a project at AI Superior aimed at the analysis of a dataset with sensitive data. So as the data have to remain private, initially we shared the dataset through a secure channel and took measures to prevent its accidental distribution (we put the dataset in a separate directory and configured git to ignore this folder and other directories containing intermediate processing results). However, working on this project I have noticed that Jupyter notebook, that is a kind of standard tool used for data analysis, may be a source of sensitive data leakage.
Last several years I use git as my version control system (VCS) both for personal and work projects. If you are working in a team, usage of a VCS brings you a lot of benefits like change tracking, history viewing, merge issues resolving, etc. However, I have found it very handy to use even for personal projects: with git I can try different solutions simultaneously and then select the better one. Moreover, I can easily remember what I have done for a project recently. This task becomes much easier if commit messages for repository changes are written clearly and in accordance with the patterns and rules recommended by the VCS. Developers have already developed the best practices how to write commit messages. I refer the interested reader, e.g., to the article written by Chris Beams. In this article, I explain how VSCode can help writing commit messages in accordance with the rules.