Nowadays, it is a quite popular to store semi-structured information using JSON format. Indeed, JSON files have quite simple structure and can be easily read by human beings. JSON syntax allows one to represent complex dependencies in data and avoid data duplication. Moreover, all modern programming languages have libraries that facilitate JSON parsing and storing data into this format. Not surprisingly, JSON is extensively used to return data in Application Programming Interfaces (APIs) .
At the same time, data analysts prefer to deal with structured data represented in the form of series and dataframes. Unfortunately, transforming JSON data into structured format is not that straightforward. Previously, I preferred to develop code to parse manually complex JSON files and create a pandas dataframe from the parsed data. However, recently I have discovered a pandas function called json_normalize that saved me some time in my projects. In this article, I explain how you can start using it in your projects.
In my previous articles (“Clearing Output Data in Jupyter Notebooks using Pre-commit Framework” and “Clearing Output Data in Jupyter Notebooks using a Bash Script Hook”), I described how to clear output data in Jupyter notebooks using the pre-commit framework and the git hook script correspondingly. Both these approaches are usable and could be applied for your project repositories. However, recently I have found the third way how to clear Jupyter notebook output cells that seems to me more clear and easier to implement. In this article, I describe my last findings.
In my previous article, I described why you may need to clear output data in your Jupyter notebooks. As at the time I participated in a pre-sail project for AI Superior, we required a quick solution to achieve this goal. That is why I used Python-based pre-commit framework to create a pipeline to clear output data. However, this approach requires you to install additional Python package into your system, that might not be always possible. Therefore, at the time I decided that I would implement this approach as a pure Bash script. Recently, I have found some spare time and decided to dig deeper into this topic. As a result of my explorations, I developed a git pre-commit hook that clears Jupyter output cells and wrote this article describing it. If you are an adept of ‘show me the code’ and do not want to read the article, you can find the final script here.
Recently, I have participated in a project at AI Superior aimed at the analysis of a dataset with sensitive data. So as the data have to remain private, initially we shared the dataset through a secure channel and took measures to prevent its accidental distribution (we put the dataset in a separate directory and configured git to ignore this folder and other directories containing intermediate processing results). However, working on this project I have noticed that Jupyter notebook, that is a kind of standard tool used for data analysis, may be a source of sensitive data leakage.