Some time ago, I wrote an article showing how much data is missed if you rely on only client-side web analytics numbers. To maintain this previous blog post in an actual state, I planned to update it from time to time. However, as you might expect, pretty soon, I got bored collecting manually Cloudflare and Google Analytics data and inserting it into a spreadsheet. As a normal human, I have decided to automate the process. Luckily, both systems provide the APIs that you can use to query and download the data. However, during the development of the data collection script, I discovered several limitations that inclined me to write a new blog post instead of updating the old one.
In the previous article comparing the JAMstack services of two popular providers, I have mentioned that one of my incentives of moving to Cloudflare was its basic server-side analytics provided even for the free tier users. Extended analytics is available on both Cloudflare and Netlify as a paid option: on Cloudflare you have to subscribe to one of the paid accounts (the cheapest is “Pro” plan that costs 20 US Dollars per month); on Netlify you can either subscribe to “Business” plan for 99 US Dollars per member per month, or you can enable this feature for every your site for just 9 US Dollars a month. If you need an accurate web analytics data, I definitely recommend you choosing one of these options because, as my analysis in this article shows, the client-side analytics solutions (e.g., Google Analytics, Yandex Metrica or Microsoft Clarity) overlook a large portion of visitors’ interactions due to different anti-tracking solutions (e.g., personally I use uBlock Origin plugin for my web-browser). In this article, I show how much data you may overlook.
Nowadays, it is a quite popular to store semi-structured information using JSON format. Indeed, JSON files have quite simple structure and can be easily read by human beings. JSON syntax allows one to represent complex dependencies in data and avoid data duplication. Moreover, all modern programming languages have libraries that facilitate JSON parsing and storing data into this format. Not surprisingly, JSON is extensively used to return data in Application Programming Interfaces (APIs) .
At the same time, data analysts prefer to deal with structured data represented in the form of series and dataframes. Unfortunately, transforming JSON data into structured format is not that straightforward. Previously, I preferred to develop code to parse manually complex JSON files and create a pandas dataframe from the parsed data. However, recently I have discovered a pandas function called json_normalize
that saved me some time in my projects. In this article, I explain how you can start using it in your projects.
In my previous articles (“Clearing Output Data in Jupyter Notebooks using Pre-commit Framework” and “Clearing Output Data in Jupyter Notebooks using a Bash Script Hook”), I described how to clear output data in Jupyter notebooks using the pre-commit framework and the git hook script correspondingly. Both these approaches are usable and could be applied for your project repositories. However, recently I have found the third way how to clear Jupyter notebook output cells that seems to me more clear and easier to implement. In this article, I describe my last findings.
In my previous article, I described why you may need to clear output data in your Jupyter notebooks. As at the time I participated in a pre-sail project for AI Superior, we required a quick solution to achieve this goal. That is why I used Python-based pre-commit framework to create a pipeline to clear output data. However, this approach requires you to install additional Python package into your system, that might not be always possible. Therefore, at the time I decided that I would implement this approach as a pure Bash script. Recently, I have found some spare time and decided to dig deeper into this topic. As a result of my explorations, I developed a git pre-commit hook that clears Jupyter output cells and wrote this article describing it. If you are an adept of ‘show me the code’ and do not want to read the article, you can find the final script here.
Recently, I have participated in a project at AI Superior aimed at the analysis of a dataset with sensitive data. So as the data have to remain private, initially we shared the dataset through a secure channel and took measures to prevent its accidental distribution (we put the dataset in a separate directory and configured git to ignore this folder and other directories containing intermediate processing results). However, working on this project I have noticed that Jupyter notebook, that is a kind of standard tool used for data analysis, may be a source of sensitive data leakage.