Flattening JSON data using Pandas

Nowadays, it is a quite popular to store semi-structured information using JSON format. Indeed, JSON files have quite simple structure and can be easily read by human beings. JSON syntax allows one to represent complex dependencies in data and avoid data duplication. Moreover, all modern programming languages have libraries that facilitate JSON parsing and storing data into this format. Not surprisingly, JSON is extensively used to return data in Application Programming Interfaces (APIs) .

At the same time, data analysts prefer to deal with structured data represented in the form of series and dataframes. Unfortunately, transforming JSON data into structured format is not that straightforward. Previously, I preferred to develop code to parse manually complex JSON files and create a pandas dataframe from the parsed data. However, recently I have discovered a pandas function called json_normalize that saved me some time in my projects. In this article, I explain how you can start using it in your projects.

You can find the code and the data for this article in the repository.
Table of Contents

Example

Let us consider how to use this function using a JSON file generated by Vertx service as an example. At Vertx, we develop a content-based multimedia search engine that allows one to find the source clip by providing a video/audio sample. It is like Shazam but both for video and audio.

Example of Vertx JSON

[
  {
    "matches": [
      {
        "album": "Elephunk",
        "artist": "Black Eyed Peas; Papa Roach",
        "segments": [
          {
            "duration": 24.9375,
            "que_offset": 0.0,
            "ref_offset": 18.0
          }
        ],
        "title": "Anxiety",
        "type": "music",
        "uid": 6475547275973858650
      },
      {
        "imdb_id": 365957,
        "segments": [
          {
            "duration": 24.9375,
            "que_offset": 0.0,
            "ref_offset": 3500.625
          }
        ],
        "title": "You Got Served",
        "type": "movie",
        "uid": 1566530810344932800,
        "year": 2004
      }
    ],
    "media_type": "audio",
    "source_path": "sample.mp4",
    "source_uid": 1758730481226206085,
    "status": "succeeded"
  },
  {
    "matches": [
      {
        "imdb_id": 332452,
        "segments": [
          {
            "duration": 21.0,
            "que_offset": 0.0,
            "ref_offset": 8608.5
          }
        ],
        "title": "Troy",
        "type": "movie",
        "uid": 1342729426672482861,
        "year": 2004
      }
    ],
    "media_type": "video",
    "source_path": "sample.mp4",
    "source_uid": 1758730481226206085,
    "status": "succeeded"
  }
]

So as the search results may be consumed both by human beings and by external services, we return them in JSON format because of its simplicity. At the same time, the resulting JSON has quite complex structure because:

  • It needs to return the results for two different media_type: video and audio;
  • Each media type may have its own set of metadata fields;
  • For each media type, there can be several matches (e.g., if a sample consists of several videos), each having its own metadata;
  • There can be several matched segments (e.g., if a sample contains several parts of a video).

When we analyze the results of our algorithm, the information, which is key for the analysis, is stored in the fields duration, key_offset and ref_offset. For instance, if we want to find out the total duration of duplicated content, we need to sum up the values in the corresponding field. Other fields are important to represent the context (e.g., the UIDs of the query and reference videos).

Straightforward Approach

Previously, in order to transform such complex JSON file into a pandas dataframe, I usually developed custom code to parse it. For instance, a function to create a pandas dataframe from the Vertx data may look in the following way:

def load_vertx_data(json_file):
    with open(json_file, 'r') as jf:
        json_data = json.load(jf)
    
    flattened_data = []
    for mte in json_data:
        que_uid = mte.get('source_uid', pd.NA)
        media_type = mte.get('media_type', pd.NA)
        source_path = mte.get('source_path', pd.NA)
        status = mte.get('status', pd.NA)
        for m in mte['matches']:
            artist = m.get('artist', pd.NA)
            album = m.get('album', pd.NA)
            title = m.get('title', pd.NA)
            type_ = m.get('type', pd.NA)
            ref_uid = m.get('uid', pd.NA)
            imdb_id = m.get('imdb_id', pd.NA)
            year = m.get('year', pd.NA)
            for s in m['segments']:
                duration = s['duration']
                que_offset = s['que_offset']
                ref_offset = s['ref_offset']
                entry = {
                    'que_uid': que_uid,
                    'media_type': media_type,
                    'source_path': source_path,
                    'status': status,
                    'artist': artist,
                    'album': album,
                    'title': title,
                    'type': type_,
                    'ref_uid': ref_uid,
                    'imdb_id': imdb_id,
                    'year': year,
                    'que_offset': que_offset,
                    'ref_offset': ref_offset,
                    'duration': duration,
                }
                flattened_data.append(entry)
    
    return pd.DataFrame(flattened_data)

At first in this function, create a Python dict object from the JSON data stored in a file:

with open(json_file, 'r') as jf:
    json_data = json.load(jf)

Then, we iterate over each media type entry mte and extract the contextual data of the queried video/audio: source_uid, media_type, source_path, and status data.

for mte in json_data:
    que_uid = mte.get('source_uid', pd.NA)
    media_type = mte.get('media_type', pd.NA)
    source_path = mte.get('source_path', pd.NA)
    status = mte.get('status', pd.NA)

There are two things to note here:

  1. I use get function to get the data. If the value for a field is missing, I substitute it with pd.NA value.
  2. I extract the value of the source_uid field into the que_uid variable. Historically, Vertx use source_uid to represent the UID of the queried video and uid for the UID of the reference video. However, for unification I prefer to use que_uid and ref_uid correspondingly.

Then, we iterate over the matches for each media type and for each match extract the contextual data the referenced video/audio: artist, album, title, type_, ref_uid, imdb_id and year.

for m in mte['matches']:
    artist = m.get('artist', pd.NA)
    album = m.get('album', pd.NA)
    title = m.get('title', pd.NA)
    type_ = m.get('type', pd.NA)
    ref_uid = m.get('uid', pd.NA)
    imdb_id = m.get('imdb_id', pd.NA)
    year = m.get('year', pd.NA)

type is a reserved keyword in Python, therefore we extract the value of the type field into the type_ variable (in order not to mess them). Note that video and audio matches have different set of contextual fields: for audio, the relevant fields are artist, album, title, type_, ref_uid; while for video, these fields are title, type_, ref_uid, imdb_id and year. However, if we want to have a unified dataframe we need to extract a superset of these fields, substituting with pd.NA the values irrelevant for this media type.

Finally, for each match we iterate over each segment and from each segment extract the data important for the analysis:

for s in m['segments']:
    duration = s['duration']
    que_offset = s['que_offset']
    ref_offset = s['ref_offset']

Note here that, instead of using the get function to extract a value, I use square brackets. In this case, if a value for duration, que_offset or ref_offset is missing, the function will generate an exception (KeyError). This is how you may express the fact that these fields are required in each record.

Then, I generate a flat dictionary from the extracted data and append it to the flattened_data list. In the end of the function, I create a pandas dataframe from the data stored in the list and return the dataframe.

I really like this approach because it is easy to remember, and can be used to parse any complex JSON file. Moreover, it allows me to generate additional aggregate information on each level. For instance, it is very easy to add the code to this function to calculate total duration for each match.

Flattening JSON using Pandas

Unfortunately, the approach described in the previous section is not very scalable. Indeed, to parse one type of JSON file you need to write a 40-lines-of-code function. If you have limited time, it is better to make use of pandas’s json_normalize function. From my experience, I see that this function is rarely used by data analysts. That’s why I decide to write this article.

The function, which uses json_normalize and achieves the result equivalent to the one considered in the previous section, is the following:

def load_vertx_data_using_json_normalize(json_file):
    with open(json_file, 'r') as jf:
        json_data = json.load(jf)
    df = json_normalize(
        json_data,
        record_path=['matches', 'segments'],
        meta=['source_uid', 'media_type', 'source_path', 'status', 
              ['matches', 'uid'], 
              ['matches', 'artist'],
              ['matches', 'album'],
              ['matches', 'title'],
              ['matches', 'type'],
              ['matches', 'imdb_id'],
              ['matches', 'year'],
        ],
        errors='ignore'
    )
    df = df.rename(
        columns={
            'source_uid': 'que_uid', 
            'matches.uid': 'ref_uid',
            'matches.artist': 'artist',
            'matches.album': 'album',
            'matches.title': 'title',
            'matches.type': 'type',
            'matches.imdb_id': 'imdb_id',
            'matches.year': 'year',
        },
    )
    
    return df

Although it is only 10 lines smaller, it is much easier to read. There, I perform only two actions: I use json_normalize to read and normalize json data, and rename function to rename the columns of the dataframe. The usage of the latter is not necessary but allows me to have the same column names as in the previous section.

Let us consider json_normalize function parameters closer. The first parameter json_data is a dictionary representation of json data.

The second parameter record_path specifies the path to the record data (usually, this is the data important for the analysis). Each piece of this data will become a separate row in the final dataframe. For instance, in our function we specify that the record data should be taken from matches -> segments.

The meta list enumerates all the contextual fields, data from which should appear in columns in the final dataframe. Note that if you want to include data from some internal fields, you need to specify the full path to them in the form of list (e.g., ). For instance, in our case if we want to have a column with reference UID data, we need to add ['matches', 'uid'] to the meta parameter list. Then, in the final dataframe, we would get a column which name would consist of these parts separated by . by default (you can change the separator using sep parameter).

I have mentioned in the previous section that media type entries have different set of context fields. The third parameter, errors (equal to ignore), asks the json_normalize function to ignore errors if some fields are not found.

Then, we use pandas’s rename function to rename columns in the final dataframe. I do this just to show how to get the same representation of the JSON data as we get in the previous section.

Yury Zhauniarovich
Yury Zhauniarovich
R&D Engineer
Lead Data Scientist

Related