category and group data

`category` and `group` data#

Inspection of the network traffic identifies two data feeds that look as if they are defined at the overall event level:

category-2025
allGroups-2025

The above path components are added to the end of the “base” live data feed API (Application Programming Interface) location https://www.dakar.live.worldrallyraidchampionship.com/api/ to generate the resource location web address:

APIs

APIs (application programming interfaces) are like web pages for machines. They define how to access or search for resources in a well-defined way, and provide results structured in a way that makes it other for other machines, or programming code, to work with.

These data sets are perhaps not the most interesting of the data feeds, but we have to start somewhere. And if you are a keen data junkie, it’s often the adminstrative data that lets you wrangle your datasets in really creative ways. If you are crioous enoguh, even the most mundane data set may trigger some interesting questions or observations. And if nothing else, working with any new data set can often be regarded as a fitness work out.

Importing Packages In Python#

The Python programming language is a quick-to-get-started-with programming language that is used as a general purpose programming language in a wide variety of contexts, from business applications to scientific research, gaming to data analysis.

Syntax is important, which is to say that where you put the spaces often matters, and you need to pay attention to spelling and case (capitalisation).

Some words, and some things that don’t quite look like properly spelled words, are reserved; which is to say, they are part of the actual language and you can’t use them for anything else.

Whilst you might think that the purpose of code is to get stuff done, most code is used to create code that is used to create other bits of code that may or may not be used to get stuff done at some point in the future. Added to that, most things you think you probably want to do with your code will have probably already been figured out by someone before you.

In the Python world, there is a huge number of packages that have been developed to do particular things.

In this book, or guide, or whatever it is, we’ll be using the pandas data analysis package a lot. This package has been developed over many years to provide a powerful set of tools for working with tabular datasets.

If you use spreadsheet applications such as Excel, you’ll already be familiar with the idea of tabular data. And you’ll probably also be familiar with the idea of column and row based operations, as well as formulas. The pandas package provides support for a similarly comnprehensive range of operations using code.

If you’re using one of the environments developed for use with this book, it should already have all the bits and bobs you need installed into it.

If the packages are installed and available in your environment, all you need to do to start working with them is to import them.

In the following code cell, I’m going to import the pandas package for working with tabular data. By convention, this is referred to by the shorthand label pd.

# This is a code cell
#
# Lines that start with a # are "comment" lines.
# Comments are not executed as code.
#  They provide an opportunity for writing free text,
# and explanatory text, in a code cell.

# To execute a line of code, click in the code cell to select it
# either click the play button, or use the keyboard shortcut `shift-enter`
import pandas as pd

Previewing the `Category` data#

We can preview the allGroups data by clicking on the link to the resource and viewing it “as a webpage”. By default, the display may not be overly friendly, so I tend to view it in a browser via a browser extension.

By inspection, I can see the data file is a JSON (Javascript Object Notation) data file: the presence of square brackets, identifying lists of items, and curly brackets ({}) wrapping key:value pairs give the game away. In Chrome, I use the JSONVue web extension for automatically previewing JSON data in a syntax sensitive way. The following screenshot of the category data is viewed in Chrome using this extension.

Screenshot of Dakar category data feed using Chrome JSONVue extension

We can also preview the contents of the file via browser developer tools by selecting the file in the Network view and then clicking on the Response tab to see what data was returned when the corresponding resource was requested.

We can also grab a copy of the data programmatically using the Python requests package.

To start with, let’s construct the URL for the resource we want to request. This is made up from two parts: the “base” API location, and the resource we want to request from that API.

One practical way of constructing the URL is to use a string template.

A string is just a series of alphanumeric and punctuation text characters.

The template part of the string allows us to specify a variable placeholder value that can be substituted with another value as required.

# In a programming language, we can use "variables" as shorthand
# ways of referring to other things.
# variableName = VARIABLE_VALUE

# In the following, the {path} element defineds a placeholder for a
# variable value substitution that can be applied later
dakar_api_template = "https://www.dakar.live.worldrallyraidchampionship.com/api/{path}"

We can define another variable that specifies the path to the API resource we want to request:

# Define the API path to the category resource
category_path = "category-2025"

# Substitute in the groups_path value into the API URL template
category_url = dakar_api_template.format(path=category_path)

# Placing a Python object at the end of a notebook code cell
# gives us a simple way of previewing its value when the cell is run
category_url

'https://www.dakar.live.worldrallyraidchampionship.com/api/category-2025'

Fudging the URL

If you are running the code associated with this book in the JupyterLite environment, you may run into problems trying to retrieve the data from the Dakar live data website. These problems arise from browser security policies associated with Cross-Origin Resource Sharing (CORS), which is to saying, trying to make requests to one website from another website.

One way around this problem is to request the data via a proxy (an intermediate website) that does support cross-origin requests.

When requesting a URL using the requests package I recommend wrapping the URL in the xurl() (cross-origin URL) function imported from the jupyterlite_simple_cors_proxy package when making requests using the requests package, and the furl() (file-object URL) function when making the request using the pandas read_json() package.

from jupyterlite_simple_cors_proxy import furl, xurl

Since we know the data returned from the API is presented as JSON data, we can attempt to parse it directly using the pd.read_json(URL) function:

# Use the pandas .read_json() function to load JSON data from a URL
# and parse it into a dataframe object
# Use furl() to handle CORS issues in Jupyterlite
category_df = pd.read_json(furl(category_url))

# View the result
category_df

	position	updatedAt	refueling	label	kmGoal	categoryLangs	mapDisplay	lastStage	promotionalDisplay	reference	shortLabel	liveDisplay	_bind	_origin	_id	_key	_updatedAt	_parent	isPointCat
0	8	2025-01-06T12:12:00+01:00	0	F	1000.0	[{'locale': 'en', 'variable': 'cat.name.F', 't...	True	12	True	2025-F	cat.name.F	False	category-2025	category-2025	0a0fb4dfc03bcb6b0e9c940b5a09ae05	label	1737386137350	millesime:28a3b2a601a2028494b004bb95233853	1.0
1	3	2025-01-06T12:11:14+01:00	1	A	NaN	[{'text': 'Car', 'locale': 'en', 'variable': '...	True	12	True	2025-A	cat.name.A	True	category-2025	category-2025	63b4f5da4591200d0a4cc239245eb03a	label	1737386137350	millesime:28a3b2a601a2028494b004bb95233853	NaN
2	7	2025-01-06T12:11:52+01:00	0	K	NaN	[{'locale': 'en', 'variable': 'cat.name.K', 't...	True	12	True	2025-K	cat.name.K	False	category-2025	category-2025	b0b0b0087d69431d87d04800e254c7c3	label	1737386137350	millesime:28a3b2a601a2028494b004bb95233853	1.0
3	0	2025-01-05T20:25:31+01:00	1	M	NaN	[{'text': 'Bike', 'variable': 'cat.name.M', 'l...	True	12	True	2025-M	cat.name.M	True	category-2025	category-2025	e377e0f7b61dcf8217845dc4fac8cc5d	label	1737386137350	millesime:28a3b2a601a2028494b004bb95233853	NaN

The category data seems to be quite limited, identifying some catregory codes (the labels F, A, K and M that we perhaps recognise from some of the API URL paths). There do appear to be text labels for various languages provided in the categoryLangs column.

Each entry in the categoryLangs column looks to be a list of dictionaries. We can select just that column from the dataframe as category_df["categoryLangs"] and then unpack each list to a separate row by applying the .explode() method to the column.

This unpacks each list to a separate row:

category_df[["reference", "categoryLangs"]].explode(column="categoryLangs")

	reference	categoryLangs
0	2025-F	{'locale': 'en', 'variable': 'cat.name.F', 'te...
0	2025-F	{'variable': 'cat.name.F', 'text': 'DAKAR FUTU...
0	2025-F	{'variable': 'cat.name.F', 'text': 'DAKAR FUTU...
0	2025-F	{'variable': 'cat.name.F', 'locale': 'ar', 'te...
1	2025-A	{'text': 'Car', 'locale': 'en', 'variable': 'c...
1	2025-A	{'text': 'Auto', 'locale': 'fr', 'variable': '...
1	2025-A	{'text': 'Coche', 'variable': 'cat.name.A', 'l...
1	2025-A	{'text': 'سيارات', 'locale': 'ar', 'variable':...
2	2025-K	{'locale': 'en', 'variable': 'cat.name.K', 'te...
2	2025-K	{'variable': 'cat.name.K', 'text': 'Classic', ...
2	2025-K	{'variable': 'cat.name.K', 'text': 'Classic', ...
2	2025-K	{'variable': 'cat.name.K', 'text': 'Classic', ...
3	2025-M	{'text': 'Bike', 'variable': 'cat.name.M', 'lo...
3	2025-M	{'text': 'Moto', 'locale': 'fr', 'variable': '...
3	2025-M	{'text': 'Moto', 'locale': 'es', 'variable': '...
3	2025-M	{'variable': 'cat.name.M', 'text': 'دراجة', 'l...

The result is a pandas.Series object (essentially, a 1-dimensional dataframe), where each item is a Python dictionary. It’s still rather hard to read in this form, but we can “normalise” it by parsing each dictionary into its own dataframe row using the pd.json_normalize() function:

# Parse the exploded Series of dicts into a dataframe
pd.json_normalize(category_df["categoryLangs"].explode())

	locale	variable	text
0	en	cat.name.F	DAKAR FUTURE MISSION 1000
1	fr	cat.name.F	DAKAR FUTURE MISSION 1000
2	es	cat.name.F	DAKAR FUTURE MISSION 1000
3	ar	cat.name.F	DAKAR FUTURE MISSION 1000
4	en	cat.name.A	Car
5	fr	cat.name.A	Auto
6	es	cat.name.A	Coche
7	ar	cat.name.A	سيارات
8	en	cat.name.K	Classic
9	fr	cat.name.K	Classic
10	es	cat.name.K	Classic
11	ar	cat.name.K	Classic
12	en	cat.name.M	Bike
13	fr	cat.name.M	Moto
14	es	cat.name.M	Moto
15	ar	cat.name.M	دراجة

By inspection, we can now see that the `A category refers to the Auto or Car category, the M category to Moto, which is to say, Bike, and so on.

Previewing the `allGroups` data#

As before, we can preview the allGroups data by clicking on the link and viewing the response in a web browser, or in browser developer tools by clicking on a corresponding netwrok traffic link and viewing the response.

Preview of JSON data response in browser dev tools

We can, of course, also retrieve the resource programmatically.

The quickest way of loading this data into a pandas dataframe is to load the JSON data from the live data API directly using the pd.read_json(URL) function.

But let’s step back a moment to see what sorts of thing might be involved in making such a request.

To begin with, how can we get the data from the API? The Python language provides a range of low level tools for making a connection to remote web server and retrieving the result. At a slightly higher level of abstraction, the requests package provides a set of tools for requesting resources from web locations, checking the response status of such requests, and accessing the response data.

To make use of the package, we need to load it it:

import requests

In order to make the request, we need to construct our target URL:

# Define the API path to the groups resource
groups_path = "allGroups-2025"

# If you are thinking ahead, you might be wondering
# whether we could specify a YEAR variable and then
# substitute that in. And indeed we could.
# YEAR = 2025
# In this case it would be simpler to direct apply the substitution
# using by using a Python f-string construction:
# groups_path = f"allGroups-{YEAR}"
# The preceding f" says: substitute in values of corresponding variables

# Substitute in the groups_path value into the API URL template
groups_url = dakar_api_template.format(path=groups_path)

# Preview the URL
groups_url

'https://www.dakar.live.worldrallyraidchampionship.com/api/allGroups-2025'

We can now request the resource from the URL using the get() command from the requests package:

# Request the resource from the specified URL
# Use xurl() to handle CORS issues in Jupyterlite
r = requests.get(xurl(groups_url))

# We can preview the returned text as a text string.
# The [:100] limit says: "just return the first 100 characters"
r.text[:100]

'[{"updatedAt":"2025-01-05T20:25:31+01:00","position":0,"shortLabel":"cat.name.F_M","liveDisplay":fal'

Noting that the data returned from the API is presented as JSON data, we can use the .json() method that is pre-defined on the requests response object to parse the data as such and return a corresponding Python object:

# Preview the response data as parsed JSON data
groups_json = r.json()

# Uncomment the final line to preview the result
# groups_json

By inspection of the groups_json dictionary, or even just the first few characters of the original response data, we note the data is structured in the following way:

[{"categoryGroupLangs":[{"text":"M1000 - Bike","locale":"en","variable":"cat.name.F_M"}...

The opening square bracket ([) identifies that we have a list of items. The curly brackets ({) identify that we likely have a list of object items (in Python, a set or a dictionary, depending on the structure of the contents).

Knowing that we have a list, we can index into the list to display the first item in the list, which has index value 0. We can also index the nth value as [n-1] and the last item in the list as index [-1].

# Preview the first item (index value 0) in the list
groups_json[0]
# We could also index the second item as [1],
# the fourth as [3], the n'th as [n-1],
# and the final item as [-1]

# Note that comments are ignored and don't count as
# "last line in the cell" display items...

{'updatedAt': '2025-01-05T20:25:31+01:00',
 'position': 0,
 'shortLabel': 'cat.name.F_M',
 'liveDisplay': False,
 'reference': '2025-F-M',
 'categoryGroupLangs': [{'locale': 'en',
   'text': 'M1000 - Bike',
   'variable': 'cat.name.F_M'},
  {'text': 'M1000 - Moto', 'locale': 'fr', 'variable': 'cat.name.F_M'},
  {'locale': 'es', 'variable': 'cat.name.F_M', 'text': 'M1000 - Moto'},
  {'variable': 'cat.name.F_M', 'text': 'M1000 - Bike', 'locale': 'ar'}],
 'label': 'M',
 'tinyLabel': 'M',
 'refueling': 0,
 'promotionalDisplay': False,
 '_bind': 'allGroups-2025',
 '_origin': 'categoryGroup-2025-F',
 '_id': 'ae6882fe1fcb5dbef0a7e42d7ca802c1',
 '_key': 'label',
 '_updatedAt': 1737386137501,
 '_parent': 'category-2025:0a0fb4dfc03bcb6b0e9c940b5a09ae05'}

Having a sense of the shape and structure of the data then suggests how we might want to try to represent it in a more convenient way, at least from the perspective of quickly previewing it more completely.

The pandas package has a range of tools for parsing JSON data objects into regular tabular datasets known as dataframes (commonly referred to by the short form, df). A simple list of disctionaries can be cast to a dataframe by calling pd.DataFrame(LIST_OF_DICTS)

We will see later how more complex nested dictionaries of the form often generated from JSON data files can often be more conveniently parsed to a dataframe using the pandas.json_normalize() function.

# Parse the list of dicts data into a dataframe
groups_df = pd.DataFrame(groups_json)
# For more complex JSON objects,
# pd.json_normalize() may be more appropriate

# Just display the "head()" of the df; which is to say,
# just display the first few rows, in this case, 5
groups_df.head(5)
# Using .head() without specifying the number of lines is also valid.

	updatedAt	position	shortLabel	liveDisplay	reference	categoryGroupLangs	label	tinyLabel	promotionalDisplay	_bind	_origin	_id	_key	_updatedAt	_parent	color
0	2025-01-05T20:25:31+01:00	0	cat.name.F_M	False	2025-F-M	[{'locale': 'en', 'text': 'M1000 - Bike', 'var...	M	M	False	allGroups-2025	categoryGroup-2025-F	ae6882fe1fcb5dbef0a7e42d7ca802c1	label	1737386137501	category-2025:0a0fb4dfc03bcb6b0e9c940b5a09ae05	NaN
1	2025-01-05T20:25:31+01:00	1	cat.name.F_Q	False	2025-F-Q	[{'locale': 'en', 'text': 'M1000 - Quad', 'var...	Q	Q	False	allGroups-2025	categoryGroup-2025-F	d02ea9bfe0a842c0650bba3c2cf33ecc	label	1737386137501	category-2025:0a0fb4dfc03bcb6b0e9c940b5a09ae05	NaN
2	2025-01-05T20:25:31+01:00	2	cat.name.F_A	False	2025-F-A	[{'variable': 'cat.name.F_A', 'text': 'M1000 -...	A	A	False	allGroups-2025	categoryGroup-2025-F	d7cb4906cafa111859658683cf820133	label	1737386137501	category-2025:0a0fb4dfc03bcb6b0e9c940b5a09ae05	NaN
3	2025-01-05T20:25:31+01:00	4	cat.name.F_T5	False	2025-F-T5	[{'variable': 'cat.name.F_T5', 'locale': 'en',...	T5	C	False	allGroups-2025	categoryGroup-2025-F	e0dcecc76fec5cd2cd095a09b821242d	label	1737386137501	category-2025:0a0fb4dfc03bcb6b0e9c940b5a09ae05	NaN
4	2025-01-05T20:25:31+01:00	3	cat.name.F_T4	False	2025-F-T4	[{'text': 'M1000 - SSV', 'variable': 'cat.name...	T4	SSV	False	allGroups-2025	categoryGroup-2025-F	f38a8e2eb62f13bc043106815cb187bd	label	1737386137501	category-2025:0a0fb4dfc03bcb6b0e9c940b5a09ae05	NaN

Skimming over the data, some of the columns don’t seem to provide information that looks to be immediately useful or useable. We can perhaps get a better feel for those columns by looking at the unique values that are contained within them:

groups_df['_key'].unique(), groups_df["promotionalDisplay"].unique(
), groups_df["refueling"].unique()

(array(['label'], dtype=object), array([False,  True]), array([0]))

The _key column doesn’t seem to be useful at all — the only value provided is label, so no rows are distinguished from any other rows. The refueling column also appears to be less than informative. Let’s drop those columns, along with the updatedAt and _updatedAt columns.

In the following cell, we automate the dropping of the columns, somewhat. To begin with, we specify a list of columns we want to drop from the dataframe. We iterate through each of those in turn using the for ... in ... construction, which takes each list item in turn then applies the consequent operations using that column value.

The conditional if statement checks that the column name is one of the columns in the current dataframe — if we try to drop a column that is not in the current dataframe, we get an error. (Note that we could override such errors by setting .drop(..., errors="ignore") but I thought this might be a good opportunity to show how we can do a simple column existence check).

# Drop the _key column (axis=1) from the data frame
# Using inplace=True applies the modification directly to the dataframe
for c in ['_key', 'refueling', 'updatedAt', '_updatedAt']:
    if c in groups_df.columns:
        groups_df.drop(c, axis=1, inplace=True)
    # We could avoid the need for checking the existence
    # of the column names by using:
    # groups_df.drop(c, axis=1, inplace=True, errors="ignore")

I’m intrigued by the "promotionalDisplay" column. What might be special about the columns where that value is True?

We can select particular rows in a dataframe by passing in a Boolean index vector of True/False, retaining those rows with the True index selection value. As the "promotionalDisplay" column is a column of booleans, we can take that column of values (as groups_df["promotionalDisplay"]) and select rows in the groups_df on that basis:

# Use the boolean promotionalDisplay column values to
# index the selection of rows from the dataframe
groups_df[groups_df["promotionalDisplay"]]
# The returned rows will all have promotionalDisplay=True

	position	shortLabel	liveDisplay	reference	categoryGroupLangs	label	tinyLabel	promotionalDisplay	_bind	_origin	_id	_parent	color
5	4	cat.name.A_T3	True	2025-A-T3	[{'text': 'Challenger', 'locale': 'en', 'varia...	T3	CHG	True	allGroups-2025	categoryGroup-2025-A	15f329900afa29e3e6b099ae681ebe12	category-2025:63b4f5da4591200d0a4cc239245eb03a	#E04E39
6	5	cat.name.A_T4	True	2025-A-T4	[{'text': 'SSV', 'locale': 'en', 'variable': '...	T4	SSV	True	allGroups-2025	categoryGroup-2025-A	423ea731fdcba5cda62c8334985889b0	category-2025:63b4f5da4591200d0a4cc239245eb03a	#A7C6ED
7	1	cat.name.A_T2	True	2025-A-T2	[{'locale': 'en', 'text': 'Stock', 'variable':...	T2	STK	True	allGroups-2025	categoryGroup-2025-A	4dac064bf100bc806b91e7f2e7758297	category-2025:63b4f5da4591200d0a4cc239245eb03a	#C7C9C7
8	0	cat.name.A_T1	True	2025-A-T1	[{'locale': 'en', 'variable': 'cat.name.A_T1',...	T1	ULT	True	allGroups-2025	categoryGroup-2025-A	b49155b3f5670d2a907aa01e319876b8	category-2025:63b4f5da4591200d0a4cc239245eb03a	#EBBC4E
9	6	cat.name.A_T5	True	2025-A-T5	[{'variable': 'cat.name.A_T5', 'text': 'Truck'...	T5	TRK	True	allGroups-2025	categoryGroup-2025-A	f1a437ac1135c9d9a5e33f5096f95259	category-2025:63b4f5da4591200d0a4cc239245eb03a	#2D2926
16	0	cat.name.M_GP	True	2025-M-GP	[{'text': 'Rally GP', 'locale': 'en', 'variabl...	GP	GP	True	allGroups-2025	categoryGroup-2025-M	1e9e9ae28b6e9498ca39609d5e403992	category-2025:e377e0f7b61dcf8217845dc4fac8cc5d	#EBBC4E
17	1	cat.name.M_R2	True	2025-M-R2	[{'locale': 'en', 'variable': 'cat.name.M_R2',...	R2	R2	True	allGroups-2025	categoryGroup-2025-M	eb1c119b536c526d5acea2a1e8a4d854	category-2025:e377e0f7b61dcf8217845dc4fac8cc5d	#A7C6ED

Presenting the data in the tabular form certainly makes it easier to review, at least, when viewed interactivley - print PDF documents are not so useful for previewing very wide tables. This data still doesn’t really feel very interesting, but it does provide labels that distinguish various categories of vehicle, so that may be useful.

It also looks like there may be some other possibly useful label information in the categoryGroupLangs is hard to read in that display. So let’s look at the contents of that categoryGroupLangs column by selecting just that column; and because the table has a large number of rows, let’s just preview the first 3 rows; and to simplify the display further, let’s cast the result to a list, which won’t be subject to any summarising dataframe display options preferred by the notebook cell output.

# Select a column, limit the display to the first three dataframe rows,
# and then display the single column, three row resulting df to a list
groups_df["categoryGroupLangs"][:3].to_list()

[[{'locale': 'en', 'text': 'M1000 - Bike', 'variable': 'cat.name.F_M'},
  {'text': 'M1000 - Moto', 'locale': 'fr', 'variable': 'cat.name.F_M'},
  {'locale': 'es', 'variable': 'cat.name.F_M', 'text': 'M1000 - Moto'},
  {'variable': 'cat.name.F_M', 'text': 'M1000 - Bike', 'locale': 'ar'}],
 [{'locale': 'en', 'text': 'M1000 - Quad', 'variable': 'cat.name.F_Q'},
  {'text': 'M1000 - Quad', 'variable': 'cat.name.F_Q', 'locale': 'fr'},
  {'text': 'M1000 - Quad', 'locale': 'es', 'variable': 'cat.name.F_Q'},
  {'text': 'M1000 - Quad', 'variable': 'cat.name.F_Q', 'locale': 'ar'}],
 [{'variable': 'cat.name.F_A', 'text': 'M1000 - Car', 'locale': 'en'},
  {'text': 'M1000 - Auto', 'variable': 'cat.name.F_A', 'locale': 'fr'},
  {'text': 'M1000 - Coche', 'locale': 'es', 'variable': 'cat.name.F_A'},
  {'variable': 'cat.name.F_A', 'locale': 'ar', 'text': 'M1000 - Car'}]]

What we see here is that we appear to have several different language labels associated with each row. It might be useful if we could somehow “bend” those values into different language columns.

This is another area where the pandas package comes into its own. But first, let’s think through what we need to do. In each row of the table, the categoryGroupLangs cell contains a list of dictionaries. Each dictionary has a variable, (which by inspection of the dataframe seems to match the shortLabel column value in the dataframe); the locale specfies a language code, and the text is a description of the vehicle type.

If we explode() the categoryGroupLangs column, we can unpack the list in each cell and generate one row for each list item. This produces a pandas.Series datatype; each index value in the Series matches the index value of the parent row in the dataframe that was exploded to form that Series row.

groups_df["categoryGroupLangs"].explode()

   {'locale': 'en', 'text': 'M1000 - Bike', 'vari...
   {'text': 'M1000 - Moto', 'locale': 'fr', 'vari...
   {'locale': 'es', 'variable': 'cat.name.F_M', '...
   {'variable': 'cat.name.F_M', 'text': 'M1000 - ...
   {'locale': 'en', 'text': 'M1000 - Quad', 'vari...
                            ...                        
  {'locale': 'ar', 'variable': 'cat.name.M_GP', ...
  {'locale': 'en', 'variable': 'cat.name.M_R2', ...
  {'text': 'Rally 2', 'locale': 'fr', 'variable'...
  {'variable': 'cat.name.M_R2', 'text': 'Rally 2...
  {'text': 'رالي 2', 'variable': 'cat.name.M_R2'...
Name: categoryGroupLangs, Length: 72, dtype: object

We can map the dictionary in each row of the Series to a dataframe row by using the pd.json_normalize() function. Note that this resets the index values, so at this point we have lost an index based mapping back to the original dataframe rows.

category_langs_df = pd.json_normalize(
    groups_df["categoryGroupLangs"].explode())
category_langs_df

	locale	text	variable
0	en	M1000 - Bike	cat.name.F_M
1	fr	M1000 - Moto	cat.name.F_M
2	es	M1000 - Moto	cat.name.F_M
3	ar	M1000 - Bike	cat.name.F_M
4	en	M1000 - Quad	cat.name.F_Q
...	...	...	...
67	ar	رالي GP	cat.name.M_GP
68	en	Rally 2	cat.name.M_R2
69	fr	Rally 2	cat.name.M_R2
70	es	Rally 2	cat.name.M_R2
71	ar	رالي 2	cat.name.M_R2

72 rows × 3 columns

Another way of reaching the same result is to parse the nested categoryGroupLangs dictionary in each original JSON record using the pd.json_normalize() function:

# Parse a specified path in the original JSON data records
pd.json_normalize(groups_json, record_path="categoryGroupLangs")

	locale	text	variable
0	en	M1000 - Bike	cat.name.F_M
1	fr	M1000 - Moto	cat.name.F_M
2	es	M1000 - Moto	cat.name.F_M
3	ar	M1000 - Bike	cat.name.F_M
4	en	M1000 - Quad	cat.name.F_Q
...	...	...	...
67	ar	رالي GP	cat.name.M_GP
68	en	Rally 2	cat.name.M_R2
69	fr	Rally 2	cat.name.M_R2
70	es	Rally 2	cat.name.M_R2
71	ar	رالي 2	cat.name.M_R2

72 rows × 3 columns

Inspection of this data shows it to be in a long format, with each row having a very simple structure.

We note from repeated variable values across the rows that we have different locale/text labels for the same variable value. This suggests that it may be more informative to view the data in a wide format, where each row includes the values for each language label for a given variable value.

# Wrapping the code in brackets is a syntactic trick that
# allows us to fettle the presentation of the code,
# splitting dot separated components out onto separate lines
categoryGroupLangs_wide = (
    pd.json_normalize(
        groups_df["categoryGroupLangs"].explode())
    .pivot(
        index='variable',
        columns='locale',
        values='text',
    )
    .reset_index()
)

categoryGroupLangs_wide

locale	variable	ar	en	es	fr
0	cat.name.A_T1	Ultimate	Ultimate	Ultimate	Ultimate
1	cat.name.A_T2	Stock	Stock	Stock	Stock
2	cat.name.A_T3	Challenger	Challenger	Challenger	Challenger
3	cat.name.A_T4	SSV	SSV	SSV	SSV
4	cat.name.A_T5	شاحنة	Truck	Camión	Camion
5	cat.name.F_A	M1000 - Car	M1000 - Car	M1000 - Coche	M1000 - Auto
6	cat.name.F_M	M1000 - Bike	M1000 - Bike	M1000 - Moto	M1000 - Moto
7	cat.name.F_Q	M1000 - Quad	M1000 - Quad	M1000 - Quad	M1000 - Quad
8	cat.name.F_T4	M1000 - SSV	M1000 - SSV	M1000 - SSV	M1000 - SSV
9	cat.name.F_T5	M1000 - Truck	M1000 - Truck	M1000 - Camión	M1000 - Camion
10	cat.name.K_H1	Classic Auto (Low average)	Classic Auto (Low average)	Classic Auto (Media Baja)	Classic Auto (Moyenne Basse)
11	cat.name.K_H1T	Classic Truck (Low average)	Classic Truck (Low average)	Classic Truck (Media Baja)	Classic Truck (Moyenne Basse)
12	cat.name.K_H2	Classic Auto (moderate average)	Classic Auto (moderate average)	Classic Auto (Media moderada)	Classic Auto (Moyenne moyenne)
13	cat.name.K_H2T	Classic Truck (moderate average)	Classic Truck (moderate average)	Classic Truck (Media moderada)	Classic Truck (Moyenne moyenne)
14	cat.name.K_H3	Classic Auto (intermediate average)	Classic Auto (intermediate average)	Classic Auto (Media intermedia)	Classic Auto (Moyenne intermédiaire)
15	cat.name.K_H4	Classic Auto (high average)	Classic Auto (high average)	Classic Auto (Media de alta)	Classic Auto (Moyenne haute)
16	cat.name.M_GP	رالي GP	Rally GP	Rally GP	Rally GP
17	cat.name.M_R2	رالي 2	Rally 2	Rally 2	Rally 2

Recalling that the variable values match the shortLabel value in the original dataframe, we can merge categoryGroupLangs_wide columns back into groups_df dataframe based on matching variable/shortLabel values:

# Merge the original dataframe and the derived wide data frame
# using matching values across the shortLabel and variable columns
groups_df = pd.merge(groups_df, categoryGroupLangs_wide,
                     left_on="shortLabel", right_on='variable')

# Drop the redundant categoryGroupLangs and variable columns
groups_df.drop("categoryGroupLangs", axis=1, inplace=True)
groups_df.drop("variable", axis=1, inplace=True)

# View the column headers of the new dataframe
groups_df.columns

Index(['position', 'shortLabel', 'liveDisplay', 'reference', 'label',
       'tinyLabel', 'promotionalDisplay', '_bind', '_origin', '_id', '_parent',
       'color', 'ar', 'en', 'es', 'fr'],
      dtype='object')

The new groups_df dataframe now includes columns for the various language labels for each category.

Loading the Data from a Single Line of Code#

The previous sections show how we can build up a series of steps that request a data feed from the Dakar rally live data site and then manipulate the data until we get it into a form we are happy with.

We can also bundle these steps into one or more functions inside a package of our own creation to make it easier to distribute and use these functions.

Adding the `groups` Data to a Database#

Loading the data “live

category and group data

Contents

category and group data#

Importing Packages In Python#

Previewing the Category data#

Previewing the allGroups data#

Loading the Data from a Single Line of Code#

Adding the groups Data to a Database#

`category` and `group` data#

Previewing the `Category` data#

Previewing the `allGroups` data#

Adding the `groups` Data to a Database#