stage Data#
The stage data provides general information about a stage, including the astage code (identifier), stage number, stage name, start and end dates, and cancellation status. The stage information also identifes stage sector information, that identifies link sections and competitive stage sections, and within a competitive stage section a list of sections which identify start and end section distances into the competitive stage and the terrain types for each section.
# Load in the required packages
import pandas as pd
from jupyterlite_simple_cors_proxy import furl, xurl
# Also load in our custom function
from dakar_utils_2025 import mergeInLangLabels
# Generate the API URL pattern
dakar_api_template = "https://www.dakar.live.worldrallyraidchampionship.com/api/{path}"
# Define the year
YEAR = 2025
# Define the category
CATEGORY = "A"
# Define the API path to the stage resource
# Use a Python f-string to instantiate variable values directly
stage_path = f"stage-{YEAR}-{CATEGORY}"
# Define the URL
stage_url = dakar_api_template.format(path=stage_path)
# Preview the path and the URL
stage_path, stage_url
('stage-2025-A',
'https://www.dakar.live.worldrallyraidchampionship.com/api/stage-2025-A')
# Load in data
# Use furl() to handle CORS issues in Jupyterlite
stage_df = pd.read_json(furl(stage_url))
stage_df.columns
Index(['updatedAt', 'isCancelled', 'mapCategoryDisplay', 'timezone',
'isDelayed', 'sectors', 'stageLangs', 'startDate', 'length', 'stage',
'type', 'date', 'stageWithBonus', 'podiumDisplay', 'mapDisplay', 'code',
'endDate', 'marathon', 'groupsAsCategory', 'lastColumnDisplay',
'generalDisplay', '_bind', '_origin', '_id', '_key', '_updatedAt',
'_parent', 'qualities', '$category', '_gets', 'categoryLangs',
'reference', 'liveDisplay', 'refueling', 'position', 'lastStage',
'shortLabel', 'promotionalDisplay', 'label'],
dtype='object')
By inspecting the column names, does the stageLangs column define a set of label language mappings, I wonder?
# Preview the first three rows of the stageLangs column
stage_df["stageLangs"][0]
[{'variable': 'stage.name.09000', 'text': 'RIYADH > HARADH', 'locale': 'en'},
{'variable': 'stage.name.09000', 'locale': 'fr', 'text': 'RIYADH > HARADH'},
{'text': 'RIYADH > HARADH', 'locale': 'es', 'variable': 'stage.name.09000'},
{'locale': 'ar', 'variable': 'stage.name.09000', 'text': 'الرياض > حرض'}]
It looks like it does, so we can flatten that data in using the function we defined in the previous chapter. But what column do we need to merge against?
# Preview the values in the first row in the dataframe
stage_df.iloc[0]
updatedAt 2025-01-11T14:53:48+01:00
isCancelled 0.0
mapCategoryDisplay A
timezone Asia/Riyadh
isDelayed 0.0
sectors [{'id': 23937, 'groupsLength': [], 'code': '09...
stageLangs [{'variable': 'stage.name.09000', 'text': 'RIY...
startDate 2025-01-14T00:00:00+03:00
length 589.0
stage 9.0
type STA
date 2025-01-14 00:00:00+03:00
stageWithBonus 0.0
podiumDisplay ce
mapDisplay True
code 09000
endDate 2025-01-14T00:00:00+03:00
marathon 0.0
groupsAsCategory None
lastColumnDisplay ce
generalDisplay 1.0
_bind stage-2025-A
_origin stage-2025-A
_id 5f83172bf9f0bb97d14b83a7c0099fa8
_key stage
_updatedAt 1737386142308
_parent category-2025:63b4f5da4591200d0a4cc239245eb03a
qualities {'ce': {'*': 2, 'T5': 2, 'T1': 2, 'T3': 2, 'T2...
$category category-2025:63b4f5da4591200d0a4cc239245eb03a
_gets {'category': '$category'}
categoryLangs NaN
reference NaN
liveDisplay NaN
refueling NaN
position NaN
lastStage NaN
shortLabel NaN
promotionalDisplay NaN
label NaN
Name: 0, dtype: object
Inspection of the columns in the original dataframe suggests there is no direct mapping to the variable value, which takes the form stage.name.09000. We do note, however, there is a code value which maps the numeric part of variable, so we can create out own merge column, which we might call variable, and then
# Create a dummy colum to match on
stage_df["variable"] = "stage.name." + stage_df["code"]
# Update the dataframe by using our new function to
# merge in the exploded and widenened language labels
stage_df = mergeInLangLabels(stage_df, "stageLangs", key="variable")
# Preview the dataframe, limited to a few illustrative columns
stage_df[["code", "en", "ar"]].head()
| code | en | ar | |
|---|---|---|---|
| 0 | 09000 | RIYADH > HARADH | الرياض > حرض |
| 1 | 07000 | AL DUWADIMI > AL DUWADIMI | الدوادمي > الدوادمي |
| 2 | 06000 | HAIL > AL DUWADIMI | حائل > الدوادمي |
| 3 | 10000 | HARADH > SHUBAYTAH | حرض > شبيطة |
| 4 | 0P000 | BISHA | BISHA |
Let’s skim the first few rows of the table, also adding in new column to help identify the stage code more explcitly.
stage_df['stage_code'] = stage_df['code']
stage_df.sort_values("startDate", inplace=True)
stage_df.reset_index(drop=True, inplace=True)
stage_df.head()
| updatedAt | isCancelled | mapCategoryDisplay | timezone | isDelayed | sectors | startDate | length | stage | type | ... | position | lastStage | shortLabel | promotionalDisplay | label | ar | en | es | fr | stage_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2025-01-08T17:24:08+01:00 | 0.0 | NaN | Asia/Riyadh | 0.0 | [{'groupsLength': [], 'startTime': '2025-01-03... | 2025-01-03T00:00:00+03:00 | 77.0 | 0.0 | PRO | ... | NaN | NaN | NaN | NaN | NaN | BISHA | BISHA | BISHA | BISHA | 0P000 |
| 1 | 2025-01-08T17:24:08+01:00 | 0.0 | A | Asia/Riyadh | 0.0 | [{'id': 23803, 'startTime': '2025-01-04T04:30:... | 2025-01-04T00:00:00+03:00 | 499.0 | 1.0 | STA | ... | NaN | NaN | NaN | NaN | NaN | بيشة > بيشة | BISHA > BISHA | BISHA > BISHA | BISHA > BISHA | 01000 |
| 2 | 2025-01-08T17:24:08+01:00 | 0.0 | M | Asia/Riyadh | 0.0 | [{'powerStage': False, 'length': 45, 'startTim... | 2025-01-05T00:00:00+03:00 | 1058.0 | 2.0 | STA | ... | NaN | NaN | NaN | NaN | NaN | بيشة > بيشة | BISHA > BISHA | BISHA > BISHA | BISHA > BISHA | 02000 |
| 3 | 2025-01-08T17:24:08+01:00 | 0.0 | A | Asia/Riyadh | 0.0 | [{'powerStage': False, 'arrivalTime': '2025-01... | 2025-01-07T00:00:00+03:00 | 963.0 | 3.0 | STA | ... | NaN | NaN | NaN | NaN | NaN | بيشة > الحناكية | BISHA > AL HENAKIYAH | BISHA > AL HENAKIYAH | BISHA > AL HENAKIYAH | 03000 |
| 4 | 2025-01-08T17:24:08+01:00 | 0.0 | A | Asia/Riyadh | 0.0 | [{'type': 'LIA', 'grounds': [{'groundLangs': [... | 2025-01-08T00:00:00+03:00 | 588.0 | 4.0 | STA | ... | NaN | NaN | NaN | NaN | NaN | الحناكية > العلا | AL HENAKIYAH > ALULA | AL HENAKIYAH > ALULA | AL HENAKIYAH > ALULA | 04000 |
5 rows × 43 columns
From the date columns, we notice that the dataframe is not ordered. To provide a more natural ordering, we could order the rows by start date, or stage number, which appears to be given by the stage column, for example.
Inspection of the data also reveals that the sectors column typically appears to contain three elements: a link sector, the competititve stage sector, and a second link sector. The competitive stage sector is further subdivided into a list of grounds or terrain types, with associated language mappings. Within in each ground is a list of sections with a section number and a start and finish value that identify start and finish distances in kilometres into the stage for that section.
Parsing each row#
Let’s get the JSON sectors data for single row and see if we can find a sensible way of parsing that.
I am imagining producing something like a two linked dataframes:
one containing the top-level sector data for each stage;
one containing the surface type by section; I imagine this dataframe to have columns
stage,sector,section,start,finish,surface_typeand then perhaps language mappings for the surface type.
Let’s start by looking at the metadata for each sector.
sectors_df = pd.json_normalize(stage_df[ "sectors"].explode())
sectors_df.head()
| groupsLength | startTime | code | type | length | powerStage | arrivalTime | id | grounds | groupsLength.A_T5 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | [] | 2025-01-03T06:45:00+00:00 | 0P100 | LIA | 26 | False | 2025-01-03T07:35:00+00:00 | 23800 | NaN | NaN |
| 1 | [] | 2025-01-03T07:35:00+00:00 | 0P200 | SPE | 29 | False | 2025-01-03T07:54:00+00:00 | 23801 | [{'percentage': 6, 'color': '#753a05', 'sectio... | NaN |
| 2 | [] | 2025-01-03T07:54:00+00:00 | 0P300 | LIA | 24 | False | 2025-01-03T08:24:00+00:00 | 23802 | NaN | NaN |
| 3 | [] | 2025-01-04T04:30:00+00:00 | 01100 | LIA | 86 | False | 2025-01-04T06:25:00+00:00 | 23803 | NaN | NaN |
| 4 | [] | 2025-01-04T06:25:00+00:00 | 01200 | SPE | 413 | False | 2025-01-04T10:35:00+00:00 | 23804 | [{'name': 'ground.name.1', 'percentage': 18, '... | NaN |
The stage number is not explicitly listed, but we can derive it from the first two characters of the code.
We can also add a “sector number”.
# Generate an appropriate stage code
sectors_df['stage_code'] = sectors_df['code'].str[:2] + '000'
# Generate a sector number
sectors_df['sector_number'] = sectors_df.groupby('stage_code').cumcount() + 1
sectors_df.head()
| groupsLength | startTime | code | type | length | powerStage | arrivalTime | id | grounds | groupsLength.A_T5 | stage_code | sector_number | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | [] | 2025-01-03T06:45:00+00:00 | 0P100 | LIA | 26 | False | 2025-01-03T07:35:00+00:00 | 23800 | NaN | NaN | 0P000 | 1 |
| 1 | [] | 2025-01-03T07:35:00+00:00 | 0P200 | SPE | 29 | False | 2025-01-03T07:54:00+00:00 | 23801 | [{'percentage': 6, 'color': '#753a05', 'sectio... | NaN | 0P000 | 2 |
| 2 | [] | 2025-01-03T07:54:00+00:00 | 0P300 | LIA | 24 | False | 2025-01-03T08:24:00+00:00 | 23802 | NaN | NaN | 0P000 | 3 |
| 3 | [] | 2025-01-04T04:30:00+00:00 | 01100 | LIA | 86 | False | 2025-01-04T06:25:00+00:00 | 23803 | NaN | NaN | 01000 | 1 |
| 4 | [] | 2025-01-04T06:25:00+00:00 | 01200 | SPE | 413 | False | 2025-01-04T10:35:00+00:00 | 23804 | [{'name': 'ground.name.1', 'percentage': 18, '... | NaN | 01000 | 2 |
If the sector is an SPE type, we have the grounds information to play with. This is split over various sections, with start and finish distances identifying each section, as well as a surface type and a “percentage” value.
# Get the sectors with grounds data
competitive_sectors = sectors_df[['grounds', 'code']].dropna(
axis="index").explode('grounds').reset_index(drop=True)
competitive_sectors.head()
| grounds | code | |
|---|---|---|
| 0 | {'percentage': 6, 'color': '#753a05', 'section... | 0P200 |
| 1 | {'color': '#1dc942', 'percentage': 24, 'sectio... | 0P200 |
| 2 | {'color': '#efc07c', 'groundLangs': [{'variabl... | 0P200 |
| 3 | {'color': '#753a05', 'name': 'ground.name.1', ... | 0P200 |
| 4 | {'sections': [{'finish': 16, 'start': 14, 'sec... | 0P200 |
Tidy the sectors dataframe:
# Sort sectors by stage and sector
sectors_df.sort_values("code", inplace=True)
# Simplify the sectors dataframe
sectors_df = sectors_df[["stage_code", "code", "id", "sector_number", "powerStage",
"length", "startTime", "type", "arrivalTime"]].reset_index(drop=True)
sectors_df.head()
| stage_code | code | id | sector_number | powerStage | length | startTime | type | arrivalTime | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 01000 | 01100 | 23803 | 1 | False | 86 | 2025-01-04T04:30:00+00:00 | LIA | 2025-01-04T06:25:00+00:00 |
| 1 | 01000 | 01200 | 23804 | 2 | False | 413 | 2025-01-04T06:25:00+00:00 | SPE | 2025-01-04T10:35:00+00:00 |
| 2 | 02000 | 02100 | 23805 | 1 | False | 45 | 2025-01-05T03:25:00+00:00 | LIA | 2025-01-05T04:40:00+00:00 |
| 3 | 02000 | 02200 | 23806 | 2 | False | 967 | 2025-01-05T04:40:00+00:00 | SPE | 2025-01-05T13:00:00+00:00 |
| 4 | 02000 | 02300 | 23807 | 3 | False | 46 | 2025-01-05T13:00:00+00:00 | LIA | 2025-01-05T14:00:00+00:00 |
And simplify and tidy the “top level” stages dataframe:
stage_cols = ['stage_code', 'stage', 'date', 'startDate', 'endDate', 'isCancelled', 'generalDisplay', 'isDelayed', 'marathon',
'length', 'type', 'timezone', 'stageWithBonus', 'mapCategoryDisplay', 'podiumDisplay', '_bind', 'ar', 'en', 'es', 'fr']
stage_df = stage_df[stage_cols]
Let’s see if we can pull out the data for the mixed surface types in a sensible way.
competitive_sectors.iloc[0].to_dict()
{'grounds': {'percentage': 6,
'color': '#753a05',
'sections': [{'finish': 1, 'start': 0, 'section': 1},
{'start': 2, 'finish': 2, 'section': 3},
{'finish': 6, 'start': 6, 'section': 5}],
'name': 'ground.name.1',
'groundLangs': [{'locale': 'en',
'text': 'Dirt Track',
'variable': 'ground.name.1'},
{'locale': 'fr', 'variable': 'ground.name.1', 'text': 'Piste terre'},
{'variable': 'ground.name.1', 'text': 'Pista de tierra', 'locale': 'es'},
{'locale': 'ar', 'variable': 'ground.name.1', 'text': 'الحجارة'}]},
'code': '0P200'}
We can pull the data out into three components:
the percentage of each surface type by stage
the surface type for each section
the surface type language labels
def flatten_grounds_data(df):
"""
Flatten nested grounds data into a wide DataFrame format.
Parameters:
df (pandas.DataFrame): DataFrame with 'grounds' and 'id' columns where grounds contains nested dictionary data
Returns:
pandas.DataFrame: Flattened DataFrame with one row per section
"""
# Create empty lists to store flattened data
flattened_data = []
percentage_data = []
surface_types = []
_surface_types = []
# Sort by stage sector
df.sort_values("code", inplace=True)
# Create a mapping for translations
for _, row in df.iterrows():
ground_data = row['grounds']
# Create a translations dictionary
translations = {f"text_{lang['locale']}": lang['text']
for lang in ground_data['groundLangs']}
_stype = translations["text_en"].lower()
percentage_data.append(
{'code': row['code'],
'percentage': ground_data['percentage'],
# 'ground_name': ground_data['name'],
'color': ground_data['color'], "type": _stype})
if _stype not in _surface_types:
_surface_types.append(_stype)
surface_types.append({"type": _stype, **translations})
# Create a record for each section
for section in ground_data['sections']:
section_record = {
'code': row['code'],
'ground_name': ground_data['name'],
'section': section['section'],
'start': section['start'],
'finish': section['finish'],
'color': ground_data['color'],
# 'percentage': ground_data['percentage'],
"type": _stype
}
flattened_data.append(section_record)
# Create DataFrame from flattened data
section_df = pd.DataFrame(flattened_data)
percentage_df = pd.DataFrame(percentage_data)
surfaces_df = pd.DataFrame(surface_types)
# Sort columns for better organization
# Ignore: 'percentage', 'ground_name',
fixed_columns = ['code', 'section', 'start', 'finish',
'color', "type"]
lang_columns = [
col for col in section_df.columns if col.startswith('text_')]
section_df = section_df[fixed_columns + sorted(lang_columns)]
# Drop duplicates if any still exist
section_df = section_df.drop_duplicates()
section_df = section_df.sort_values(
['code', 'section'])
section_df.reset_index(drop=True, inplace=True)
return section_df, percentage_df, surfaces_df
section_surfaces, stage_surfaces, surfaces = flatten_grounds_data(
competitive_sectors)
section_surfaces.head()
| code | section | start | finish | color | type | |
|---|---|---|---|---|---|---|
| 0 | 01200 | 1 | 0 | 27 | #efc07c | sand |
| 1 | 01200 | 2 | 27 | 32 | #753a05 | dirt track |
| 2 | 01200 | 3 | 32 | 32 | #1dc942 | gravel track |
| 3 | 01200 | 4 | 32 | 41 | #efc07c | sand |
| 4 | 01200 | 5 | 41 | 42 | #753a05 | dirt track |
stage_surfaces.head(10)
| code | percentage | color | type | |
|---|---|---|---|---|
| 0 | 01200 | 18 | #753a05 | dirt track |
| 1 | 01200 | 18 | #753a05 | dirt track |
| 2 | 01200 | 28 | #1dc942 | gravel track |
| 3 | 01200 | 53 | #efc07c | sand |
| 4 | 01200 | 18 | #753a05 | dirt track |
| 5 | 01200 | 28 | #1dc942 | gravel track |
| 6 | 01200 | 53 | #efc07c | sand |
| 7 | 01200 | 18 | #753a05 | dirt track |
| 8 | 01200 | 53 | #efc07c | sand |
| 9 | 01200 | 28 | #1dc942 | gravel track |
surfaces
| type | text_en | text_fr | text_es | text_ar | |
|---|---|---|---|---|---|
| 0 | dirt track | Dirt Track | Piste terre | Pista de tierra | الحجارة |
| 1 | gravel track | Gravel Track | Piste empierrée | Tierra | التراب |
| 2 | sand | Sand | Sable | Arena | الرمال |
| 3 | asphalt road | Asphalt Road | Goudron | Asfalto | الزفت |
| 4 | dunes | Dunes | Dunes | Dunas | الكثبان |
stage_surfaces[stage_surfaces["code"]=="11200"]
| code | percentage | color | type | |
|---|---|---|---|---|
| 121 | 11200 | 60 | #efc07c | sand |
| 122 | 11200 | 40 | #ff7200 | dunes |
| 123 | 11200 | 38 | #efc07c | sand |
| 124 | 11200 | 61 | #ff7200 | dunes |
Differences Between Stage Data for Different Categories#
Are there actually any differences in the stage data for the different categories?