stage Data

`stage` Data#

The stage data provides general information about a stage, including the astage code (identifier), stage number, stage name, start and end dates, and cancellation status. The stage information also identifes stage sector information, that identifies link sections and competitive stage sections, and within a competitive stage section a list of sections which identify start and end section distances into the competitive stage and the terrain types for each section.

# Load in the required packages
import pandas as pd
from jupyterlite_simple_cors_proxy import furl, xurl

# Also load in our custom function
from dakar_utils_2025 import mergeInLangLabels

# Generate the API URL pattern
dakar_api_template = "https://www.dakar.live.worldrallyraidchampionship.com/api/{path}"

# Define the year
YEAR = 2025
# Define the category
CATEGORY = "A"

# Define the API path to the stage resource
# Use a Python f-string to instantiate variable values directly
stage_path = f"stage-{YEAR}-{CATEGORY}"

# Define the URL
stage_url = dakar_api_template.format(path=stage_path)

# Preview the path and the URL
stage_path, stage_url

('stage-2025-A',
 'https://www.dakar.live.worldrallyraidchampionship.com/api/stage-2025-A')

# Load in data
# Use furl() to handle CORS issues in Jupyterlite
stage_df = pd.read_json(furl(stage_url))
stage_df.columns

Index(['updatedAt', 'isCancelled', 'mapCategoryDisplay', 'timezone',
       'isDelayed', 'sectors', 'stageLangs', 'startDate', 'length', 'stage',
       'type', 'date', 'stageWithBonus', 'podiumDisplay', 'mapDisplay', 'code',
       'endDate', 'marathon', 'groupsAsCategory', 'lastColumnDisplay',
       'generalDisplay', '_bind', '_origin', '_id', '_key', '_updatedAt',
       '_parent', 'qualities', '$category', '_gets', 'categoryLangs',
       'reference', 'liveDisplay', 'refueling', 'position', 'lastStage',
       'shortLabel', 'promotionalDisplay', 'label'],
      dtype='object')

By inspecting the column names, does the stageLangs column define a set of label language mappings, I wonder?

# Preview the first three rows of the stageLangs column
stage_df["stageLangs"][0]

[{'variable': 'stage.name.09000', 'text': 'RIYADH > HARADH', 'locale': 'en'},
 {'variable': 'stage.name.09000', 'locale': 'fr', 'text': 'RIYADH > HARADH'},
 {'text': 'RIYADH > HARADH', 'locale': 'es', 'variable': 'stage.name.09000'},
 {'locale': 'ar', 'variable': 'stage.name.09000', 'text': 'الرياض > حرض'}]

It looks like it does, so we can flatten that data in using the function we defined in the previous chapter. But what column do we need to merge against?

# Preview the values in the first row in the dataframe
stage_df.iloc[0]

updatedAt                                     2025-01-11T14:53:48+01:00
isCancelled                                                         0.0
mapCategoryDisplay                                                    A
timezone                                                    Asia/Riyadh
isDelayed                                                           0.0
sectors               [{'id': 23937, 'groupsLength': [], 'code': '09...
stageLangs            [{'variable': 'stage.name.09000', 'text': 'RIY...
startDate                                     2025-01-14T00:00:00+03:00
length                                                            589.0
stage                                                               9.0
type                                                                STA
date                                          2025-01-14 00:00:00+03:00
stageWithBonus                                                      0.0
podiumDisplay                                                        ce
mapDisplay                                                         True
code                                                              09000
endDate                                       2025-01-14T00:00:00+03:00
marathon                                                            0.0
groupsAsCategory                                                   None
lastColumnDisplay                                                    ce
generalDisplay                                                      1.0
_bind                                                      stage-2025-A
_origin                                                    stage-2025-A
_id                                    5f83172bf9f0bb97d14b83a7c0099fa8
_key                                                              stage
_updatedAt                                                1737386142308
_parent                  category-2025:63b4f5da4591200d0a4cc239245eb03a
qualities             {'ce': {'*': 2, 'T5': 2, 'T1': 2, 'T3': 2, 'T2...
$category                category-2025:63b4f5da4591200d0a4cc239245eb03a
_gets                                         {'category': '$category'}
categoryLangs                                                       NaN
reference                                                           NaN
liveDisplay                                                         NaN
refueling                                                           NaN
position                                                            NaN
lastStage                                                           NaN
shortLabel                                                          NaN
promotionalDisplay                                                  NaN
label                                                               NaN
Name: 0, dtype: object

Inspection of the columns in the original dataframe suggests there is no direct mapping to the variable value, which takes the form stage.name.09000. We do note, however, there is a code value which maps the numeric part of variable, so we can create out own merge column, which we might call variable, and then

# Create a dummy colum to match on
stage_df["variable"] = "stage.name." + stage_df["code"]

# Update the dataframe by using our new function to
# merge in the exploded and widenened language labels
stage_df = mergeInLangLabels(stage_df, "stageLangs", key="variable")

# Preview the dataframe, limited to a few illustrative columns
stage_df[["code", "en", "ar"]].head()

	code	en	ar
0	09000	RIYADH > HARADH	الرياض > حرض
1	07000	AL DUWADIMI > AL DUWADIMI	الدوادمي > الدوادمي
2	06000	HAIL > AL DUWADIMI	حائل > الدوادمي
3	10000	HARADH > SHUBAYTAH	حرض > شبيطة
4	0P000	BISHA	BISHA

Let’s skim the first few rows of the table, also adding in new column to help identify the stage code more explcitly.

stage_df['stage_code'] = stage_df['code']
stage_df.sort_values("startDate", inplace=True)
stage_df.reset_index(drop=True, inplace=True)

stage_df.head()

	updatedAt	mapCategoryDisplay	timezone	sectors	startDate	length	stage	type	...	position	lastStage	shortLabel	promotionalDisplay	label	ar	en	es	fr	stage_code
0	2025-01-08T17:24:08+01:00	NaN	Asia/Riyadh	[{'groupsLength': [], 'startTime': '2025-01-03...	2025-01-03T00:00:00+03:00	77.0	0.0	PRO	...	NaN	NaN	NaN	NaN	NaN	BISHA	BISHA	BISHA	BISHA	0P000
1	2025-01-08T17:24:08+01:00	A	Asia/Riyadh	[{'id': 23803, 'startTime': '2025-01-04T04:30:...	2025-01-04T00:00:00+03:00	499.0	1.0	STA	...	NaN	NaN	NaN	NaN	NaN	بيشة > بيشة	BISHA > BISHA	BISHA > BISHA	BISHA > BISHA	01000
2	2025-01-08T17:24:08+01:00	M	Asia/Riyadh	[{'powerStage': False, 'length': 45, 'startTim...	2025-01-05T00:00:00+03:00	1058.0	2.0	STA	...	NaN	NaN	NaN	NaN	NaN	بيشة > بيشة	BISHA > BISHA	BISHA > BISHA	BISHA > BISHA	02000
3	2025-01-08T17:24:08+01:00	A	Asia/Riyadh	[{'powerStage': False, 'arrivalTime': '2025-01...	2025-01-07T00:00:00+03:00	963.0	3.0	STA	...	NaN	NaN	NaN	NaN	NaN	بيشة > الحناكية	BISHA > AL HENAKIYAH	BISHA > AL HENAKIYAH	BISHA > AL HENAKIYAH	03000
4	2025-01-08T17:24:08+01:00	A	Asia/Riyadh	[{'type': 'LIA', 'grounds': [{'groundLangs': [...	2025-01-08T00:00:00+03:00	588.0	4.0	STA	...	NaN	NaN	NaN	NaN	NaN	الحناكية > العلا	AL HENAKIYAH > ALULA	AL HENAKIYAH > ALULA	AL HENAKIYAH > ALULA	04000

5 rows × 43 columns

From the date columns, we notice that the dataframe is not ordered. To provide a more natural ordering, we could order the rows by start date, or stage number, which appears to be given by the stage column, for example.

Inspection of the data also reveals that the sectors column typically appears to contain three elements: a link sector, the competititve stage sector, and a second link sector. The competitive stage sector is further subdivided into a list of grounds or terrain types, with associated language mappings. Within in each ground is a list of sections with a section number and a start and finish value that identify start and finish distances in kilometres into the stage for that section.

Parsing each row#

Let’s get the JSON sectors data for single row and see if we can find a sensible way of parsing that.

I am imagining producing something like a two linked dataframes:

one containing the top-level sector data for each stage;
one containing the surface type by section; I imagine this dataframe to have columns stage, sector, section, start, finish, surface_type and then perhaps language mappings for the surface type.

Let’s start by looking at the metadata for each sector.

sectors_df = pd.json_normalize(stage_df[ "sectors"].explode())

sectors_df.head()

	groupsLength	startTime	code	type	length	powerStage	arrivalTime	id	grounds	groupsLength.A_T5
0	[]	2025-01-03T06:45:00+00:00	0P100	LIA	26	False	2025-01-03T07:35:00+00:00	23800	NaN	NaN
1	[]	2025-01-03T07:35:00+00:00	0P200	SPE	29	False	2025-01-03T07:54:00+00:00	23801	[{'percentage': 6, 'color': '#753a05', 'sectio...	NaN
2	[]	2025-01-03T07:54:00+00:00	0P300	LIA	24	False	2025-01-03T08:24:00+00:00	23802	NaN	NaN
3	[]	2025-01-04T04:30:00+00:00	01100	LIA	86	False	2025-01-04T06:25:00+00:00	23803	NaN	NaN
4	[]	2025-01-04T06:25:00+00:00	01200	SPE	413	False	2025-01-04T10:35:00+00:00	23804	[{'name': 'ground.name.1', 'percentage': 18, '...	NaN

The stage number is not explicitly listed, but we can derive it from the first two characters of the code.

We can also add a “sector number”.

# Generate an appropriate stage code
sectors_df['stage_code'] = sectors_df['code'].str[:2] + '000'

# Generate a sector number
sectors_df['sector_number'] = sectors_df.groupby('stage_code').cumcount() + 1

sectors_df.head()

	groupsLength	startTime	code	type	length	powerStage	arrivalTime	id	grounds	groupsLength.A_T5	stage_code	sector_number
0	[]	2025-01-03T06:45:00+00:00	0P100	LIA	26	False	2025-01-03T07:35:00+00:00	23800	NaN	NaN	0P000	1
1	[]	2025-01-03T07:35:00+00:00	0P200	SPE	29	False	2025-01-03T07:54:00+00:00	23801	[{'percentage': 6, 'color': '#753a05', 'sectio...	NaN	0P000	2
2	[]	2025-01-03T07:54:00+00:00	0P300	LIA	24	False	2025-01-03T08:24:00+00:00	23802	NaN	NaN	0P000	3
3	[]	2025-01-04T04:30:00+00:00	01100	LIA	86	False	2025-01-04T06:25:00+00:00	23803	NaN	NaN	01000	1
4	[]	2025-01-04T06:25:00+00:00	01200	SPE	413	False	2025-01-04T10:35:00+00:00	23804	[{'name': 'ground.name.1', 'percentage': 18, '...	NaN	01000	2

If the sector is an SPE type, we have the grounds information to play with. This is split over various sections, with start and finish distances identifying each section, as well as a surface type and a “percentage” value.

# Get the sectors with grounds data
competitive_sectors = sectors_df[['grounds', 'code']].dropna(
    axis="index").explode('grounds').reset_index(drop=True)

competitive_sectors.head()

	grounds	code
0	{'percentage': 6, 'color': '#753a05', 'section...	0P200
1	{'color': '#1dc942', 'percentage': 24, 'sectio...	0P200
2	{'color': '#efc07c', 'groundLangs': [{'variabl...	0P200
3	{'color': '#753a05', 'name': 'ground.name.1', ...	0P200
4	{'sections': [{'finish': 16, 'start': 14, 'sec...	0P200

Tidy the sectors dataframe:

# Sort sectors by stage and sector
sectors_df.sort_values("code", inplace=True)
# Simplify the sectors dataframe
sectors_df = sectors_df[["stage_code", "code", "id", "sector_number", "powerStage",
                             "length", "startTime", "type", "arrivalTime"]].reset_index(drop=True)

sectors_df.head()

	stage_code	code	id	sector_number	powerStage	length	startTime	type	arrivalTime
0	01000	01100	23803	1	False	86	2025-01-04T04:30:00+00:00	LIA	2025-01-04T06:25:00+00:00
1	01000	01200	23804	2	False	413	2025-01-04T06:25:00+00:00	SPE	2025-01-04T10:35:00+00:00
2	02000	02100	23805	1	False	45	2025-01-05T03:25:00+00:00	LIA	2025-01-05T04:40:00+00:00
3	02000	02200	23806	2	False	967	2025-01-05T04:40:00+00:00	SPE	2025-01-05T13:00:00+00:00
4	02000	02300	23807	3	False	46	2025-01-05T13:00:00+00:00	LIA	2025-01-05T14:00:00+00:00

And simplify and tidy the “top level” stages dataframe:

stage_cols = ['stage_code', 'stage', 'date', 'startDate', 'endDate', 'isCancelled', 'generalDisplay', 'isDelayed', 'marathon',
              'length', 'type',  'timezone', 'stageWithBonus', 'mapCategoryDisplay', 'podiumDisplay', '_bind', 'ar', 'en', 'es', 'fr']
stage_df = stage_df[stage_cols]

Let’s see if we can pull out the data for the mixed surface types in a sensible way.

competitive_sectors.iloc[0].to_dict()

{'grounds': {'percentage': 6,
  'color': '#753a05',
  'sections': [{'finish': 1, 'start': 0, 'section': 1},
   {'start': 2, 'finish': 2, 'section': 3},
   {'finish': 6, 'start': 6, 'section': 5}],
  'name': 'ground.name.1',
  'groundLangs': [{'locale': 'en',
    'text': 'Dirt Track',
    'variable': 'ground.name.1'},
   {'locale': 'fr', 'variable': 'ground.name.1', 'text': 'Piste terre'},
   {'variable': 'ground.name.1', 'text': 'Pista de tierra', 'locale': 'es'},
   {'locale': 'ar', 'variable': 'ground.name.1', 'text': 'الحجارة'}]},
 'code': '0P200'}

We can pull the data out into three components:

the percentage of each surface type by stage
the surface type for each section
the surface type language labels

def flatten_grounds_data(df):
    """
    Flatten nested grounds data into a wide DataFrame format.

    Parameters:
    df (pandas.DataFrame): DataFrame with 'grounds' and 'id' columns where grounds contains nested dictionary data

    Returns:
    pandas.DataFrame: Flattened DataFrame with one row per section
    """

    # Create empty lists to store flattened data
    flattened_data = []
    percentage_data = []
    surface_types = []
    _surface_types = []

    # Sort by stage sector
    df.sort_values("code", inplace=True)

    # Create a mapping for translations
    for _, row in df.iterrows():
        ground_data = row['grounds']

        # Create a translations dictionary
        translations = {f"text_{lang['locale']}": lang['text']
                        for lang in ground_data['groundLangs']}

        _stype = translations["text_en"].lower()

        percentage_data.append(
            {'code': row['code'],
             'percentage': ground_data['percentage'],
             # 'ground_name': ground_data['name'],
             'color': ground_data['color'], "type": _stype})

        if _stype not in _surface_types:
            _surface_types.append(_stype)
            surface_types.append({"type": _stype, **translations})

        # Create a record for each section
        for section in ground_data['sections']:
            section_record = {
                'code': row['code'],
                'ground_name': ground_data['name'],
                'section': section['section'],
                'start': section['start'],
                'finish': section['finish'],
                'color': ground_data['color'],
                # 'percentage': ground_data['percentage'],
                "type": _stype
            }
            flattened_data.append(section_record)

    # Create DataFrame from flattened data
    section_df = pd.DataFrame(flattened_data)
    percentage_df = pd.DataFrame(percentage_data)
    surfaces_df = pd.DataFrame(surface_types)

    # Sort columns for better organization
    # Ignore: 'percentage', 'ground_name',
    fixed_columns = ['code',  'section', 'start', 'finish',
                     'color', "type"]
    lang_columns = [
        col for col in section_df.columns if col.startswith('text_')]
    section_df = section_df[fixed_columns + sorted(lang_columns)]

    # Drop duplicates if any still exist
    section_df = section_df.drop_duplicates()
    section_df = section_df.sort_values(
        ['code', 'section'])
    section_df.reset_index(drop=True, inplace=True)

    return section_df, percentage_df, surfaces_df

section_surfaces, stage_surfaces, surfaces = flatten_grounds_data(
    competitive_sectors)
section_surfaces.head()

	code	section	start	finish	color	type
0	01200	1	0	27	#efc07c	sand
1	01200	2	27	32	#753a05	dirt track
2	01200	3	32	32	#1dc942	gravel track
3	01200	4	32	41	#efc07c	sand
4	01200	5	41	42	#753a05	dirt track

stage_surfaces.head(10)

	code	percentage	color	type
0	01200	18	#753a05	dirt track
1	01200	18	#753a05	dirt track
2	01200	28	#1dc942	gravel track
3	01200	53	#efc07c	sand
4	01200	18	#753a05	dirt track
5	01200	28	#1dc942	gravel track
6	01200	53	#efc07c	sand
7	01200	18	#753a05	dirt track
8	01200	53	#efc07c	sand
9	01200	28	#1dc942	gravel track

surfaces

	type	text_en	text_fr	text_es	text_ar
0	dirt track	Dirt Track	Piste terre	Pista de tierra	الحجارة
1	gravel track	Gravel Track	Piste empierrée	Tierra	التراب
2	sand	Sand	Sable	Arena	الرمال
3	asphalt road	Asphalt Road	Goudron	Asfalto	الزفت
4	dunes	Dunes	Dunes	Dunas	الكثبان

stage_surfaces[stage_surfaces["code"]=="11200"]

	code	percentage	color	type
121	11200	60	#efc07c	sand
122	11200	40	#ff7200	dunes
123	11200	38	#efc07c	sand
124	11200	61	#ff7200	dunes

Differences Between Stage Data for Different Categories#

Are there actually any differences in the stage data for the different categories?

stage Data

Contents

stage Data#

Parsing each row#

Differences Between Stage Data for Different Categories#

`stage` Data#