stage Data#

The stage data provides general information about a stage, including the astage code (identifier), stage number, stage name, start and end dates, and cancellation status. The stage information also identifes stage sector information, that identifies link sections and competitive stage sections, and within a competitive stage section a list of sections which identify start and end section distances into the competitive stage and the terrain types for each section.

# Load in the required packages
import pandas as pd
from jupyterlite_simple_cors_proxy import furl, xurl

# Also load in our custom function
from dakar_utils_2025 import mergeInLangLabels
# Generate the API URL pattern
dakar_api_template = "https://www.dakar.live.worldrallyraidchampionship.com/api/{path}"

# Define the year
YEAR = 2025
# Define the category
CATEGORY = "A"

# Define the API path to the stage resource
# Use a Python f-string to instantiate variable values directly
stage_path = f"stage-{YEAR}-{CATEGORY}"

# Define the URL
stage_url = dakar_api_template.format(path=stage_path)

# Preview the path and the URL
stage_path, stage_url
('stage-2025-A',
 'https://www.dakar.live.worldrallyraidchampionship.com/api/stage-2025-A')
# Load in data
# Use furl() to handle CORS issues in Jupyterlite
stage_df = pd.read_json(furl(stage_url))
stage_df.columns
Index(['updatedAt', 'isCancelled', 'mapCategoryDisplay', 'timezone',
       'isDelayed', 'sectors', 'stageLangs', 'startDate', 'length', 'stage',
       'type', 'date', 'stageWithBonus', 'podiumDisplay', 'mapDisplay', 'code',
       'endDate', 'marathon', 'groupsAsCategory', 'lastColumnDisplay',
       'generalDisplay', '_bind', '_origin', '_id', '_key', '_updatedAt',
       '_parent', 'qualities', '$category', '_gets', 'categoryLangs',
       'reference', 'liveDisplay', 'refueling', 'position', 'lastStage',
       'shortLabel', 'promotionalDisplay', 'label'],
      dtype='object')

By inspecting the column names, does the stageLangs column define a set of label language mappings, I wonder?

# Preview the first three rows of the stageLangs column
stage_df["stageLangs"][0]
[{'variable': 'stage.name.09000', 'text': 'RIYADH > HARADH', 'locale': 'en'},
 {'variable': 'stage.name.09000', 'locale': 'fr', 'text': 'RIYADH > HARADH'},
 {'text': 'RIYADH > HARADH', 'locale': 'es', 'variable': 'stage.name.09000'},
 {'locale': 'ar', 'variable': 'stage.name.09000', 'text': 'الرياض > حرض'}]

It looks like it does, so we can flatten that data in using the function we defined in the previous chapter. But what column do we need to merge against?

# Preview the values in the first row in the dataframe
stage_df.iloc[0]
updatedAt                                     2025-01-11T14:53:48+01:00
isCancelled                                                         0.0
mapCategoryDisplay                                                    A
timezone                                                    Asia/Riyadh
isDelayed                                                           0.0
sectors               [{'id': 23937, 'groupsLength': [], 'code': '09...
stageLangs            [{'variable': 'stage.name.09000', 'text': 'RIY...
startDate                                     2025-01-14T00:00:00+03:00
length                                                            589.0
stage                                                               9.0
type                                                                STA
date                                          2025-01-14 00:00:00+03:00
stageWithBonus                                                      0.0
podiumDisplay                                                        ce
mapDisplay                                                         True
code                                                              09000
endDate                                       2025-01-14T00:00:00+03:00
marathon                                                            0.0
groupsAsCategory                                                   None
lastColumnDisplay                                                    ce
generalDisplay                                                      1.0
_bind                                                      stage-2025-A
_origin                                                    stage-2025-A
_id                                    5f83172bf9f0bb97d14b83a7c0099fa8
_key                                                              stage
_updatedAt                                                1737386142308
_parent                  category-2025:63b4f5da4591200d0a4cc239245eb03a
qualities             {'ce': {'*': 2, 'T5': 2, 'T1': 2, 'T3': 2, 'T2...
$category                category-2025:63b4f5da4591200d0a4cc239245eb03a
_gets                                         {'category': '$category'}
categoryLangs                                                       NaN
reference                                                           NaN
liveDisplay                                                         NaN
refueling                                                           NaN
position                                                            NaN
lastStage                                                           NaN
shortLabel                                                          NaN
promotionalDisplay                                                  NaN
label                                                               NaN
Name: 0, dtype: object

Inspection of the columns in the original dataframe suggests there is no direct mapping to the variable value, which takes the form stage.name.09000. We do note, however, there is a code value which maps the numeric part of variable, so we can create out own merge column, which we might call variable, and then

# Create a dummy colum to match on
stage_df["variable"] = "stage.name." + stage_df["code"]

# Update the dataframe by using our new function to
# merge in the exploded and widenened language labels
stage_df = mergeInLangLabels(stage_df, "stageLangs", key="variable")

# Preview the dataframe, limited to a few illustrative columns
stage_df[["code", "en", "ar"]].head()
code en ar
0 09000 RIYADH > HARADH الرياض > حرض
1 07000 AL DUWADIMI > AL DUWADIMI الدوادمي > الدوادمي
2 06000 HAIL > AL DUWADIMI حائل > الدوادمي
3 10000 HARADH > SHUBAYTAH حرض > شبيطة
4 0P000 BISHA BISHA

Let’s skim the first few rows of the table, also adding in new column to help identify the stage code more explcitly.

stage_df['stage_code'] = stage_df['code']
stage_df.sort_values("startDate", inplace=True)
stage_df.reset_index(drop=True, inplace=True)

stage_df.head()
updatedAt isCancelled mapCategoryDisplay timezone isDelayed sectors startDate length stage type ... position lastStage shortLabel promotionalDisplay label ar en es fr stage_code
0 2025-01-08T17:24:08+01:00 0.0 NaN Asia/Riyadh 0.0 [{'groupsLength': [], 'startTime': '2025-01-03... 2025-01-03T00:00:00+03:00 77.0 0.0 PRO ... NaN NaN NaN NaN NaN BISHA BISHA BISHA BISHA 0P000
1 2025-01-08T17:24:08+01:00 0.0 A Asia/Riyadh 0.0 [{'id': 23803, 'startTime': '2025-01-04T04:30:... 2025-01-04T00:00:00+03:00 499.0 1.0 STA ... NaN NaN NaN NaN NaN بيشة > بيشة BISHA > BISHA BISHA > BISHA BISHA > BISHA 01000
2 2025-01-08T17:24:08+01:00 0.0 M Asia/Riyadh 0.0 [{'powerStage': False, 'length': 45, 'startTim... 2025-01-05T00:00:00+03:00 1058.0 2.0 STA ... NaN NaN NaN NaN NaN بيشة > بيشة BISHA > BISHA BISHA > BISHA BISHA > BISHA 02000
3 2025-01-08T17:24:08+01:00 0.0 A Asia/Riyadh 0.0 [{'powerStage': False, 'arrivalTime': '2025-01... 2025-01-07T00:00:00+03:00 963.0 3.0 STA ... NaN NaN NaN NaN NaN بيشة > الحناكية BISHA > AL HENAKIYAH BISHA > AL HENAKIYAH BISHA > AL HENAKIYAH 03000
4 2025-01-08T17:24:08+01:00 0.0 A Asia/Riyadh 0.0 [{'type': 'LIA', 'grounds': [{'groundLangs': [... 2025-01-08T00:00:00+03:00 588.0 4.0 STA ... NaN NaN NaN NaN NaN الحناكية > العلا AL HENAKIYAH > ALULA AL HENAKIYAH > ALULA AL HENAKIYAH > ALULA 04000

5 rows × 43 columns

From the date columns, we notice that the dataframe is not ordered. To provide a more natural ordering, we could order the rows by start date, or stage number, which appears to be given by the stage column, for example.

Inspection of the data also reveals that the sectors column typically appears to contain three elements: a link sector, the competititve stage sector, and a second link sector. The competitive stage sector is further subdivided into a list of grounds or terrain types, with associated language mappings. Within in each ground is a list of sections with a section number and a start and finish value that identify start and finish distances in kilometres into the stage for that section.

Parsing each row#

Let’s get the JSON sectors data for single row and see if we can find a sensible way of parsing that.

I am imagining producing something like a two linked dataframes:

  • one containing the top-level sector data for each stage;

  • one containing the surface type by section; I imagine this dataframe to have columns stage, sector, section, start, finish, surface_type and then perhaps language mappings for the surface type.

Let’s start by looking at the metadata for each sector.

sectors_df = pd.json_normalize(stage_df[ "sectors"].explode())

sectors_df.head()
groupsLength startTime code type length powerStage arrivalTime id grounds groupsLength.A_T5
0 [] 2025-01-03T06:45:00+00:00 0P100 LIA 26 False 2025-01-03T07:35:00+00:00 23800 NaN NaN
1 [] 2025-01-03T07:35:00+00:00 0P200 SPE 29 False 2025-01-03T07:54:00+00:00 23801 [{'percentage': 6, 'color': '#753a05', 'sectio... NaN
2 [] 2025-01-03T07:54:00+00:00 0P300 LIA 24 False 2025-01-03T08:24:00+00:00 23802 NaN NaN
3 [] 2025-01-04T04:30:00+00:00 01100 LIA 86 False 2025-01-04T06:25:00+00:00 23803 NaN NaN
4 [] 2025-01-04T06:25:00+00:00 01200 SPE 413 False 2025-01-04T10:35:00+00:00 23804 [{'name': 'ground.name.1', 'percentage': 18, '... NaN

The stage number is not explicitly listed, but we can derive it from the first two characters of the code.

We can also add a “sector number”.

# Generate an appropriate stage code
sectors_df['stage_code'] = sectors_df['code'].str[:2] + '000'

# Generate a sector number
sectors_df['sector_number'] = sectors_df.groupby('stage_code').cumcount() + 1

sectors_df.head()
groupsLength startTime code type length powerStage arrivalTime id grounds groupsLength.A_T5 stage_code sector_number
0 [] 2025-01-03T06:45:00+00:00 0P100 LIA 26 False 2025-01-03T07:35:00+00:00 23800 NaN NaN 0P000 1
1 [] 2025-01-03T07:35:00+00:00 0P200 SPE 29 False 2025-01-03T07:54:00+00:00 23801 [{'percentage': 6, 'color': '#753a05', 'sectio... NaN 0P000 2
2 [] 2025-01-03T07:54:00+00:00 0P300 LIA 24 False 2025-01-03T08:24:00+00:00 23802 NaN NaN 0P000 3
3 [] 2025-01-04T04:30:00+00:00 01100 LIA 86 False 2025-01-04T06:25:00+00:00 23803 NaN NaN 01000 1
4 [] 2025-01-04T06:25:00+00:00 01200 SPE 413 False 2025-01-04T10:35:00+00:00 23804 [{'name': 'ground.name.1', 'percentage': 18, '... NaN 01000 2

If the sector is an SPE type, we have the grounds information to play with. This is split over various sections, with start and finish distances identifying each section, as well as a surface type and a “percentage” value.

# Get the sectors with grounds data
competitive_sectors = sectors_df[['grounds', 'code']].dropna(
    axis="index").explode('grounds').reset_index(drop=True)

competitive_sectors.head()
grounds code
0 {'percentage': 6, 'color': '#753a05', 'section... 0P200
1 {'color': '#1dc942', 'percentage': 24, 'sectio... 0P200
2 {'color': '#efc07c', 'groundLangs': [{'variabl... 0P200
3 {'color': '#753a05', 'name': 'ground.name.1', ... 0P200
4 {'sections': [{'finish': 16, 'start': 14, 'sec... 0P200

Tidy the sectors dataframe:

# Sort sectors by stage and sector
sectors_df.sort_values("code", inplace=True)
# Simplify the sectors dataframe
sectors_df = sectors_df[["stage_code", "code", "id", "sector_number", "powerStage",
                             "length", "startTime", "type", "arrivalTime"]].reset_index(drop=True)

sectors_df.head()
stage_code code id sector_number powerStage length startTime type arrivalTime
0 01000 01100 23803 1 False 86 2025-01-04T04:30:00+00:00 LIA 2025-01-04T06:25:00+00:00
1 01000 01200 23804 2 False 413 2025-01-04T06:25:00+00:00 SPE 2025-01-04T10:35:00+00:00
2 02000 02100 23805 1 False 45 2025-01-05T03:25:00+00:00 LIA 2025-01-05T04:40:00+00:00
3 02000 02200 23806 2 False 967 2025-01-05T04:40:00+00:00 SPE 2025-01-05T13:00:00+00:00
4 02000 02300 23807 3 False 46 2025-01-05T13:00:00+00:00 LIA 2025-01-05T14:00:00+00:00

And simplify and tidy the “top level” stages dataframe:

stage_cols = ['stage_code', 'stage', 'date', 'startDate', 'endDate', 'isCancelled', 'generalDisplay', 'isDelayed', 'marathon',
              'length', 'type',  'timezone', 'stageWithBonus', 'mapCategoryDisplay', 'podiumDisplay', '_bind', 'ar', 'en', 'es', 'fr']
stage_df = stage_df[stage_cols]

Let’s see if we can pull out the data for the mixed surface types in a sensible way.

competitive_sectors.iloc[0].to_dict()
{'grounds': {'percentage': 6,
  'color': '#753a05',
  'sections': [{'finish': 1, 'start': 0, 'section': 1},
   {'start': 2, 'finish': 2, 'section': 3},
   {'finish': 6, 'start': 6, 'section': 5}],
  'name': 'ground.name.1',
  'groundLangs': [{'locale': 'en',
    'text': 'Dirt Track',
    'variable': 'ground.name.1'},
   {'locale': 'fr', 'variable': 'ground.name.1', 'text': 'Piste terre'},
   {'variable': 'ground.name.1', 'text': 'Pista de tierra', 'locale': 'es'},
   {'locale': 'ar', 'variable': 'ground.name.1', 'text': 'الحجارة'}]},
 'code': '0P200'}

We can pull the data out into three components:

  • the percentage of each surface type by stage

  • the surface type for each section

  • the surface type language labels

def flatten_grounds_data(df):
    """
    Flatten nested grounds data into a wide DataFrame format.

    Parameters:
    df (pandas.DataFrame): DataFrame with 'grounds' and 'id' columns where grounds contains nested dictionary data

    Returns:
    pandas.DataFrame: Flattened DataFrame with one row per section
    """

    # Create empty lists to store flattened data
    flattened_data = []
    percentage_data = []
    surface_types = []
    _surface_types = []

    # Sort by stage sector
    df.sort_values("code", inplace=True)

    # Create a mapping for translations
    for _, row in df.iterrows():
        ground_data = row['grounds']

        # Create a translations dictionary
        translations = {f"text_{lang['locale']}": lang['text']
                        for lang in ground_data['groundLangs']}

        _stype = translations["text_en"].lower()

        percentage_data.append(
            {'code': row['code'],
             'percentage': ground_data['percentage'],
             # 'ground_name': ground_data['name'],
             'color': ground_data['color'], "type": _stype})

        if _stype not in _surface_types:
            _surface_types.append(_stype)
            surface_types.append({"type": _stype, **translations})

        # Create a record for each section
        for section in ground_data['sections']:
            section_record = {
                'code': row['code'],
                'ground_name': ground_data['name'],
                'section': section['section'],
                'start': section['start'],
                'finish': section['finish'],
                'color': ground_data['color'],
                # 'percentage': ground_data['percentage'],
                "type": _stype
            }
            flattened_data.append(section_record)

    # Create DataFrame from flattened data
    section_df = pd.DataFrame(flattened_data)
    percentage_df = pd.DataFrame(percentage_data)
    surfaces_df = pd.DataFrame(surface_types)

    # Sort columns for better organization
    # Ignore: 'percentage', 'ground_name',
    fixed_columns = ['code',  'section', 'start', 'finish',
                     'color', "type"]
    lang_columns = [
        col for col in section_df.columns if col.startswith('text_')]
    section_df = section_df[fixed_columns + sorted(lang_columns)]

    # Drop duplicates if any still exist
    section_df = section_df.drop_duplicates()
    section_df = section_df.sort_values(
        ['code', 'section'])
    section_df.reset_index(drop=True, inplace=True)

    return section_df, percentage_df, surfaces_df
section_surfaces, stage_surfaces, surfaces = flatten_grounds_data(
    competitive_sectors)
section_surfaces.head()
code section start finish color type
0 01200 1 0 27 #efc07c sand
1 01200 2 27 32 #753a05 dirt track
2 01200 3 32 32 #1dc942 gravel track
3 01200 4 32 41 #efc07c sand
4 01200 5 41 42 #753a05 dirt track
stage_surfaces.head(10)
code percentage color type
0 01200 18 #753a05 dirt track
1 01200 18 #753a05 dirt track
2 01200 28 #1dc942 gravel track
3 01200 53 #efc07c sand
4 01200 18 #753a05 dirt track
5 01200 28 #1dc942 gravel track
6 01200 53 #efc07c sand
7 01200 18 #753a05 dirt track
8 01200 53 #efc07c sand
9 01200 28 #1dc942 gravel track
surfaces
type text_en text_fr text_es text_ar
0 dirt track Dirt Track Piste terre Pista de tierra الحجارة
1 gravel track Gravel Track Piste empierrée Tierra التراب
2 sand Sand Sable Arena الرمال
3 asphalt road Asphalt Road Goudron Asfalto الزفت
4 dunes Dunes Dunes Dunas الكثبان
stage_surfaces[stage_surfaces["code"]=="11200"]
code percentage color type
121 11200 60 #efc07c sand
122 11200 40 #ff7200 dunes
123 11200 38 #efc07c sand
124 11200 61 #ff7200 dunes

Differences Between Stage Data for Different Categories#

Are there actually any differences in the stage data for the different categories?