Mazda CX-30 Price Data - Preparation

Preparing and cleaning car price data.

R
CarPrice
Mazda CX-30
Author

John Bates

Published

April 3, 2023

The aim of this notebook is to prepare an R dataframe for analysis. We start with some data that has been collected about car prices for a number of variants of the Mazda CX-30. The car prices are in £ Sterling and have been obtained from a number of sources.

The processing in this notebook will result in the creation of a serialisation of a cleaned dataframe to disk in R’s native RDS format.

We use the term dataframe but will actually be using a Tidyverse tibble which can be thought of as an enhanced dataframe. The tibble package, and several others that we will be using, are made available by referencing the tidyverse package.

Note

You can install the tidyverse package by running the one-off command:

install.packages("tidyverse")
library(tidyverse)

Our data is already in a CSV file on disk that we can read into a dataframe and then we can convert the dataframe into a tibble.

cars <- read_csv("mazda.csv") |> as_tibble()

The pipe-forward operator, which is similar (but not identical) to the one found in the F# language and even earlier dialects of ML, takes the result of the command to its left and passes it as the first argument of the function on its right. So the above code fragment is functionally the same as this:

cars <- as_tibble(read_csv("mazda.csv"))

But using the pipe-forward operator, especially when chained in runs of more than one, makes the code easier to read.

str(cars)
tibble [816 × 7] (S3: tbl_df/tbl/data.frame)
 $ rowno      : num [1:816] 1 2 3 4 5 6 7 8 9 10 ...
 $ id         : num [1:816] 2.02e+14 2.02e+14 2.02e+14 2.02e+14 2.02e+14 ...
 $ title      : chr [1:816] "Mazda Cx-30" "Mazda CX-30" "Mazda Cx-30" "Mazda Cx-30" ...
 $ subtitle   : chr [1:816] "2.0 Skyactiv-X MHEV GT Sport Tech 5dr" "2.0 e-Skyactiv G MHEV SE-L 5dr" "SE-L LUX MHEV 2.0 5dr" "2.0 e-Skyactiv X MHEV GT Sport Tech 5dr Hatchback" ...
 $ price      : num [1:816] 24990 18785 21995 29994 34745 ...
 $ spec       : chr [1:816] "2021 (71 reg) SUV 6,542 miles 2.0L 186PS Manual Petrol Full service history" "2021 SUV 14,814 miles 2.0L 122PS Manual Petrol" "2020 (20 reg) SUV 17,548 miles 2.0L 122PS Manual Petrol" "2023 (23 reg) SUV 225 miles 2.0L 186PS Manual Petrol" ...
 $ advert_date: POSIXct[1:816], format: "2023-04-03 16:26:50" "2023-04-03 16:26:50" ...

Running str on the tibble shows that we have 816 rows on 7 columns. Two rows are badly named: title should be named model and subtitle would be better named variant_detail because it contains a collection of information that we can examine to determine more about the particular variant of the model that each row describes. We can rename the rows by using the rename function from the dplyr package (which is itself part of the umbrella tidyverse package)

cars <- cars |> rename(
    model = title,
    variant_detail = subtitle
)

Examining the contents of the tibble we can see that there are some rows for models other than the CX-30.

table(cars$model)

       Hyundai Tucson           Mazda Cx-30           Mazda CX-30 
                    6                   314                   489 
Mazda CX-30 HATCHBACK            Mazda CX-5            Mazda CX30 
                    3                     2                     2 

Most are variations on the CX-30 model but the Hyundai Tucson and the Mazda CX-5 rows will need to be removed. We can do this with the filter function of the dplyr package. We also don’t need the rowno, id or advert_date columns, so we remove them using the select function of the ‘dplyr’ package:

cars <- cars |>
    select(-c(id, rowno, advert_date)) |>
    filter(
        model != "Hyundai Tucson",
        model != "Mazda CX-5"
    )
head(cars)
# A tibble: 6 × 4
  model       variant_detail                                       price spec   
  <chr>       <chr>                                                <dbl> <chr>  
1 Mazda Cx-30 2.0 Skyactiv-X MHEV GT Sport Tech 5dr                24990 2021 (…
2 Mazda CX-30 2.0 e-Skyactiv G MHEV SE-L 5dr                       18785 2021 S…
3 Mazda Cx-30 SE-L LUX MHEV 2.0 5dr                                21995 2020 (…
4 Mazda Cx-30 2.0 e-Skyactiv X MHEV GT Sport Tech 5dr Hatchback    29994 2023 (…
5 Mazda CX-30 2.0 186ps 2WD GT Sport Tech Auto / Stone Leather 5dr 34745 SUV 2.…
6 Mazda CX-30 2.0 e-SKYACTIV G MHEV SE-L Lux Euro 6 (s/s) 5dr      25335 SUV 2.…

It is easy to see from the remaining 4 columns most of the useful information about the variant of the car lies within the variant_detail and the spec columns:

head(cars$variant_detail)
[1] "2.0 Skyactiv-X MHEV GT Sport Tech 5dr"               
[2] "2.0 e-Skyactiv G MHEV SE-L 5dr"                      
[3] "SE-L LUX MHEV 2.0 5dr"                               
[4] "2.0 e-Skyactiv X MHEV GT Sport Tech 5dr Hatchback"   
[5] "2.0 186ps 2WD GT Sport Tech Auto / Stone Leather 5dr"
[6] "2.0 e-SKYACTIV G MHEV SE-L Lux Euro 6 (s/s) 5dr"     
head(cars$spec)
[1] "2021 (71 reg) SUV 6,542 miles 2.0L 186PS Manual Petrol Full service history"         
[2] "2021 SUV 14,814 miles 2.0L 122PS Manual Petrol"                                      
[3] "2020 (20 reg) SUV 17,548 miles 2.0L 122PS Manual Petrol"                             
[4] "2023 (23 reg) SUV 225 miles 2.0L 186PS Manual Petrol"                                
[5] "SUV 2.0L 186PS Manual Petrol First year road tax included Full manufacturer warranty"
[6] "SUV 2.0L 122PS Manual Petrol First year road tax included Full manufacturer warranty"

The task here is to identify the useful information that could be extracted from these two columns and then write functions to extract that information into new columns in the tibble. We show how this is done using the mutate function of the dplyr package which allows us to change the value of existing columns or to create new columns. Where the calculation of value for a column is anything other than trivial the work has been handed off to a function. You can see the definition of those functions by clicking the “show the code” link below.

Show the code
bhp_from_spec <- function(spec){
    bhp <- str_extract(spec, "([:digit:]{3})(PS|BHP)", group = 1)
    return(as.numeric(bhp))
}
miles_from_spec <- function(spec){
    miles <- str_extract(spec, "([0-9,]+) miles", group = 1)
    miles <- str_replace(miles, ",", "")
    miles <- if_else(is.na(miles), 0, as.numeric(miles))
    return(miles)
}
new_used_from_spec_and_age <- function(spec, age){
    miles <- miles_from_spec(spec)
    warranty <- str_detect(
        str_to_lower(spec),
        "full *manufacturer *warranty"
    )
    return(if_else(
        miles == 0 & (warranty == TRUE | age == 0),
        "New",
        "Used"
    ))
}
reg_from_spec <- function(spec){
    matches <- str_match(spec, "(?<year>[0-9]{4}) \\((?<reg>[0-9]{2}) reg\\)")
    reg <- matches[, "reg"]
    return(reg)
}
year_from_spec <- function(spec){
    matches1 <- str_match(spec, "(?<year>[0-9]{4}) \\((?<reg>[0-9]{2}) reg\\)")
    year1 <- matches1[, "year"]
    matches2 <- str_match(spec, "(?<year>20[12][0-9])")
    year2 <- matches2[, "year"]
    return(if_else(is.na(year1), year2, year1))
}
age_in_months <- function(reg){
    #table_offset_in_months <- interval(ymd("20230301"), today()) / dmonths() # nolint
    table_offset_in_months <- 0
    months = table_offset_in_months +
        12 * case_when(
        reg == 23 ~ 0.0,
        reg == 72 ~ 0.5,
        reg == 22 ~ 1.0,
        reg == 71 ~ 1.5,
        reg == 21 ~ 2.0,
        reg == 70 ~ 2.5,
        reg == 20 ~ 3.0,
        reg == 69 ~ 3.5,
        reg == 19 ~ 4.0,
        reg == 68 ~ 4.5,
        reg == 18 ~ 5.0,
        .default = NA
        )
    return(months)
}
age_in_months_from_year <- function(spec){
    matches <- str_match(spec, "(?<year>20[12][0-9])")
    year <- matches[, "year"]
    #table_offset_in_months <- interval(ymd("20230301"), today()) / dmonths() # nolint
    table_offset_in_months <- 0
    months = table_offset_in_months +
        12 * case_when(
        year == 2023 ~ 0.0,
        year == 2022 ~ 1.0,
        year == 2021 ~ 2.0,
        year == 2020 ~ 3.0,
        year == 2019 ~ 4.0,
        year == 2018 ~ 5.0,
        .default = NA
        )
    return(months)

}
variant_from_detail <- function(variant_detail){
    detail <- str_to_lower(variant_detail)
    test <- \(pat) str_detect(fixed(detail), pat)
    variant <- case_when(
        test("se-l lux") ~ "SE-L Lux",
        test("se-l") ~ "SE-L",
        test("sport lux") ~ "Sport Lux",
        test("gt sport tech") ~ "GT Sport Tech",
        test("gt sport") ~ "GT Sport",
        test("anniversary edition") ~ "100th Anniversary Edition",
        test("spt lux") ~ "Sport Lux",
        test("sport nav") ~ "Sport Nav",
        .default = str_glue("XXXX: {variant_detail}")
    )
    return(variant)
}
drive_from_detail <- function(variant_detail){
    detail <- str_to_lower(variant_detail)
    test <- \(pat) str_detect(fixed(detail), pat)
    drive <- case_when(
        test("awd") ~ "AWD",
        test("4wd") ~ "AWD",
        .default = "2WD"
    )
    return(drive)
}
engine_from_detail_and_bhp <- function(variant_detail, bhp) {
    detail <- str_to_lower(variant_detail)
    test <- \(pat) str_detect(fixed(detail), pat)
    variant1 <- case_when(
        test("skyactiv x") ~ "e-Skyactiv X",
        test("skyactiv-x") ~ "e-Skyactiv X",
        test("skyactiv g") ~ "e-Skyactiv G",
        test("skyactiv-g") ~ "e-Skyactiv G",
    )
    variant2 <- if_else(
        bhp == "High",
        "e-Skyactiv X",
        "e-Skyactiv G"
    )
    return(if_else(
        !is.na(variant1),
        variant1,
        if_else(
            !is.na(variant2),
            variant2,
            str_glue("XXXX: {variant_detail}")
        )
    ))
}
price_new_from_variant_engine_and_drive <- function(variant, engine, drive) { # nolint
    key <- paste(variant, drive, engine, sep = "/")
    key <- str_to_lower(key)
    price_new <- case_when(
        key == "se-l lux/2wd/e-skyactiv g" ~ "26145",
        key == "se-l lux/2wd/e-skyactiv x" ~ "28005",
        key == "se-l/2wd/e-skyactiv g" ~ "24645",
        key == "sport lux/2wd/e-skyactiv g" ~ "26995",
        key == "sport lux/2wd/e-skyactiv x" ~ "29005",
        key == "gt sport edition/2wd/e-skyactiv g" ~ "28855",
        key == "gt sport edition/2wd/e-skyactiv x" ~ "31115",
        key == "gt sport edition/awd/e-skyactiv x" ~ "33355",
        key == "gt sport/2wd/e-skyactiv g" ~ "28905",
        key == "gt sport/2wd/e-skyactiv x" ~ "30915",
        key == "gt sport/awd/e-skyactiv x" ~ "33655",
        key == "gt sport tech edition/2wd/e-skyactiv g" ~ "29755",
        key == "gt sport tech edition/2wd/e-skyactiv x" ~ "32015",
        key == "gt sport tech edition/awd/e-skyactiv x" ~ "34255",
        key == "gt sport tech/2wd/e-skyactiv g" ~ "29805",
        key == "gt sport tech/2wd/e-skyactiv x" ~ "31815",
        key == "gt sport tech/awd/e-skyactiv x" ~ "34555",
        .default = str_glue("XXXX: {variant} {engine} {drive}")
    )
    return(as.numeric(price_new))
}

The main body of the code uses those functions to determine the value of a number of new columns.

cars <- cars |>
mutate(
    model = "Mazda CX-30",
    variant = variant_from_detail(variant_detail),
    drive = drive_from_detail(variant_detail),
    fuel = if_else(
        str_detect(spec, "Petrol"),
        "Petrol",
        "Diesel",
        "Unknown"
    ),
    price = price,
    transmission = if_else(
        str_detect(spec, "Manual"),
        "Manual",
        "Automatic",
        "Unknown"
    ),
    bhp = if_else(
        bhp_from_spec(spec) > 175,
        "High",
        "Low"
    ),
    engine = engine_from_detail_and_bhp(variant_detail, bhp),
    price_new = price_new_from_variant_engine_and_drive(variant, engine, drive), # nolint: line_length_linter.
    miles = miles_from_spec(spec),
    reg = reg_from_spec(spec),
    year = year_from_spec(spec),
    age_in_months = if_else(
        !is.na(reg),
        age_in_months(reg),
        age_in_months_from_year(spec)
    ),
    mid_age_in_months = if_else(
        !is.na(reg),
        age_in_months(reg)+3,
        age_in_months_from_year(spec)+6
    ),
    new_used = new_used_from_spec_and_age(spec, age_in_months),
    spec = spec,
    depreciation = if_else(price_new > price, price_new - price, 0),
    depreciation_per_month = depreciation / mid_age_in_months,
    miles_per_month = miles / mid_age_in_months
)

Quite a lot is going on here. In particular,

  1. All of the assignments and operations are on vectors of data.
  2. New or mutated columns are able to be defined in terms of previously named columns.

R is an array processing language which means that it has operations that are capable of working on a vector of values, so there are no manually written loops in the code above and even the if_else tests apply to a vector of values.

In the case where we appear to be assigning a scalar value (“Mazda CX-30”) to the column model R will arrange that the scalar value is recycled as many times as required to fill the entire vector value of the column.

model = "Mazda CX-30",

In the case where we are comparing the value of a column with a constant scalar value:

bhp_from_spec(spec) > 175,

R will recycle the constant (175) to be a vector of the same length as the vector returned by the call to bhp_from_spec(spec) allowing the comparison to be made between two vectors of equal length and resulting in a vector of TRUE or FALSE values.

We calculated depreciation_per_month and miles_per_month from mid_age_in_months which was in turn calculated from reg which was determined from spec.

We now have a tibble with many more columns:

str(cars)
tibble [808 × 20] (S3: tbl_df/tbl/data.frame)
 $ model                 : chr [1:808] "Mazda CX-30" "Mazda CX-30" "Mazda CX-30" "Mazda CX-30" ...
 $ variant_detail        : chr [1:808] "2.0 Skyactiv-X MHEV GT Sport Tech 5dr" "2.0 e-Skyactiv G MHEV SE-L 5dr" "SE-L LUX MHEV 2.0 5dr" "2.0 e-Skyactiv X MHEV GT Sport Tech 5dr Hatchback" ...
 $ price                 : num [1:808] 24990 18785 21995 29994 34745 ...
 $ spec                  : chr [1:808] "2021 (71 reg) SUV 6,542 miles 2.0L 186PS Manual Petrol Full service history" "2021 SUV 14,814 miles 2.0L 122PS Manual Petrol" "2020 (20 reg) SUV 17,548 miles 2.0L 122PS Manual Petrol" "2023 (23 reg) SUV 225 miles 2.0L 186PS Manual Petrol" ...
 $ variant               : 'glue' chr [1:808] "GT Sport Tech" "SE-L" "SE-L Lux" "GT Sport Tech" ...
 $ drive                 : chr [1:808] "2WD" "2WD" "2WD" "2WD" ...
 $ fuel                  : chr [1:808] "Petrol" "Petrol" "Petrol" "Petrol" ...
 $ transmission          : chr [1:808] "Manual" "Manual" "Manual" "Manual" ...
 $ bhp                   : chr [1:808] "High" "Low" "Low" "High" ...
 $ engine                : 'glue' chr [1:808] "e-Skyactiv X" "e-Skyactiv G" "e-Skyactiv G" "e-Skyactiv X" ...
 $ price_new             : num [1:808] 31815 24645 26145 31815 31815 ...
 $ miles                 : num [1:808] 6542 14814 17548 225 0 ...
 $ reg                   : chr [1:808] "71" NA "20" "23" ...
 $ year                  : chr [1:808] "2021" "2021" "2020" "2023" ...
 $ age_in_months         : num [1:808] 18 24 36 0 NA NA NA 36 NA NA ...
 $ mid_age_in_months     : num [1:808] 21 30 39 3 NA NA NA 39 NA NA ...
 $ new_used              : chr [1:808] "Used" "Used" "Used" "Used" ...
 $ depreciation          : num [1:808] 6825 5860 4150 1821 0 ...
 $ depreciation_per_month: num [1:808] 325 195 106 607 NA ...
 $ miles_per_month       : num [1:808] 312 494 450 75 NA ...

The columns variant_detail and spec are now redundant and we will drop them from the tibble later. But the determination of new column values has thrown up a few records for which we don’t have enough information and we will delete those rows.

table(cars$variant)

100th Anniversary Edition                  GT Sport             GT Sport Tech 
                        4                       187                       200 
                     SE-L                  SE-L Lux                 Sport Lux 
                       62                       113                       241 
                Sport Nav 
                        1 

The Mazda site defines only the models “SE-L”, “SE-L Lux”, “Sport Lux”, “GT Sport”, “GT Sport Tech”. We also have rows for “100th Anniversary Edition” and “Sport Nav” - There are a very small number of these and they are either bad data, old data or of no interest to us and so we will remove them.

(cars |> filter(is.na(price_new)) |> count())[[1]]
[1] 29

We have written a function to determine the new price of each of the vehicles. The new price can be found from the Mazda site using a combination of variant, engine size and drive (2WD or AWD). We want to be able to use the new price of a given vehicle to help us estimate how much the vehicle has depreciated over time.

The above query shows us that for a small number of rows we were unable to find a price for the given combination of these column values (For example, some of the lower priced variants are not available with a large engine size or four wheel drive). Where such a combination is found our new price lookup function returns NA - we will remove those rows from the data too:

cars <- cars |>
filter(
    variant != "100th Anniversary Edition",
    variant != "Sport Nav",
    !is.na(price_new)
)

Next we create factors for those columns that are to be treated as such:

cars$model <- as_factor(cars$model)
cars$variant <-
    as_factor(cars$variant) |>
    fct_reorder(cars$price_new)
cars$drive <- as_factor(cars$drive)
cars$engine <-
    as_factor(cars$engine) |>
    fct_reorder(cars$price_new)
cars$new_used <- as_factor(cars$new_used)
cars$reg <- as_factor(cars$reg)
cars$fuel <- as_factor(cars$fuel)
cars$transmission <- as_factor(cars$transmission)
cars$bhp <- as_factor(cars$bhp)

The as_factor function is part of the tidyverse forcats package which makes it easier to work with factors than using the built-in R factor support. The fct_reorder function has allowed us to order the variant factor so that it is arranged in order of increasing (median) new price - sorted from cheapest variant to most expensive. We do the same kind of ordering for the engine - the “e-Skyactiv G” engine is cheaper than the “e-Skyactiv X” engine.

So now:

table(cars$variant)

         SE-L      SE-L Lux     Sport Lux      GT Sport GT Sport Tech 
           42           113           237           187           200 

This improves the appearance of charts and tables that involve the vehicle variant by displaying them in increasing order of cost.

Finally we select the final set of columns that we want to see in our tibble and we save the tibble to a disk file for use in later analyses.

cars <- cars |>
select(c(
    model, variant, drive, engine, new_used,
    miles, miles_per_month,
    price, price_new,
    depreciation, depreciation_per_month,
    fuel, transmission,
    age_in_months
))
saveRDS(cars, file="mazda.rds")

Later we can re-hydrate the cars tibble in another notebook with a call to readRDS.

cars <- readRDS("mazda.rds")

Assumptions Made

We have succeeded in extracting information from a small number of descriptive columns (spec and variant_detail) into a larger number of factor columns. We should find it easier to work with these new columns but a number of assumptions have been made in our processing.

  1. We identified a New car as a vehicle that had zero miles and either a “full manufacturer warranty” statement in the spec column or we determined its age to be zero.
  2. We looked for a couple of different formats of registration plate or year in the spec column and rejected vehicles which didn’t match either of these patterns.
  3. Our age_in_months and age_in_months_from_year functions makes use of a table that will date with time. We ought to be able to do better than this.
  4. We have hardwired new prices from the Mazda website into our function that calculates the new price of a vehicle. Some of the older vehicles in our data set may have had a different price when new and also may have expensive extras such as custom paint colours that would have added to their new price. We have ignored that complexity.

Determination of Per Month Depreciation and Mileage

There are several approximations used in the determination of the per month depreciation and mileage.

To estimate depreciation of a vehicle we simply subtract the vehicle price from our estimate of its new price. However, for cars that are new or nearly new and which have some expensive extras it is sometimes possible for the asking price to be greater than our estimated new price. In that situation we set the depreciation to zero.

depreciation = if_else(price_new > price, price_new - price, 0),

To calculate the per month depreciation or mileage we need an estimate of how old the vehicle is. The only information that we have about the age of the vehicle comes from its registration number or registration year. For some vehicles we have the registration number, for others we have the registration year. The two digit age identifier from the registration number gives us a 6 month window in which we know that the car was registered. The vehicle registration year gives us a 12 month window in which we know that the vehicle was registered. So, which of these two we have for a particular vehicle, determines how accurately we can estimate its age - to within 6 months or within 12 months.

We also need to make an assumption about where in the range of 6 or 12 months the vehicle is likely to be. We make the assumption that the age of a vehicle falls in the middle of whichever kind of range (6 or 12 months) we have for the vehicle. The variable mid_age_in_months contains this information and we use that to calculate depreciation_per_month and miles_per_month.

mid_age_in_months = if_else(
    !is.na(reg),
    age_in_months(reg)+3,
    age_in_months_from_year(spec)+6
)
...
depreciation_per_month = depreciation / mid_age_in_months
miles_per_month = miles / mid_age_in_months

So, in the absence of the full set of data, we have to make some big assumptions - but they don’t seem unreasonable ones.

The next task is to use the data in the tibble to learn something about the depreciation of these vehicles over time.

Next: Mazda CX-30 Price Data - Visualisation