Mazda 3 Price Data - Preparation

Preparing and cleaning car price data.

R
CarPrice
Mazda 3
Author

John Bates

Published

May 2, 2023

The aim of this notebook is to prepare an R dataframe for analysis. We start with some data that has been collected about car prices for a number of variants of the Mazda 3. The car prices are in £ Sterling and have been obtained from a number of sources.

The processing in this notebook will result in the creation of a serialisation of a cleaned dataframe to disk in R’s native RDS format.

This notebook follows the same series of steps as those described in more detail in the equivalent notebooks for the CX-30. The aim of this notebook is to rapidly repeat the same steps but for the Mazda 3 and so has much less narrative is given.

library(tidyverse)

Our data is already in a CSV file on disk that we can read into a dataframe and then we can convert the dataframe into a tibble.

cars <- read_csv("mazda3.csv") |> as_tibble()

Running str on the tibble shows that we have 669 rows on 7 columns.

str(cars)
tibble [669 × 7] (S3: tbl_df/tbl/data.frame)
 $ rowno      : num [1:669] 1 2 3 4 5 6 7 8 9 10 ...
 $ id         : num [1:669] 2.02e+14 2.02e+14 2.02e+14 2.02e+14 2.02e+14 ...
 $ title      : chr [1:669] "Mazda3" "Mazda 3" "Mazda 3" "Mazda 3" ...
 $ subtitle   : chr [1:669] "2.0 e-Skyactiv G MHEV GT Sport Tech 5dr Leather Seats" "2.0 SKYACTIV G MHEV SE-L LUX 5DR" "SE-L LUX MHEV 2.0 5dr" "GT SPORT TECH MHEV 2.0 5dr" ...
 $ price      : num [1:669] 21000 17250 19995 18995 14490 ...
 $ spec       : chr [1:669] "2021 (71 reg) Hatchback 16,843 miles 2.0L 122PS Manual Petrol" "2020 (20 reg) Hatchback 12,875 miles 2.0L 122PS Manual Petrol 1 owner" "2021 (21 reg) Hatchback 7,891 miles 2.0L 122PS Manual Petrol" "2019 (69 reg) Hatchback 28,947 miles 2.0L 122PS Manual Petrol" ...
 $ advert_date: chr [1:669] "23/04/2023" "23/04/2023" "23/04/2023" "23/04/2023" ...

Rename the title and subtitle columns.

cars <- cars |> rename(
    model = title,
    variant_detail = subtitle
)

Examining the contents of the tibble we can see that there are some rows for models other than the Mazda 3.

table(cars$model)

      Honda Civic          KIA Ceed           Mazda 2           Mazda 3 
               13                 3                 3               418 
Mazda 3 Hatchback            Mazda2            Mazda3  MAZDA3 HATCHBACK 
                2                 7               221                 2 

Most are variations on the Mazda 3 model but the Honda, Kia and Mazda 2 rows will need to be removed. We can do this with the filter function of the dplyr package. We also don’t need the rowno, id or advert_date columns, so we remove them using the select function of the ‘dplyr’ package:

cars <- cars |>
    select(-c(id, rowno, advert_date)) |>
    filter(
        model != "Honda Civic", 
        model != "KIA Ceed",
        model != "Mazda2",
        model != "Mazda 2",
    )
table(cars$model)

          Mazda 3 Mazda 3 Hatchback            Mazda3  MAZDA3 HATCHBACK 
              418                 2               221                 2 

It is easy to see from the remaining 4 columns that most of the useful information about the variant of the car lies within the variant_detail and the spec columns:

head(cars$variant_detail)
[1] "2.0 e-Skyactiv G MHEV GT Sport Tech 5dr Leather Seats"
[2] "2.0 SKYACTIV G MHEV SE-L LUX 5DR"                     
[3] "SE-L LUX MHEV 2.0 5dr"                                
[4] "GT SPORT TECH MHEV 2.0 5dr"                           
[5] "2.0 SKYACTIV-G Sport Nav Euro 6 (s/s) 5dr"            
[6] "2.0 Skyactiv G MHEV Sport Lux 5dr Petrol Hatchback"   
head(cars$spec)
[1] "2021 (71 reg) Hatchback 16,843 miles 2.0L 122PS Manual Petrol"         
[2] "2020 (20 reg) Hatchback 12,875 miles 2.0L 122PS Manual Petrol 1 owner" 
[3] "2021 (21 reg) Hatchback 7,891 miles 2.0L 122PS Manual Petrol"          
[4] "2019 (69 reg) Hatchback 28,947 miles 2.0L 122PS Manual Petrol"         
[5] "2018 (18 reg) Hatchback 42,577 miles 2.0L 121PS Manual Petrol 2 owners"
[6] "2019 (69 reg) Hatchback 25,664 miles 2.0L 122PS Manual Petrol"         

The task here is to identify the useful information that could be extracted from these two columns and then write functions to extract that information into new columns in the tibble. We show how this is done using the mutate function of the dplyr package which allows us to change the value of existing columns or to create new columns. Where the calculation of value for a column is anything other than trivial the work has been handed off to a function. You can see the definition of those functions by clicking the “show the code” link below.

Show the code
bhp_from_spec <- function(spec){
    bhp <- str_extract(spec, "([:digit:]{3})(PS|BHP)", group = 1)
    return(as.numeric(bhp))
}
miles_from_spec <- function(spec){
    miles <- str_extract(spec, "([0-9,]+) miles", group = 1)
    miles <- str_replace(miles, ",", "")
    miles <- if_else(is.na(miles), 0, as.numeric(miles))
    return(miles)
}
new_used_from_spec_and_age <- function(spec, age){
    miles <- miles_from_spec(spec)
    warranty <- str_detect(
        str_to_lower(spec),
        "full *manufacturer *warranty"
    )
    return(if_else(
        miles == 0 & (warranty == TRUE | age == 0),
        "New",
        "Used"
    ))
}
reg_from_spec <- function(spec){
    matches <- str_match(spec, "(?<year>[0-9]{4}) \\((?<reg>[0-9]{2}) reg\\)")
    reg <- matches[, "reg"]
    return(reg)
}
year_from_spec <- function(spec){
    matches1 <- str_match(spec, "(?<year>[0-9]{4}) \\((?<reg>[0-9]{2}) reg\\)")
    year1 <- matches1[, "year"]
    matches2 <- str_match(spec, "(?<year>20[12][0-9])")
    year2 <- matches2[, "year"]
    return(if_else(is.na(year1), year2, year1))
}
age_in_months <- function(reg){
    #table_offset_in_months <- interval(ymd("20230301"), today()) / dmonths() # nolint
    table_offset_in_months <- 0
    months = table_offset_in_months +
        12 * case_when(
        reg == 23 ~ 0.0,
        reg == 72 ~ 0.5,
        reg == 22 ~ 1.0,
        reg == 71 ~ 1.5,
        reg == 21 ~ 2.0,
        reg == 70 ~ 2.5,
        reg == 20 ~ 3.0,
        reg == 69 ~ 3.5,
        reg == 19 ~ 4.0,
        reg == 68 ~ 4.5,
        reg == 18 ~ 5.0,
        reg == 67 ~ 5.5,
        reg == 17 ~ 6.0,
        .default = NA
        )
    return(months)
}
age_in_months_from_year <- function(spec){
    matches <- str_match(spec, "(?<year>20[12][0-9])")
    year <- matches[, "year"]
    #table_offset_in_months <- interval(ymd("20230301"), today()) / dmonths() # nolint
    table_offset_in_months <- 0
    months = table_offset_in_months +
        12 * case_when(
        year == 2023 ~ 0.0,
        year == 2022 ~ 1.0,
        year == 2021 ~ 2.0,
        year == 2020 ~ 3.0,
        year == 2019 ~ 4.0,
        year == 2018 ~ 5.0,
        .default = NA
        )
    return(months)

}
variant_from_detail <- function(variant_detail){
    detail <- str_to_lower(variant_detail)
    test <- \(pat) str_detect(fixed(detail), pat)
    variant <- case_when(
        test("se-l lux") ~ "SE-L Lux",
        test("se-l") ~ "SE-L",
        test("sport lux") ~ "Sport Lux",
        test("gt sport tech") ~ "GT Sport Tech",
        test("gt sport") ~ "GT Sport",
        .default = str_glue("XXXX: {variant_detail}")
    )
    return(variant)
}
drive_from_detail <- function(variant_detail){
    detail <- str_to_lower(variant_detail)
    test <- \(pat) str_detect(fixed(detail), pat)
    drive <- case_when(
        test("awd") ~ "AWD",
        test("4wd") ~ "AWD",
        .default = "2WD"
    )
    return(drive)
}
engine_from_detail_and_bhp <- function(variant_detail, bhp) {
    detail <- str_to_lower(variant_detail)
    test <- \(pat) str_detect(fixed(detail), pat)
    variant1 <- case_when(
        test("skyactiv x") ~ "e-Skyactiv X",
        test("skyactiv-x") ~ "e-Skyactiv X",
        test("skyactiv g") ~ "e-Skyactiv G",
        test("skyactiv-g") ~ "e-Skyactiv G",
    )
    variant2 <- if_else(
        bhp == "High",
        "e-Skyactiv X",
        "e-Skyactiv G"
    )
    return(if_else(
        !is.na(variant1),
        variant1,
        if_else(
            !is.na(variant2),
            variant2,
            str_glue("XXXX: {variant_detail}")
        )
    ))
}
price_new_from_variant_and_engine <- function(variant, engine) { # nolint
    key <- paste(variant, engine, sep = "/")
    key <- str_to_lower(key)
    price_new <- case_when(
        key == "se-l lux/e-skyactiv g" ~ "24385",
        key == "se-l lux/e-skyactiv x" ~ "26285",
        key == "se-l/e-skyactiv g" ~ "23285",
        key == "sport lux/e-skyactiv g" ~ "25485",
        key == "sport lux/e-skyactiv x" ~ "27685",
        key == "gt sport/e-skyactiv g" ~ "27385",
        key == "gt sport/e-skyactiv x" ~ "29585",
        key == "gt sport tech/e-skyactiv g" ~ "28285",
        key == "gt sport tech/e-skyactiv x" ~ "30485",
        .default = str_glue("XXXX: {variant} {engine}")
    )
    return(as.numeric(price_new))
}

The main body of the code uses those functions to determine the value of a number of new columns.

cars <- cars |>
mutate(
    model = "Mazda 3",
    variant = variant_from_detail(variant_detail),
    drive = drive_from_detail(variant_detail),
    fuel = if_else(
        str_detect(spec, "Petrol"),
        "Petrol",
        "Diesel",
        "Unknown"
    ),
    price = price,
    transmission = if_else(
        str_detect(spec, "Manual"),
        "Manual",
        "Automatic",
        "Unknown"
    ),
    bhp = if_else(
        bhp_from_spec(spec) > 175,
        "High",
        "Low"
    ),
    engine = engine_from_detail_and_bhp(variant_detail, bhp),
    price_new = price_new_from_variant_and_engine(variant, engine), # nolint: line_length_linter.
    miles = miles_from_spec(spec),
    reg = reg_from_spec(spec),
    year = year_from_spec(spec),
    age_in_months = if_else(
        !is.na(reg),
        age_in_months(reg),
        age_in_months_from_year(spec)
    ),
    mid_age_in_months = if_else(
        !is.na(reg),
        age_in_months(reg)+3,
        age_in_months_from_year(spec)+6
    ),
    new_used = new_used_from_spec_and_age(spec, age_in_months),
    spec = spec,
    depreciation = if_else(price_new > price, price_new - price, 0),
    depreciation_per_month = depreciation / mid_age_in_months,
    miles_per_month = miles / mid_age_in_months
)

We now have a tibble with many more columns:

str(cars)
tibble [643 × 20] (S3: tbl_df/tbl/data.frame)
 $ model                 : chr [1:643] "Mazda 3" "Mazda 3" "Mazda 3" "Mazda 3" ...
 $ variant_detail        : chr [1:643] "2.0 e-Skyactiv G MHEV GT Sport Tech 5dr Leather Seats" "2.0 SKYACTIV G MHEV SE-L LUX 5DR" "SE-L LUX MHEV 2.0 5dr" "GT SPORT TECH MHEV 2.0 5dr" ...
 $ price                 : num [1:643] 21000 17250 19995 18995 14490 ...
 $ spec                  : chr [1:643] "2021 (71 reg) Hatchback 16,843 miles 2.0L 122PS Manual Petrol" "2020 (20 reg) Hatchback 12,875 miles 2.0L 122PS Manual Petrol 1 owner" "2021 (21 reg) Hatchback 7,891 miles 2.0L 122PS Manual Petrol" "2019 (69 reg) Hatchback 28,947 miles 2.0L 122PS Manual Petrol" ...
 $ variant               : 'glue' chr [1:643] "GT Sport Tech" "SE-L Lux" "SE-L Lux" "GT Sport Tech" ...
 $ drive                 : chr [1:643] "2WD" "2WD" "2WD" "2WD" ...
 $ fuel                  : chr [1:643] "Petrol" "Petrol" "Petrol" "Petrol" ...
 $ transmission          : chr [1:643] "Manual" "Manual" "Manual" "Manual" ...
 $ bhp                   : chr [1:643] "Low" "Low" "Low" "Low" ...
 $ engine                : 'glue' chr [1:643] "e-Skyactiv G" "e-Skyactiv G" "e-Skyactiv G" "e-Skyactiv G" ...
 $ price_new             : num [1:643] 28285 24385 24385 28285 NA ...
 $ miles                 : num [1:643] 16843 12875 7891 28947 42577 ...
 $ reg                   : chr [1:643] "71" "20" "21" "69" ...
 $ year                  : chr [1:643] "2021" "2020" "2021" "2019" ...
 $ age_in_months         : num [1:643] 18 36 24 42 60 42 24 42 30 NA ...
 $ mid_age_in_months     : num [1:643] 21 39 27 45 63 45 27 45 33 NA ...
 $ new_used              : chr [1:643] "Used" "Used" "Used" "Used" ...
 $ depreciation          : num [1:643] 7285 7135 4390 9290 NA ...
 $ depreciation_per_month: num [1:643] 347 183 163 206 NA ...
 $ miles_per_month       : num [1:643] 802 330 292 643 676 ...

The columns variant_detail and spec are now redundant and we will drop them from the tibble later. But the determination of new column values has thrown up a few records for which we don’t have enough information and we will delete those rows.

The Mazda site defines only the models “SE-L”, “SE-L Lux”, “Sport Lux”, “GT Sport”, and “GT Sport Tech”. We make use of the fact that the determination of price_new will fail if we have not recognised the variant

(cars |> filter(is.na(price_new)) |> count())[[1]]
[1] 137

We have written a function to determine the new price of each of the vehicles. The new price can be found from the Mazda site using a combination of variant, and engine size. We want to be able to use the new price of a given vehicle to help us estimate how much the vehicle has depreciated over time.

The above query shows us that for 137 vehicles we were unable to find a price for the given combination of these column values. We will also remove record for which we cannot determine the age_in_months and those for all-wheel-drive vehicles.

cars <- cars |>
filter(
    !is.na(price_new),
    !is.na(age_in_months),
    drive != "AWD"
)

Next we create factors for those columns that are to be treated as such:

cars$model <- as_factor(cars$model)
cars$variant <-
    as_factor(cars$variant) |>
    fct_reorder(cars$price_new)
cars$drive <- as_factor(cars$drive)
cars$engine <-
    as_factor(cars$engine) |>
    fct_reorder(cars$price_new)
cars$new_used <- as_factor(cars$new_used)
cars$reg <- as_factor(cars$reg)
cars$fuel <- as_factor(cars$fuel)
cars$transmission <- as_factor(cars$transmission)
cars$bhp <- as_factor(cars$bhp)

So now we have the following numbers of vehicles of each variant remaining.

table(cars$variant)

         SE-L      SE-L Lux     Sport Lux      GT Sport GT Sport Tech 
           41            57           143            81            57 

We select the final set of columns that we want to see in our tibble and we save the tibble to a disk file for use in later analyses.

cars <- cars |>
select(c(
    model, variant, drive, engine, new_used,
    miles, miles_per_month,
    price, price_new,
    depreciation, depreciation_per_month,
    fuel, transmission,
    age_in_months
))
saveRDS(cars, file="mazda3.rds")

Next: Mazda 3 Price Data - Visualisation