library(tidyverse)
Mazda CX-30 Price Data - Preparation
Preparing and cleaning car price data.
The aim of this notebook is to prepare an R dataframe for analysis. We start with some data that has been collected about car prices for a number of variants of the Mazda CX-30. The car prices are in £ Sterling and have been obtained from a number of sources.
The processing in this notebook will result in the creation of a serialisation of a cleaned dataframe to disk in R’s native RDS format.
We use the term dataframe but will actually be using a Tidyverse tibble
which can be thought of as an enhanced dataframe. The tibble package, and several others that we will be using, are made available by referencing the tidyverse
package.
You can install the tidyverse package by running the one-off command:
install.packages("tidyverse")
Our data is already in a CSV file on disk that we can read into a dataframe and then we can convert the dataframe into a tibble.
<- read_csv("mazda.csv") |> as_tibble() cars
The pipe-forward operator, which is similar (but not identical) to the one found in the F# language and even earlier dialects of ML, takes the result of the command to its left and passes it as the first argument of the function on its right. So the above code fragment is functionally the same as this:
<- as_tibble(read_csv("mazda.csv")) cars
But using the pipe-forward operator, especially when chained in runs of more than one, makes the code easier to read.
str(cars)
tibble [816 × 7] (S3: tbl_df/tbl/data.frame)
$ rowno : num [1:816] 1 2 3 4 5 6 7 8 9 10 ...
$ id : num [1:816] 2.02e+14 2.02e+14 2.02e+14 2.02e+14 2.02e+14 ...
$ title : chr [1:816] "Mazda Cx-30" "Mazda CX-30" "Mazda Cx-30" "Mazda Cx-30" ...
$ subtitle : chr [1:816] "2.0 Skyactiv-X MHEV GT Sport Tech 5dr" "2.0 e-Skyactiv G MHEV SE-L 5dr" "SE-L LUX MHEV 2.0 5dr" "2.0 e-Skyactiv X MHEV GT Sport Tech 5dr Hatchback" ...
$ price : num [1:816] 24990 18785 21995 29994 34745 ...
$ spec : chr [1:816] "2021 (71 reg) SUV 6,542 miles 2.0L 186PS Manual Petrol Full service history" "2021 SUV 14,814 miles 2.0L 122PS Manual Petrol" "2020 (20 reg) SUV 17,548 miles 2.0L 122PS Manual Petrol" "2023 (23 reg) SUV 225 miles 2.0L 186PS Manual Petrol" ...
$ advert_date: POSIXct[1:816], format: "2023-04-03 16:26:50" "2023-04-03 16:26:50" ...
Running str
on the tibble shows that we have 816 rows on 7 columns. Two rows are badly named: title
should be named model
and subtitle
would be better named variant_detail
because it contains a collection of information that we can examine to determine more about the particular variant of the model that each row describes. We can rename the rows by using the rename
function from the dplyr
package (which is itself part of the umbrella tidyverse
package)
<- cars |> rename(
cars model = title,
variant_detail = subtitle
)
Examining the contents of the tibble we can see that there are some rows for models other than the CX-30.
table(cars$model)
Hyundai Tucson Mazda Cx-30 Mazda CX-30
6 314 489
Mazda CX-30 HATCHBACK Mazda CX-5 Mazda CX30
3 2 2
Most are variations on the CX-30 model but the Hyundai Tucson and the Mazda CX-5 rows will need to be removed. We can do this with the filter
function of the dplyr
package. We also don’t need the rowno
, id
or advert_date
columns, so we remove them using the select
function of the ‘dplyr’ package:
<- cars |>
cars select(-c(id, rowno, advert_date)) |>
filter(
!= "Hyundai Tucson",
model != "Mazda CX-5"
model
)head(cars)
# A tibble: 6 × 4
model variant_detail price spec
<chr> <chr> <dbl> <chr>
1 Mazda Cx-30 2.0 Skyactiv-X MHEV GT Sport Tech 5dr 24990 2021 (…
2 Mazda CX-30 2.0 e-Skyactiv G MHEV SE-L 5dr 18785 2021 S…
3 Mazda Cx-30 SE-L LUX MHEV 2.0 5dr 21995 2020 (…
4 Mazda Cx-30 2.0 e-Skyactiv X MHEV GT Sport Tech 5dr Hatchback 29994 2023 (…
5 Mazda CX-30 2.0 186ps 2WD GT Sport Tech Auto / Stone Leather 5dr 34745 SUV 2.…
6 Mazda CX-30 2.0 e-SKYACTIV G MHEV SE-L Lux Euro 6 (s/s) 5dr 25335 SUV 2.…
It is easy to see from the remaining 4 columns most of the useful information about the variant of the car lies within the variant_detail
and the spec
columns:
head(cars$variant_detail)
[1] "2.0 Skyactiv-X MHEV GT Sport Tech 5dr"
[2] "2.0 e-Skyactiv G MHEV SE-L 5dr"
[3] "SE-L LUX MHEV 2.0 5dr"
[4] "2.0 e-Skyactiv X MHEV GT Sport Tech 5dr Hatchback"
[5] "2.0 186ps 2WD GT Sport Tech Auto / Stone Leather 5dr"
[6] "2.0 e-SKYACTIV G MHEV SE-L Lux Euro 6 (s/s) 5dr"
head(cars$spec)
[1] "2021 (71 reg) SUV 6,542 miles 2.0L 186PS Manual Petrol Full service history"
[2] "2021 SUV 14,814 miles 2.0L 122PS Manual Petrol"
[3] "2020 (20 reg) SUV 17,548 miles 2.0L 122PS Manual Petrol"
[4] "2023 (23 reg) SUV 225 miles 2.0L 186PS Manual Petrol"
[5] "SUV 2.0L 186PS Manual Petrol First year road tax included Full manufacturer warranty"
[6] "SUV 2.0L 122PS Manual Petrol First year road tax included Full manufacturer warranty"
The task here is to identify the useful information that could be extracted from these two columns and then write functions to extract that information into new columns in the tibble. We show how this is done using the mutate
function of the dplyr
package which allows us to change the value of existing columns or to create new columns. Where the calculation of value for a column is anything other than trivial the work has been handed off to a function. You can see the definition of those functions by clicking the “show the code” link below.
Show the code
<- function(spec){
bhp_from_spec <- str_extract(spec, "([:digit:]{3})(PS|BHP)", group = 1)
bhp return(as.numeric(bhp))
}<- function(spec){
miles_from_spec <- str_extract(spec, "([0-9,]+) miles", group = 1)
miles <- str_replace(miles, ",", "")
miles <- if_else(is.na(miles), 0, as.numeric(miles))
miles return(miles)
}<- function(spec, age){
new_used_from_spec_and_age <- miles_from_spec(spec)
miles <- str_detect(
warranty str_to_lower(spec),
"full *manufacturer *warranty"
)return(if_else(
== 0 & (warranty == TRUE | age == 0),
miles "New",
"Used"
))
}<- function(spec){
reg_from_spec <- str_match(spec, "(?<year>[0-9]{4}) \\((?<reg>[0-9]{2}) reg\\)")
matches <- matches[, "reg"]
reg return(reg)
}<- function(spec){
year_from_spec <- str_match(spec, "(?<year>[0-9]{4}) \\((?<reg>[0-9]{2}) reg\\)")
matches1 <- matches1[, "year"]
year1 <- str_match(spec, "(?<year>20[12][0-9])")
matches2 <- matches2[, "year"]
year2 return(if_else(is.na(year1), year2, year1))
}<- function(reg){
age_in_months #table_offset_in_months <- interval(ymd("20230301"), today()) / dmonths() # nolint
<- 0
table_offset_in_months = table_offset_in_months +
months 12 * case_when(
== 23 ~ 0.0,
reg == 72 ~ 0.5,
reg == 22 ~ 1.0,
reg == 71 ~ 1.5,
reg == 21 ~ 2.0,
reg == 70 ~ 2.5,
reg == 20 ~ 3.0,
reg == 69 ~ 3.5,
reg == 19 ~ 4.0,
reg == 68 ~ 4.5,
reg == 18 ~ 5.0,
reg .default = NA
)return(months)
}<- function(spec){
age_in_months_from_year <- str_match(spec, "(?<year>20[12][0-9])")
matches <- matches[, "year"]
year #table_offset_in_months <- interval(ymd("20230301"), today()) / dmonths() # nolint
<- 0
table_offset_in_months = table_offset_in_months +
months 12 * case_when(
== 2023 ~ 0.0,
year == 2022 ~ 1.0,
year == 2021 ~ 2.0,
year == 2020 ~ 3.0,
year == 2019 ~ 4.0,
year == 2018 ~ 5.0,
year .default = NA
)return(months)
}<- function(variant_detail){
variant_from_detail <- str_to_lower(variant_detail)
detail <- \(pat) str_detect(fixed(detail), pat)
test <- case_when(
variant test("se-l lux") ~ "SE-L Lux",
test("se-l") ~ "SE-L",
test("sport lux") ~ "Sport Lux",
test("gt sport tech") ~ "GT Sport Tech",
test("gt sport") ~ "GT Sport",
test("anniversary edition") ~ "100th Anniversary Edition",
test("spt lux") ~ "Sport Lux",
test("sport nav") ~ "Sport Nav",
.default = str_glue("XXXX: {variant_detail}")
)return(variant)
}<- function(variant_detail){
drive_from_detail <- str_to_lower(variant_detail)
detail <- \(pat) str_detect(fixed(detail), pat)
test <- case_when(
drive test("awd") ~ "AWD",
test("4wd") ~ "AWD",
.default = "2WD"
)return(drive)
}<- function(variant_detail, bhp) {
engine_from_detail_and_bhp <- str_to_lower(variant_detail)
detail <- \(pat) str_detect(fixed(detail), pat)
test <- case_when(
variant1 test("skyactiv x") ~ "e-Skyactiv X",
test("skyactiv-x") ~ "e-Skyactiv X",
test("skyactiv g") ~ "e-Skyactiv G",
test("skyactiv-g") ~ "e-Skyactiv G",
)<- if_else(
variant2 == "High",
bhp "e-Skyactiv X",
"e-Skyactiv G"
)return(if_else(
!is.na(variant1),
variant1,if_else(
!is.na(variant2),
variant2,str_glue("XXXX: {variant_detail}")
)
))
}<- function(variant, engine, drive) { # nolint
price_new_from_variant_engine_and_drive <- paste(variant, drive, engine, sep = "/")
key <- str_to_lower(key)
key <- case_when(
price_new == "se-l lux/2wd/e-skyactiv g" ~ "26145",
key == "se-l lux/2wd/e-skyactiv x" ~ "28005",
key == "se-l/2wd/e-skyactiv g" ~ "24645",
key == "sport lux/2wd/e-skyactiv g" ~ "26995",
key == "sport lux/2wd/e-skyactiv x" ~ "29005",
key == "gt sport edition/2wd/e-skyactiv g" ~ "28855",
key == "gt sport edition/2wd/e-skyactiv x" ~ "31115",
key == "gt sport edition/awd/e-skyactiv x" ~ "33355",
key == "gt sport/2wd/e-skyactiv g" ~ "28905",
key == "gt sport/2wd/e-skyactiv x" ~ "30915",
key == "gt sport/awd/e-skyactiv x" ~ "33655",
key == "gt sport tech edition/2wd/e-skyactiv g" ~ "29755",
key == "gt sport tech edition/2wd/e-skyactiv x" ~ "32015",
key == "gt sport tech edition/awd/e-skyactiv x" ~ "34255",
key == "gt sport tech/2wd/e-skyactiv g" ~ "29805",
key == "gt sport tech/2wd/e-skyactiv x" ~ "31815",
key == "gt sport tech/awd/e-skyactiv x" ~ "34555",
key .default = str_glue("XXXX: {variant} {engine} {drive}")
)return(as.numeric(price_new))
}
The main body of the code uses those functions to determine the value of a number of new columns.
<- cars |>
cars mutate(
model = "Mazda CX-30",
variant = variant_from_detail(variant_detail),
drive = drive_from_detail(variant_detail),
fuel = if_else(
str_detect(spec, "Petrol"),
"Petrol",
"Diesel",
"Unknown"
),price = price,
transmission = if_else(
str_detect(spec, "Manual"),
"Manual",
"Automatic",
"Unknown"
),bhp = if_else(
bhp_from_spec(spec) > 175,
"High",
"Low"
),engine = engine_from_detail_and_bhp(variant_detail, bhp),
price_new = price_new_from_variant_engine_and_drive(variant, engine, drive), # nolint: line_length_linter.
miles = miles_from_spec(spec),
reg = reg_from_spec(spec),
year = year_from_spec(spec),
age_in_months = if_else(
!is.na(reg),
age_in_months(reg),
age_in_months_from_year(spec)
),mid_age_in_months = if_else(
!is.na(reg),
age_in_months(reg)+3,
age_in_months_from_year(spec)+6
),new_used = new_used_from_spec_and_age(spec, age_in_months),
spec = spec,
depreciation = if_else(price_new > price, price_new - price, 0),
depreciation_per_month = depreciation / mid_age_in_months,
miles_per_month = miles / mid_age_in_months
)
Quite a lot is going on here. In particular,
- All of the assignments and operations are on vectors of data.
- New or mutated columns are able to be defined in terms of previously named columns.
R is an array processing language which means that it has operations that are capable of working on a vector of values, so there are no manually written loops in the code above and even the if_else tests apply to a vector of values.
In the case where we appear to be assigning a scalar value (“Mazda CX-30”) to the column model
R will arrange that the scalar value is recycled as many times as required to fill the entire vector value of the column.
= "Mazda CX-30", model
In the case where we are comparing the value of a column with a constant scalar value:
bhp_from_spec(spec) > 175,
R will recycle the constant (175) to be a vector of the same length as the vector returned by the call to bhp_from_spec(spec)
allowing the comparison to be made between two vectors of equal length and resulting in a vector of TRUE or FALSE values.
We calculated depreciation_per_month
and miles_per_month
from mid_age_in_months
which was in turn calculated from reg
which was determined from spec
.
We now have a tibble with many more columns:
str(cars)
tibble [808 × 20] (S3: tbl_df/tbl/data.frame)
$ model : chr [1:808] "Mazda CX-30" "Mazda CX-30" "Mazda CX-30" "Mazda CX-30" ...
$ variant_detail : chr [1:808] "2.0 Skyactiv-X MHEV GT Sport Tech 5dr" "2.0 e-Skyactiv G MHEV SE-L 5dr" "SE-L LUX MHEV 2.0 5dr" "2.0 e-Skyactiv X MHEV GT Sport Tech 5dr Hatchback" ...
$ price : num [1:808] 24990 18785 21995 29994 34745 ...
$ spec : chr [1:808] "2021 (71 reg) SUV 6,542 miles 2.0L 186PS Manual Petrol Full service history" "2021 SUV 14,814 miles 2.0L 122PS Manual Petrol" "2020 (20 reg) SUV 17,548 miles 2.0L 122PS Manual Petrol" "2023 (23 reg) SUV 225 miles 2.0L 186PS Manual Petrol" ...
$ variant : 'glue' chr [1:808] "GT Sport Tech" "SE-L" "SE-L Lux" "GT Sport Tech" ...
$ drive : chr [1:808] "2WD" "2WD" "2WD" "2WD" ...
$ fuel : chr [1:808] "Petrol" "Petrol" "Petrol" "Petrol" ...
$ transmission : chr [1:808] "Manual" "Manual" "Manual" "Manual" ...
$ bhp : chr [1:808] "High" "Low" "Low" "High" ...
$ engine : 'glue' chr [1:808] "e-Skyactiv X" "e-Skyactiv G" "e-Skyactiv G" "e-Skyactiv X" ...
$ price_new : num [1:808] 31815 24645 26145 31815 31815 ...
$ miles : num [1:808] 6542 14814 17548 225 0 ...
$ reg : chr [1:808] "71" NA "20" "23" ...
$ year : chr [1:808] "2021" "2021" "2020" "2023" ...
$ age_in_months : num [1:808] 18 24 36 0 NA NA NA 36 NA NA ...
$ mid_age_in_months : num [1:808] 21 30 39 3 NA NA NA 39 NA NA ...
$ new_used : chr [1:808] "Used" "Used" "Used" "Used" ...
$ depreciation : num [1:808] 6825 5860 4150 1821 0 ...
$ depreciation_per_month: num [1:808] 325 195 106 607 NA ...
$ miles_per_month : num [1:808] 312 494 450 75 NA ...
The columns variant_detail
and spec
are now redundant and we will drop them from the tibble later. But the determination of new column values has thrown up a few records for which we don’t have enough information and we will delete those rows.
table(cars$variant)
100th Anniversary Edition GT Sport GT Sport Tech
4 187 200
SE-L SE-L Lux Sport Lux
62 113 241
Sport Nav
1
The Mazda site defines only the models “SE-L”, “SE-L Lux”, “Sport Lux”, “GT Sport”, “GT Sport Tech”. We also have rows for “100th Anniversary Edition” and “Sport Nav” - There are a very small number of these and they are either bad data, old data or of no interest to us and so we will remove them.
|> filter(is.na(price_new)) |> count())[[1]] (cars
[1] 29
We have written a function to determine the new price of each of the vehicles. The new price can be found from the Mazda site using a combination of variant, engine size and drive (2WD or AWD). We want to be able to use the new price of a given vehicle to help us estimate how much the vehicle has depreciated over time.
The above query shows us that for a small number of rows we were unable to find a price for the given combination of these column values (For example, some of the lower priced variants are not available with a large engine size or four wheel drive). Where such a combination is found our new price lookup function returns NA - we will remove those rows from the data too:
<- cars |>
cars filter(
!= "100th Anniversary Edition",
variant != "Sport Nav",
variant !is.na(price_new)
)
Next we create factors for those columns that are to be treated as such:
$model <- as_factor(cars$model)
cars$variant <-
carsas_factor(cars$variant) |>
fct_reorder(cars$price_new)
$drive <- as_factor(cars$drive)
cars$engine <-
carsas_factor(cars$engine) |>
fct_reorder(cars$price_new)
$new_used <- as_factor(cars$new_used)
cars$reg <- as_factor(cars$reg)
cars$fuel <- as_factor(cars$fuel)
cars$transmission <- as_factor(cars$transmission)
cars$bhp <- as_factor(cars$bhp) cars
The as_factor
function is part of the tidyverse forcats
package which makes it easier to work with factors than using the built-in R factor support. The fct_reorder
function has allowed us to order the variant factor so that it is arranged in order of increasing (median) new price - sorted from cheapest variant to most expensive. We do the same kind of ordering for the engine - the “e-Skyactiv G” engine is cheaper than the “e-Skyactiv X” engine.
So now:
table(cars$variant)
SE-L SE-L Lux Sport Lux GT Sport GT Sport Tech
42 113 237 187 200
This improves the appearance of charts and tables that involve the vehicle variant by displaying them in increasing order of cost.
Finally we select the final set of columns that we want to see in our tibble and we save the tibble to a disk file for use in later analyses.
<- cars |>
cars select(c(
model, variant, drive, engine, new_used,
miles, miles_per_month,
price, price_new,
depreciation, depreciation_per_month,
fuel, transmission,
age_in_months
))saveRDS(cars, file="mazda.rds")
Later we can re-hydrate the cars
tibble in another notebook with a call to readRDS
.
<- readRDS("mazda.rds") cars
Assumptions Made
We have succeeded in extracting information from a small number of descriptive columns (spec
and variant_detail
) into a larger number of factor columns. We should find it easier to work with these new columns but a number of assumptions have been made in our processing.
- We identified a New car as a vehicle that had zero miles and either a “full manufacturer warranty” statement in the
spec
column or we determined its age to be zero. - We looked for a couple of different formats of registration plate or year in the
spec
column and rejected vehicles which didn’t match either of these patterns. - Our
age_in_months
andage_in_months_from_year
functions makes use of a table that will date with time. We ought to be able to do better than this. - We have hardwired new prices from the Mazda website into our function that calculates the new price of a vehicle. Some of the older vehicles in our data set may have had a different price when new and also may have expensive extras such as custom paint colours that would have added to their new price. We have ignored that complexity.
Determination of Per Month Depreciation and Mileage
There are several approximations used in the determination of the per month depreciation and mileage.
To estimate depreciation of a vehicle we simply subtract the vehicle price from our estimate of its new price. However, for cars that are new or nearly new and which have some expensive extras it is sometimes possible for the asking price to be greater than our estimated new price. In that situation we set the depreciation to zero.
= if_else(price_new > price, price_new - price, 0), depreciation
To calculate the per month depreciation or mileage we need an estimate of how old the vehicle is. The only information that we have about the age of the vehicle comes from its registration number or registration year. For some vehicles we have the registration number, for others we have the registration year. The two digit age identifier from the registration number gives us a 6 month window in which we know that the car was registered. The vehicle registration year gives us a 12 month window in which we know that the vehicle was registered. So, which of these two we have for a particular vehicle, determines how accurately we can estimate its age - to within 6 months or within 12 months.
We also need to make an assumption about where in the range of 6 or 12 months the vehicle is likely to be. We make the assumption that the age of a vehicle falls in the middle of whichever kind of range (6 or 12 months) we have for the vehicle. The variable mid_age_in_months
contains this information and we use that to calculate depreciation_per_month and miles_per_month.
= if_else(
mid_age_in_months !is.na(reg),
age_in_months(reg)+3,
age_in_months_from_year(spec)+6
)
...= depreciation / mid_age_in_months
depreciation_per_month = miles / mid_age_in_months miles_per_month
So, in the absence of the full set of data, we have to make some big assumptions - but they don’t seem unreasonable ones.
The next task is to use the data in the tibble to learn something about the depreciation of these vehicles over time.