library(tidyverse)
Mazda 3 Price Data - Preparation
Preparing and cleaning car price data.
The aim of this notebook is to prepare an R dataframe for analysis. We start with some data that has been collected about car prices for a number of variants of the Mazda 3. The car prices are in £ Sterling and have been obtained from a number of sources.
The processing in this notebook will result in the creation of a serialisation of a cleaned dataframe to disk in R’s native RDS format.
This notebook follows the same series of steps as those described in more detail in the equivalent notebooks for the CX-30. The aim of this notebook is to rapidly repeat the same steps but for the Mazda 3 and so has much less narrative is given.
Our data is already in a CSV file on disk that we can read into a dataframe and then we can convert the dataframe into a tibble.
<- read_csv("mazda3.csv") |> as_tibble() cars
Running str
on the tibble shows that we have 669 rows on 7 columns.
str(cars)
tibble [669 × 7] (S3: tbl_df/tbl/data.frame)
$ rowno : num [1:669] 1 2 3 4 5 6 7 8 9 10 ...
$ id : num [1:669] 2.02e+14 2.02e+14 2.02e+14 2.02e+14 2.02e+14 ...
$ title : chr [1:669] "Mazda3" "Mazda 3" "Mazda 3" "Mazda 3" ...
$ subtitle : chr [1:669] "2.0 e-Skyactiv G MHEV GT Sport Tech 5dr Leather Seats" "2.0 SKYACTIV G MHEV SE-L LUX 5DR" "SE-L LUX MHEV 2.0 5dr" "GT SPORT TECH MHEV 2.0 5dr" ...
$ price : num [1:669] 21000 17250 19995 18995 14490 ...
$ spec : chr [1:669] "2021 (71 reg) Hatchback 16,843 miles 2.0L 122PS Manual Petrol" "2020 (20 reg) Hatchback 12,875 miles 2.0L 122PS Manual Petrol 1 owner" "2021 (21 reg) Hatchback 7,891 miles 2.0L 122PS Manual Petrol" "2019 (69 reg) Hatchback 28,947 miles 2.0L 122PS Manual Petrol" ...
$ advert_date: chr [1:669] "23/04/2023" "23/04/2023" "23/04/2023" "23/04/2023" ...
Rename the title and subtitle columns.
<- cars |> rename(
cars model = title,
variant_detail = subtitle
)
Examining the contents of the tibble we can see that there are some rows for models other than the Mazda 3.
table(cars$model)
Honda Civic KIA Ceed Mazda 2 Mazda 3
13 3 3 418
Mazda 3 Hatchback Mazda2 Mazda3 MAZDA3 HATCHBACK
2 7 221 2
Most are variations on the Mazda 3 model but the Honda, Kia and Mazda 2 rows will need to be removed. We can do this with the filter
function of the dplyr
package. We also don’t need the rowno
, id
or advert_date
columns, so we remove them using the select
function of the ‘dplyr’ package:
<- cars |>
cars select(-c(id, rowno, advert_date)) |>
filter(
!= "Honda Civic",
model != "KIA Ceed",
model != "Mazda2",
model != "Mazda 2",
model
)table(cars$model)
Mazda 3 Mazda 3 Hatchback Mazda3 MAZDA3 HATCHBACK
418 2 221 2
It is easy to see from the remaining 4 columns that most of the useful information about the variant of the car lies within the variant_detail
and the spec
columns:
head(cars$variant_detail)
[1] "2.0 e-Skyactiv G MHEV GT Sport Tech 5dr Leather Seats"
[2] "2.0 SKYACTIV G MHEV SE-L LUX 5DR"
[3] "SE-L LUX MHEV 2.0 5dr"
[4] "GT SPORT TECH MHEV 2.0 5dr"
[5] "2.0 SKYACTIV-G Sport Nav Euro 6 (s/s) 5dr"
[6] "2.0 Skyactiv G MHEV Sport Lux 5dr Petrol Hatchback"
head(cars$spec)
[1] "2021 (71 reg) Hatchback 16,843 miles 2.0L 122PS Manual Petrol"
[2] "2020 (20 reg) Hatchback 12,875 miles 2.0L 122PS Manual Petrol 1 owner"
[3] "2021 (21 reg) Hatchback 7,891 miles 2.0L 122PS Manual Petrol"
[4] "2019 (69 reg) Hatchback 28,947 miles 2.0L 122PS Manual Petrol"
[5] "2018 (18 reg) Hatchback 42,577 miles 2.0L 121PS Manual Petrol 2 owners"
[6] "2019 (69 reg) Hatchback 25,664 miles 2.0L 122PS Manual Petrol"
The task here is to identify the useful information that could be extracted from these two columns and then write functions to extract that information into new columns in the tibble. We show how this is done using the mutate
function of the dplyr
package which allows us to change the value of existing columns or to create new columns. Where the calculation of value for a column is anything other than trivial the work has been handed off to a function. You can see the definition of those functions by clicking the “show the code” link below.
Show the code
<- function(spec){
bhp_from_spec <- str_extract(spec, "([:digit:]{3})(PS|BHP)", group = 1)
bhp return(as.numeric(bhp))
}<- function(spec){
miles_from_spec <- str_extract(spec, "([0-9,]+) miles", group = 1)
miles <- str_replace(miles, ",", "")
miles <- if_else(is.na(miles), 0, as.numeric(miles))
miles return(miles)
}<- function(spec, age){
new_used_from_spec_and_age <- miles_from_spec(spec)
miles <- str_detect(
warranty str_to_lower(spec),
"full *manufacturer *warranty"
)return(if_else(
== 0 & (warranty == TRUE | age == 0),
miles "New",
"Used"
))
}<- function(spec){
reg_from_spec <- str_match(spec, "(?<year>[0-9]{4}) \\((?<reg>[0-9]{2}) reg\\)")
matches <- matches[, "reg"]
reg return(reg)
}<- function(spec){
year_from_spec <- str_match(spec, "(?<year>[0-9]{4}) \\((?<reg>[0-9]{2}) reg\\)")
matches1 <- matches1[, "year"]
year1 <- str_match(spec, "(?<year>20[12][0-9])")
matches2 <- matches2[, "year"]
year2 return(if_else(is.na(year1), year2, year1))
}<- function(reg){
age_in_months #table_offset_in_months <- interval(ymd("20230301"), today()) / dmonths() # nolint
<- 0
table_offset_in_months = table_offset_in_months +
months 12 * case_when(
== 23 ~ 0.0,
reg == 72 ~ 0.5,
reg == 22 ~ 1.0,
reg == 71 ~ 1.5,
reg == 21 ~ 2.0,
reg == 70 ~ 2.5,
reg == 20 ~ 3.0,
reg == 69 ~ 3.5,
reg == 19 ~ 4.0,
reg == 68 ~ 4.5,
reg == 18 ~ 5.0,
reg == 67 ~ 5.5,
reg == 17 ~ 6.0,
reg .default = NA
)return(months)
}<- function(spec){
age_in_months_from_year <- str_match(spec, "(?<year>20[12][0-9])")
matches <- matches[, "year"]
year #table_offset_in_months <- interval(ymd("20230301"), today()) / dmonths() # nolint
<- 0
table_offset_in_months = table_offset_in_months +
months 12 * case_when(
== 2023 ~ 0.0,
year == 2022 ~ 1.0,
year == 2021 ~ 2.0,
year == 2020 ~ 3.0,
year == 2019 ~ 4.0,
year == 2018 ~ 5.0,
year .default = NA
)return(months)
}<- function(variant_detail){
variant_from_detail <- str_to_lower(variant_detail)
detail <- \(pat) str_detect(fixed(detail), pat)
test <- case_when(
variant test("se-l lux") ~ "SE-L Lux",
test("se-l") ~ "SE-L",
test("sport lux") ~ "Sport Lux",
test("gt sport tech") ~ "GT Sport Tech",
test("gt sport") ~ "GT Sport",
.default = str_glue("XXXX: {variant_detail}")
)return(variant)
}<- function(variant_detail){
drive_from_detail <- str_to_lower(variant_detail)
detail <- \(pat) str_detect(fixed(detail), pat)
test <- case_when(
drive test("awd") ~ "AWD",
test("4wd") ~ "AWD",
.default = "2WD"
)return(drive)
}<- function(variant_detail, bhp) {
engine_from_detail_and_bhp <- str_to_lower(variant_detail)
detail <- \(pat) str_detect(fixed(detail), pat)
test <- case_when(
variant1 test("skyactiv x") ~ "e-Skyactiv X",
test("skyactiv-x") ~ "e-Skyactiv X",
test("skyactiv g") ~ "e-Skyactiv G",
test("skyactiv-g") ~ "e-Skyactiv G",
)<- if_else(
variant2 == "High",
bhp "e-Skyactiv X",
"e-Skyactiv G"
)return(if_else(
!is.na(variant1),
variant1,if_else(
!is.na(variant2),
variant2,str_glue("XXXX: {variant_detail}")
)
))
}<- function(variant, engine) { # nolint
price_new_from_variant_and_engine <- paste(variant, engine, sep = "/")
key <- str_to_lower(key)
key <- case_when(
price_new == "se-l lux/e-skyactiv g" ~ "24385",
key == "se-l lux/e-skyactiv x" ~ "26285",
key == "se-l/e-skyactiv g" ~ "23285",
key == "sport lux/e-skyactiv g" ~ "25485",
key == "sport lux/e-skyactiv x" ~ "27685",
key == "gt sport/e-skyactiv g" ~ "27385",
key == "gt sport/e-skyactiv x" ~ "29585",
key == "gt sport tech/e-skyactiv g" ~ "28285",
key == "gt sport tech/e-skyactiv x" ~ "30485",
key .default = str_glue("XXXX: {variant} {engine}")
)return(as.numeric(price_new))
}
The main body of the code uses those functions to determine the value of a number of new columns.
<- cars |>
cars mutate(
model = "Mazda 3",
variant = variant_from_detail(variant_detail),
drive = drive_from_detail(variant_detail),
fuel = if_else(
str_detect(spec, "Petrol"),
"Petrol",
"Diesel",
"Unknown"
),price = price,
transmission = if_else(
str_detect(spec, "Manual"),
"Manual",
"Automatic",
"Unknown"
),bhp = if_else(
bhp_from_spec(spec) > 175,
"High",
"Low"
),engine = engine_from_detail_and_bhp(variant_detail, bhp),
price_new = price_new_from_variant_and_engine(variant, engine), # nolint: line_length_linter.
miles = miles_from_spec(spec),
reg = reg_from_spec(spec),
year = year_from_spec(spec),
age_in_months = if_else(
!is.na(reg),
age_in_months(reg),
age_in_months_from_year(spec)
),mid_age_in_months = if_else(
!is.na(reg),
age_in_months(reg)+3,
age_in_months_from_year(spec)+6
),new_used = new_used_from_spec_and_age(spec, age_in_months),
spec = spec,
depreciation = if_else(price_new > price, price_new - price, 0),
depreciation_per_month = depreciation / mid_age_in_months,
miles_per_month = miles / mid_age_in_months
)
We now have a tibble with many more columns:
str(cars)
tibble [643 × 20] (S3: tbl_df/tbl/data.frame)
$ model : chr [1:643] "Mazda 3" "Mazda 3" "Mazda 3" "Mazda 3" ...
$ variant_detail : chr [1:643] "2.0 e-Skyactiv G MHEV GT Sport Tech 5dr Leather Seats" "2.0 SKYACTIV G MHEV SE-L LUX 5DR" "SE-L LUX MHEV 2.0 5dr" "GT SPORT TECH MHEV 2.0 5dr" ...
$ price : num [1:643] 21000 17250 19995 18995 14490 ...
$ spec : chr [1:643] "2021 (71 reg) Hatchback 16,843 miles 2.0L 122PS Manual Petrol" "2020 (20 reg) Hatchback 12,875 miles 2.0L 122PS Manual Petrol 1 owner" "2021 (21 reg) Hatchback 7,891 miles 2.0L 122PS Manual Petrol" "2019 (69 reg) Hatchback 28,947 miles 2.0L 122PS Manual Petrol" ...
$ variant : 'glue' chr [1:643] "GT Sport Tech" "SE-L Lux" "SE-L Lux" "GT Sport Tech" ...
$ drive : chr [1:643] "2WD" "2WD" "2WD" "2WD" ...
$ fuel : chr [1:643] "Petrol" "Petrol" "Petrol" "Petrol" ...
$ transmission : chr [1:643] "Manual" "Manual" "Manual" "Manual" ...
$ bhp : chr [1:643] "Low" "Low" "Low" "Low" ...
$ engine : 'glue' chr [1:643] "e-Skyactiv G" "e-Skyactiv G" "e-Skyactiv G" "e-Skyactiv G" ...
$ price_new : num [1:643] 28285 24385 24385 28285 NA ...
$ miles : num [1:643] 16843 12875 7891 28947 42577 ...
$ reg : chr [1:643] "71" "20" "21" "69" ...
$ year : chr [1:643] "2021" "2020" "2021" "2019" ...
$ age_in_months : num [1:643] 18 36 24 42 60 42 24 42 30 NA ...
$ mid_age_in_months : num [1:643] 21 39 27 45 63 45 27 45 33 NA ...
$ new_used : chr [1:643] "Used" "Used" "Used" "Used" ...
$ depreciation : num [1:643] 7285 7135 4390 9290 NA ...
$ depreciation_per_month: num [1:643] 347 183 163 206 NA ...
$ miles_per_month : num [1:643] 802 330 292 643 676 ...
The columns variant_detail
and spec
are now redundant and we will drop them from the tibble later. But the determination of new column values has thrown up a few records for which we don’t have enough information and we will delete those rows.
The Mazda site defines only the models “SE-L”, “SE-L Lux”, “Sport Lux”, “GT Sport”, and “GT Sport Tech”. We make use of the fact that the determination of price_new will fail if we have not recognised the variant
|> filter(is.na(price_new)) |> count())[[1]] (cars
[1] 137
We have written a function to determine the new price of each of the vehicles. The new price can be found from the Mazda site using a combination of variant, and engine size. We want to be able to use the new price of a given vehicle to help us estimate how much the vehicle has depreciated over time.
The above query shows us that for 137 vehicles we were unable to find a price for the given combination of these column values. We will also remove record for which we cannot determine the age_in_months and those for all-wheel-drive vehicles.
<- cars |>
cars filter(
!is.na(price_new),
!is.na(age_in_months),
!= "AWD"
drive )
Next we create factors for those columns that are to be treated as such:
$model <- as_factor(cars$model)
cars$variant <-
carsas_factor(cars$variant) |>
fct_reorder(cars$price_new)
$drive <- as_factor(cars$drive)
cars$engine <-
carsas_factor(cars$engine) |>
fct_reorder(cars$price_new)
$new_used <- as_factor(cars$new_used)
cars$reg <- as_factor(cars$reg)
cars$fuel <- as_factor(cars$fuel)
cars$transmission <- as_factor(cars$transmission)
cars$bhp <- as_factor(cars$bhp) cars
So now we have the following numbers of vehicles of each variant remaining.
table(cars$variant)
SE-L SE-L Lux Sport Lux GT Sport GT Sport Tech
41 57 143 81 57
We select the final set of columns that we want to see in our tibble and we save the tibble to a disk file for use in later analyses.
<- cars |>
cars select(c(
model, variant, drive, engine, new_used,
miles, miles_per_month,
price, price_new,
depreciation, depreciation_per_month,
fuel, transmission,
age_in_months
))saveRDS(cars, file="mazda3.rds")