Mazda CX-30 Price Data - Tabulation

Tabulation of car price data.

R
CarPrice
Mazda CX-30
Author

John Bates

Published

April 14, 2023

This notebook looks at methods of tabulating the data in the car price data dataset (see Car Price Data - Preparation). The aim is to look at ways of presenting simple one or two dimensional summary tables with as little code as possible. All of the manipulation of data will be done using functions from the dplyr or tidyr packages, both of which are part of the tidyverse umbrella package.

I feel sure that there should be a simple R package for creating simple tables, with aggregated values, from a dataframe. But, at the time of writing, I could not find one that was simple or mature enough to persuade me to use it. Hopefully that will change soon.

In the meantime, I demonstrate one technique that uses the tidyverse packages to build a two-dimensional table that handles the calculation and display of aggregated data and includes the correct calculation of aggregate values for row, column and grand totals.

Note

In this notebook where we want to talk about the rows or columns of a displayed table we will use the terms row dimension and column dimension. Using dimension, rather than variable or column, will help make it clear that we are talking about the horizontal or vertical dimension of a displayed table rather than a property of a dataframe or underlying model.

The technique for creating a table is to build a dataframe that holds the entire table structure and then to display it using the appropriate mechanism. Output in this notebook is by means of the knitr::kable function.

We start by reading in the cars tibble that we created and saved.

library(tidyverse)
cars <- readRDS("../01.preparation/mazda.rds")

And then restrict it to just Used cars, and cars for which we were able to calculate the monthly depreciation.

cars <- cars |>
  filter(new_used == "Used") |>
  filter(!is.na(depreciation_per_month))

Suppose that we would like to create a simple ordered, one-dimensional listing that shows the count and median monthly depreciation for each combination of variant, engine, and drive.

We choose the columns that we are interested in using the select function.

Then we use group_by followed by summarise to reduce the data. Our output will contain a row for each combination of the columns named in the call to group_by - these columns will be shown in the output.

In addition, any values that we define in the call to summarize will also show up in the output in columns in the listing immediately following the group-by columns. Their values will be calculated for each of the groups - each of the combinations of the group-by columns. The aggregating functions can make use of any of the columns that we have selected from the dataframe. The special function n() returns the current group size.

The arrange function orders the output into ascending or descending order based on the values of one or more columns.

cars |>
  select(depreciation_per_month, variant, engine, drive) |>
  group_by(variant, engine, drive) |>
  summarise(n=n(), median_dpm = median(depreciation_per_month)) |>
  arrange(median_dpm) |>
  knitr::kable(digits=0)
variant engine drive n median_dpm
SE-L Lux e-Skyactiv G 2WD 48 186
SE-L e-Skyactiv G 2WD 32 195
SE-L Lux e-Skyactiv X 2WD 27 218
Sport Lux e-Skyactiv G 2WD 88 223
GT Sport e-Skyactiv G 2WD 27 233
Sport Lux e-Skyactiv X 2WD 89 241
GT Sport Tech e-Skyactiv G 2WD 13 252
GT Sport e-Skyactiv X 2WD 60 253
GT Sport Tech e-Skyactiv X 2WD 82 285
GT Sport e-Skyactiv X AWD 12 304
GT Sport Tech e-Skyactiv X AWD 13 306

If we want to create a two-dimensional table we need to use the pivot_wider function to declare which column of the dataframe will appear at the table column dimension and what values will be displayed in the cells of the table.

cars |>
  select(variant, engine, dpm=depreciation_per_month) |>
  group_by(variant, engine) |>
  summarise( dpm = median(dpm)) |>
  pivot_wider(names_from=engine, values_from=dpm) |>
  knitr::kable(digits=0)
variant e-Skyactiv G e-Skyactiv X
SE-L 195 NA
SE-L Lux 186 218
Sport Lux 223 241
GT Sport 233 261
GT Sport Tech 252 291

We have seen that without using pivot_wider we get a listing output that includes our two group-by columns (variant and engine). We use the names_from argument to pivot_wider to declare the dataframe column that will provide the values for the column dimension of our table. We use the values_from argument to pivot_wider to declare the dataframe column whose values will be displayed in the cells of the table. The remaining group-by column will provide the values for the row dimension of the table.

cars |>
  select(variant, engine, dpm=depreciation_per_month) |>
  group_by(variant, engine) |>
  summarise( dpm = median(dpm)) |>
  pivot_wider(names_from=engine, values_from=dpm) |>
  knitr::kable(digits=0)
variant e-Skyactiv G e-Skyactiv X
SE-L 195 NA
SE-L Lux 186 218
Sport Lux 223 241
GT Sport 233 261
GT Sport Tech 252 291

In the above table we have reduced the dataframe to a variant value on the rows, an engine name on the columns and the median monthly depreciation values for each combination of row value and column value in each of the cells.

In general, the procedure is to group the data by the two variables that are to make up the row and column values. Then, summarise the group by the statistic that you want to display in the table cells.

Specify the variable that you want to use to supply the column values of the table in the names_from argument to pivot_wider and the summary statistic that you want to use to supply the cell values of the table in the values_from argument to pivot_wider.

By default, missing values will be shown in the table as NA. If you want them to be replaced by an alternative value then the values_fill argument can supply the replacement.

If you need to represent more than two dimensions in the table you can combine two into the columns by providing them both to names_from but this only really works if the number of values are relatively small and the length of the values of the columns are also relatively short.

cars |>
  select(variant, engine, drive, dpm=depreciation_per_month) |>
  group_by(variant, engine, drive) |>
  summarise(n = n()) |>
  pivot_wider(
    names_from=c(engine,drive), values_from=n, values_fill=0
  ) |>
  knitr::kable(digits=0)
variant e-Skyactiv G_2WD e-Skyactiv X_2WD e-Skyactiv X_AWD
SE-L 32 0 0
SE-L Lux 48 27 0
Sport Lux 88 89 0
GT Sport 27 60 12
GT Sport Tech 13 82 13

A better solution is to use the longer variables as table rows and use the smallest and shortest variable as the table columns.

cars |>
  select(variant, engine, drive, dpm=depreciation_per_month) |>
  group_by(variant, engine, drive) |>
  summarise(n = n()) |>
  pivot_wider(names_from=drive, values_from=n, values_fill=0) |>
  knitr::kable(digits=0)
variant engine 2WD AWD
SE-L e-Skyactiv G 32 0
SE-L Lux e-Skyactiv G 48 0
SE-L Lux e-Skyactiv X 27 0
Sport Lux e-Skyactiv G 88 0
Sport Lux e-Skyactiv X 89 0
GT Sport e-Skyactiv G 27 0
GT Sport e-Skyactiv X 60 12
GT Sport Tech e-Skyactiv G 13 0
GT Sport Tech e-Skyactiv X 82 13

If none of the arrangements looks good then revert to using the one-dimensional listing format with one or more summarised values that we started this notebook with.

A simple form of table caption can be provided in the call to knitr::kable.

cars |>
  select(variant, engine, drive, dpm=depreciation_per_month) |>
  group_by(engine, drive) |>
  summarise(n = n()) |>
  pivot_wider(names_from=drive, values_from=n, values_fill=0) |>
  knitr::kable(digits=0, caption="A Count of Drive by Engine")
A Count of Drive by Engine
engine 2WD AWD
e-Skyactiv G 208 0
e-Skyactiv X 258 25

Row Totals - Counts

The mechanism can be extended to include row totals. The technique is to create a dataframe for the body of the table and another dataframe for the rowtotal and then to combine them with bind_rows.

Note

The terminology of row-totals and column-totals can be a little confusing. In this notebook a row-total is a row that appears beneath the main body of a table and a column-total is a column that appears to the right of the main body of a table.

body <- cars |>
  select(engine, drive) |>
  group_by(engine, drive) |>
  summarise(n = n()) |>
  pivot_wider(names_from=drive, values_from=n, values_fill=0) 

rowtotals <- body |>
  group_by() |>
  summarise(
    across(where(is.factor), ~"Total"),
    across(where(is.numeric), sum)
  ) 
bind_rows(body, rowtotals) |>
knitr::kable(digits=0)
engine 2WD AWD
e-Skyactiv G 208 0
e-Skyactiv X 258 25
Total 466 25

This is fine for simple additive counts, but for aggregates we need something better.

Row Totals - Aggregates

The problem with tables containing aggregates is that the aggregate of a column of aggregates is not the same as the aggregate you would obtain had you not previously grouped on the rows. An easy way to see this is to imagine dividing a collection of values up into two unequally sized groups. You can calculate the mean of each group but the mean of the whole collection is not equal in value to the mean of the two group means (it would be if the two groups were equally sized).

To calculate aggregates correctly for row totals we need to start with the cars dataframe and perform a different grouping for each kind of aggregate. Instead of grouping and summarising the body dataframe as we did above, we obtain row totals by grouping the cars dataframe across the table columns. This allows us to calculate an aggregate value for each group (table column) and our rowtotal is constructed as a single row dataframe that contains an aggregate value in each column and a blank value or text label in any column for which no aggregate can be calculated. We populate these blank or label columns using the mutate function.

body <- cars |>
  select(engine, drive, dpm=depreciation_per_month) |>
  group_by(engine, drive) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=drive, values_from=dpm, values_fill=0) 

rowtotals <- cars |>
  select(drive, dpm=depreciation_per_month) |>
  group_by(drive) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=drive, values_from=dpm, values_fill=0) |>
  mutate(engine = "Median")

bind_rows(body, rowtotals) |>
knitr::kable(digits=0)
engine 2WD AWD
e-Skyactiv G 208 0
e-Skyactiv X 243 305
Median 228 305

You will notice that the Aggregate value for the 2WD column is no longer the sum of the numbers above it - it is not even the mean or median of them. It is the median of the values of the dataframe 2WD column but not grouped by engine.

Row, Column and Grand Totals - Aggregates

To create a table with the full set of totals we reduce (group-by and summarise) the cars dataframe to build each of the totals, so that we handle aggregates correctly, and then we stitch them together into a table.

body <- cars |>
  select(engine, drive, dpm=depreciation_per_month) |>
  group_by(engine, drive) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=drive, values_from=dpm, values_fill=0) 

rowtotals <- cars |>
  select(drive, dpm=depreciation_per_month) |>
  group_by(drive) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=drive, values_from=dpm, values_fill=0) |>
  mutate(engine = "Median")

coltotals <- cars |>
  select(engine, dpm=depreciation_per_month) |>
  group_by(engine) |>
  summarise(Median = median(dpm, na.rm=T))

grandtotal <- cars |>
  select(dpm=depreciation_per_month) |>
  summarise(engine="Median", Median = median(dpm, na.rm=T)) 

all <-inner_join(
  bind_rows(body, rowtotals),
  bind_rows(coltotals, grandtotal)
)
knitr::kable(all, digits=0)
engine 2WD AWD Median
e-Skyactiv G 208 0 208
e-Skyactiv X 243 305 244
Median 228 305 234

This seems like quite a lot of work for a 3-row by 4-column table but it is quite flexible. If you can see from the above code fragment how the table is constructed there is no need to read the next section which simply shows a breakdown of each stage of the table construction.

Assembly of the Complete Table - Bit by Bit

We begin by creating the body of the table.

Show the code
body <- cars |>
  select(engine, drive, dpm=depreciation_per_month) |>
  group_by(engine, drive) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=drive, values_from=dpm, values_fill=0) 
knitr::kable(body, digits=0)
engine 2WD AWD
e-Skyactiv G 208 0
e-Skyactiv X 243 305

Then we create the row totals as a single row dataframe with a value for each table column. It does not matter that the engine column appears as the last column in the dataframe.

Show the code
rowtotals <- cars |>
  select(drive, dpm=depreciation_per_month) |>
  group_by(drive) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=drive, values_from=dpm, values_fill=0) |>
  mutate(engine = "Median")
knitr::kable(rowtotals, digits=0)
2WD AWD engine
228 305 Median

We create the column totals as a dataframe with a row for each row in the main body of the table. Each row contains a field (or fields) that allows us to find the table row with which to associate it, and a value to use as the column total.

Show the code
coltotals <- cars |>
  select(engine, dpm=depreciation_per_month) |>
  group_by(engine) |>
  summarise(Median = median(dpm, na.rm=T))
knitr::kable(coltotals, digits=0)
engine Median
e-Skyactiv G 208
e-Skyactiv X 244

The grand total is a dataframe with a single row and a column that will link it to the rowtotal row of the table and a value to use as the grand total.

Show the code
grandtotal <- cars |>
  select(dpm=depreciation_per_month) |>
  summarise(engine="Median", Median = median(dpm, na.rm=T)) 
knitr::kable(grandtotal, digits=0)
engine Median
Median 234

We bind the body and the rowtotals together into a single dataframe. The rowtotals appear as the final row of the dataframe. The values in the engine column are going to be used to position the column totals and the grand total correctly later.

Show the code
bind_rows(body, rowtotals) |>
knitr::kable(digits=0)
engine 2WD AWD
e-Skyactiv G 208 0
e-Skyactiv X 243 305
Median 228 305

So we now have the main body of the table and the row totals in a single dataframe. Next we need to create a similar dataframe containing the column totals and the grand total.

We bind the coltotals and the grandtotal together into a single dataframe.

Show the code
bind_rows(coltotals, grandtotal) |>
knitr::kable(digits=0)
engine Median
e-Skyactiv G 208
e-Skyactiv X 244
Median 234

Finally, we glue the two dataframes together by joining them on their common engine column. This is the column that contains values of the row dimension of the table.

The join ensures that the columns from the body and rowtotals appear before the columns from the coltotals and grandtotal. So the effect that we get is that the Median column appears as the last table column.

Show the code
all <-inner_join(
  bind_rows(body, rowtotals),
  bind_rows(coltotals, grandtotal)
)
knitr::kable(all, digits=0)
engine 2WD AWD Median
e-Skyactiv G 208 0 208
e-Skyactiv X 243 305 244
Median 228 305 234

And that is our complete table.

The deconstruction makes is sound like more work than it really is.