Mazda 3 Price Data - Tabulation

Tabulation of car price data.

R
CarPrice
Mazda 3
Author

John Bates

Published

May 3, 2023

This notebook looks at methods of tabulating the data in the car price data dataset (see Mazda 3 Price Data - Preparation). The aim is to look at ways of presenting simple one or two dimensional summary tables with as little code as possible. All of the manipulation of data will be done using functions from the dplyr or tidyr packages, both of which are part of the tidyverse umbrella package.

I feel sure that there should be a simple R package for creating simple tables, with aggregated values, from a dataframe. But, at the time of writing, I could not find one that was simple or mature enough to persuade me to use it. Hopefully that will change soon.

In the meantime, I demonstrate one technique that uses the tidyverse packages to build a two-dimensional table that handles the calculation and display of aggregated data and includes the correct calculation of aggregate values for row, column and grand totals.

Note

In this notebook where we want to talk about the rows or columns of a displayed table we will use the terms row dimension and column dimension. Using dimension, rather than variable or column, will help make it clear that we are talking about the horizontal or vertical dimension of a displayed table rather than a property of a dataframe or underlying model.

The technique for creating a table is to build a dataframe that holds the entire table structure and then to display it using the appropriate mechanism. Output in this notebook is by means of the knitr::kable function.

We start by reading in the cars tibble that we created and saved.

library(tidyverse)
cars <- readRDS("../01.preparation/mazda3.rds")

For the purposes of this notebook we ensure that we have a value of monthly depreciation for all vehicles by setting it to 250 (a high value) for all vehicles for which we do not already have a value.

cars <-
  cars |>
  mutate(depreciation_per_month = ifelse(new_used == "New", 250, depreciation_per_month))

Suppose that we would like to create a simple ordered, one-dimensional listing that shows the count and median monthly depreciation for each combination of variant, and engine.

We choose the columns that we are interested in using the select function.

Then we use group_by followed by summarise to reduce the data. Our output will contain a row for each combination of the columns named in the call to group_by - these columns will be shown in the output.

In addition, any values that we define in the call to summarize will also show up in the output in columns in the listing immediately following the group-by columns. Their values will be calculated for each of the groups - each of the combinations of the group-by columns. The aggregating functions can make use of any of the columns that we have selected from the dataframe. The special function n() returns the current group size.

The arrange function orders the output into ascending or descending order based on the values of one or more columns.

cars |>
  select(depreciation_per_month, variant, engine) |>
  group_by(variant, engine) |>
  summarise(n=n(), median_dpm = median(depreciation_per_month)) |>
  arrange(median_dpm) |>
  knitr::kable(digits=0)
variant engine n median_dpm
SE-L e-Skyactiv G 41 161
SE-L Lux e-Skyactiv G 51 188
Sport Lux e-Skyactiv G 114 197
GT Sport e-Skyactiv G 53 214
SE-L Lux e-Skyactiv X 6 222
GT Sport e-Skyactiv X 28 233
Sport Lux e-Skyactiv X 29 239
GT Sport Tech e-Skyactiv X 15 249
GT Sport Tech e-Skyactiv G 42 250

If we want to create a two-dimensional table we need to use the pivot_wider function to declare which column of the dataframe will appear at the table column dimension and what values will be displayed in the cells of the table.

cars |>
  select(variant, engine, dpm=depreciation_per_month) |>
  group_by(variant, engine) |>
  summarise( dpm = median(dpm)) |>
  pivot_wider(names_from=engine, values_from=dpm) |>
  knitr::kable(digits=0)
variant e-Skyactiv G e-Skyactiv X
SE-L 161 NA
SE-L Lux 188 222
Sport Lux 197 239
GT Sport 214 233
GT Sport Tech 250 249

We have seen that without using pivot_wider we get a listing output that includes our two group-by columns (variant and engine). We use the names_from argument to pivot_wider to declare the dataframe column that will provide the values for the column dimension of our table. We use the values_from argument to pivot_wider to declare the dataframe column whose values will be displayed in the cells of the table. The remaining group-by column will provide the values for the row dimension of the table.

cars |>
  select(variant, engine, new_used, dpm=depreciation_per_month) |>
  group_by(variant, engine, new_used) |>
  summarise( dpm = median(dpm)) |>
  pivot_wider(names_from=engine, values_from=dpm) |>
  knitr::kable(digits=0)
variant new_used e-Skyactiv G e-Skyactiv X
SE-L Used 161 NA
SE-L Lux Used 187 213
SE-L Lux New 250 250
Sport Lux Used 194 229
Sport Lux New 250 250
GT Sport Used 207 233
GT Sport New 250 250
GT Sport Tech Used 223 249
GT Sport Tech New 250 NA

In the above table we have reduced the dataframe to a variant value on the rows, an engine name on the columns and the median monthly depreciation values for each combination of row value and column value in each of the cells.

In general, the procedure is to group the data by the two variables that are to make up the row and column values. Then, summarise the group by the statistic that you want to display in the table cells.

Specify the variable that you want to use to supply the column values of the table in the names_from argument to pivot_wider and the summary statistic that you want to use to supply the cell values of the table in the values_from argument to pivot_wider.

By default, missing values will be shown in the table as NA. If you want them to be replaced by an alternative value then the values_fill argument can supply the replacement.

If you need to represent more than two dimensions in the table you can combine two into the columns by providing them both to names_from but this only really works if the number of values are relatively small and the length of the values of the columns are also relatively short.

cars |>
  select(variant, engine, new_used, dpm=depreciation_per_month) |>
  group_by(variant, engine, new_used) |>
  summarise(n = n()) |>
  pivot_wider(
    names_from=c(engine,new_used), values_from=n, values_fill=0
  ) |>
  knitr::kable(digits=0)
variant e-Skyactiv G_Used e-Skyactiv G_New e-Skyactiv X_Used e-Skyactiv X_New
SE-L 41 0 0 0
SE-L Lux 50 1 5 1
Sport Lux 109 5 26 3
GT Sport 50 3 27 1
GT Sport Tech 39 3 15 0

A better solution is to use the longer variables as table rows and use the smallest and shortest variable as the table columns.

cars |>
  select(variant, engine, new_used, dpm=depreciation_per_month) |>
  group_by(variant, engine, new_used) |>
  summarise(n = n()) |>
  pivot_wider(names_from=new_used, values_from=n, values_fill=0) |>
  knitr::kable(digits=0)
variant engine Used New
SE-L e-Skyactiv G 41 0
SE-L Lux e-Skyactiv G 50 1
SE-L Lux e-Skyactiv X 5 1
Sport Lux e-Skyactiv G 109 5
Sport Lux e-Skyactiv X 26 3
GT Sport e-Skyactiv G 50 3
GT Sport e-Skyactiv X 27 1
GT Sport Tech e-Skyactiv G 39 3
GT Sport Tech e-Skyactiv X 15 0

If none of the arrangements looks good then revert to using the one-dimensional listing format with one or more summarised values that we started this notebook with.

A simple form of table caption can be provided in the call to knitr::kable.

cars |>
  select(variant, engine, new_used, dpm=depreciation_per_month) |>
  group_by(engine, new_used) |>
  summarise(n = n()) |>
  pivot_wider(names_from=new_used, values_from=n, values_fill=0) |>
  knitr::kable(digits=0, caption="A Count of New/Used by Engine")
A Count of New/Used by Engine
engine Used New
e-Skyactiv G 289 12
e-Skyactiv X 73 5

Row Totals - Counts

The mechanism can be extended to include row totals. The technique is to create a dataframe for the body of the table and another dataframe for the rowtotal and then to combine them with bind_rows.

Note

The terminology of row-totals and column-totals can be a little confusing. In this notebook a row-total is a row that appears beneath the main body of a table and a column-total is a column that appears to the right of the main body of a table.

body <- cars |>
  select(engine, new_used) |>
  group_by(engine, new_used) |>
  summarise(n = n()) |>
  pivot_wider(names_from=new_used, values_from=n, values_fill=0) 

rowtotals <- body |>
  group_by() |>
  summarise(
    across(where(is.factor), ~"Total"),
    across(where(is.numeric), sum)
  ) 
bind_rows(body, rowtotals) |>
knitr::kable(digits=0)
engine Used New
e-Skyactiv G 289 12
e-Skyactiv X 73 5
Total 362 17

This is fine for simple additive counts, but for aggregates we need something better.

Row Totals - Aggregates

The problem with tables containing aggregates is that the aggregate of a column of aggregates is not the same as the aggregate you would obtain had you not previously grouped on the rows. An easy way to see this is to imagine dividing a collection of values up into two unequally sized groups. You can calculate the mean of each group but the mean of the whole collection is not equal in value to the mean of the two group means (it would be if the two groups were equally sized).

To calculate aggregates correctly for row totals we need to start with the cars dataframe and perform a different grouping for each kind of aggregate. Instead of grouping and summarising the body dataframe as we did above, we obtain row totals by grouping the cars dataframe across the table columns. This allows us to calculate an aggregate value for each group (table column) and our rowtotal is constructed as a single row dataframe that contains an aggregate value in each column and a blank value or text label in any column for which no aggregate can be calculated. We populate these blank or label columns using the mutate function.

body <- cars |>
  select(engine, new_used, dpm=depreciation_per_month) |>
  group_by(engine, new_used) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=new_used, values_from=dpm, values_fill=0) 

rowtotals <- cars |>
  select(new_used, dpm=depreciation_per_month) |>
  group_by(new_used) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=new_used, values_from=dpm, values_fill=0) |>
  mutate(engine = "Median")

bind_rows(body, rowtotals) |>
knitr::kable(digits=0)
engine Used New
e-Skyactiv G 197 250
e-Skyactiv X 233 250
Median 202 250

You will notice that the aggregate value for the Used column is no longer the sum of the numbers above it - it is not even the mean or median of them. It is the median of the values of the dataframe Used column but not grouped by engine.

Row, Column and Grand Totals - Aggregates

To create a table with the full set of totals we reduce (group-by and summarise) the cars dataframe to build each of the totals, so that we handle aggregates correctly, and then we stitch them together into a table.

body <- cars |>
  select(engine, new_used, dpm=depreciation_per_month) |>
  group_by(engine, new_used) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=new_used, values_from=dpm, values_fill=0) 

rowtotals <- cars |>
  select(new_used, dpm=depreciation_per_month) |>
  group_by(new_used) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=new_used, values_from=dpm, values_fill=0) |>
  mutate(engine = "Median")

coltotals <- cars |>
  select(engine, dpm=depreciation_per_month) |>
  group_by(engine) |>
  summarise(Median = median(dpm, na.rm=T))

grandtotal <- cars |>
  select(dpm=depreciation_per_month) |>
  summarise(engine="Median", Median = median(dpm, na.rm=T)) 

all <-inner_join(
  bind_rows(body, rowtotals),
  bind_rows(coltotals, grandtotal)
)
knitr::kable(all, digits=0)
engine Used New Median
e-Skyactiv G 197 250 197
e-Skyactiv X 233 250 238
Median 202 250 202

This seems like quite a lot of work for a 3-row by 4-column table but it is quite flexible. If you can see from the above code fragment how the table is constructed there is no need to read the next section which simply shows a breakdown of each stage of the table construction.

Assembly of the Complete Table - Bit by Bit

We begin by creating the body of the table.

Show the code
body <- cars |>
  select(engine, new_used, dpm=depreciation_per_month) |>
  group_by(engine, new_used) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=new_used, values_from=dpm, values_fill=0) 
knitr::kable(body, digits=0)
engine Used New
e-Skyactiv G 197 250
e-Skyactiv X 233 250

Then we create the row totals as a single row dataframe with a value for each table column. It does not matter that the engine column appears as the last column in the dataframe.

Show the code
rowtotals <- cars |>
  select(new_used, dpm=depreciation_per_month) |>
  group_by(new_used) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=new_used, values_from=dpm, values_fill=0) |>
  mutate(engine = "Median")
knitr::kable(rowtotals, digits=0)
Used New engine
202 250 Median

We create the column totals as a dataframe with a row for each row in the main body of the table. Each row contains a field (or fields) that allows us to find the table row with which to associate it, and a value to use as the column total.

Show the code
coltotals <- cars |>
  select(engine, dpm=depreciation_per_month) |>
  group_by(engine) |>
  summarise(Median = median(dpm, na.rm=T))
knitr::kable(coltotals, digits=0)
engine Median
e-Skyactiv G 197
e-Skyactiv X 238

The grand total is a dataframe with a single row and a column that will link it to the rowtotal row of the table and a value to use as the grand total.

Show the code
grandtotal <- cars |>
  select(dpm=depreciation_per_month) |>
  summarise(engine="Median", Median = median(dpm, na.rm=T)) 
knitr::kable(grandtotal, digits=0)
engine Median
Median 202

We bind the body and the rowtotals together into a single dataframe. The rowtotals appear as the final row of the dataframe. The values in the engine column are going to be used to position the column totals and the grand total correctly later.

Show the code
bind_rows(body, rowtotals) |>
knitr::kable(digits=0)
engine Used New
e-Skyactiv G 197 250
e-Skyactiv X 233 250
Median 202 250

So we now have the main body of the table and the row totals in a single dataframe. Next we need to create a similar dataframe containing the column totals and the grand total.

We bind the coltotals and the grandtotal together into a single dataframe.

Show the code
bind_rows(coltotals, grandtotal) |>
knitr::kable(digits=0)
engine Median
e-Skyactiv G 197
e-Skyactiv X 238
Median 202

Finally, we glue the two dataframes together by joining them on their common engine column. This is the column that contains values of the row dimension of the table.

The join ensures that the columns from the body and rowtotals appear before the columns from the coltotals and grandtotal. So the effect that we get is that the Median column appears as the last table column.

Show the code
all <-inner_join(
  bind_rows(body, rowtotals),
  bind_rows(coltotals, grandtotal)
)
knitr::kable(all, digits=0)
engine Used New Median
e-Skyactiv G 197 250 197
e-Skyactiv X 233 250 238
Median 202 250 202

And that is our complete table.

The deconstruction makes is sound like more work than it really is.