Mazda CX-30 Price Data - Tabulation

Tabulation of car price data.

CarPrice

Mazda CX-30

Author

John Bates

Published

April 14, 2023

This notebook looks at methods of tabulating the data in the car price data dataset (see Car Price Data - Preparation). The aim is to look at ways of presenting simple one or two dimensional summary tables with as little code as possible. All of the manipulation of data will be done using functions from the dplyr or tidyr packages, both of which are part of the tidyverse umbrella package.

I feel sure that there should be a simple R package for creating simple tables, with aggregated values, from a dataframe. But, at the time of writing, I could not find one that was simple or mature enough to persuade me to use it. Hopefully that will change soon.

In the meantime, I demonstrate one technique that uses the tidyverse packages to build a two-dimensional table that handles the calculation and display of aggregated data and includes the correct calculation of aggregate values for row, column and grand totals.

Note

In this notebook where we want to talk about the rows or columns of a displayed table we will use the terms row dimension and column dimension. Using dimension, rather than variable or column, will help make it clear that we are talking about the horizontal or vertical dimension of a displayed table rather than a property of a dataframe or underlying model.

The technique for creating a table is to build a dataframe that holds the entire table structure and then to display it using the appropriate mechanism. Output in this notebook is by means of the knitr::kable function.

We start by reading in the cars tibble that we created and saved.

library(tidyverse)
cars <- readRDS("../01.preparation/mazda.rds")

And then restrict it to just Used cars, and cars for which we were able to calculate the monthly depreciation.

cars <- cars |>
  filter(new_used == "Used") |>
  filter(!is.na(depreciation_per_month))

Suppose that we would like to create a simple ordered, one-dimensional listing that shows the count and median monthly depreciation for each combination of variant, engine, and drive.

We choose the columns that we are interested in using the select function.

Then we use group_by followed by summarise to reduce the data. Our output will contain a row for each combination of the columns named in the call to group_by - these columns will be shown in the output.

In addition, any values that we define in the call to summarize will also show up in the output in columns in the listing immediately following the group-by columns. Their values will be calculated for each of the groups - each of the combinations of the group-by columns. The aggregating functions can make use of any of the columns that we have selected from the dataframe. The special function n() returns the current group size.

The arrange function orders the output into ascending or descending order based on the values of one or more columns.

cars |>
  select(depreciation_per_month, variant, engine, drive) |>
  group_by(variant, engine, drive) |>
  summarise(n=n(), median_dpm = median(depreciation_per_month)) |>
  arrange(median_dpm) |>
  knitr::kable(digits=0)

variant	engine	drive	n	median_dpm
SE-L Lux	e-Skyactiv G	2WD	48	186
SE-L	e-Skyactiv G	2WD	32	195
SE-L Lux	e-Skyactiv X	2WD	27	218
Sport Lux	e-Skyactiv G	2WD	88	223
GT Sport	e-Skyactiv G	2WD	27	233
Sport Lux	e-Skyactiv X	2WD	89	241
GT Sport Tech	e-Skyactiv G	2WD	13	252
GT Sport	e-Skyactiv X	2WD	60	253
GT Sport Tech	e-Skyactiv X	2WD	82	285
GT Sport	e-Skyactiv X	AWD	12	304
GT Sport Tech	e-Skyactiv X	AWD	13	306

If we want to create a two-dimensional table we need to use the pivot_wider function to declare which column of the dataframe will appear at the table column dimension and what values will be displayed in the cells of the table.

cars |>
  select(variant, engine, dpm=depreciation_per_month) |>
  group_by(variant, engine) |>
  summarise( dpm = median(dpm)) |>
  pivot_wider(names_from=engine, values_from=dpm) |>
  knitr::kable(digits=0)

variant	e-Skyactiv G	e-Skyactiv X
SE-L	195	NA
SE-L Lux	186	218
Sport Lux	223	241
GT Sport	233	261
GT Sport Tech	252	291

We have seen that without using pivot_wider we get a listing output that includes our two group-by columns (variant and engine). We use the names_from argument to pivot_wider to declare the dataframe column that will provide the values for the column dimension of our table. We use the values_from argument to pivot_wider to declare the dataframe column whose values will be displayed in the cells of the table. The remaining group-by column will provide the values for the row dimension of the table.

cars |>
  select(variant, engine, dpm=depreciation_per_month) |>
  group_by(variant, engine) |>
  summarise( dpm = median(dpm)) |>
  pivot_wider(names_from=engine, values_from=dpm) |>
  knitr::kable(digits=0)

variant	e-Skyactiv G	e-Skyactiv X
SE-L	195	NA
SE-L Lux	186	218
Sport Lux	223	241
GT Sport	233	261
GT Sport Tech	252	291

In the above table we have reduced the dataframe to a variant value on the rows, an engine name on the columns and the median monthly depreciation values for each combination of row value and column value in each of the cells.

In general, the procedure is to group the data by the two variables that are to make up the row and column values. Then, summarise the group by the statistic that you want to display in the table cells.

Specify the variable that you want to use to supply the column values of the table in the names_from argument to pivot_wider and the summary statistic that you want to use to supply the cell values of the table in the values_from argument to pivot_wider.

By default, missing values will be shown in the table as NA. If you want them to be replaced by an alternative value then the values_fill argument can supply the replacement.

If you need to represent more than two dimensions in the table you can combine two into the columns by providing them both to names_from but this only really works if the number of values are relatively small and the length of the values of the columns are also relatively short.

cars |>
  select(variant, engine, drive, dpm=depreciation_per_month) |>
  group_by(variant, engine, drive) |>
  summarise(n = n()) |>
  pivot_wider(
    names_from=c(engine,drive), values_from=n, values_fill=0
  ) |>
  knitr::kable(digits=0)

variant	e-Skyactiv G_2WD	e-Skyactiv X_2WD	e-Skyactiv X_AWD
SE-L	32	0	0
SE-L Lux	48	27	0
Sport Lux	88	89	0
GT Sport	27	60	12
GT Sport Tech	13	82	13

A better solution is to use the longer variables as table rows and use the smallest and shortest variable as the table columns.

cars |>
  select(variant, engine, drive, dpm=depreciation_per_month) |>
  group_by(variant, engine, drive) |>
  summarise(n = n()) |>
  pivot_wider(names_from=drive, values_from=n, values_fill=0) |>
  knitr::kable(digits=0)

variant	engine	2WD	AWD
SE-L	e-Skyactiv G	32	0
SE-L Lux	e-Skyactiv G	48	0
SE-L Lux	e-Skyactiv X	27	0
Sport Lux	e-Skyactiv G	88	0
Sport Lux	e-Skyactiv X	89	0
GT Sport	e-Skyactiv G	27	0
GT Sport	e-Skyactiv X	60	12
GT Sport Tech	e-Skyactiv G	13	0
GT Sport Tech	e-Skyactiv X	82	13

If none of the arrangements looks good then revert to using the one-dimensional listing format with one or more summarised values that we started this notebook with.

A simple form of table caption can be provided in the call to knitr::kable.

cars |>
  select(variant, engine, drive, dpm=depreciation_per_month) |>
  group_by(engine, drive) |>
  summarise(n = n()) |>
  pivot_wider(names_from=drive, values_from=n, values_fill=0) |>
  knitr::kable(digits=0, caption="A Count of Drive by Engine")

A Count of Drive by Engine
engine	2WD	AWD
e-Skyactiv G	208	0
e-Skyactiv X	258	25

Row Totals - Counts

The mechanism can be extended to include row totals. The technique is to create a dataframe for the body of the table and another dataframe for the rowtotal and then to combine them with bind_rows.

Note

The terminology of row-totals and column-totals can be a little confusing. In this notebook a row-total is a row that appears beneath the main body of a table and a column-total is a column that appears to the right of the main body of a table.

body <- cars |>
  select(engine, drive) |>
  group_by(engine, drive) |>
  summarise(n = n()) |>
  pivot_wider(names_from=drive, values_from=n, values_fill=0) 

rowtotals <- body |>
  group_by() |>
  summarise(
    across(where(is.factor), ~"Total"),
    across(where(is.numeric), sum)
  ) 
bind_rows(body, rowtotals) |>
knitr::kable(digits=0)

engine	2WD	AWD
e-Skyactiv G	208	0
e-Skyactiv X	258	25
Total	466	25

This is fine for simple additive counts, but for aggregates we need something better.

Row Totals - Aggregates

The problem with tables containing aggregates is that the aggregate of a column of aggregates is not the same as the aggregate you would obtain had you not previously grouped on the rows. An easy way to see this is to imagine dividing a collection of values up into two unequally sized groups. You can calculate the mean of each group but the mean of the whole collection is not equal in value to the mean of the two group means (it would be if the two groups were equally sized).

To calculate aggregates correctly for row totals we need to start with the cars dataframe and perform a different grouping for each kind of aggregate. Instead of grouping and summarising the body dataframe as we did above, we obtain row totals by grouping the cars dataframe across the table columns. This allows us to calculate an aggregate value for each group (table column) and our rowtotal is constructed as a single row dataframe that contains an aggregate value in each column and a blank value or text label in any column for which no aggregate can be calculated. We populate these blank or label columns using the mutate function.

body <- cars |>
  select(engine, drive, dpm=depreciation_per_month) |>
  group_by(engine, drive) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=drive, values_from=dpm, values_fill=0) 

rowtotals <- cars |>
  select(drive, dpm=depreciation_per_month) |>
  group_by(drive) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=drive, values_from=dpm, values_fill=0) |>
  mutate(engine = "Median")

bind_rows(body, rowtotals) |>
knitr::kable(digits=0)

engine	2WD	AWD
e-Skyactiv G	208	0
e-Skyactiv X	243	305
Median	228	305

You will notice that the Aggregate value for the 2WD column is no longer the sum of the numbers above it - it is not even the mean or median of them. It is the median of the values of the dataframe 2WD column but not grouped by engine.

Row, Column and Grand Totals - Aggregates

To create a table with the full set of totals we reduce (group-by and summarise) the cars dataframe to build each of the totals, so that we handle aggregates correctly, and then we stitch them together into a table.

body <- cars |>
  select(engine, drive, dpm=depreciation_per_month) |>
  group_by(engine, drive) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=drive, values_from=dpm, values_fill=0) 

rowtotals <- cars |>
  select(drive, dpm=depreciation_per_month) |>
  group_by(drive) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=drive, values_from=dpm, values_fill=0) |>
  mutate(engine = "Median")

coltotals <- cars |>
  select(engine, dpm=depreciation_per_month) |>
  group_by(engine) |>
  summarise(Median = median(dpm, na.rm=T))

grandtotal <- cars |>
  select(dpm=depreciation_per_month) |>
  summarise(engine="Median", Median = median(dpm, na.rm=T)) 

all <-inner_join(
  bind_rows(body, rowtotals),
  bind_rows(coltotals, grandtotal)
)
knitr::kable(all, digits=0)

engine	2WD	AWD	Median
e-Skyactiv G	208	0	208
e-Skyactiv X	243	305	244
Median	228	305	234

This seems like quite a lot of work for a 3-row by 4-column table but it is quite flexible. If you can see from the above code fragment how the table is constructed there is no need to read the next section which simply shows a breakdown of each stage of the table construction.

Assembly of the Complete Table - Bit by Bit

We begin by creating the body of the table.

Show the code

body <- cars |>
  select(engine, drive, dpm=depreciation_per_month) |>
  group_by(engine, drive) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=drive, values_from=dpm, values_fill=0) 
knitr::kable(body, digits=0)

engine	2WD	AWD
e-Skyactiv G	208	0
e-Skyactiv X	243	305

Then we create the row totals as a single row dataframe with a value for each table column. It does not matter that the engine column appears as the last column in the dataframe.

Show the code

rowtotals <- cars |>
  select(drive, dpm=depreciation_per_month) |>
  group_by(drive) |>
  summarise(dpm = median(dpm, na.rm=T)) |>
  pivot_wider(names_from=drive, values_from=dpm, values_fill=0) |>
  mutate(engine = "Median")
knitr::kable(rowtotals, digits=0)

2WD	AWD	engine
228	305	Median

We create the column totals as a dataframe with a row for each row in the main body of the table. Each row contains a field (or fields) that allows us to find the table row with which to associate it, and a value to use as the column total.

Show the code

coltotals <- cars |>
  select(engine, dpm=depreciation_per_month) |>
  group_by(engine) |>
  summarise(Median = median(dpm, na.rm=T))
knitr::kable(coltotals, digits=0)

engine	Median
e-Skyactiv G	208
e-Skyactiv X	244

The grand total is a dataframe with a single row and a column that will link it to the rowtotal row of the table and a value to use as the grand total.

Show the code

grandtotal <- cars |>
  select(dpm=depreciation_per_month) |>
  summarise(engine="Median", Median = median(dpm, na.rm=T)) 
knitr::kable(grandtotal, digits=0)

engine	Median
Median	234

We bind the body and the rowtotals together into a single dataframe. The rowtotals appear as the final row of the dataframe. The values in the engine column are going to be used to position the column totals and the grand total correctly later.

Show the code

bind_rows(body, rowtotals) |>
knitr::kable(digits=0)

engine	2WD	AWD
e-Skyactiv G	208	0
e-Skyactiv X	243	305
Median	228	305

So we now have the main body of the table and the row totals in a single dataframe. Next we need to create a similar dataframe containing the column totals and the grand total.

We bind the coltotals and the grandtotal together into a single dataframe.

Show the code

bind_rows(coltotals, grandtotal) |>
knitr::kable(digits=0)

engine	Median
e-Skyactiv G	208
e-Skyactiv X	244
Median	234

Finally, we glue the two dataframes together by joining them on their common engine column. This is the column that contains values of the row dimension of the table.

The join ensures that the columns from the body and rowtotals appear before the columns from the coltotals and grandtotal. So the effect that we get is that the Median column appears as the last table column.

Show the code

all <-inner_join(
  bind_rows(body, rowtotals),
  bind_rows(coltotals, grandtotal)
)
knitr::kable(all, digits=0)

engine	2WD	AWD	Median
e-Skyactiv G	208	0	208
e-Skyactiv X	243	305	244
Median	228	305	234

And that is our complete table.

The deconstruction makes is sound like more work than it really is.