library(tidyverse)
<- readRDS("../01.preparation/mazda3.rds") cars
Mazda 3 Price Data - Tabulation
Tabulation of car price data.
This notebook looks at methods of tabulating the data in the car price data dataset (see Mazda 3 Price Data - Preparation). The aim is to look at ways of presenting simple one or two dimensional summary tables with as little code as possible. All of the manipulation of data will be done using functions from the dplyr
or tidyr
packages, both of which are part of the tidyverse
umbrella package.
I feel sure that there should be a simple R package for creating simple tables, with aggregated values, from a dataframe. But, at the time of writing, I could not find one that was simple or mature enough to persuade me to use it. Hopefully that will change soon.
In the meantime, I demonstrate one technique that uses the tidyverse
packages to build a two-dimensional table that handles the calculation and display of aggregated data and includes the correct calculation of aggregate values for row, column and grand totals.
In this notebook where we want to talk about the rows or columns of a displayed table we will use the terms row dimension and column dimension. Using dimension, rather than variable or column, will help make it clear that we are talking about the horizontal or vertical dimension of a displayed table rather than a property of a dataframe or underlying model.
The technique for creating a table is to build a dataframe that holds the entire table structure and then to display it using the appropriate mechanism. Output in this notebook is by means of the knitr::kable
function.
We start by reading in the cars tibble that we created and saved.
For the purposes of this notebook we ensure that we have a value of monthly depreciation for all vehicles by setting it to 250 (a high value) for all vehicles for which we do not already have a value.
<-
cars |>
cars mutate(depreciation_per_month = ifelse(new_used == "New", 250, depreciation_per_month))
Suppose that we would like to create a simple ordered, one-dimensional listing that shows the count and median monthly depreciation for each combination of variant, and engine.
We choose the columns that we are interested in using the select
function.
Then we use group_by
followed by summarise
to reduce the data. Our output will contain a row for each combination of the columns named in the call to group_by
- these columns will be shown in the output.
In addition, any values that we define in the call to summarize
will also show up in the output in columns in the listing immediately following the group-by columns. Their values will be calculated for each of the groups - each of the combinations of the group-by columns. The aggregating functions can make use of any of the columns that we have selected from the dataframe. The special function n()
returns the current group size.
The arrange
function orders the output into ascending or descending order based on the values of one or more columns.
|>
cars select(depreciation_per_month, variant, engine) |>
group_by(variant, engine) |>
summarise(n=n(), median_dpm = median(depreciation_per_month)) |>
arrange(median_dpm) |>
::kable(digits=0) knitr
variant | engine | n | median_dpm |
---|---|---|---|
SE-L | e-Skyactiv G | 41 | 161 |
SE-L Lux | e-Skyactiv G | 51 | 188 |
Sport Lux | e-Skyactiv G | 114 | 197 |
GT Sport | e-Skyactiv G | 53 | 214 |
SE-L Lux | e-Skyactiv X | 6 | 222 |
GT Sport | e-Skyactiv X | 28 | 233 |
Sport Lux | e-Skyactiv X | 29 | 239 |
GT Sport Tech | e-Skyactiv X | 15 | 249 |
GT Sport Tech | e-Skyactiv G | 42 | 250 |
If we want to create a two-dimensional table we need to use the pivot_wider
function to declare which column of the dataframe will appear at the table column dimension and what values will be displayed in the cells of the table.
|>
cars select(variant, engine, dpm=depreciation_per_month) |>
group_by(variant, engine) |>
summarise( dpm = median(dpm)) |>
pivot_wider(names_from=engine, values_from=dpm) |>
::kable(digits=0) knitr
variant | e-Skyactiv G | e-Skyactiv X |
---|---|---|
SE-L | 161 | NA |
SE-L Lux | 188 | 222 |
Sport Lux | 197 | 239 |
GT Sport | 214 | 233 |
GT Sport Tech | 250 | 249 |
We have seen that without using pivot_wider
we get a listing output that includes our two group-by columns (variant and engine). We use the names_from
argument to pivot_wider
to declare the dataframe column that will provide the values for the column dimension of our table. We use the values_from
argument to pivot_wider
to declare the dataframe column whose values will be displayed in the cells of the table. The remaining group-by column will provide the values for the row dimension of the table.
|>
cars select(variant, engine, new_used, dpm=depreciation_per_month) |>
group_by(variant, engine, new_used) |>
summarise( dpm = median(dpm)) |>
pivot_wider(names_from=engine, values_from=dpm) |>
::kable(digits=0) knitr
variant | new_used | e-Skyactiv G | e-Skyactiv X |
---|---|---|---|
SE-L | Used | 161 | NA |
SE-L Lux | Used | 187 | 213 |
SE-L Lux | New | 250 | 250 |
Sport Lux | Used | 194 | 229 |
Sport Lux | New | 250 | 250 |
GT Sport | Used | 207 | 233 |
GT Sport | New | 250 | 250 |
GT Sport Tech | Used | 223 | 249 |
GT Sport Tech | New | 250 | NA |
In the above table we have reduced the dataframe to a variant value on the rows, an engine name on the columns and the median monthly depreciation values for each combination of row value and column value in each of the cells.
In general, the procedure is to group the data by the two variables that are to make up the row and column values. Then, summarise the group by the statistic that you want to display in the table cells.
Specify the variable that you want to use to supply the column values of the table in the names_from
argument to pivot_wider
and the summary statistic that you want to use to supply the cell values of the table in the values_from
argument to pivot_wider
.
By default, missing values will be shown in the table as NA
. If you want them to be replaced by an alternative value then the values_fill
argument can supply the replacement.
If you need to represent more than two dimensions in the table you can combine two into the columns by providing them both to names_from
but this only really works if the number of values are relatively small and the length of the values of the columns are also relatively short.
|>
cars select(variant, engine, new_used, dpm=depreciation_per_month) |>
group_by(variant, engine, new_used) |>
summarise(n = n()) |>
pivot_wider(
names_from=c(engine,new_used), values_from=n, values_fill=0
|>
) ::kable(digits=0) knitr
variant | e-Skyactiv G_Used | e-Skyactiv G_New | e-Skyactiv X_Used | e-Skyactiv X_New |
---|---|---|---|---|
SE-L | 41 | 0 | 0 | 0 |
SE-L Lux | 50 | 1 | 5 | 1 |
Sport Lux | 109 | 5 | 26 | 3 |
GT Sport | 50 | 3 | 27 | 1 |
GT Sport Tech | 39 | 3 | 15 | 0 |
A better solution is to use the longer variables as table rows and use the smallest and shortest variable as the table columns.
|>
cars select(variant, engine, new_used, dpm=depreciation_per_month) |>
group_by(variant, engine, new_used) |>
summarise(n = n()) |>
pivot_wider(names_from=new_used, values_from=n, values_fill=0) |>
::kable(digits=0) knitr
variant | engine | Used | New |
---|---|---|---|
SE-L | e-Skyactiv G | 41 | 0 |
SE-L Lux | e-Skyactiv G | 50 | 1 |
SE-L Lux | e-Skyactiv X | 5 | 1 |
Sport Lux | e-Skyactiv G | 109 | 5 |
Sport Lux | e-Skyactiv X | 26 | 3 |
GT Sport | e-Skyactiv G | 50 | 3 |
GT Sport | e-Skyactiv X | 27 | 1 |
GT Sport Tech | e-Skyactiv G | 39 | 3 |
GT Sport Tech | e-Skyactiv X | 15 | 0 |
If none of the arrangements looks good then revert to using the one-dimensional listing format with one or more summarised values that we started this notebook with.
A simple form of table caption can be provided in the call to knitr::kable
.
|>
cars select(variant, engine, new_used, dpm=depreciation_per_month) |>
group_by(engine, new_used) |>
summarise(n = n()) |>
pivot_wider(names_from=new_used, values_from=n, values_fill=0) |>
::kable(digits=0, caption="A Count of New/Used by Engine") knitr
engine | Used | New |
---|---|---|
e-Skyactiv G | 289 | 12 |
e-Skyactiv X | 73 | 5 |
Row Totals - Counts
The mechanism can be extended to include row totals. The technique is to create a dataframe for the body of the table and another dataframe for the rowtotal and then to combine them with bind_rows
.
The terminology of row-totals and column-totals can be a little confusing. In this notebook a row-total is a row that appears beneath the main body of a table and a column-total is a column that appears to the right of the main body of a table.
<- cars |>
body select(engine, new_used) |>
group_by(engine, new_used) |>
summarise(n = n()) |>
pivot_wider(names_from=new_used, values_from=n, values_fill=0)
<- body |>
rowtotals group_by() |>
summarise(
across(where(is.factor), ~"Total"),
across(where(is.numeric), sum)
) bind_rows(body, rowtotals) |>
::kable(digits=0) knitr
engine | Used | New |
---|---|---|
e-Skyactiv G | 289 | 12 |
e-Skyactiv X | 73 | 5 |
Total | 362 | 17 |
This is fine for simple additive counts, but for aggregates we need something better.
Row Totals - Aggregates
The problem with tables containing aggregates is that the aggregate of a column of aggregates is not the same as the aggregate you would obtain had you not previously grouped on the rows. An easy way to see this is to imagine dividing a collection of values up into two unequally sized groups. You can calculate the mean of each group but the mean of the whole collection is not equal in value to the mean of the two group means (it would be if the two groups were equally sized).
To calculate aggregates correctly for row totals we need to start with the cars dataframe and perform a different grouping for each kind of aggregate. Instead of grouping and summarising the body dataframe as we did above, we obtain row totals by grouping the cars dataframe across the table columns. This allows us to calculate an aggregate value for each group (table column) and our rowtotal is constructed as a single row dataframe that contains an aggregate value in each column and a blank value or text label in any column for which no aggregate can be calculated. We populate these blank or label columns using the mutate
function.
<- cars |>
body select(engine, new_used, dpm=depreciation_per_month) |>
group_by(engine, new_used) |>
summarise(dpm = median(dpm, na.rm=T)) |>
pivot_wider(names_from=new_used, values_from=dpm, values_fill=0)
<- cars |>
rowtotals select(new_used, dpm=depreciation_per_month) |>
group_by(new_used) |>
summarise(dpm = median(dpm, na.rm=T)) |>
pivot_wider(names_from=new_used, values_from=dpm, values_fill=0) |>
mutate(engine = "Median")
bind_rows(body, rowtotals) |>
::kable(digits=0) knitr
engine | Used | New |
---|---|---|
e-Skyactiv G | 197 | 250 |
e-Skyactiv X | 233 | 250 |
Median | 202 | 250 |
You will notice that the aggregate value for the Used column is no longer the sum of the numbers above it - it is not even the mean or median of them. It is the median of the values of the dataframe Used column but not grouped by engine.
Row, Column and Grand Totals - Aggregates
To create a table with the full set of totals we reduce (group-by and summarise) the cars dataframe to build each of the totals, so that we handle aggregates correctly, and then we stitch them together into a table.
<- cars |>
body select(engine, new_used, dpm=depreciation_per_month) |>
group_by(engine, new_used) |>
summarise(dpm = median(dpm, na.rm=T)) |>
pivot_wider(names_from=new_used, values_from=dpm, values_fill=0)
<- cars |>
rowtotals select(new_used, dpm=depreciation_per_month) |>
group_by(new_used) |>
summarise(dpm = median(dpm, na.rm=T)) |>
pivot_wider(names_from=new_used, values_from=dpm, values_fill=0) |>
mutate(engine = "Median")
<- cars |>
coltotals select(engine, dpm=depreciation_per_month) |>
group_by(engine) |>
summarise(Median = median(dpm, na.rm=T))
<- cars |>
grandtotal select(dpm=depreciation_per_month) |>
summarise(engine="Median", Median = median(dpm, na.rm=T))
<-inner_join(
all bind_rows(body, rowtotals),
bind_rows(coltotals, grandtotal)
)::kable(all, digits=0) knitr
engine | Used | New | Median |
---|---|---|---|
e-Skyactiv G | 197 | 250 | 197 |
e-Skyactiv X | 233 | 250 | 238 |
Median | 202 | 250 | 202 |
This seems like quite a lot of work for a 3-row by 4-column table but it is quite flexible. If you can see from the above code fragment how the table is constructed there is no need to read the next section which simply shows a breakdown of each stage of the table construction.
Assembly of the Complete Table - Bit by Bit
We begin by creating the body of the table.
Show the code
<- cars |>
body select(engine, new_used, dpm=depreciation_per_month) |>
group_by(engine, new_used) |>
summarise(dpm = median(dpm, na.rm=T)) |>
pivot_wider(names_from=new_used, values_from=dpm, values_fill=0)
::kable(body, digits=0) knitr
engine | Used | New |
---|---|---|
e-Skyactiv G | 197 | 250 |
e-Skyactiv X | 233 | 250 |
Then we create the row totals as a single row dataframe with a value for each table column. It does not matter that the engine column appears as the last column in the dataframe.
Show the code
<- cars |>
rowtotals select(new_used, dpm=depreciation_per_month) |>
group_by(new_used) |>
summarise(dpm = median(dpm, na.rm=T)) |>
pivot_wider(names_from=new_used, values_from=dpm, values_fill=0) |>
mutate(engine = "Median")
::kable(rowtotals, digits=0) knitr
Used | New | engine |
---|---|---|
202 | 250 | Median |
We create the column totals as a dataframe with a row for each row in the main body of the table. Each row contains a field (or fields) that allows us to find the table row with which to associate it, and a value to use as the column total.
Show the code
<- cars |>
coltotals select(engine, dpm=depreciation_per_month) |>
group_by(engine) |>
summarise(Median = median(dpm, na.rm=T))
::kable(coltotals, digits=0) knitr
engine | Median |
---|---|
e-Skyactiv G | 197 |
e-Skyactiv X | 238 |
The grand total is a dataframe with a single row and a column that will link it to the rowtotal row of the table and a value to use as the grand total.
Show the code
<- cars |>
grandtotal select(dpm=depreciation_per_month) |>
summarise(engine="Median", Median = median(dpm, na.rm=T))
::kable(grandtotal, digits=0) knitr
engine | Median |
---|---|
Median | 202 |
We bind the body and the rowtotals together into a single dataframe. The rowtotals appear as the final row of the dataframe. The values in the engine column are going to be used to position the column totals and the grand total correctly later.
Show the code
bind_rows(body, rowtotals) |>
::kable(digits=0) knitr
engine | Used | New |
---|---|---|
e-Skyactiv G | 197 | 250 |
e-Skyactiv X | 233 | 250 |
Median | 202 | 250 |
So we now have the main body of the table and the row totals in a single dataframe. Next we need to create a similar dataframe containing the column totals and the grand total.
We bind the coltotals and the grandtotal together into a single dataframe.
Show the code
bind_rows(coltotals, grandtotal) |>
::kable(digits=0) knitr
engine | Median |
---|---|
e-Skyactiv G | 197 |
e-Skyactiv X | 238 |
Median | 202 |
Finally, we glue the two dataframes together by joining them on their common engine
column. This is the column that contains values of the row dimension of the table.
The join ensures that the columns from the body and rowtotals appear before the columns from the coltotals and grandtotal. So the effect that we get is that the Median column appears as the last table column.
Show the code
<-inner_join(
all bind_rows(body, rowtotals),
bind_rows(coltotals, grandtotal)
)::kable(all, digits=0) knitr
engine | Used | New | Median |
---|---|---|---|
e-Skyactiv G | 197 | 250 | 197 |
e-Skyactiv X | 233 | 250 | 238 |
Median | 202 | 250 | 202 |
And that is our complete table.
The deconstruction makes is sound like more work than it really is.