A.I. Tools

Tabyl: A Frequency Table for the Modern R User | by Zvonimir Boban | May, 2023

Out with the old, in with the new!

Zvonimir Boban
Towards Data Science
Image created using Canva Image Generator

Anyone who has worked with categorical data eventually came across a need to calculate the absolute number and proportion of a certain class. This article introduces the tabyl function for creating frequency tables through a series of hands-on examples.

What does tabyl bring to the table (no pun intended :D)?

The tabyl function is a feature of the janitor package in R. It’s a very convenient tool for creating contingency tables, otherwise known as frequency tables or cross-tabulations. Here are some of the benefits of using tabyl:

1. Easy syntax: tabyl has an easy-to-use syntax. It can take one, two, or three variables, and it automatically returns a data frame that includes counts and proportions.

2. Flexibility: tabyl can generate one-way (single variable), two-way (two variables), and three-way (three variables) contingency tables. This flexibility makes it suitable for a wide range of applications.

3. Automatic calculation of proportions: tabyl automatically calculates the proportions (percentages) for one-way contingency tables. For two and three-way tables, the same result can be accomplished in combination with the adorn_percentages function from the same package.

4. Compatibility with dplyr: The output of tabyl is a data frame (or tibble), which makes it fully compatible with dply functions and the tidyverse ecosystem. This means you can easily pipe %>% the output into further data wrangling or visualization functions.

5. Neat and informative output: tabyl provides neat and informative output, which includes the variable names as row names and column names, making it easier to interpret the results.

For all these reasons, tabyl is a great choice when you want to create frequency tables in R. It simplifies many steps and integrates well with the tidyverse approach to data analysis.

The dataset

Photo by Hans Veth on Unsplash

This post will demonstrate the benefits of the tabyl function from the janitor package using the data on the edibility of different types of mushrooms depending on their odor. Here, I will be using a tidied dataset under the name mushrooms, but you can access the original data on Kaggle. Below is the code used for cleaning the data.

library(tidyverse)library(janitor)

mushrooms <- read_csv(“mushrooms.csv”) %>%select(class, odor) %>%mutate(class = case_when(class == “p” ~ “poisonous”,class == “e” ~ “edible”),odor = case_when(odor == “a” ~ “almond”,odor == “l” ~ “anise”,odor == “c” ~ “creosote”,odor == “y” ~ “fishy”,odor == “f” ~ “foul”,odor == “m” ~ “musty”,odor == “n” ~ “none”,odor == “p” ~ “pungent”,odor == “s” ~ “spicy”))

If you are unfamiliar with the above syntax, please check out a hands-on guide to using the tidyverse in one of my earlier articles.

The old

In order to better understand which advantages tabyl offers, let’s first make a frequency table using the base R table function.

table(mushrooms$class)

edible poisonous 4208 3916

table(mushrooms$odor, mushrooms$class)

edible poisonousalmond 400 0anise 400 0creosote 0 192fishy 0 576foul 0 2160musty 0 36none 3408 120pungent 0 256spicy 0 576

Unsurprisingly, it turns out that odor is a great predictor of mushroom edibility, with anything “funny-smelling” probably being poisonous. Thank you evolution! Also, there seem to be many more poisonous mushrooms, so it’s always important to be cautious when picking mushrooms on your own.

If we want to be able to use the variable names directly without specifying the $ operator, we would need to use the with command to make the dataset available to the table function.

mush_table <- with(mushrooms, table(odor, class))

Unfortunately, if we want to upgrade to proportions instead of absolute numbers, we can not use the same function but another one instead — prop.table .

prop.table(mush_table)

classodor edible poisonousalmond 0.049236829 0.000000000anise 0.049236829 0.000000000creosote 0.000000000 0.023633678fishy 0.000000000 0.070901034foul 0.000000000 0.265878877musty 0.000000000 0.004431315none 0.419497784 0.014771049pungent 0.000000000 0.031511571spicy 0.000000000 0.070901034

By default, this gives us a column-wise proportion table. If we want row-wise proportions, we can specify the margin argument (1 for row-wise and 2 for column-wise).

prop.table(mush_table, margin = 1)

classodor edible poisonousalmond 1.00000000 0.00000000anise 1.00000000 0.00000000creosote 0.00000000 1.00000000fishy 0.00000000 1.00000000foul 0.00000000 1.00000000musty 0.00000000 1.00000000none 0.96598639 0.03401361pungent 0.00000000 1.00000000spicy 0.00000000 1.00000000

All these special functions can feel cumbersome and hard to remember, so a single function which contains all the above funcionality would be nice to have.

Additionally, if we check the type of the created object using the class(mush_table) command, we see that it is of a class table.

This creates a compatibility problem, since nowadays R users are mostly using the tidyverse ecosystem which is centered around applying functions to data.frame type objects and stringing the results together using the pipe (%>%) operator.

The new

Let’s do the same things with the tabyl function.

tabyl(mushrooms, class)

class n percentedible 4208 0.5179714poisonous 3916 0.4820286

mush_tabyl <- tabyl(mushrooms, odor, class)mush_tabyl

odor edible poisonousalmond 400 0anise 400 0creosote 0 192fishy 0 576foul 0 2160musty 0 36none 3408 120pungent 0 256spicy 0 576

Compared to the corresponding table output, the resulting tables aretidier using the tabyl function, with variable names (class) being explicitly stated. Moreover, for the one-way table, aside from numbers, the percentages are automatically generated as well.

We can also notice that we didn’t have to use the which functio to be able to specify the variable names directly. Additionally, running class(mush_tabyl) tells us that the resulting object is of a data.frame class which ensures tidyverse compatibility!

The adorned janitor

Image created using Canva Image Generator

For additional tabyl functionalities, the janitor package also contains a series of adorn functions. To get the percentages, we simply pipe the resulting frequency table to the adorn_percentages function.

mush_tabyl %>% adorn_percentages()

odor edible poisonousalmond 1.0000000 0.00000000anise 1.0000000 0.00000000creosote 0.0000000 1.00000000fishy 0.0000000 1.00000000foul 0.0000000 1.00000000musty 0.0000000 1.00000000none 0.9659864 0.03401361pungent 0.0000000 1.00000000spicy 0.0000000 1.00000000

If we want the column-wise percentages, we can specify the denominator argument as “col”.

mush_tabyl %>% adorn_percentages(denominator = “col”)

odor edible poisonousalmond 0.09505703 0.000000000anise 0.09505703 0.000000000creosote 0.00000000 0.049029622fishy 0.00000000 0.147088866foul 0.00000000 0.551583248musty 0.00000000 0.009193054none 0.80988593 0.030643514pungent 0.00000000 0.065372829spicy 0.00000000 0.147088866

The tabyl — adorn combo even enables us to easily combine both the number and percentage in a same table cell…

mush_tabyl %>% adorn_percentages %>% adorn_ns

odor edible poisonousalmond 1.0000000 (400) 0.00000000 (0)anise 1.0000000 (400) 0.00000000 (0)creosote 0.0000000 (0) 1.00000000 (192)fishy 0.0000000 (0) 1.00000000 (576)foul 0.0000000 (0) 1.00000000 (2160)musty 0.0000000 (0) 1.00000000 (36)none 0.9659864 (3408) 0.03401361 (120)pungent 0.0000000 (0) 1.00000000 (256)spicy 0.0000000 (0) 1.00000000 (576)

… or add the totals to the rows and columns.

mush_tabyl %>% adorn_totals(c(“row”, “col”))

odor edible poisonous Totalalmond 400 0 400anise 400 0 400creosote 0 192 192fishy 0 576 576foul 0 2160 2160musty 0 36 36none 3408 120 3528pungent 0 256 256spicy 0 576 576Total 4208 3916 8124

Conclusion

The tabyl() function from the janitor package in R offers a user-friendly and flexible solution for creating one-way, two-way, or three-way contingency tables. It excels in automatically computing proportions and producing tidy data frames that integrate seamlessly with the tidyverse ecosystem, especially dplyr. Its outputs are well-structured and easy to interpret, and it can be further enhanced with adorn functions, simplifying the overall process of generating informative frequency tables. This makes tabyl() a highly beneficial tool in data analysis in R.


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button
Translate »