The janitor functions expedite the initial data exploration and cleaning that comes with any new data set. This catalog describes the usage for each function.
Functions for everyday use.
clean_names()
Call this function every time you read data.
It works in a %>%
pipeline, and handles problematic variable names, especially those that are so well preserved by readxl::read_excel()
and readr::read_csv()
.
_
as a separator# Load dplyr for the %>% pipe
library(dplyr)
# Create a data.frame with dirty names
test_df <- as.data.frame(matrix(ncol = 6))
names(test_df) <- c("hIgHlo", "REPEAT VALUE", "REPEAT VALUE",
"% successful (2009)", "abc@!*", "")
Clean the variable names, returning a data.frame:
test_df %>%
clean_names()
#> h_ig_hlo repeat_value repeat_value_2 percent_successful_2009 abc x
#> 1 NA NA NA NA NA NA
Compare to what base R produces:
make.names(names(test_df))
#> [1] "hIgHlo" "REPEAT.VALUE" "REPEAT.VALUE"
#> [4] "X..successful..2009." "abc..." "X"
tabyl()
- a better version of table()
tabyl()
is a tidyverse-oriented replacement for table()
. It counts combinations of one, two, or three variables, and then can be formatted with a suite of adorn_*
functions to look just how you want. For instance:
mtcars %>%
tabyl(gear, cyl) %>%
adorn_totals("col") %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns()
#> gear 4 6 8 Total
#> 3 6.67% (1) 13.33% (2) 80.00% (12) 100.00% (15)
#> 4 66.67% (8) 33.33% (4) 0.00% (0) 100.00% (12)
#> 5 40.00% (2) 20.00% (1) 40.00% (2) 100.00% (5)
Learn more in the tabyls vignette.
get_dupes()
This is for hunting down and examining duplicate records during data cleaning - usually when there shouldn’t be any.
For example, in a tidy data frame you might expect to have a unique ID repeated for each year, and year repeated for each unique ID, but no duplicated pairs of unique ID & year. Say you want to check for their presence, and study any such duplicated records.
get_dupes()
returns the records (and inserts a count of duplicates) so you can sleuth out the problematic cases:
get_dupes(mtcars, wt, cyl) # or mtcars %>% get_dupes(wt, cyl) if you prefer to pipe
#> # A tibble: 4 x 12
#> wt cyl dupe_… mpg disp hp drat qsec vs am gear carb
#> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3.44 6.00 2 19.2 168 123 3.92 18.3 1.00 0 4.00 4.00
#> 2 3.44 6.00 2 17.8 168 123 3.92 18.9 1.00 0 4.00 4.00
#> 3 3.57 8.00 2 14.3 360 245 3.21 15.8 0 0 3.00 4.00
#> 4 3.57 8.00 2 15.0 301 335 3.54 14.6 0 1.00 5.00 8.00
Smaller functions for use in particular situations. More human-readable than the equivalent code they replace.
excel_numeric_to_date()
Ever load data from Excel and see a value like 42223
where a date should be? This function converts those serial numbers to class Date
, and contains an option for specifying the alternate date system for files created with Excel for Mac 2008 and earlier versions (which count from a different starting point).
excel_numeric_to_date(41103)
#> [1] "2012-07-13"
excel_numeric_to_date(41103, date_system = "mac pre-2011")
#> [1] "2016-07-14"
remove_empty_cols()
and remove_empty_rows()
One-line wrapper functions that do what they say. For cases like cleaning Excel files containing empty rows and columns.
q <- data.frame(v1 = c(1, NA, 3),
v2 = c(NA, NA, NA),
v3 = c("a", NA, "b"))
q %>%
remove_empty_cols() %>%
remove_empty_rows()
#> v1 v3
#> 1 1 a
#> 3 3 b
top_levels()
Originally designed for use with Likert survey data stored as factors. Returns a tbl_df
frequency table with appropriately-named rows, grouped into head/middle/tail groups.
NA
values.f <- factor(c("strongly agree", "agree", "neutral", "neutral", "disagree", "strongly agree"),
levels = c("strongly agree", "agree", "neutral", "disagree", "strongly disagree"))
top_levels(f)
#> f n percent
#> strongly agree, agree 3 0.5000000
#> neutral 2 0.3333333
#> disagree, strongly disagree 1 0.1666667
top_levels(f, n = 1)
#> f n percent
#> strongly agree 2 0.3333333
#> agree, neutral, disagree 4 0.6666667
#> strongly disagree 0 0.0000000