This function compares two versions of a dataset returning the dataset with the added, removed or changed rows identified, using the daff package. The compared dataset can then be exported into an Excel spreadsheet to quickly identify where values have been changed using conditional formatting, on text containing #.

compare_dataset_versions(old_version, new_version)

Arguments

old_version

The earlier version of the dataset as a data frame.

new_version

The later version of the dataset as a data frame.

Value

The data frame with an additional difference column indicating new, removed or updated rows highlighted with #.

Details

An initial check should be performed prior to comparing versions to check that the column names are identical and that there has not been any addition or removal of columns between dataset versions, so that the dataset schema can be made the same between versions if necessary. This check can be done using the compare function in the waldo package.

Examples

suppressPackageStartupMessages({ library(store) suppressWarnings({ library(palmerpenguins) library(dplyr) }) }) # select top 5 heaviest penguins from each species on each island heaviest_penguins <- penguins %>% select(species, island, body_mass_g) %>% group_by(species, island) %>% arrange(desc(body_mass_g)) %>% slice_head(n = 5) %>% ungroup() heaviest_penguins
#> # A tibble: 25 x 3 #> species island body_mass_g #> <fct> <fct> <int> #> 1 Adelie Biscoe 4775 #> 2 Adelie Biscoe 4725 #> 3 Adelie Biscoe 4600 #> 4 Adelie Biscoe 4400 #> 5 Adelie Biscoe 4300 #> 6 Adelie Dream 4650 #> 7 Adelie Dream 4600 #> 8 Adelie Dream 4475 #> 9 Adelie Dream 4450 #> 10 Adelie Dream 4400 #> # ... with 15 more rows
suppressPackageStartupMessages({ suppressWarnings({ library(dplyr) }) }) ## each version will require an unique identifier heaviest_penguins <- heaviest_penguins %>% mutate(id = row_number()) %>% relocate(id) ## old_version: exclude Chinstrap penguins heaviest_penguins_old <- heaviest_penguins %>% filter(species != "Chinstrap") ## new_version: exclude Gentoo penguins and convert body mass to kilograms heaviest_penguins_new <- heaviest_penguins %>% filter(species != "Gentoo") %>% mutate(body_mass_g = body_mass_g / 1000) %>% rename(body_mass_kg = body_mass_g) # check columns and column names are identical between versions waldo::compare(heaviest_penguins_old, heaviest_penguins_new)
#> `names(old)`: "id" "species" "island" "body_mass_g" #> `names(new)`: "id" "species" "island" "body_mass_kg" #> #> `old$id[13:20]`: 13 14 15 21 22 23 24 25 #> `new$id[13:20]`: 13 14 15 16 17 18 19 20 #> #> `old$species[13:20]`: 1 1 1 3 3 3 3 3 #> `new$species[13:20]`: 1 1 1 2 2 2 2 2 #> #> `old$island[13:20]`: 3 3 3 1 1 1 1 1 #> `new$island[13:20]`: 3 3 3 2 2 2 2 2 #> #> `old$body_mass_g` is an integer vector (4775, 4725, 4600, 4400, 4300, ...) #> `new$body_mass_g` is absent #> #> `old$body_mass_kg` is absent #> `new$body_mass_kg` is a double vector (4.775, 4.725, 4.6, 4.4, 4.3, ...)
# make columns same between versions heaviest_penguins_old <- heaviest_penguins_old %>% rename(body_mass = body_mass_g) heaviest_penguins_new <- heaviest_penguins_new %>% rename(body_mass = body_mass_kg) # compare versions of dataset suppressWarnings(compare_dataset_versions(heaviest_penguins_old, heaviest_penguins_new))
#> # A tibble: 25 x 5 #> differences id species island body_mass #> <chr> <chr> <chr> <chr> <chr> #> 1 # row changed 1 Adelie Biscoe 4775 # 4.775 #> 2 # row changed 2 Adelie Biscoe 4725 # 4.725 #> 3 # row changed 3 Adelie Biscoe 4600 # 4.6 #> 4 # row changed 4 Adelie Biscoe 4400 # 4.4 #> 5 # row changed 5 Adelie Biscoe 4300 # 4.3 #> 6 # row changed 6 Adelie Dream 4650 # 4.65 #> 7 # row changed 7 Adelie Dream 4600 # 4.6 #> 8 # row changed 8 Adelie Dream 4475 # 4.475 #> 9 # row changed 9 Adelie Dream 4450 # 4.45 #> 10 # row changed 10 Adelie Dream 4400 # 4.4 #> # ... with 15 more rows