class: center, middle, inverse, title-slide #
Managing Computational Reproducibility in a Shifting R Package Landscape ### Kiegan Rice (advised by Heike Hofmann and Ulrike Genschel) ### ISU Graphics Group ### April 24, 2020 --- <style> /* colors: #EEB422, #8B0000, #191970, #00a8cc */ a, a > code { color: #EEB422; text-decoration: none; } .remark-slide-content { background-color: #FFFFFF; border-top: 80px solid #191970; font-size: 20px; font-weight: 300; line-height: 1.5; padding: 1em 2em 1em 2em } .inverse { background-color: #191970; border-top: 80px solid #191970; text-shadow: none; background-position: 50% 75%; background-size: 150px; } .remark-slide-content > h1 { font-family: 'Goudy Old Style'; font-weight: normal; font-size: 45px; margin-top: -95px; margin-left: -00px; color: #FFFFFF; } .title-slide { background-color: #FFFFFF; border-top: 15px solid #FFFFFF; background-image: none; } .title-slide > h1 { color: #111111; font-size: 40px; text-shadow: none; font-weight: 400; text-align: left; margin-left: 15px; padding-top: 80px; } .title-slide > h2 { margin-top: -25px; padding-bottom: -20px; color: #111111; text-shadow: none; font-weight: 300; font-size: 35px; text-align: left; margin-left: 15px; } .title-slide > h3 { color: #111111; text-shadow: none; font-weight: 300; font-size: 20px; text-align: left; margin-left: 15px; margin-bottom: -20px; } body { font-family: 'Goudy Old Style'; } .remark-slide-number { font-size: 13pt; font-family: 'Goudy Old Style'; color: #272822; opacity: 1; } .inverse .remark-slide-number { font-size: 13pt; font-family: 'Goudy Old Style'; color: #FAFAFA; opacity: 1; } </style> <style type="text/css"> .tiny{font-size: 30%} .small{font-size: 50%} .medium{font-size: 75%} .left-code { color: #777; width: 40%; height: 92%; float: left; } .right-plot { width: 59%; float: right; padding-left: 1%; } .right-code { color: #777; width: 40%; height: 92%; float: right; } </style> # What is Computational Reproducibility? ## The short answer: -- ![](images/darkest_timeline.gif) --- # What is Computational Reproducibility? ## The long answer: -- There are lots of facets! -- For a good overview of several approaches to computational reproducibility, see the [ROpenSci reproducibility guide](https://ropensci.github.io/reproducibility-guide/sections/introduction/). -- 1. Literate programming and reporting - Creating self-contained documents that include the code used to generate results and figures. - The largest step forward in this direction in recent history (for R users) has been the creation of [`knitr`](https://yihui.org/knitr/) by Yihui Xie. 2. Numerical reproducibility of results - Code of any kind used to complete an analysis or generate results - Reproduction of analysis or results across users/machines, or over time Often, (1) helps with *transparency* of methodology in research, while (2) helps with *consistency* of methodology. --- # Computational Reproducibility in R R has a language model that makes it uniquely frustrating to manage computational reproducibility. -- **`base` R** - provides base language model - data structures (logical, character, double, factor, vector, list, matrix, data.frame, etc.) - framework for object manipulation (`$`, `[]`, `[[]]`, etc.) - framework for function syntax (arguments, body, return object, etc.) - basic functionality: (`for`, `while`, `apply`, `ifelse`, etc.) - also includes some packages (e.g., `stats`, that complete data analysis) -- **user-developed packages** - many data analysis and data science capabilities are available through packages - `randomForest`, `lme4`, `tidyverse`, `knitr`.... -- Packages are subject to change (this *includes* base packages like `stats`). *Note: This framework exists in Python, too, but not SAS.* --- # Computational Reproducibility in R There are a few things to note about the package framework **R packages can change over time** - Package developers can change anything they want in their own packages - [CRAN (Comprehensive R Archive Network)](https://cran.r-project.org/) provides a way for users to install packages - CRAN packages go through a review process, but versions do still change over time - GitHub versions of many popular packages also exist (usually between CRAN releases) **R package versioning is user/machine-specific** - It is up to the user to install and upgrade packages - Different users can have different sets of packages AND different versions of those packages - Functionality then differs across users and machines - If installing from CRAN, will install most recent release So what does this mean practically for data analysis? --- # Computational Reproducibility in R Suppose we have a data analysis pipeline like so: <img src="images/pipeline_v1_update.png" width="\textwidth" /> -- Now suppose that the underlying code changes: <img src="images/pipeline_coding_update.png" width="\textwidth" /> We now have four possible variants of our quantitative result, depending on how the code changes affect our process. --- class: inverse, middle, center # Existing Reproducibility Approaches in R --- # Package Versioning Packages 📦 **`checkpoint` package** [(Revolution Analytics)](https://github.com/RevolutionAnalytics/checkpoint) - stored versions of CRAN mirror from every day - can insert a "checkpoint" in scripts for users - users then use all packages from CRAN on day of checkpoint **`packrat` package** [(RStudio)](https://rstudio.github.io/packrat/) - private package library 📦 📖 - works on a project level, creating a library for that R project - interacts with projects/scripts - (*personal opinion*): bulky and makes projects a little hard to use - really awesome way to create a project-specific library! **`pkgnet` package** [(Uptake)](https://uptake.github.io/pkgnet/) - creates "package reports" - package dependency network (visual and data file) - function relationship/dependency network (visual and data file) - this tool is very cool --- # What is missing? All the packages mentioned are meant for **static** package versioning. - Meant to store/track packages used *at a particular point in time*. - Storing and maintaining a computational environment is an important aspect of computational reproducibility! - Does not allow for changes in packages across machines and over time - Ensures the same version (usually an old version) of a package will be used. We propose an alternate approach which takes advantage of new developments and improvements in packages... **Adaptive computational reproducibility!!** --- class: inverse, middle, center # Our Approach ![](images/change-is-good.gif)<!-- --> --- # Adaptive Computational Reproducibility We take a two-step approach: 1. Define and refine the scope of package dependency for a script or project. - This lives somewhere in the space between `pkgnet` and `packrat`. - Focuses on taking and visualizing **inventories** of packages and their dependencies - Assess the level of vulnerability a script has to changes 2. Compare package dependency networks over time and across users/machines - Identify specific package differences between inventories - Identify places in a script where differences are used - Gives users a chance to identify differences *before* the code breaks To explain this approach, we will look at a brief example. --- # Step 1: Define and refine vulnerabilities Consider the following data analysis process that makes use of five R packages: <img src="images/pipeline_small_rfuncs.png" width="\textwidth" /> -- We can take these five packages, and get a *package inventory* using the `manager` package (👨‍💻📦👨‍💻): ```r devtools::install_github("kiegan/manager") library(manager) project_inventory <- take_inventory(packages = c("tidyr", "stringr", "purrr", "dplyr", "randomForest")) ``` -- We can then visualize that inventory... --- # Step 1: Define and refine vulnerabilities ```r plot_inventory(project_inventory) ``` ![](graphics-group-manager_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- # Quick side note 👀 ```r plot_inventory(tidyverse_inventory) ``` ![](graphics-group-manager_files/figure-html/unnamed-chunk-10-1.png)<!-- --> --- # Step 1: Define and refine vulnerabilities **We can see a few things:** 1. `take_inventory` and `plot_inventory` together are a quick and easy way to get at your vulnerabilities 2. Your projects are probably more vulnerable to package changes than you realize! - *anecdotally*, I know several people who have had packages disrupted by a "level 3" or "level 4" change. 3. You maybe probably definitely shouldn't just call `library(tidyverse)` if you want to make your project reproducible. Try to define specific packages you want to use to refine the scope! **A few things to note:** 1. Inventories are taken using an iterative level-by-level approach - not meant to present an entire network of all connections - `pkgnet` can do this for one package! - if a package is already in the "tree" at a higher level, it isn't included again. - meant to enumerate all vulnerabilities, not identify all connections. 2. Inventories only focus on Depends and Imports! 3. Inventories catalog all package objects, so that we can compare them... --- # Step 2: Compare inventories We can make use of catalogued package objects to compare package inventories to find: - **nominal differences**: differences in package versions (CRAN versions, git commits, etc.) - **practical differences**: differences in package objects across versions! We care more about **practical differences**, and we can identify them automatically using the `compare_inventory` function. The `compare_inventory` function uses strategically-placed **MD5 checksums** to summarize pieces of the inventory, such as: 1. package meta-information (name, version, source, and github repo/commit) 2. list of package dependencies 3. package objects meta-information (object name, object checksum, parameter checksum, object type, etc.) 4. package object checksum 5. package object parameter checksum This is computationally much faster than comparing large objects, and lets us quickly filter out packages and objects which are identical across inventories and versions. --- # Step 2: Compare inventories Consider two users who take inventories using our five packages (`tidyr`, `stringr`, `purrr`, `dplyr`, `randomForest`). ```r kiegan_vs_amy <- compare_inventory(inventory1 = kiegan_inventory, inventory2 = amy_inventory, summary_file = "data/kiegan_vs_amy.txt") ``` We get three results: 1. a text summary saved to "summary_file" 2. a `$table` option which includes a row for each different object and some summary columns 3. a `$objects` option which includes a row for each different object and all relevant info, including the deparsed objects from both inventories for easy comparison. --- # Step 2: Compare inventories Text summary: <img src="images/kiegan-vs-amy.png" width="80%" /> --- # Step 2: Compare inventories `$table` object: ```r head(kiegan_vs_amy$table) ``` ``` ## # A tibble: 6 x 6 ## package_name object_name object_presence object_changes obj_same params_same ## <chr> <chr> <chr> <chr> <lgl> <lgl> ## 1 cli is_ansi_tty Object in both … Objects change… FALSE TRUE ## 2 digest digest Object in both … Objects change… FALSE TRUE ## 3 dplyr contains Object in both … Both object an… FALSE FALSE ## 4 dplyr ends_with Object in both … Both object an… FALSE FALSE ## 5 dplyr enquos Object in both … Objects change… FALSE TRUE ## 6 dplyr everything Object in both … Both object an… FALSE FALSE ``` --- # Step 2: Compare inventories `$objects` object: ```r head(kiegan_vs_amy$objects) ``` ``` ## # A tibble: 6 x 22 ## package_name object_name object_presence object_changes obj_same params_same ## <chr> <chr> <chr> <chr> <lgl> <lgl> ## 1 cli is_ansi_tty Object in both… Objects chang… FALSE TRUE ## 2 digest digest Object in both… Objects chang… FALSE TRUE ## 3 dplyr contains Object in both… Both object a… FALSE FALSE ## 4 dplyr ends_with Object in both… Both object a… FALSE FALSE ## 5 dplyr enquos Object in both… Objects chang… FALSE TRUE ## 6 dplyr everything Object in both… Both object a… FALSE FALSE ## # … with 16 more variables: package_source_inv1 <chr>, ## # package_version_inv1 <chr>, package_gitrepo_inv1 <chr>, ## # package_gitcommit_inv1 <chr>, function_text_inv1 <list>, ## # fun_params_inv1 <list>, obj_checksum_inv1 <chr>, ## # params_checksum_inv1 <chr>, package_source_inv2 <chr>, ## # package_version_inv2 <chr>, package_gitrepo_inv2 <chr>, ## # package_gitcommit_inv2 <chr>, function_text_inv2 <list>, ## # fun_params_inv2 <list>, obj_checksum_inv2 <chr>, params_checksum_inv2 <chr> ``` --- class: inverse, middle, center # Case Study <img src="images/tidyr-hex.png" width="150" /> --- # `tidyr` 📦 background **`tidyr`** is a package by [Hadley Wickham and Lionel Henry of RStudio](https://tidyr.tidyverse.org/), and part of the `tidyverse`. It is meant to help users create **tidy** data: - each column is a variable - each row is an observation - each cell contains a single value `tidyr` is a widely-used package, with monthly downloads in the hundreds of thousands: ![](graphics-group-manager_files/figure-html/unnamed-chunk-17-1.png)<!-- --> --- # `tidyr` 📦 changes `tidyr` is also still subject to change relatively often: - two package versions released in 2019, one so far in 2020<sup>1</sup> - this is true for many tidyverse packages, and leaves people's code and packages vulnerable ... -- <img src="images/romain-dplyr-tweet.png" width="591" /> -- Consider the following example: - user with version 0.8.3 (released Mar 2, 2019) - updates to version 1.0.0 (released Sep 13, 2019) - all other packages on computer unchanged .footnote[[1] https://github.com/tidyverse/tidyr/releases] --- # `tidyr` 📦 changes ```r compare_tidyr <- compare_inventory(inventory1 = tidyr_083, inventory2 = tidyr_100) ``` <img src="images/tidyr_compare_summary.png" width="1840" /> --- # `tidyr` 📦 changes **Version 0.8.3** `nest()` function: `ChickWeight %>% nest(-Diet) %>% glimpse()` ``` ## Rows: 4 ## Columns: 2 ## $ Diet <fct> 1, 2, 3, 4 ## $ data <list> [<data.frame[220 x 3]>, <data.frame[120 x 3]>, <data.frame[120 … ``` -- **Version 1.0.0** `nest()` function: `ChickWeight %>% nest(-Diet) %>% glimpse()`<sup>1</sup> ``` ## Rows: 4 ## Columns: 2 ## $ ...1 <fct> 1, 2, 3, 4 ## $ data <list> [<data.frame[220 x 3]>, <data.frame[120 x 3]>, <data.frame[120 … ``` -- *Note the difference between the column names.* .footnote[[1] Note that in the RStudio console, the version 1.0.0 code will throw an error and not run. RMarkdown files (papers and slides) seem to render it just fine, but with incorrect column name.] --- # `tidyr` 📦 conclusions 1. PSA: Check yo' scripts! 2. Syntax of `nest()` has changed 3. Second PSA: new functions have been introduced to replace `gather()` and `spread()` - (`pivot_longer` and `pivot_wider`) 4. These are the types of changes that can leave R users frustrated - install a new package version - code is broken - have to spend a bunch of time hunting down the problem - this happens on a regular basis! -- Now, we will look at another example... --- class: inverse, middle, center # Case Study <img src="images/bulletxtrctr-hex.png" width="150" /><img src="images/grooveFinder-hex.png" width="150" /><img src="images/x3ptools-hex.png" width="150" /> --- # `bulletxtrctr` 📦 background `bulletxtrctr` is one of several packages developed by a team of researchers at the Center for Statistics and Applications in Forensic Evidence (CSAFE) at Iowa State. It is meant to automatically complete forensic evidence comparison, using high-resolution images of bullet evidence. The package works in concert with three other packages: `x3ptools`, `grooveFinder`, and `bulletcp` to complete data analysis of captured `x3p` files: -- <img src="images/pipeline-packages.png" width="\textwidth" /> --- # `bulletxtrctr` 📦 background The suite of packages is both **used** and **developed** by a small-ish team of researchers. This means: 1. The packages are under active development and subject to change 2. Team members may have different versions of packages at any given time 3. Analyses run by different users on different machines (or at different times), are vulnerable to changes. We can first look at the package dependency tree for one user (*full disclosure: it's me*) --- # `bulletxtrctr` 📦 background ```r plot_inventory(bulletxtrctr_inventory) ``` ![](graphics-group-manager_files/figure-html/unnamed-chunk-27-1.png)<!-- --> --- # `bulletxtrctr` study set-up But what we really want to study is the nominal and practical differences across users on the team. I asked researchers on the bullet team to use the `manager` package to take an inventory of the `bulletxtrctr` dependency network and send me the resulting inventories. I got: - 9 different inventories - 6 users (one user on 3 machines, another user on 2 machines) We can investigate first the nominal differences: - what versions of each package do team members have? First, we look at the four bullet packages. --- # Bullet Package Versions <img src="graphics-group-manager_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> .medium[**Insights:** - Only two machines have the same set of versions of the four packages (User K-Laptop and User S-Desktop) - This is only the four bullet packages, but there are 100 packages total in the `bulletxtrctr` package dependency tree. - Of the 96 non-bullet packages in the tree, 73 have at least two versions across the machines involved. ] --- # Other Package Versions <img src="graphics-group-manager_files/figure-html/unnamed-chunk-29-1.png" width="77%" style="display: block; margin: auto;" /> --- # Bullet Project Package Versions So far we have only seen **nominal** differences, but what we have already seen shows a lot of variability! What is more important is the **practical** differences. --- # Bullet Packages: Practical Differences <img src="graphics-group-manager_files/figure-html/unnamed-chunk-30-1.png" width="80%" /> --- class: inverse, middle, center # Wrapping Up --- # Conclusions and Future Work <img src="images/shadowy-place-r-packages.jpg" width="33%" style="display: block; margin: auto;" /> --- # Conclusions and Future Work **Conclusions** - your code might be more vulnerable than you think! - our tools provide a way to quickly assess things and identify changes - team collaboration - identify changes *before* they break your code -- **Ironies That I Am Aware Of** - the `manager` package relies on some `tidyverse` packages -- **Future Work** - making the `manager` package less dependent on `tidyverse` packages - script mapping to identify a script's "flow" and where a potential code change starts affecting things - visuals to better compare two dependency trees - including a way to look at how it would change if you removed a package dependency - C/C++ code is difficult to get around - when it is compiled, it gets a user-specific "address" - always different --- class: inverse, middle, center # Questions?