A Paradigm for Managing Computational Reproducibility in a Changing Software Package Landscape

# A Paradigm for Managing Computational Reproducibility in a Changing Software Package Landscape
### Kiegan Rice and Heike Hofmann
### Symposium on Data Science and Statistics
### June 4, 2020

---

<style>

/* colors: #EEB422, #8B0000, #191970, #00a8cc */
a, a > code {
  color: #EEB422;
  text-decoration: none;
}

.remark-slide-content {
  background-color: #FFFFFF;
  border-top: 80px solid #696969;
  font-size: 20px;
  font-weight: 300;
  line-height: 1.5;
  padding: 1em 2em 1em 2em
}

.inverse {
  background-color: #696969;
  border-top: 80px solid #696969;
  text-shadow: none;
	background-position: 50% 75%;
  background-size: 150px;
}

.remark-slide-content > h1 {
  font-family: 'Goudy Old Style';
  font-weight: normal;
  font-size: 45px;
  margin-top: -95px;
  margin-left: -00px;
  color: #FFFFFF;
}

.title-slide {
  background-color: #696969;
  border-top: 15px solid #696969;
  background-image: none;
}

.title-slide > h1  {
  color: #FFFFFF;
  font-size: 40px;
  text-shadow: none;
  font-weight: 400;
  text-align: left;
  margin-left: 15px;
  padding-top: 80px;
}
.title-slide > h2  {
  margin-top: -25px;
  padding-bottom: -20px;
  color: #FFFFFF;
  text-shadow: none;
  font-weight: 300;
  font-size: 35px;
  text-align: left;
  margin-left: 15px;
}
.title-slide > h3  {
  color: #FFFFFF;
  text-shadow: none;
  font-weight: 300;
  font-size: 20px;
  text-align: left;
  margin-left: 15px;
  margin-bottom: -20px;
}

body {
  font-family: 'Goudy Old Style';
}

.remark-slide-number {
  font-size: 13pt;
  font-family: 'Goudy Old Style';
  color: #272822;
  opacity: 1;
}

.inverse .remark-slide-number {
  font-size: 13pt;
  font-family: 'Goudy Old Style';
  color: #FAFAFA;
  opacity: 1;
}

</style>

.left-code {
  color: #777;
  width: 40%;
  height: 92%;
  float: left;
}
.right-plot {
  width: 59%;
  float: right;
  padding-left: 1%;
}

.right-code {
  color: #777;
  width: 40%;
  height: 92%;
  float: right;
}

</style>

# Data Science and Computation

**Data science pipelines**  
- Modularized, linear pipelines of sequential decisions  
- Series of decisions applied to data  
- Actions utilize underlying code

<br />

<br />
--

**Computational Reproducibility Question**  
- Can a pipeline (and its results) be reproduced?

---

# Defining Computational Reproducibility

**Literate programming and reporting**  
  - Self-contained documents, containing code to generate results and figures  
  - *transparency* of methodology

**Numerical reproducibility of results**  
  - Code used to generate results  
  - Reproduce across users/machines, over time  
  - *consistency* of methodology

---

# Computational Reproducibility in R

**R Language Model**  
- base R: data structures, object manipulation, and function syntax. A few packages.   
- user-developed packages: many data analysis and data science capabilities  
  - e.g. `randomForest`, `lme4`, `tidyverse`, `knitr`  
- packages subject to change

**R package versioning is user/machine-specific**  
  - User must install and update packages  
  - Different sets of packages, different versions of packages  
  - Functionality differences across users and machines  
  - CRAN (Comprehensive R Archive Network)  
    - Review process  
    - Versions change over time

What does this mean for reproducing analyses in R?

---

# Existing Reproducibility Approaches in R

---

# Package Versioning Packages

**`checkpoint` package** [(Revolution Analytics)](https://github.com/RevolutionAnalytics/checkpoint)  
  - stored versions of CRAN mirror from every day 
  - can insert a "checkpoint" in scripts for users  
  - users then use all package versions from CRAN at checkpoint

**`packrat` package** [(RStudio)](https://rstudio.github.io/packrat/)
  - private package library   
  - works on a project level, creating a library for that R project

**`pkgnet` package** [(Uptake)](https://uptake.github.io/pkgnet/)
  - creates "package reports" 
  - package dependency network 
    - interactive visualization and data table  
  - function relationship/dependency network  
    - interactive visualization and data table  
  
---

# What is missing?

Prior approaches are meant for **static** package versioning.

- Meant to document packages used *at a particular point in time*.  
  - Storing and maintaining a computational environment    
  - Ensures the same version of a package will be used  
  - Doesn't take advantage of changes and innovations  
  
We propose an alternate approach:

**adaptive computational reproducibility**

---

# Adaptive Computational Reproducibility

---

# Two-Step Approach

**Step 1**: Define and refine the scope of package dependency for a script or project.  
  - This lives in the space between `pkgnet` and `packrat`. 
  - Focuses on taking and visualizing **inventories** of packages and their dependencies  
  - Assess the level of vulnerability a script has to changes  
  
<br />

**Step 2**: Compare package inventories over time and across users/machines  
  - Identify specific package differences between inventories  
  - Identify places in a script where differences are used  
  - Gives users a chance to identify differences *before* the code breaks

<br />
Tools to take this approach in R: `manager` package

---

# Step 1: Define and refine vulnerabilities

Consider the following data analysis process that makes use of five R packages:

We can take these five packages, and get a *package inventory* using the `manager` package:

```r
devtools::install_github("kiegan/manager")
library(manager)
project_inventory <- take_inventory(packages = c("tidyr", "stringr", 
                                                 "purrr", "dplyr", 
                                                 "randomForest"))
```

We can then visualize that inventory...

---

# Step 1: Define and refine vulnerabilities

```r
plot_inventory(project_inventory)
```

![](sdss-manager_files/figure-html/unnamed-chunk-7-1.png)

---

# Step 1: Define and refine vulnerabilities

**Taking inventory**:

1. Using an iterative level-by-level approach  
  - not meant to present an entire network of all connections  
      - `pkgnet` can do this for one package!  
  - if a package is already in the dependency tree at a higher level, it isn't included again.  
2. Inventories only focus on packages listed as Depends and Imports  
3. Inventories catalog all package objects

---

# Quick side note

```r
plot_inventory(tidyverse_inventory)
```

![](sdss-manager_files/figure-html/unnamed-chunk-9-1.png)

---

# Step 2: Compare inventories

Use catalogued package objects to compare package inventories to find:

- **nominal differences**  
  - **practical differences**

Strategically-placed **MD5 checksums** to summarize pieces of the inventory:

- package meta-information  
- package objects 
- package object parameters

<br />
- computationally much faster than comparing large objects
- can quickly identify packages and objects which are not identical

---

# Step 2: Compare inventories

Consider two users who take inventories using our five packages (`tidyr`, `stringr`, `purrr`, `dplyr`, `randomForest`).

```r
kiegan_vs_amy <- compare_inventory(inventory1 = kiegan_inventory,
                                   inventory2 = amy_inventory,
                                   summary_file = "data/kiegan_vs_amy.txt")
```

We get three results:

1. a text summary saved to "summary_file"
2. a `$table` object: row for each different object and some summary columns 
3. a `$objects` object: row for each different object and additional info 
  - deparsed objects from both inventories   
  
<br />

Requires two independent inventories to be taken and loaded at the same time.

---

# Step 2: Compare inventories

Text summary:

---

# Step 2: Compare inventories

`$objects` object:

```r
head(kiegan_vs_amy$objects)
```

```
## # A tibble: 6 x 22
##   package_name object_name object_presence object_changes obj_same params_same
##   <chr>        <chr>       <chr>           <chr>          <lgl>    <lgl>      
## 1 cli          is_ansi_tty Object in both… Objects chang… FALSE    TRUE       
## 2 digest       digest      Object in both… Objects chang… FALSE    TRUE       
## 3 dplyr        contains    Object in both… Both object a… FALSE    FALSE      
## 4 dplyr        ends_with   Object in both… Both object a… FALSE    FALSE      
## 5 dplyr        enquos      Object in both… Objects chang… FALSE    TRUE       
## 6 dplyr        everything  Object in both… Both object a… FALSE    FALSE      
## # … with 16 more variables: package_source_inv1 <chr>,
## #   package_version_inv1 <chr>, package_gitrepo_inv1 <chr>,
## #   package_gitcommit_inv1 <chr>, function_text_inv1 <list>,
## #   fun_params_inv1 <list>, obj_checksum_inv1 <chr>,
## #   params_checksum_inv1 <chr>, package_source_inv2 <chr>,
## #   package_version_inv2 <chr>, package_gitrepo_inv2 <chr>,
## #   package_gitcommit_inv2 <chr>, function_text_inv2 <list>,
## #   fun_params_inv2 <list>, obj_checksum_inv2 <chr>, params_checksum_inv2 <chr>
```

---

# Case Study

---

# `tidyr` background

**`tidyr`** is a package by [Hadley Wickham and Lionel Henry of RStudio](https://tidyr.tidyverse.org/), and part of the `tidyverse`.

It is meant to help users create **tidy** data:  
  - each column is a variable
  - each row is an observation  
  - each cell contains a single value  
  
`tidyr` is a widely-used package, with monthly downloads in the hundreds of thousands:

![](sdss-manager_files/figure-html/unnamed-chunk-15-1.png)

---

# `tidyr` changes

`tidyr` is also still subject to change relatively often:  
  - two package versions released in 2019, one so far in 2020<sup>1</sup>  
  - this is true for many tidyverse packages, and leaves people's code and packages vulnerable

<br />

Consider the following example: 
  - user with version 0.8.3 (released Mar 2, 2019)  
  - updates to version 1.0.0 (released Sep 13, 2019)  
  - all other packages on computer unchanged

---

# `tidyr` changes

```r
compare_tidyr <- compare_inventory(inventory1 = tidyr_083, inventory2 = tidyr_100)
```

---

# `tidyr` changes

**Version 0.8.3** `nest()` function:

`ChickWeight %>% nest(-Diet) %>% glimpse()`

```
## Rows: 4
## Columns: 2
## $ Diet <fct> 1, 2, 3, 4
## $ data <list> [<data.frame[220 x 3]>, <data.frame[120 x 3]>, <data.frame[120 …
```

**Version 1.0.0** `nest()` function:

`ChickWeight %>% nest(-Diet) %>% glimpse()`<sup>1</sup>

```
## Rows: 4
## Columns: 2
## $ ...1 <fct> 1, 2, 3, 4
## $ data <list> [<data.frame[220 x 3]>, <data.frame[120 x 3]>, <data.frame[120 …
```

*Note the difference between the column names.*

.footnote[[1] Note that in the RStudio console, the version 1.0.0 code will throw an error and not run. RMarkdown files (papers and slides) seem to render it just fine, but with incorrect column name.]

---

# `tidyr` conclusions

<br /> 
- Syntax of `nest()` has changed

<br />

- Developers provide release notes and warning messages

<br />  
- Can still break your code!  
  
---

# Wrapping Up

---

# Conclusions and Future Work

**Conclusions**  
  - identify changes *before* they break your code  
    - changes to packages over time  
    - differences in package versions across machines  
  - our tools provide a way to quickly assess things and identify changes  
  - team collaboration or reproducibility over time

**Future Work**  
  - making the `manager` package less dependent on `tidyverse` packages
  - C/C++ code is difficult to get around
    - when it is compiled, it gets a user-specific "address"   
  - **interactivity**: 
    - script mapping: identify how scripts use functions and the relationship between them  
    - comparing two dependency trees  
    - remove a package dependency - how much does that simplify the tree?

---

# Thank You!