Applying R to Big Data Processing

class: center, middle, inverse, title-slide

.title[
# Applying R to Big Data Processing
]
.subtitle[
## Dtplyr as an Example
]
.author[
### Kaiping Yang
]
.institute[
### BeiGene
]

---

## Background

#### Pharmaceutical Research
##### Clinical Trial Data Analysis
- During the development of new drugs, clinical trials generate a large amount of data on drug safety, efficacy, and dose-response relationships.
- Big data analytics enables researchers to identify drug effects more quickly and accurately, and to assess risks and benefits.

##### Genomics and Bioinformatics
- As the cost of genome sequencing decreases, vast amounts of genomic data are used to understand disease mechanisms, discover new targets, and develop personalized medicines.  
- Big data analysis techniques can process these complex data, accelerating the development of precision medicine.

##### Drug Repurposing
- Analyzing data from existing drugs can reveal new uses for old drugs, significantly reducing the cost and time of drug development.

---

## Background

#### Real-World Studies
##### Drug Effectiveness and Safety Assessment
- Evaluating the real-world effectiveness and safety of drugs using observational studies and patient-reported outcomes.

##### Medical Decision Support
- Providing healthcare professionals with data-driven insights to support clinical decision-making.

##### Personalized Medicine and Precision Treatment
- Utilizing real-world data to tailor treatments to individual patients, improving outcomes and minimizing adverse effects.

---
### dplyr api flowchart

<center><img src="files/dplyr api flowchart.png" height="500px" /></center>
---

### dtplyr api

.panelset[

.panel[.panel-name[dtplyr]

```r
library(data.table)
library(dtplyr)
iris_lazy_dt <- lazy_dt(iris)
results_lazy_dt <- iris_lazy_dt %>% 
  filter(Sepal.Length > 5.0) %>%
  select(Species, Sepal.Length)
results_lazy_dt
```

```
## Source: local data table [118 x 2]
## Call:   `_DT1`[Sepal.Length > 5, .(Species, Sepal.Length)]
## 
##   Species Sepal.Length
##   <fct>          <dbl>
## 1 setosa           5.1
## 2 setosa           5.4
## 3 setosa           5.4
## 4 setosa           5.8
## 5 setosa           5.7
## 6 setosa           5.4
## # ℹ 112 more rows
## 
## # Use as.data.table()/as.data.frame()/as_tibble() to access results
```
]

.panel[.panel-name[show_query]

```r
class(results_lazy_dt)
```

```
## [1] "dtplyr_step_subset" "dtplyr_step"
```

```r
results_lazy_dt %>% show_query()
```

```
## `_DT1`[Sepal.Length > 5, .(Species, Sepal.Length)]
```
]

.panel[.panel-name[collect]

```r
results <- results_lazy_dt %>% collect()
results
```

```
## # A tibble: 118 × 2
##    Species Sepal.Length
##    <fct>          <dbl>
##  1 setosa           5.1
##  2 setosa           5.4
##  3 setosa           5.4
##  4 setosa           5.8
##  5 setosa           5.7
##  6 setosa           5.4
##  7 setosa           5.1
##  8 setosa           5.7
##  9 setosa           5.1
## 10 setosa           5.4
## # ℹ 108 more rows
```
]

]

---

### dbplyr api

.panelset[

.panel[.panel-name[dbplyr]

```r
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
copy_to(con, iris, temporary = FALSE, name = "iris_table")
results_db <- con %>% tbl("iris_table") %>% 
  filter(Sepal.Length > 5.0) %>%
  select(Species, Sepal.Length)
results_db
```

```
## # Source:   SQL [?? x 2]
## # Database: sqlite 3.41.2 [:memory:]
##    Species Sepal.Length
##    <chr>          <dbl>
##  1 setosa           5.1
##  2 setosa           5.4
##  3 setosa           5.4
##  4 setosa           5.8
##  5 setosa           5.7
##  6 setosa           5.4
##  7 setosa           5.1
##  8 setosa           5.7
##  9 setosa           5.1
## 10 setosa           5.4
## # ℹ more rows
```
]

.panel[.panel-name[show_query]

```r
class(results_db)
```

```
## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"
```

```r
results_db %>% show_query()
```

```
## <SQL>
## SELECT `Species`, `Sepal.Length`
## FROM `iris_table`
## WHERE (`Sepal.Length` > 5.0)
```
]

.panel[.panel-name[collect]

```r
results <- results_db %>% collect()
results
```

```
## # A tibble: 118 × 2
##    Species Sepal.Length
##    <chr>          <dbl>
##  1 setosa           5.1
##  2 setosa           5.4
##  3 setosa           5.4
##  4 setosa           5.8
##  5 setosa           5.7
##  6 setosa           5.4
##  7 setosa           5.1
##  8 setosa           5.7
##  9 setosa           5.1
## 10 setosa           5.4
## # ℹ 108 more rows
```

```r
DBI::dbDisconnect(con)
```
]

]

---

### sparklyr api

.panelset[

.panel[.panel-name[sparklyr]

```r
library(sparklyr)
# spark_install()
sc <- spark_connect(master = "local")
iris_spark <- copy_to(sc, iris, "iris_spark", overwrite = TRUE)
results_spark <- iris_spark %>%
  select(Species, Sepal_Length) %>%
  filter(Sepal_Length > 5.0)
results_spark
```
```
# Source: spark<?> [?? x 2]
   Species Sepal_Length
   <chr>          <dbl>
 1 setosa           5.1
 2 setosa           5.4
 3 setosa           5.4
 4 setosa           5.8
 5 setosa           5.7
 6 setosa           5.4
 7 setosa           5.1
 8 setosa           5.7
 9 setosa           5.1
10 setosa           5.4
# ℹ more rows
# ℹ Use `print(n = ...)` to see more rows
```
]

.panel[.panel-name[show_query]

```r
class(results_spark)
results_spark %>% show_query()
```
```
[1] "tbl_spark" "tbl_sql"   "tbl_lazy"  "tbl"

<SQL>
SELECT `Species`, `Sepal_Length`
FROM `iris_spark`
WHERE (`Sepal_Length` > 5.0)
```
]

.panel[.panel-name[collect]

```r
results <- results_spark %>% collect()
results
spark_disconnect(sc)
```
```
# A tibble: 118 × 2
   Species Sepal_Length
   <chr>          <dbl>
 1 setosa           5.1
 2 setosa           5.4
 3 setosa           5.4
 4 setosa           5.8
 5 setosa           5.7
 6 setosa           5.4
 7 setosa           5.1
 8 setosa           5.7
 9 setosa           5.1
10 setosa           5.4
# ℹ 108 more rows
# ℹ Use `print(n = ...)` to see more rows
```
]

]

---

### arrow api

.panelset[

.panel[.panel-name[arrow]

```r
library(arrow)
iris_arrow <- iris %>% to_duckdb() %>% to_arrow()
results_arrow <- iris_arrow %>% 
  filter(Sepal.Length > 5.0) %>%
  select(Species, Sepal.Length)
results_arrow
```
```
RecordBatchReader (query)
Species: string
Sepal.Length: double

* Filter: (Sepal.Length > 5)
See $.data for the source Arrow object
```
]

.panel[.panel-name[show_query]

```r
class(results_arrow)
# [1] "arrow_dplyr_query"
results_arrow %>% show_query()
# ExecPlan with 4 nodes:
# 3:SinkNode{}
#   2:ProjectNode{projection=[Species, Sepal.Length]}
#     1:FilterNode{filter=(Sepal.Length > 5)}
#       0:SourceNode{}
```
]

.panel[.panel-name[collect]

```r
results <- results_arrow %>% collect()
results
```
```
# A tibble: 118 × 2
   Species Sepal.Length
   <chr>          <dbl>
 1 setosa           5.1
 2 setosa           5.4
 3 setosa           5.4
 4 setosa           5.8
 5 setosa           5.7
 6 setosa           5.4
 7 setosa           5.1
 8 setosa           5.7
 9 setosa           5.1
10 setosa           5.4
# ℹ 108 more rows
# ℹ Use `print(n = ...)` to see more rows
```
]

]

---

## Background

- **Pharmaceutical Industry Data**: Clinical trials, genomes, drug reactions, etc.
- **R's Role**: Versatile data science tool with rich data manipulation capabilities.
- **data.table**: High-performance package for large-scale data operations.
- **dplyr**: Concise syntax for data manipulation with verb functions.
- **dtplyr**: Combines the best of both worlds for efficient data processing.

.panelset[
.panel[.panel-name[R Code]

```r
library(dtplyr)
library(data.table)
library(dplyr)
library(tidyverse)
library(microbenchmark)
library(knitr)
```
]

]

---

## Introduction of data.table

### Why `data.table`?

* concise syntax: fast to type, fast to read
* fast speed
* memory efficient
* careful API lifecycle management
* community
* feature rich

---

### Features

* fast and friendly delimited **file reader**: **[`?fread`](https://rdatatable.gitlab.io/data.table/reference/fread.html)**, see also [convenience features for _small_ data](https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread)
* fast and feature rich delimited **file writer**: **[`?fwrite`](https://rdatatable.gitlab.io/data.table/reference/fwrite.html)**
* low-level **parallelism**: many common operations are internally parallelized to use multiple CPU threads
* fast and scalable aggregations; e.g. 100GB in RAM (see [benchmarks](https://duckdblabs.github.io/db-benchmark/) on up to **two billion rows**)
* fast and feature rich joins: **ordered joins** (e.g. rolling forwards, backwards, nearest and limited staleness), **[overlapping range joins](https://github.com/Rdatatable/data.table/wiki/talks/EARL2014_OverlapRangeJoin_Arun.pdf)** (similar to `IRanges::findOverlaps`), **[non-equi joins](https://github.com/Rdatatable/data.table/wiki/talks/ArunSrinivasanUseR2016.pdf)** (i.e. joins using operators `>, >=, <, <=`), **aggregate on join** (`by=.EACHI`), **update on join**
* fast add/update/delete columns **by reference** by group using no copies at all
* fast and feature rich **reshaping** data: **[`?dcast`](https://rdatatable.gitlab.io/data.table/reference/dcast.data.table.html)** (_pivot/wider/spread_) and **[`?melt`](https://rdatatable.gitlab.io/data.table/reference/melt.data.table.html)** (_unpivot/longer/gather_)
* **any R function from any R package** can be used in queries not just the subset of functions made available by a database backend, also columns of type `list` are supported
* has **[no dependencies](https://en.wikipedia.org/wiki/Dependency_hell)** at all other than base R itself, for simpler production/maintenance
* the R dependency is **as old as possible for as long as possible**, dated April 2014, and we continuously test against that version; e.g. v1.11.0 released on 5 May 2018 bumped the dependency up from 5 year old R 3.0.0 to 4 year old R 3.1.0

---

### data.table syntax

```r
library(data.table)
DT = as.data.table(iris)
# FROM[WHERE, SELECT, GROUP BY]
# DT  [i,     j,      by]

DT[Petal.Width > 1.0, mean(Petal.Length), by = Species]
#      Species       V1
#1: versicolor 4.362791
#2:  virginica 5.552000
```
<center><img src="files/data table syntax.png" height="300px" /></center>
---

### Function reference
* **between() `%between%` inrange() `%inrange%`**: Convenience functions for range subsets.
* **fcoalesce()**: Coalescing missing values
* **duplicated(<data.table>) unique(<data.table>) anyDuplicated(<data.table>) uniqueN()**: Determine Duplicate Rows
* **fcase()**:fast case when 
* **fifelse()**: Fast if-else
* **fsort()**: Fast parallel sort
* **first() last()**: First/last item of an object
* **like() `%like%` `%ilike%` `%flike%` `%plike%`**: Convenience function for calling grep.
* **na.omit(<data.table>)**: Remove rows with missing values on columns specified
* **nafill() setnafill()**: Fill missing values
* **`%notin%`**: Convenience operator for checking if an example is not in a set of elements
* **fintersect() fsetdiff() funion() fsetequal()**: Set operations for data tables
* **shift()**: Fast lead/lag for vectors and lists

---

### any R function from any R package

.panelset[

.panel[.panel-name[simulated data]

```r
set.seed(123)  
x = rnorm(200)
df <- data.table(  
  group = rep(c("A", "B"), each = 100),  
  x = rnorm(200),  
  y = rnorm(200) + 0.5 * x + rnorm(200, sd = 0.2)
)
```
```
      group          x            y
     <char>      <num>        <num>
  1:      A  2.1988103 -0.138991390
  2:      A  1.3124130 -1.289209562
  3:      A -0.2651451  0.137939824
  4:      A  0.5431941 -0.296800882
  5:      A -0.4143399  0.893416905
 ---                               
196:      B  1.5039006  0.006455795
197:      B -0.7741449  0.250878838
198:      B  0.8457315 -2.656341062
199:      B -1.2606829  0.921820920
200:      B -0.3545424  1.396673491
```
]

.panel[.panel-name[r base]

```r
groups <- split(df, df$group)
lapply(groups, function(groups) coef(lm(y ~ x, data = groups))["x"])
```
```
$A
          x 
-0.08606835

$B
          x 
-0.02820232 
```
]

.panel[.panel-name[in data.table]

```r
df[ , coef(lm(y ~ x))['x'], by = group]
```
```
    group          V1
   <char>       <num>
1:      A -0.08606835
2:      B -0.02820232
```
]

]
---

## import data

.panelset[

.panel[.panel-name[adlb]

```r
set.seed(123)
nrows = 1e7
# nrows = 1e2
ADLB <- tibble(
  STUDYID = rep("StudyXYZ", nrows),
  USUBJID = paste("SUBJ", sprintf("%08d", 1:nrows), sep = ""),
  SAFFL = sample(c("Y", "N"), size = nrows, replace = TRUE, prob = c(0.8, 0.2)),
  EFFFL = sample(c("Y", "N"), size = nrows, replace = TRUE, prob = c(0.7, 0.3)),
  COHORT = sample(c("A", "B", "C"), size = nrows, replace = TRUE),
  TRT01A = sample(c("DrugX", "DrugY", "Placebo"), size = nrows, replace = TRUE),
  PARAMCD = sample(c("ALT", "AST", "BILI", "ALP", "GGT", "LDH"), nrows, replace = TRUE),
  AVAL = round(runif(nrows, min = 5, max = 250), digits = 1),
  AVISIT = sample(c("Baseline", "Week 1", "Week 4", "Week 8", "Week 12"), nrows, replace = TRUE),
  AVISITN = as.numeric(factor(AVISIT, levels = c("Baseline", "Week 1", "Week 4", "Week 8", "Week 12"))),
  ABLFL = sample(c("Y", "N"), nrows, replace = TRUE, prob = c(0.1, 0.9))
)

head(ADLB)
# write_csv(ADLB,"data/ADLB.csv")
```

]

.panel[.panel-name[tibble]

```r
ADLB <- vroom::vroom("data/ADLB.csv")
head(ADLB)
```
```
# A tibble: 6 × 11
  STUDYID  USUBJID      SAFFL EFFFL COHORT TRT01A  PARAMCD  AVAL AVISIT   AVISITN ABLFL
  <chr>    <chr>        <chr> <chr> <chr>  <chr>   <chr>   <dbl> <chr>      <dbl> <chr>
1 StudyXYZ SUBJ00000001 Y     Y     B      DrugY   BILI    180.  Week 8         4 N    
2 StudyXYZ SUBJ00000002 Y     Y     C      DrugX   ALT     144.  Week 1         2 N    
3 StudyXYZ SUBJ00000003 Y     Y     A      DrugX   GGT     108   Week 4         3 N    
4 StudyXYZ SUBJ00000004 N     Y     C      Placebo LDH      41.2 Week 8         4 N    
5 StudyXYZ SUBJ00000005 N     Y     A      DrugX   LDH      36.7 Week 8         4 N    
6 StudyXYZ SUBJ00000006 Y     Y     C      Placebo LDH     123.  Baseline       1 N 
```
]

.panel[.panel-name[data.table]

```r
ADLB_dt <- as.data.table(ADLB)
head(ADLB_dt)
```
```
    STUDYID      USUBJID SAFFL EFFFL COHORT  TRT01A PARAMCD  AVAL   AVISIT AVISITN ABLFL
1: StudyXYZ SUBJ00000001     Y     Y      B   DrugY    BILI 180.4   Week 8       4     N
2: StudyXYZ SUBJ00000002     Y     Y      C   DrugX     ALT 143.7   Week 1       2     N
3: StudyXYZ SUBJ00000003     Y     Y      A   DrugX     GGT 108.0   Week 4       3     N
4: StudyXYZ SUBJ00000004     N     Y      C Placebo     LDH  41.2   Week 8       4     N
5: StudyXYZ SUBJ00000005     N     Y      A   DrugX     LDH  36.7   Week 8       4     N
6: StudyXYZ SUBJ00000006     Y     Y      C Placebo     LDH 123.2 Baseline       1     N
```
]

.panel[.panel-name[lazy_dt]

```r
ADLB_lazy_dt <- lazy_dt(ADLB)
ADLB_lazy_dt
```
```
Source: local data table [10,000,000 x 11]
Call:   `_DT1`

STUDYID  USUBJID      SAFFL EFFFL COHORT TRT01A  PARAMCD  AVAL AVISIT   AVISITN ABLFL
  <chr>    <chr>        <chr> <chr> <chr>  <chr>   <chr>   <dbl> <chr>      <dbl> <chr>
1 StudyXYZ SUBJ00000001 Y     Y     B      DrugY   BILI    180.  Week 8         4 N    
2 StudyXYZ SUBJ00000002 Y     Y     C      DrugX   ALT     144.  Week 1         2 N    
3 StudyXYZ SUBJ00000003 Y     Y     A      DrugX   GGT     108   Week 4         3 N    
4 StudyXYZ SUBJ00000004 N     Y     C      Placebo LDH      41.2 Week 8         4 N    
5 StudyXYZ SUBJ00000005 N     Y     A      DrugX   LDH      36.7 Week 8         4 N    
6 StudyXYZ SUBJ00000006 Y     Y     C      Placebo LDH     123.  Baseline       1 N    
# ℹ 9,999,994 more rows
# ℹ Use `print(n = ...)` to see more rows

# Use as.data.table()/as.data.frame()/as_tibble() to access results
```
]

]
---

## Purpose

- **Introduction**: Use of dtplyr for big data processing in the pharmaceutical industry.
- **Examples**: Data reading, cleaning, processing, and analysis with dtplyr.
- **Comparison**: Performance and ease of use versus other R packages.
- **Pros & Cons**: Advantages, disadvantages, and application scenarios of dtplyr.

---

## Methods

- **Data Size**: Millions of rows.
- **R Version**: 4.1.2
- **Packages**: dtplyr, data.table, dplyr, tidyverse, microbenchmark, etc.
- **dtplyr Usage**: Lazy_dt() for conversion, dplyr verbs for operations, and conversion back to data frames or tibbles.
- **Performance**: microbenchmark() for timing comparisons and autoplot() for visualization.

---

### Performance Comparison
- fast function of data.table

.panelset[

.panel[.panel-name[fsort]

```r
x = runif(1e6)
system.time(ans1 <- sort(x, method="quick"))
#>    user  system elapsed 
#>   0.109   0.003   0.113 
system.time(ans2 <- fsort(x))
#>    user  system elapsed 
#>   0.027   0.002   0.028 
identical(ans1, ans2)
#> [1] TRUE
```
]

.panel[.panel-name[fifelse]

```r
microbenchmark(
  ifelse = ifelse(x > 0.5, 1, 0),
  if_else = if_else(x > 0.5, 1, 0),
  fifelse = fifelse(x > 0.5, 1, 0),
  times = 10
)
```
```
Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval
  ifelse 23.448985 23.635879 23.732932 23.739346 23.860450 23.930417    10
 if_else 33.108476 33.186554 33.429599 33.471577 33.643490 33.780251    10
 fifelse  4.579173  4.620743  4.720137  4.675604  4.724404  5.215244    10
```
]

.panel[.panel-name[fcase]

```r
microbenchmark(
  case_when = case_when(
    x < 0.5 ~ 0,
    x >= 0.5 ~ 1
  ),
  fcase = fcase(
    x < 0.5, 0,
    x >= 0.5, 1
  ),
  times = 10
)
```
```
Unit: milliseconds
      expr      min       lq     mean   median       uq      max neval
 case_when 39.61633 39.81768 40.12419 39.93951 40.30975 41.18214    10
     fcase 13.68499 13.89618 14.20279 13.98189 14.21023 15.90134    10
```
]

]

---

### Performance Comparison
- Reading large files with `fread()`.

.panelset[

.panel[.panel-name[R code]

```r
microbenchmark(
  read.csv = read.csv("data/ADLB.csv"),
  read_csv = readr::read_csv("data/ADLB.csv"),
  vroom = vroom::vroom("data/ADLB.csv"),
  fread = fread("data/ADLB.csv"),
  times = 1
)

# Unit: milliseconds
#      expr        min         lq       mean     median         uq        max neval
#  read.csv 24714.3880 24714.3880 24714.3880 24714.3880 24714.3880 24714.3880     1
#  read_csv 13850.4502 13850.4502 13850.4502 13850.4502 13850.4502 13850.4502     1
#     vroom   693.9692   693.9692   693.9692   693.9692   693.9692   693.9692     1
#     fread  2410.7339  2410.7339  2410.7339  2410.7339  2410.7339  2410.7339     1
```
]

]

---

### Performance Comparison
- Data cleaning with `filter()`, `mutate()`.

.panelset[

.panel[.panel-name[filter()]

```r
microbenchmark(
  dplyr = ADLB %>% filter(SAFFL == "Y"),
  data.table = ADLB_dt[SAFFL == "Y"],
  dtplyr = lazy_dt(ADLB) %>% filter(SAFFL == "Y") %>% collect(),
  dtplyr_lazy_dt = ADLB_lazy_dt %>% filter(SAFFL == "Y") %>% collect(),
  dtplyr_step = ADLB_lazy_dt %>% filter(SAFFL == "Y"),
  times = 10
)
```

```
# Unit: microseconds
#            expr        min         lq         mean      median          uq         max neval
#           dplyr 647833.231 726671.896  912982.2005  765481.822  776278.598 2195434.633    10
#      data.table 548526.187 556034.098  741727.8233  716915.549  906294.220 1026066.912    10
#          dtplyr 797393.558 905034.695 1277613.8551 1162588.351 1289447.394 2617747.699    10
#  dtplyr_lazy_dt 451409.459 461740.580  726950.0772  849271.776  905560.158  957123.854    10
#     dtplyr_step    383.841    387.971     625.2246     670.492     743.212     863.952    10
```
]

.panel[.panel-name[mutate()]

```r
ADLB_dt_copy <- copy(ADLB_dt)
microbenchmark(
  dplyr = ADLB %>% mutate(SUBJID = paste(STUDYID, USUBJID )),
  data.table = ADLB_dt_copy[, `:=`(SUBJID =paste(STUDYID, USUBJID ))],
  dtplyr = lazy_dt(ADLB) %>% mutate(SUBJID =paste(STUDYID, USUBJID )) %>% collect(),
  dtplyr_lazy_dt = ADLB_lazy_dt %>% mutate(SUBJID = paste(STUDYID, USUBJID )) %>% collect(),
  times = 10
)

# Unit: seconds
#            expr      min       lq     mean   median       uq      max neval
#           dplyr 3.155472 3.170916 3.237791 3.199588 3.204866 3.694258    10
#      data.table 1.573593 1.581879 1.585378 1.585486 1.589344 1.595787    10
#          dtplyr 1.929725 2.108639 2.453532 2.198572 2.902294 3.111678    10
#  dtplyr_lazy_dt 1.869382 1.877024 2.016532 1.897585 1.932124 2.574179    10
```
]

]

---

### Performance Comparison

- Data processing with `group_by()`, `left_join()`.

.panelset[

.panel[.panel-name[groupby]
<center><img src="files/groupby_benchmark.png" height="350px" /></center>
]

.panel[.panel-name[join]
<center><img src="files/join_benchmark.png" height="350px" /></center>
]

]

---

### Performance Comparison

- Data processing

.panelset[

.panel[.panel-name[dplyr]

```r
dplyr_process <- function(ADLB){
  ADLB %>%
    filter(SAFFL == "Y") %>%
    arrange(TRT01A,PARAMCD,AVAL) %>% 
    group_by(TRT01A,PARAMCD) %>%
    mutate(SUBJID = paste(STUDYID, USUBJID)) %>% 
    summarise(a = mean(AVAL))
}
```
]

.panel[.panel-name[data.table]

```r
dt_process <- function(ADLB_dt){
  ADLB_dt[SAFFL == "Y"][order(TRT01A, PARAMCD, AVAL)][, `:=`(SUBJID = paste(STUDYID, 
    USUBJID)), by = .(TRT01A, PARAMCD)][, .(a = mean(AVAL)), 
    keyby = .(TRT01A, PARAMCD)]
}
```
]

.panel[.panel-name[dtplyr]

```r
dtplyr_process <- function(ADLB){
  lazy_dt(ADLB_dt) %>%
    filter(SAFFL == "Y") %>%
    arrange(TRT01A,PARAMCD,AVAL) %>% 
    group_by(TRT01A,PARAMCD) %>%
    mutate(SUBJID = paste(STUDYID, USUBJID)) %>% 
    summarise(a = mean(AVAL)) %>% 
    collect()
}
```
]

.panel[.panel-name[dtplyr_lazy_dt]

```r
dtplyr_lazy_dt_process <- function(ADLB_lazy_dt){
  ADLB_lazy_dt %>%
    filter(SAFFL == "Y") %>%
    arrange(TRT01A,PARAMCD,AVAL,AVAL) %>% 
    group_by(TRT01A,PARAMCD) %>%
    mutate(SUBJID = paste(STUDYID, USUBJID)) %>% 
    summarise(a = mean(AVAL)) %>% 
    collect()
}
```
]

.panel[.panel-name[microbenchmark]

```r
ADLB <- ADLB[1:10000,]
ADLB_dt <- as.data.table(ADLB)
ADLB_lazy_dt <- lazy_dt(ADLB)
microbenchmark(
  dplyr = dplyr_process(ADLB),
  data.table = dt_process(ADLB_dt),
  dtplyr = dtplyr_process(ADLB),
  dtplyr_lazy_dt = dtplyr_lazy_dt_process(ADLB_lazy_dt),
  times = 1
)

# Unit: milliseconds
#            expr       min        lq      mean    median        uq      max neval
#           dplyr 47.190455 48.033038 48.235663 48.394003 48.495019 48.91615    10
#      data.table  6.061097  6.157647  6.477774  6.494478  6.763869  6.90094    10
#          dtplyr 12.714876 12.807087 13.017500 13.047317 13.111517 13.35109    10
#  dtplyr_lazy_dt 11.814074 11.913484 12.080253 12.015099 12.306755 12.40006    10
```
]

]

---

## Results

- **Tasks with dtplyr**:
  - Reading large files with `fread()`.
  - Data cleaning with `filter()`, `mutate()`, `recode()`, and `na.omit()`.
  - Data processing with `group_by()`, `summarise()`, `arrange()`, and `left_join()`.
  - Data analysis with `count()`, `top_n()`, and `ggplot()` for visualization.
- **Performance Comparison**:
  - dtplyr is generally faster than dplyr.
  - data.table outperforms in grouping and joining operations.
- **Pros & Cons of dtplyr**:
  - A compromise between dplyr and data.table.
  - High performance with ease of use.
  - Limitations include limited dplyr function support, no in-place updates, and no distributed computing.

---

## Case Study

### Biomarker anaysis: Post-hoc anaysis dataset
- The dataset has 1.24 million rows and 31 columns
- The dataset included the results of standard post hoc analyses of different biomarkers

### Real world research: Drug sales dataset 
- The dataset has 4.21 million rows and 36 columns
- The dataset records the sales information of each drug every day.
### Real world research: symphony database
- The dataset has 4.64 billion rows
- The dataset contains 10 years of data on Medicare claims diagnoses in the United States
---

### Post-hoc anaysis dataset

.panelset[

.panel[.panel-name[dataset]

```r
head(result_data_raw)
```
```
# A tibble: 6 × 22
  Type  BMK_N…¹ Endpo…² Model Effect Descr…³ TRT   BMK      Pval    EST    LWR    UPR FREQN
  <chr> <chr>   <chr>   <chr> <chr>  <chr>   <chr> <chr>   <dbl>  <dbl>  <dbl>  <dbl> <chr>
1 BMUT… ALL     DCR     ""    Treat… T/C @ … T/C   All       xxx    xxx    xxx    xxx  xxx
2 BMUT… ALL     OBJR    ""    Treat… T/C @ … T/C   All       xxx    xxx    xxx    xxx  xxx
3 BMUT… ALL     OS      ""    Treat… T/C @ … T/C   All       xxx    xxx    xxx    xxx  xxx
4 BMUT… ALL     PFS     ""    Treat… T/C @ … T/C   All       xxx    xxx    xxx    xxx  xxx
5 BMUT… BMUT_A… DCR     "Pre… Predi… (T/C @… T/C   L1/L0     xxx    xxx    xxx    xxx  xxx 
6 BMUT… BMUT_A… DCR     "Pre… Progn… L1/L0 … C     L1/L0     xxx    xxx    xxx    xxx  xxx
# … with 9 more variables: FREQN1 <chr>, FREQN2 <chr>, KMMEDI <chr>, KMMEDI1 <chr>,
#   KMMEDI2 <chr>, LRT_PVAL <dbl>, RESRATE <chr>, RESRATE1 <chr>, RESRATE2 <chr>, and
#   abbreviated variable names ¹BMK_Name, ²Endpoint, ³Description
# ℹ Use `colnames()` to see all variable names
```
]

.panel[.panel-name[dplyr]

```r
get_bmk_rank <- function(result_data_raw){
  Cutoffs <- result_data_raw %>% distinct(Type,BMK_Name) %>% 
    mutate(Cutoff=str_extract(BMK_Name,"__\\w+?$") %>% str_remove('^__') %>% coalesce('Default')) 
  Effects <- result_data_raw %>% distinct(Model,Effect) %>% 
    mutate(Effect1 = ifelse(Model == 'Predictive' & Effect=='Prognostic', "Prognostic ", Effect)) 
  BMK_Names <- result_data_raw %>% distinct(Type,BMK_Name,Effect) %>% 
    filter(Effect == "Prog + Pred" | (Type != "" & BMK_Name == "ALL")) %>% distinct(paste(Type,BMK_Name)) %>% pull()
  result_data_raw %>%
    filter(paste(Type,BMK_Name) %in% BMK_Names) %>% 
    arrange(Type) %>% 
    mutate(
      Pval = round(Pval, 4),LRT_PVAL = round(LRT_PVAL, 4),
      Estimate = glue("{v}({l}, {u})",v = sprintf("%.2f", EST),l = sprintf("%.2f", LWR), u = sprintf("%.2f", UPR))
    ) %>%
    left_join(Cutoffs,by=c("Type","BMK_Name")) %>% left_join(Effects,by=c("Model","Effect")) %>% 
    select(-Effect) %>%
    mutate(Type = str_to_upper(Type),
           BMK_Name = ifelse(BMK_Name == "ALL",paste(str_to_upper(Type),"ALL",sep = "__"),BMK_Name)) %>% 
    rename(Biomarker = BMK_Name, Effect = Effect1, P_value = Pval
    )
}
```
]

.panel[.panel-name[dtplyr]

```r
get_bmk_rank_dt <- function(result_data_raw){
  result_data_raw_lazy_dt <- lazy_dt(result_data_raw)
  Cutoffs <- result_data_raw_lazy_dt %>% distinct(Type,BMK_Name) %>% 
    mutate(Cutoff=str_extract(BMK_Name,"__\\w+?$") %>% str_remove('^__') %>% coalesce('Default')) 
  Effects <- result_data_raw_lazy_dt %>% distinct(Model,Effect) %>% 
    mutate(Effect1 = ifelse(Model == 'Predictive' & Effect=='Prognostic', "Prognostic ", Effect)) 
  BMK_Names <- result_data_raw_lazy_dt %>% distinct(Type,BMK_Name,Effect) %>% 
    filter(Effect == "Prog + Pred" | (Type != "" & BMK_Name == "ALL")) %>% distinct(paste(Type,BMK_Name)) %>% pull()
  result_data_raw_lazy_dt %>%
    filter(paste(Type,BMK_Name) %in% BMK_Names) %>% 
    arrange(Type) %>% 
    mutate(
      Pval = round(Pval, 4),LRT_PVAL = round(LRT_PVAL, 4),
      Estimate = glue("{v}({l}, {u})",v = sprintf("%.2f", EST),l = sprintf("%.2f", LWR), u = sprintf("%.2f", UPR))
    ) %>%
    left_join(Cutoffs,by=c("Type","BMK_Name")) %>% left_join(Effects,by=c("Model","Effect")) %>% 
    select(-Effect) %>%
    mutate(Type = str_to_upper(Type),
           BMK_Name = ifelse(BMK_Name == "ALL",paste(str_to_upper(Type),"ALL",sep = "__"),BMK_Name)) %>% 
    rename(Biomarker = BMK_Name, Effect = Effect1, P_value = Pval
    ) %>% as_tibble()
}
```
]

.panel[.panel-name[microbenchmark]

```r
microbenchmark::microbenchmark(
  get_bmk_rank(result_data_raw),
  get_bmk_rank_dt(result_data_raw),
  times = 10
)
# Unit: seconds
#             expr       min        lq      mean    median        uq       max neval cld
#     get_bmk_rank 10.678574 11.063172 11.326080 11.279834 11.717900 11.816315    10   b
#  get_bmk_rank_dt  2.902138  3.398134  3.830075  3.938931  4.226232  4.921175    10  a 
```
]

]

---

### Drug sales dataset

.panelset[

.panel[.panel-name[dataset]

```r
head(drug_sales)
```
```
# A tibble: 6 × 36
  HOSPCODE   BRAND_EN   ORDERDATE_ym DSM_TERRITORY_CODE month_day Day_1 Day_2 Day_3 Day_4 Day_5 Day_6 Day_7 Day_8 Day_9 Day_10 Day_11 Day_12
<chr>      <chr>      <chr>        <chr>                  <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1 ""         xxxxxxxxxx 2022-01-01   ""                        31    NA    NA    NA    NA    NA    NA    NA    NA    NA     NA     NA     NA
2 "xxxxxxxx" xxxxxxxxxx 2022-01-01   ""                        31    NA    NA    NA    NA    NA    NA    NA    NA    NA     NA     NA     NA
3 "xxxxxxxx" xxxxxxxxxx 2022-01-01   ""                        31    NA    NA    NA    NA    NA    NA    NA    NA    NA     NA     NA     NA
4 "xxxxxxxx" xxxxxxxxxx 2022-01-01   ""                        31    NA    NA    NA    NA    NA    NA    NA    NA    NA     NA     NA     NA
5 "xxxxxxxx" xxxxxxxxxx 2022-01-01   ""                        31    NA    NA    NA    NA    NA    NA    NA    NA    NA     NA     NA     NA
6 "xxxxxxxx" xxxxxxxxxx 2022-01-01   ""                        31    NA    NA    NA    NA    NA    NA    NA    NA    NA     NA     NA     NA
# ℹ 19 more variables: Day_13 <dbl>, Day_14 <dbl>, Day_15 <dbl>, Day_16 <dbl>, Day_17 <dbl>, Day_18 <dbl>, Day_19 <dbl>, Day_20 <dbl>,
#   Day_21 <dbl>, Day_22 <dbl>, Day_23 <dbl>, Day_24 <dbl>, Day_25 <dbl>, Day_26 <dbl>, Day_27 <dbl>, Day_28 <dbl>, Day_29 <dbl>,
#   Day_30 <dbl>, Day_31 <dbl>
```
]

.panel[.panel-name[dplyr]

```r
system.time({
  drug_sales %>%
    left_join(drug_sales1 %>%select(ORDERDATE_ym,DSM_TERRITORY_CODE,HOSPCODE,paste0("last_day_", 16:1)),
              by = c("HOSPCODE", "ORDERDATE_ym","DSM_TERRITORY_CODE","BRAND_EN","month_day"), multiple = "all") %>%
    select(HOSPCODE,ORDERDATE_ym,DSM_TERRITORY_CODE,
           paste0("Day_", 1:16),paste0("last_day_", 1:16)) %>%
    mutate_at(c(paste0("Day_",1:16)), ~ if_else(is.na(.), 0, .))  %>%
    mutate_at(c(paste0("last_day_",1:16)), ~ if_else(is.na(.), 0, .))%>%
    mutate(
      sum_10day_before=Day_1+Day_2+Day_3+Day_4+Day_5+Day_6+Day_7+Day_8+Day_9+Day_10,
      sum_10day_last=last_day_1+last_day_2+last_day_3+last_day_4+last_day_5+last_day_6+last_day_7+last_day_8+last_day_9+last_day_10
    )
})
```
```
   user  system elapsed
674.151   0.956 673.282
```
]

.panel[.panel-name[dtplyr]

```r
system.time({
  lazy_dt(drug_sales) %>%
    left_join(drug_sales1 %>%select(ORDERDATE_ym,DSM_TERRITORY_CODE,HOSPCODE,paste0("last_day_", 16:1)),
              by = c("HOSPCODE", "ORDERDATE_ym","DSM_TERRITORY_CODE","BRAND_EN","month_day"), multiple = "all") %>%
    select(HOSPCODE,ORDERDATE_ym,DSM_TERRITORY_CODE,
           paste0("Day_", 1:16),paste0("last_day_", 1:16)) %>%
    mutate_at(c(paste0("Day_",1:16)), ~ if_else(is.na(.), 0, .))  %>%
    mutate_at(c(paste0("last_day_",1:16)), ~ if_else(is.na(.), 0, .))%>%
    mutate(
      sum_10day_before=Day_1+Day_2+Day_3+Day_4+Day_5+Day_6+Day_7+Day_8+Day_9+Day_10,
      sum_10day_last=last_day_1+last_day_2+last_day_3+last_day_4+last_day_5+last_day_6+last_day_7+last_day_8+last_day_9+last_day_10
    ) %>% 
    collect()
})
```
```
   user  system elapsed 
 12.169   3.027  13.714 
```
]

]

---
### symphony database

.panelset[
.panel[.panel-name[redshift database]

```r
library(DBI)
conn <- DBI::dbConnect(odbc::odbc(), Driver="redshift", Server = "xxx", Port = "xxx", Database = "beigenexxx", UID = "xxx", WD = "xxx")
dx <- tbl(conn, in_schema("beigenexxx_xxx","dx"))
dx %>% count()
```

```
# Source:   lazy query [?? x 1]
# Database: Redshift
#   8.0.2[xxx.redshift.amazonaws.com/beigenexxx]
           n
     <int64>
1 4644781332
```
]
.panel[.panel-name[dbplyr]

```r
library(dbplyr)
dx %>% 
  filter(grepl("C83\\.1",diagnosis_code) & service_date>=as.Date("2019-01-01")) %>% 
  distinct(patient_id,service_date) %>% 
  group_by(patient_id) %>% 
  mutate(index_date=min(service_date),
         count=sum(ifelse(service_date-index_date <=180,1,0))) %>% 
  ungroup() %>% 
  filter(count>=2) %>% 
  distinct(patient_id,index_date) %>% 
  count() 
```

```
# Source:   lazy query [?? x 1]
# Database: Redshift
#   8.0.2[xxx.redshift.amazonaws.com/beigenexxx]
        n
  <int64>
1   23657
```
]

]

---
## Conclusion

- **dtplyr for Big Data in Pharmaceuticals**:
  - Improves data analysis efficiency and readability.
- **Usefulness of dtplyr**:
  - Aids analysts in processing big data without compromising on syntax simplicity or performance speed.
- **Recommendation**:
  - Analysts should choose R packages based on data size and analysis needs.
  - Flexibly use dtplyr, data.table, dplyr, and other tools for optimal results.

---
## Reference

- [big-data-dtplyr](https://www.business-science.io/code-tools/2019/08/15/big-data-dtplyr.html)  
- [https://dtplyr.tidyverse.org/index.html](https://dtplyr.tidyverse.org/index.html)  
- [https://rdatatable.gitlab.io/data.table/](https://rdatatable.gitlab.io/data.table/)  
- [https://r4ds.hadley.nz/](https://r4ds.hadley.nz/)  
- [https://spark.posit.co/](https://spark.posit.co/)

---

class: center, middle

# Thank You

Kaiping Yang  
email: kaiping.yang@beigene.com