China Pharma RUG meeting program
China PharmRUG 2026 Annual Conference – March 27, 2026
Venues: Shanghai (J&J Campus) · Beijing (BeOne Medicines) · Online
| Time | Presentation Title | Speaker | Company | Location |
|---|---|---|---|---|
| 08:30 - 09:00 | Registration & Opening | |||
| 09:00 - 10:15 | Session 1 · Submission & AI | |||
| 09:00 - 09:25 | Producing Submission-Quality Report Directly from R | Yusi Liu | BeOne Medicines | Beijing |
| 09:25 - 09:50 | From Pilot to Practice: Navigating the New Era of R-Based Regulatory Submissions | Liu, He & Pan, Wenlan | Johnson & Johnson | Shanghai |
| 09:50 - 10:15 | Chef, Waiter, and the Menu: Upgrading Query Chat into a Verifiable Analytics Experience | Ding Cheng | Daiichi Sankyo | Online |
| 10:15 - 10:30 | Coffee Break | |||
| 10:30 - 11:45 | Session 2 · TLG & Reporting | |||
| 10:30 - 10:55 | NEST 2.0 — crane, Roche gtsummary and Our Experiment with AI | Joe Zhu | Roche | Shanghai |
| 10:55 - 11:20 | From CDISC ARS to Production: An End-to-End Implementation of Metadata-Driven TLF Automation in Clinical Trials | Qing Zou | Sanofi | Online |
| 11:20 - 11:45 | Making Clinical Reporting Easier: Enhancing gtsummary/analysis result dataset with reporter | Xiecheng Gu | Boehringer Ingelheim | Shanghai |
| 11:45 - 13:00 | Lunch | |||
| 13:00 - 14:15 | Session 3 · Reporting & AI Applications | |||
| 13:00 - 13:25 | The reporter Package: A Powerful Reporting Package to Replicate SAS Output | Bill Huang | Parexel | Shanghai |
| 13:25 - 13:50 | BIP Copilot: An LLM-Powered R Shiny Application for Interactive Biomarker Data Analysis via Natural Language | Kaiping Yang | BeOne Medicines | Shanghai |
| 13:50 - 14:15 | AI-Driven Meta-Analysis: Transforming Literature Overload into Structured Evidence | Lingjie Shen | 南京白色巨塔临床研究 | Shanghai |
| 14:15 - 15:30 | Session 4 · General & AI in Clinical Research | |||
| 14:15 - 14:40 | Data Science to Empower Clinical Trials | Haitao Xu | 泰格医药 (Tigermed) | Shanghai |
| 14:40 - 15:05 | Becoming Antifragile with a Fast-Evolving R Ecosystem | Zhu Bochen | Johnson & Johnson | Shanghai |
| 15:05 - 15:30 | Leveraging R and AI in Clinical Research: Maximizing Analytical Efficacy While Ensuring Data Safety | Haomin Yang & Qianwang Wang | BeOne Medicines | Shanghai |
| 15:30 - 15:45 | Tea Break | |||
| 15:45 - 17:00 | Session 5 · Automation, SDTM & Platform | |||
| 15:45 - 16:10 | ADRGgenius: Revolutionizing Analysis Data Reviewer’s Guide Creation via Automated Workflows | Weiwei Jiao | MSD | Beijing |
| 16:10 - 16:35 | R-Powered Clinical Data Standardization: Sanofi’s End-to-End SDTM Implementation & Customized Tool Ecosystem | Longfei Li | Sanofi | Beijing |
| 16:35 - 17:00 | AI-Powered Data Automation and Tool Development with Positron on the Posit Platform | Yunqing Hu | Posit | Online |
| 17:00 - 17:15 | Closing Remarks |
Speakers & Abstracts
Producing Submission-Quality Report Directly from R
Speaker: Yusi Liu, 百济神州
Clinical Study Reports (CSRs) rely on large volumes of Tables, Listings, and Figures (TLFs) that must meet strict formatting expectations and integrate smoothly into document assembly workflows. Historically, SAS dominated this space, but R is increasingly used across biostatistics and statistical programming, including for regulated submissions and related pilot initiatives. As R adoption expands from exploratory work to regulated deliverables, producing publication-quality, submission-ready TFLs directly from R has become an important practical topic. This paper will introduce approaches of formatted R submissions with emphasis on (1) tables and listings and (2) figures, while keeping assembly reproducible and automatable.
For tables/listings, we outline how to combine computation-oriented frameworks (rtables) with presentation engines (r2rtf and flextable) and introduce two practical formatting controls that repeatedly drive “CSR-like” quality: (a) width management in twips (the native unit used widely in RTF) to control column proportions and page fit; and (b) estimating wrapped line counts to stabilize row heights and footnote pagination. Twips and related RTF measurement conventions are well-defined (1 inch = 1440 twips), enabling deterministic layout calculations. (c)Using the above solutions to divide long tables into several parts and output them in different pages.
For figures, we provide output patterns for RTF/DOCX/PDF as while as SAS7BDAT/CSV format for QC tables: r2rtf for embedding images into RTF and assembling multi-part documents, officer for Word document manipulation and insertion, and rmarkdown for PDF generation via scripted .Rmd creation and rendering. We conclude with method-selection tables and a pragmatic checklist (templates, validation, version control, and batch production) aimed at CSR production teams adopting R at scale.
NEST 2.0 — crane, Roche gtsummary and Our Experiment with AI
Speaker: Joe Zhu, Roche
The open-source NEST framework has successfully accelerated clinical reporting by providing a robust collection ofR packages. Building on this collaborative success, we now introduce NEST 2.0,a significant architectural evolution designed for greater efficiency and scalability. This presentation will unveil the key enhancements of NEST 2.0 and demonstrate how its modern framework serves as a launchpad for the next paradigm shift: artificial intelligence. We will explore the synergy between NEST 2.0 and cutting-edge Al, including using large language models (LLMs) for automated narrative generation and machine learning for deeper exploratory analysis. By integrating intelligence directly into the analysis workflow, we are moving towards a future where insights are generated faster and with greater depth. We invite the community to join us in shaping this next generation of clinical analytics.
From CDISC ARS to Production: An End-to-End Implementation of Metadata-Driven TLF Automation in Clinical Trials
Speaker: 邹庆, 赛诺菲
In the evolving landscape of statistical analyses of clinical trial data, the need for efficient and accurate reporting is essential. Generating Tables, Listings and Figures (TLFs) used to be a tedious task, from designing TLFs shells in traditional static Word documents to writing algorithms to generate those TLFs. While CDISC has introduced the Analysis Results Standard (ARS) as a logical data model to describe analysis results metadata, a significant gap exists between this conceptual framework and its practical operationalization in production environments. This presentation addresses that gap by demonstrating our end-to-end implementation journey — from adapting the CDISC ARS model to delivering a fully functional metadata-driven TLF generation system.
Our implementation is built on three pillars. First, we extended the CDISC ARS logical model to create a TLF Metadata schema that accommodates real-world reporting requirements, including shell design specifications, display formatting rules, and statistical method definitions. Second, we developed an interactive web-based user interface that enables statisticians to design TLF shells, populate and validate analysis metadata, and manage version-controlled metadata assets across studies — all without requiring deep knowledge of the underlying ARS structure. Third, we introduced an Analysis Output Metadata (AOM) framework and built a robust code generation engine. This engine automatically parses the integrated metadata into executable programming logic, producing standardized code for Analysis Results Dataset (ARD) creation while ensuring full traceability from metadata design to final output.
This approach has demonstrated measurable improvements in programming efficiency, cross-study consistency, and reproducibility of analysis results. We will also discuss the practical challenges encountered during implementation — including balancing ARS model complexity with user accessibility, ensuring backward compatibility with legacy TLFs, and designing a flexible yet robust code generation architecture — along with the solutions we adopted. Keywords: CDISC ARS, TLF Metadata, Analysis Results Dataset (ARD), Analysis Output Metadata (AOM), Code Generation, Automation, Metadata-Driven
Making Clinical Reporting Easier: Enhancing gtsummary/analysis result dataset with reporter
Speaker: 顾勰成, 勃林格殷格翰
The aim is to produce flexible, submission-ready TLFs (tables, listings, figures) from a single, reproducible pipeline. Tables are built from ARD (Analysis Results Data) and gtsummary, and then rendered to RTF or TXT using reporter or r2rtf, with control over layout, pagination, and footnotes. One analysis definition can thus drive multiple output formats and styles. ARD may be saved to RDS for audit and template reuse. The result is TLFs that satisfy submission requirements and remain adaptable to study- and regulator-specific needs.
The reporter Package: A Powerful Reporting Package to Replicate SAS Output
Speaker: Bill Huang, 百瑞精鼎國際股份有限公司
For clinical trial research utilizing the R programming language, the primary objective is to successfully replicate SAS output. The reporter package represents the most comprehensive and user-friendly reporting solution currently available. It’s close to the capabilities of PROC REPORT/ODS in SAS. It can not only produce reports in TXT, PDF, RTF, DOCX, and HTML formats, but also process large output efficiently. Amgen and other leading pharmaceutical companies have already decided to use reporter as the primary outputting package. As one of the authors of reporter, I will provide a progressive introduction to the package’s fundamental usage, ranging from basic to advanced applications. Subsequently, I will demonstrate how reporter can generate output nearly identical to SAS using concise code, illustrated through several standard clinical trial tables and figures. Finally, I will present recently implemented features and outline our roadmap for future enhancements.
BIP Copilot: An LLM-Powered R Shiny Application for Interactive Biomarker Data Analysis via Natural Language
Speaker: Kaiping Yang, 百济神州
Background: Biomarker-driven analysis is central to oncology drug development, yet routine tasks — querying results across multiple clinical studies, generating Kaplan-Meier plots, forest plots, and baseline characteristic tables — require significant programming effort and domain expertise, creating bottlenecks for cross-functional collaboration.
Methods: We developed BIP Copilot, an R Shiny application that integrates Large Language Models (LLMs) with clinical biomarker databases to enable natural language-driven data analysis. The application leverages the ellmer R package to connect to GPT-4.1 via Databricks, implementing an LLM Agent architecture with function calling (tool use). Six specialized tools are registered with the LLM: SQL query execution, query optimization via Retrieval-Augmented Generation (RAG), Kaplan-Meier survival plots (survminer, patchwork), forest plots (forestploter), boxplots (ggplot2), and baseline characteristic tables (rtables). Data are stored in DuckDB for efficient querying, with patient-level snapshot data in .qs format. A key design feature supports multi-study comparison, allowing users to generate cross-study KM plots in a single request. A secondary AI analysis module enables automated statistical interpretation of generated visualizations.
Results: BIP Copilot enables non-programmers to perform complex biomarker queries — such as “What was the most statistically significant prognostic marker in BGB-XXXX-XXX?” OR “In which studies is TMUT_XXX statistically significant?” — and receive both tabular results and publication-ready visualizations within seconds. The tool-layer architecture allows rapid extension of analytical capabilities without modifying core plotting functions.
Conclusions: By combining LLM reasoning with validated R statistical packages, BIP Copilot demonstrates a practical framework for AI-augmented clinical biomarker analysis that preserves statistical rigor while dramatically lowering the barrier to data access in drug development teams.
AI-Driven Meta-Analysis: Transforming Literature Overload into Structured Evidence
Speaker: 沈凌洁, 南京白色巨塔临床研究有限公司
In the era of evidence-based medicine and real-world data expansion, meta-analysis and evidence synthesis have become essential components of drug development and regulatory decision-making. However, the traditional meta-analysis workflow remains highly manual—requiring extensive literature screening, full-text review, data extraction, and structured data curation. This process is time-consuming, resource-intensive, and difficult to scale, posing a major bottleneck for statistical and medical teams.
This presentation addresses these practical challenges by introducing an AI-driven framework powered by large language models (LLMs) to accelerate and partially automate the meta-analysis pipeline. We will demonstrate how LLMs can: Process and interpret large volumes of publications;Extract predefined study characteristics and endpoint data;Generate structured datasets ready for statistical modeling;Integrate seamlessly with conventional meta-analytic methods
Through real-world examples, we illustrate how AI can compress weeks of manual effort into hours, improve data consistency, and enhance reproducibility. We will also discuss implementation considerations, governance, and potential impact within pharmaceutical statistical teams.
Data Science to Empower Clinical Trials
Speaker: 徐海涛, 泰格医药
In modern clinical trials, the ability to make timely, data-driven decisions is critical for operational success, participant safety, and resource optimization. However, many teams remain constrained by static reporting cycles and manual data processes. This presentation introduces a transformative approach: leveraging R and its ecosystem to construct an end-to-end, automated data pipeline that delivers daily-updated analytics and interactive visualizations, directly empowering stakeholders across the clinical research spectrum.
Becoming Antifragile with a Fast-Evolving R Ecosystem
Speaker: Zhu Bochen, Johnson & Johnson
Abstract: R offers remarkable flexibility and innovation, yet its ecosystem remains both rapidly evolving and inherently fragile. To become antifragile—gaining strength through uncertainty—R users must balance exploration with disciplined practice. This presentation highlights three essential principles: defensive coding to manage ambiguity and reduce failure risks; defensive estimation to realistically assess new packages and technologies; and defensive promising to align cross‑functional expectations, recognizing that R (especially Shiny) is capable and fancy, but not magic. Together, these practices enable teams to innovate responsibly and build robust, trustworthy analytical workflows in an ever‑changing R landscape.
Leveraging R and AI in Clinical Research: Maximizing Analytical Efficacy While Ensuring Data Safety
Speaker: 杨昊岷,王乾望, beone medicine
Abstract: Integrating Large Language Models (LLMs) with R programming transforms clinical data analysis by enabling natural language queries, automated code generation, and interactive visualizations. However, transmitting sensitive clinical trial data to external AI services introduces significant data governance risks, creating tension between analytical innovation and patient privacy. This presentation addresses these challenges by demonstrating an R Shiny-based AI agent that allows users to analyze clinical datasets via conversational prompts, generating and executing R code in real time to produce tables, statistics, and visualizations.
To balance this analytical efficacy with rigorous data safety, we detail a novel “dual-track architecture” that mitigates data leakage. In this framework, the AI receives only structural metadata (variable names, types, dimensions), while patient-level data remains securely confined within the local R session. Most importantly, a sandboxed execution environment ensures all generated code runs in strict isolation to avoid unauthorized system access, accidental source data modification, or the execution of unsafe commands. We also introduce a hybrid execution mode using Databricks and the Claude Code SDK. This allows users to send anonymized data samples to cloud-based agents to obtain more accurate R code generation and enhanced analytical support, subsequently executing the returned code locally. Through live demonstrations, technical architecture discussions, and practical guidance, this session will equip attendees with the strategies needed to implement secure, privacy-preserving AI solutions in highly regulated pharmaceutical environments.
Keywords:
R Shiny, LLM, Clinical Data, Data Privacy, AI Agent, Dual-Track Architecture, Regulatory Compliance
ADRGgenius: Revolutionizing Analysis Data Reviewer’s Guide Creation via Automated Workflows
Speaker: 焦溦溦, 默沙东
The Analysis Data Reviewer’s Guide (ADRG) is a critical document within e-submission packages required by regulatory agencies such as the FDA and EMA. Traditionally, compiling an ADRG involves manually extracting and integrating information from multiple study files, a process that is both time-consuming and prone to errors. To address these challenges, ADRGgenius, an R Shiny application, has been developed to streamline the updating of ADRG content based on a standardized template and related study documents. This innovative tool leverages the reticulate package to seamlessly integrate Python functionality within the R environment, enabling the creation of a user-friendly interface via R Shiny. By automating generation of key ADRG sections according to the ADRG template, ADRGgenius significantly improves efficiency and accuracy, reduces manual effort, and ensures consistency across documents.
R-Powered Clinical Data Standardization: Sanofi’s End-to-End SDTM Implementation & Customized Tool Ecosystem
Speaker: 李龙飞, Sanofi
In clinical research, CDISC SDTM (Study Data Tabulation Model) serves as the cornerstone of data standardization, where the compliance, accuracy, and traceability of its generation process directly impact clinical trial timelines and quality. Sanofi has built an end-to-end solution for the entire SDTM lifecycle based on R, integrating toolchains such as RStudio, Kubeflow, and Docker. Specifically, Docker-based containers enable precise governance of R versions and packages—cross-functional teams (upstream and downstream) adopt consistent container environments for dataset extraction and interaction, ensuring traceability throughout the process. This presentation will focus on the core practices of the solution: standardized metadata management via the custom R package loadmeta, standardized derivation of key variables using sdtmeris, efficient dataset comparison and validation with compare2DF, and visual troubleshooting of log issues through logviewer.
AI-Powered Data Automation and Tool Development with Positron on the Posit Platform
Speaker: 胡蕴青, 源资信息科技(上海)有限公司
This presentation centers on AI-driven data automation and custom tool development leveraging Positron—Posit’s next-generation AI-integrated development environment (IDE). It highlights streamlined workflows for direct ingestion of EDC metadata into the R environment to expedite data automation processes, alongside exploring end-to-end data flow orchestration for key clinical deliverables, including SDTM mapping, ADaM dataset generation, and TFLs production.
We will further demonstrate how Positron harnesses real-time R session context and seamless large language model (LLM) integration to deliver targeted, context-aware AI support. The ellmer package serves as a critical bridge, offering a robust R interface to connect with diverse LLM providers such as ChatGPT and Gemini. By combining Positron’s advanced AI functionalities with the enterprise-grade governance frameworks of the Posit Platform, this solution adheres to stringent reproducibility and regulatory standards, encompassing GxP, 21 CFR Part 11, and CDISC compliance requirements.