China Pharma RUG meeting program

China PharmRUG 2026 Annual Conference – March 27, 2026

Venues: Shanghai (J&J Campus) · Beijing (BeOne Medicines) · Online

Time	Presentation Title	Speaker	Company	Location
08:30 - 09:00	Registration
09:00 - 09:15	Welcome	Shuang Gao and Xiaochang Liu
09:15 - 10:30	Session 1 · Submission & AI
09:15 - 09:40	Producing Submission-Quality Report Directly from R video	Yusi Liu	BeOne Medicines	Beijing
09:40 - 10:05	From Pilot to Practice: Navigating the New Era of R-Based Regulatory Submissions video	Liu, He & Pan, Wenlan	Johnson & Johnson	Shanghai
10:05 - 10:30	Chef, Waiter, and the Menu: Upgrading Query Chat into a Verifiable Analytics Experience video	Ding Cheng	Daiichi Sankyo	Online
10:30 - 10:45	Coffee Break
10:45 - 12:00	Session 2 · TLG & Reporting
10:45 - 11:10	NEST 2.0 — crane, Roche gtsummary and Our Experiment with AI video	Joe Zhu	Roche	Shanghai
11:10 - 11:35	From CDISC ARS to Production: An End-to-End Implementation of Metadata-Driven TLF Automation in Clinical Trials video	Qing Zou	Sanofi	Online
11:35 - 12:00	Making Clinical Reporting Easier: Enhancing gtsummary/analysis result dataset with reporter video	Xiecheng Gu	Boehringer Ingelheim	Shanghai
12:00 - 13:00	Lunch
13:00 - 14:15	Session 3 · Reporting & AI Applications
13:00 - 13:25	The reporter Package: A Powerful Reporting Package to Replicate SAS Output video	Bill Huang	Parexel	Shanghai
13:25 - 13:50	BIP Copilot: An LLM-Powered R Shiny Application for Interactive Biomarker Data Analysis via Natural Language video	Kaiping Yang	BeOne Medicines	Shanghai
13:50 - 14:15	AI-Driven Meta-Analysis: Transforming Literature Overload into Structured Evidence video	Lingjie Shen	南京白色巨塔临床研究	Shanghai
14:15 - 15:30	Session 4 · General & AI in Clinical Research
14:15 - 14:40	Data Science to Empower Clinical Trials video	Haitao Xu	泰格医药 (Tigermed)	Shanghai
14:40 - 15:05	Becoming Antifragile with a Fast-Evolving R Ecosystem video	Zhu Bochen	Johnson & Johnson	Shanghai
15:05 - 15:30	Leveraging R and AI in Clinical Research: Maximizing Analytical Efficacy While Ensuring Data Safety video	Haomin Yang & Qianwang Wang	BeOne Medicines	Shanghai
15:30 - 15:45	Tea Break
15:45 - 17:00	Session 5 · Automation, SDTM & Platform
15:45 - 16:10	ADRGgenius: Revolutionizing Analysis Data Reviewer’s Guide Creation via Automated Workflows	Weiwei Jiao	MSD	Beijing
16:10 - 16:35	R-Powered Clinical Data Standardization: Sanofi’s End-to-End SDTM Implementation & Customized Tool Ecosystem video	Longfei Li	Sanofi	Beijing
16:35 - 17:00	AI-Powered Data Automation and Tool Development with Positron on the Posit Platform video 1 video 2	Yunqing Hu	Posit	Online
17:00 - 17:15	Closing	Yanli Chang and Yan Qiao

Speakers & Abstracts

Producing Submission-Quality Report Directly from R

Speaker: Yusi Liu, 百济神州

Clinical Study Reports (CSRs) rely on large volumes of Tables, Listings, and Figures (TLFs) that must meet strict formatting expectations and integrate smoothly into document assembly workflows. Historically, SAS dominated this space, but R is increasingly used across biostatistics and statistical programming, including for regulated submissions and related pilot initiatives. As R adoption expands from exploratory work to regulated deliverables, producing publication-quality, submission-ready TFLs directly from R has become an important practical topic. This paper will introduce approaches of formatted R submissions with emphasis on (1) tables and listings and (2) figures, while keeping assembly reproducible and automatable.

For tables/listings, we outline how to combine computation-oriented frameworks (rtables) with presentation engines (r2rtf and flextable) and introduce two practical formatting controls that repeatedly drive “CSR-like” quality: (a) width management in twips (the native unit used widely in RTF) to control column proportions and page fit; and (b) estimating wrapped line counts to stabilize row heights and footnote pagination. Twips and related RTF measurement conventions are well-defined (1 inch = 1440 twips), enabling deterministic layout calculations. (c)Using the above solutions to divide long tables into several parts and output them in different pages.

For figures, we provide output patterns for RTF/DOCX/PDF as while as SAS7BDAT/CSV format for QC tables: r2rtf for embedding images into RTF and assembling multi-part documents, officer for Word document manipulation and insertion, and rmarkdown for PDF generation via scripted .Rmd creation and rendering. We conclude with method-selection tables and a pragmatic checklist (templates, validation, version control, and batch production) aimed at CSR production teams adopting R at scale.

From Pilot to Practice: Navigating the New Era of R-Based Regulatory Submissions

Speaker: Liu, He & Pan, Wenlan, J&J

Following the comprehensive validation of the R Consortium R Submission Working Group’s Pilot projects, the technical feasibility and regulatory compliance of R-based regulatory submissions have been increasingly demonstrated. These pilots have provided the industry with valuable submission templates, technical guidelines, and reference implementations, accelerating the transition of R from an auxiliary tool for exploratory analysis into a production-ready programming ecosystem. It now complements SAS and offers distinct advantages—particularly its interactive and dynamic capabilities for regulatory reviews.

This session will interpret the latest guidance from regulatory authorities such as the FDA and NMPA, and systematically review the industrialized standards derived from the Pilot projects, including standardized documentation approaches, reproducible environment strategies, and delivery frameworks. Furthermore, we will present a consolidated overview of industry-wide progress in R-based regulatory submissions, based on publicly available case examples and observed implementation approaches. Finally, we will discuss the core challenges and broader considerations for R submissions. By tracing the journey from pilot to practice, this session will provide peers with a broad understanding of current momentum, emerging best practices, and a scalable, actionable roadmap for regulatory filings.

Chef, Waiter, and the Menu: Upgrading Query Chat into a Verifiable Analytics Experience

Speaker: Ding Cheng, Daiichi Sankyo

Conventional chat-with-data methods often rely on an LLM to both generate data analysis queries and explain results, which can lead to subtle text-to-SQL errors and increased exposure to prompt injection and unauthorized tool use. Here we propose a controlled architecture in which deterministic backends produce all computational outputs(e.g. analyses implemented in R) while the LLM is restricted to validated, structured actions that are compiled into parameterized queries. Each result is accompanied by an execution plan and an evidence package for verification, replay, and audit.

NEST 2.0 — crane, Roche gtsummary and Our Experiment with AI

Speaker: Joe Zhu, Roche

The open-source NEST framework has successfully accelerated clinical reporting by providing a robust collection ofR packages. Building on this collaborative success, we now introduce NEST 2.0,a significant architectural evolution designed for greater efficiency and scalability. This presentation will unveil the key enhancements of NEST 2.0 and demonstrate how its modern framework serves as a launchpad for the next paradigm shift: artificial intelligence. We will explore the synergy between NEST 2.0 and cutting-edge Al, including using large language models (LLMs) for automated narrative generation and machine learning for deeper exploratory analysis. By integrating intelligence directly into the analysis workflow, we are moving towards a future where insights are generated faster and with greater depth. We invite the community to join us in shaping this next generation of clinical analytics.

From CDISC ARS to Production: An End-to-End Implementation of Metadata-Driven TLF Automation in Clinical Trials

Speaker: 邹庆, 赛诺菲

In the evolving landscape of statistical analyses of clinical trial data, the need for efficient and accurate reporting is essential. Generating Tables, Listings and Figures (TLFs) used to be a tedious task, from designing TLFs shells in traditional static Word documents to writing algorithms to generate those TLFs. While CDISC has introduced the Analysis Results Standard (ARS) as a logical data model to describe analysis results metadata, a significant gap exists between this conceptual framework and its practical operationalization in production environments. This presentation addresses that gap by demonstrating our end-to-end implementation journey — from adapting the CDISC ARS model to delivering a fully functional metadata-driven TLF generation system.

Our implementation is built on three pillars. First, we extended the CDISC ARS logical model to create a TLF Metadata schema that accommodates real-world reporting requirements, including shell design specifications, display formatting rules, and statistical method definitions. Second, we developed an interactive web-based user interface that enables statisticians to design TLF shells, populate and validate analysis metadata, and manage version-controlled metadata assets across studies — all without requiring deep knowledge of the underlying ARS structure. Third, we introduced an Analysis Output Metadata (AOM) framework and built a robust code generation engine. This engine automatically parses the integrated metadata into executable programming logic, producing standardized code for Analysis Results Dataset (ARD) creation while ensuring full traceability from metadata design to final output.

This approach has demonstrated measurable improvements in programming efficiency, cross-study consistency, and reproducibility of analysis results. We will also discuss the practical challenges encountered during implementation — including balancing ARS model complexity with user accessibility, ensuring backward compatibility with legacy TLFs, and designing a flexible yet robust code generation architecture — along with the solutions we adopted. Keywords: CDISC ARS, TLF Metadata, Analysis Results Dataset (ARD), Analysis Output Metadata (AOM), Code Generation, Automation, Metadata-Driven

Making Clinical Reporting Easier: Enhancing gtsummary/analysis result dataset with reporter

Speaker: 顾勰成, 勃林格殷格翰

The aim is to produce flexible, submission-ready TLFs (tables, listings, figures) from a single, reproducible pipeline. Tables are built from ARD (Analysis Results Data) and gtsummary, and then rendered to RTF or TXT using reporter or r2rtf, with control over layout, pagination, and footnotes. One analysis definition can thus drive multiple output formats and styles. ARD may be saved to RDS for audit and template reuse. The result is TLFs that satisfy submission requirements and remain adaptable to study- and regulator-specific needs.

The reporter Package: A Powerful Reporting Package to Replicate SAS Output

Speaker: Bill Huang, 百瑞精鼎國際股份有限公司

For clinical trial research utilizing the R programming language, the primary objective is to successfully replicate SAS output. The reporter package represents the most comprehensive and user-friendly reporting solution currently available. It’s close to the capabilities of PROC REPORT/ODS in SAS. It can not only produce reports in TXT, PDF, RTF, DOCX, and HTML formats, but also process large output efficiently. Amgen and other leading pharmaceutical companies have already decided to use reporter as the primary outputting package. As one of the authors of reporter, I will provide a progressive introduction to the package’s fundamental usage, ranging from basic to advanced applications. Subsequently, I will demonstrate how reporter can generate output nearly identical to SAS using concise code, illustrated through several standard clinical trial tables and figures. Finally, I will present recently implemented features and outline our roadmap for future enhancements.

BIP Copilot: An LLM-Powered R Shiny Application for Interactive Biomarker Data Analysis via Natural Language

Speaker: Kaiping Yang, 百济神州

Background: Biomarker-driven analysis is central to oncology drug development, yet routine tasks — querying results across multiple clinical studies, generating Kaplan-Meier plots, forest plots, and baseline characteristic tables — require significant programming effort and domain expertise, creating bottlenecks for cross-functional collaboration.

Methods: We developed BIP Copilot, an R Shiny application that integrates Large Language Models (LLMs) with clinical biomarker databases to enable natural language-driven data analysis. The application leverages the ellmer R package to connect to GPT-4.1 via Databricks, implementing an LLM Agent architecture with function calling (tool use). Six specialized tools are registered with the LLM: SQL query execution, query optimization via Retrieval-Augmented Generation (RAG), Kaplan-Meier survival plots (survminer, patchwork), forest plots (forestploter), boxplots (ggplot2), and baseline characteristic tables (rtables). Data are stored in DuckDB for efficient querying, with patient-level snapshot data in .qs format. A key design feature supports multi-study comparison, allowing users to generate cross-study KM plots in a single request. A secondary AI analysis module enables automated statistical interpretation of generated visualizations.

Results: BIP Copilot enables non-programmers to perform complex biomarker queries — such as “What was the most statistically significant prognostic marker in BGB-XXXX-XXX?” OR “In which studies is TMUT_XXX statistically significant?” — and receive both tabular results and publication-ready visualizations within seconds. The tool-layer architecture allows rapid extension of analytical capabilities without modifying core plotting functions.

Conclusions: By combining LLM reasoning with validated R statistical packages, BIP Copilot demonstrates a practical framework for AI-augmented clinical biomarker analysis that preserves statistical rigor while dramatically lowering the barrier to data access in drug development teams.

AI-Driven Meta-Analysis: Transforming Literature Overload into Structured Evidence

Speaker: 沈凌洁, 南京白色巨塔临床研究有限公司

In the era of evidence-based medicine and real-world data expansion, meta-analysis and evidence synthesis have become essential components of drug development and regulatory decision-making. However, the traditional meta-analysis workflow remains highly manual—requiring extensive literature screening, full-text review, data extraction, and structured data curation. This process is time-consuming, resource-intensive, and difficult to scale, posing a major bottleneck for statistical and medical teams.

This presentation addresses these practical challenges by introducing an AI-driven framework powered by large language models (LLMs) to accelerate and partially automate the meta-analysis pipeline. We will demonstrate how LLMs can: Process and interpret large volumes of publications；Extract predefined study characteristics and endpoint data；Generate structured datasets ready for statistical modeling；Integrate seamlessly with conventional meta-analytic methods

Through real-world examples, we illustrate how AI can compress weeks of manual effort into hours, improve data consistency, and enhance reproducibility. We will also discuss implementation considerations, governance, and potential impact within pharmaceutical statistical teams.

Data Science to Empower Clinical Trials

Speaker: 徐海涛, 泰格医药

In modern clinical trials, the ability to make timely, data-driven decisions is critical for operational success, participant safety, and resource optimization. However, many teams remain constrained by static reporting cycles and manual data processes. This presentation introduces a transformative approach: leveraging R and its ecosystem to construct an end-to-end, automated data pipeline that delivers daily-updated analytics and interactive visualizations, directly empowering stakeholders across the clinical research spectrum.

Becoming Antifragile with a Fast-Evolving R Ecosystem

Speaker: Zhu Bochen, Johnson & Johnson

Abstract: R offers remarkable flexibility and innovation, yet its ecosystem remains both rapidly evolving and inherently fragile. To become antifragile—gaining strength through uncertainty—R users must balance exploration with disciplined practice. This presentation highlights three essential principles: defensive coding to manage ambiguity and reduce failure risks; defensive estimation to realistically assess new packages and technologies; and defensive promising to align cross‑functional expectations, recognizing that R (especially Shiny) is capable and fancy, but not magic. Together, these practices enable teams to innovate responsibly and build robust, trustworthy analytical workflows in an ever‑changing R landscape.

Leveraging R and AI in Clinical Research: Maximizing Analytical Efficacy While Ensuring Data Safety

Speaker: 杨昊岷，王乾望, beone medicine

Abstract: Integrating Large Language Models (LLMs) with R programming transforms clinical data analysis by enabling natural language queries, automated code generation, and interactive visualizations. However, transmitting sensitive clinical trial data to external AI services introduces significant data governance risks, creating tension between analytical innovation and patient privacy. This presentation addresses these challenges by demonstrating an R Shiny-based AI agent that allows users to analyze clinical datasets via conversational prompts, generating and executing R code in real time to produce tables, statistics, and visualizations.

To balance this analytical efficacy with rigorous data safety, we detail a novel “dual-track architecture” that mitigates data leakage. In this framework, the AI receives only structural metadata (variable names, types, dimensions), while patient-level data remains securely confined within the local R session. Most importantly, a sandboxed execution environment ensures all generated code runs in strict isolation to avoid unauthorized system access, accidental source data modification, or the execution of unsafe commands. We also introduce a hybrid execution mode using Databricks and the Claude Code SDK. This allows users to send anonymized data samples to cloud-based agents to obtain more accurate R code generation and enhanced analytical support, subsequently executing the returned code locally. Through live demonstrations, technical architecture discussions, and practical guidance, this session will equip attendees with the strategies needed to implement secure, privacy-preserving AI solutions in highly regulated pharmaceutical environments.

Keywords:

R Shiny, LLM, Clinical Data, Data Privacy, AI Agent, Dual-Track Architecture, Regulatory Compliance

ADRGgenius: Revolutionizing Analysis Data Reviewer’s Guide Creation via Automated Workflows

Speaker: 焦溦溦, 默沙东

The Analysis Data Reviewer’s Guide (ADRG) is a critical document within e-submission packages required by regulatory agencies such as the FDA and EMA. Traditionally, compiling an ADRG involves manually extracting and integrating information from multiple study files, a process that is both time-consuming and prone to errors. To address these challenges, ADRGgenius, an R Shiny application, has been developed to streamline the updating of ADRG content based on a standardized template and related study documents. This innovative tool leverages the reticulate package to seamlessly integrate Python functionality within the R environment, enabling the creation of a user-friendly interface via R Shiny. By automating generation of key ADRG sections according to the ADRG template, ADRGgenius significantly improves efficiency and accuracy, reduces manual effort, and ensures consistency across documents.

R-Powered Clinical Data Standardization: Sanofi’s End-to-End SDTM Implementation & Customized Tool Ecosystem

Speaker: 李龙飞, Sanofi

In clinical research, CDISC SDTM (Study Data Tabulation Model) serves as the cornerstone of data standardization, where the compliance, accuracy, and traceability of its generation process directly impact clinical trial timelines and quality. Sanofi has built an end-to-end solution for the entire SDTM lifecycle based on R, integrating toolchains such as RStudio, Kubeflow, and Docker. Specifically, Docker-based containers enable precise governance of R versions and packages—cross-functional teams (upstream and downstream) adopt consistent container environments for dataset extraction and interaction, ensuring traceability throughout the process. This presentation will focus on the core practices of the solution: standardized metadata management via the custom R package loadmeta, standardized derivation of key variables using sdtmeris, efficient dataset comparison and validation with compare2DF, and visual troubleshooting of log issues through logviewer.

AI-Powered Data Automation and Tool Development with Positron on the Posit Platform

Speaker: 胡蕴青, 源资信息科技（上海）有限公司

This presentation centers on AI-driven data automation and custom tool development leveraging Positron—Posit’s next-generation AI-integrated development environment (IDE). It highlights streamlined workflows for direct ingestion of EDC metadata into the R environment to expedite data automation processes, alongside exploring end-to-end data flow orchestration for key clinical deliverables, including SDTM mapping, ADaM dataset generation, and TFLs production.

We will further demonstrate how Positron harnesses real-time R session context and seamless large language model (LLM) integration to deliver targeted, context-aware AI support. The ellmer package serves as a critical bridge, offering a robust R interface to connect with diverse LLM providers such as ChatGPT and Gemini. By combining Positron’s advanced AI functionalities with the enterprise-grade governance frameworks of the Posit Platform, this solution adheres to stringent reproducibility and regulatory standards, encompassing GxP, 21 CFR Part 11, and CDISC compliance requirements.