Keynote Session 2

Keynote Session 2:

+ Keynote Session 2 Chair
+ Keynote 2: Large language models for cancer diagnosis and medical discovery (8:30 AM - 9:30 AM, May 17, 2025)

Keynote Session 2 Chair: Dr. Ying Lu, Stanford

Ying Lu Ying Lu, Ph.D., is Professor in the Department of Biomedical Data Science, and by courtesy in the Department of Radiology and Departement of Health Research and Policy, Stanford University. He is the Co-Director of the Stanford Center for Innovative Study Design and the Biostatistics Core of the Stanford Cancer Institute. Before his current position, he was the director of VA Cooperative Studies Program Palo Alto Coordinating Center (2009-2016) and a Professor of Biostatistics and Radiology at the University of California, San Francisco (1994-2009). His research areas are biostatistics methodology and applications in clinical trials, statistical evaluation of medical diagnostic tests, and medical decision making. He serves as the biostatistical associate Editor for JCO Precision Oncology and co-editor of the Cancer Research Section of the New England Journal of Statistics and Data Science. Dr. Lu is an elected fellow of the American Association for the Advancement of Science and the American Statistical Association. Dr. Lu initiated the Stat4Onc Annual Symposium with Dr. Ji and Dr. Kummar in 2017 and is the PI of the R13 NCI grant for this conference.


Keynote 2: Large language models for cancer diagnosis and medical discovery

May 17, 2025

Speaker: Professor Robert Tibshirani, PhD
Stanford University

Robert Tibshirani Robert Tibshirani is a Professor of Biomedical Data Science, and of Statistics, at Stanford University. He has made important contributions to the statistical analysis of complex datasets. Some of his most well-known contributions are the Lasso, which uses L1 penalization in regression and related problems, generalized additive models and Significance Analysis of Microarrays (SAM). He also co-authored five widely used books ‘Generalized Additive Models’, ‘An Introduction to the Bootstrap’, ‘The Elements of Statistical Learning’, "An Introduction to Statistical learning", and ‘Sparsity in Statistics: the Lasso and its generalizations’. He is an active collaborator with many scientists at Stanford Medical school.Tibshirani received the COPSS Presidents' Award in 1996. Given jointly by the world's leading statistical societies, the award recognizes outstanding contributions to statistics by a statistician under the age of 40. He was elected a Fellow of the Royal Society of Canada in 2001, the National Academy of Sciences in 2012, and the Royal Society of Britain in 2019. In 2021 he received the ISI Founders of Statistics Prize for his 1996 paper Regression Shrinkage and Selection via the Lasso. In 2024 he received the COPSS Distinguished Achievement Award and WNAR/IBS Outstanding Impact Award.

Abstract

This will be a two-part talk. In the first paper, I introduce LLM-Lasso, a novel framework that leverages large language models (LLMs) to guide feature selection in Lasso ℓ1 regression. Unlike traditional methods that rely solely on numerical data, LLM-Lasso incorporate domain-specific knowledge extracted from natural language, enhanced through a retrieval-augmented generation (RAG) pipeline, to seamlessly integrate data-driven modeling with contextual insights., LLM-Lasso outperforms standard Lasso and existing feature selection baselines, all while ensuring the LLM operates without prior access to the datasets. To our knowledge, this is the first approach to effectively integrate conventional feature selection techniques directly with LLM-based domain-specific reasoning. Joint work with: Erica Zhang, Ryunosuke Goto, Naomi Sagan, Jurik Mutter, Nick Phillips, Ash Alizadeh, Kangwook Lee, Jose Blanchet, and Mert Pilanci

In the second part, my graduate student Min Woo Sun will present Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.

Joint work with Alejandro Lozano, James Burgess, Jeffrey J. Nirschl, Christopher Polzak, Yuhui Zhang, Liangyu Chen, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Anita Rau, Austin Wolfgang Katzer, Collin Chiu, Orr Zohar, Xiaohan Wang, Alfred Seunghoon Song, Chiang Chia-Chun, and Serena Yeung-Levy


Keynote Session 2 discussant: Dr. Dacheng Liu

dacheng Dacheng Liu is a Highly Distinguished Therapeutic Area and Methodology Statistician at Boehringer Ingelheim, with 20 years of experience in the pharmaceutical industry. He provides leadership in driving the statistical quality and fostering innovation of companywide clinical development programs across all therapeutic areas. He represents Boehringer Ingelheim in industry-wide groups and leads collaborations with US partners from both industry and academia. Before his current role, Dacheng held positions as the Global Head of Clinical Data Sciences and the US Head of Statistics, where he led both US and global teams in clinical drug developments of the company pipeline. Dacheng has extensive experience leading early and late-phase development in multiple disease areas. He has over 40 publications in areas of clinical research, trial design, statistical methodologies, and AI/machine learning.