In biomedical studies involving electronic health records, manually extracting gold-standard phenotype data is labor-intensive and limited in scale. The rise of generative AI, particularly large language models (LLMs), offers a systematic and significantly faster alternative through predictions, such as automated computational phenotypes (ACPs). However, directly substituting gold-standard data with these predictions, without addressing their differences, can introduce biases and lead to misleading conclusions. To address this challenge, we adopt a semi-supervised learning framework that integrates both labeled data (with gold-standard annotations) and unlabeled data (without gold-standard annotations) under the covariate shift paradigm. We propose doubly robust and semiparametrically efficient estimators to infer general target parameters. Through a rigorous efficiency analysis, we compare scenarios with and without the incorporation of LLM-derived predictions. Furthermore, we situate our approach within existing literature, drawing connections to prediction-powered inference and its extensions, as well as some seemingly unrelated concept such as surrogacy. To validate our theoretical findings, we conduct extensive synthetic experiments and apply our method to real-world data, demonstrating its practical advantages.
Jiwei Zhao: Statistical Benefits when Incorporating LLM-Derived Predictions: Old Wine in a New Bottle?
April 10, 2025 1:30 pm - 2:30 pm ET