Re-engineering a machine learning phenotype to adapt to the changing COVID-19 landscape - a machine learning modelling study from the N3C and RECOVER consortia

Aug 25, 2025·

Miles Crosskey

Tomas McIntee

Sandy Preiss

Daniel Brannock

John M Baratta

Yun Jae Yoo

Emily Hadley

Frank Blanceró

Robert Chew

Johanna Loomba

Abhishek Bhatia

Christopher G Chute

Melissa Haendel

Richard Moffitt

Emily R Pfaff

· 0 min read

DOI URL

Abstract

Background: In 2021, we used the National COVID Cohort Collaborative (N3C) as part of the National Institutes of Health RECOVER Initiative to develop a machine learning pipeline to identify patients with a high probability of having post-acute sequelae of SARS-CoV-2 infection or long COVID. However, the increased home testing, missing documentation, and reinfections that characterise the pandemic beyond 2022 necessitated the re-engineering of our original model to account for these changes in the COVID-19 research landscape. Methods: Trained on 72,745 patient records (36,238 with long COVID and 36,507 with no evidence of long COVID), our updated XGBoost model gathered data for each patient in overlapping 100-day periods that progressed through time and issued a probability of long COVID for each 100-day period. We ran the model on patients in N3C (n=5,875,065) who met specified criteria from Jan 1, 2020, to June 22, 2023. Each patient was given a model score that predicted long COVID status for each 100-day window. Findings: The updated model had an area under the receiver operating characteristic curve of 0.90. Using our model, we estimate the overall prevalence of long COVID among the COVID-19 positive cohort within N3C repository to be 10.4%. Interpretation: By eschewing the COVID-19 index date as an anchor point for analysis, we can assess the probability of long COVID among patients who might have tested at home, or with suspected (but untested) cases of COVID-19, or multiple SARS-CoV-2 reinfections.

Type

Journal article

Publication

The Lancet Digital Health

Last updated on Aug 25, 2025

COVID-19 Long COVID Machine Learning N3C RECOVER

Long COVID after SARS-CoV-2 during pregnancy in the United States Apr 1, 2025 →