Posted by Alumni from Nature
April 16, 2026
Dubious data sets are being used to train artificial-intelligence models that are designed to predict people's risk of stroke and diabetes, researchers report in a preprint1 on medRxiv. Some of the models seem to have been used in clinical settings although it's not clear whether this has led to flawed diagnoses. At least two journals are investigating studies that used these data sets. Adrian Barnett, a statistician at the Queensland University of Technology in Brisbane, Australia, and his colleagues identified 124 peer-reviewed papers that report using one of two open-access health data sets to train machine-learning models that provide little information about where the data came from. An analysis revealed multiple oddities that would not be expected for data from real people, leading Barnett and his colleagues to suspect that the data could have been fabricated. 'It was an enormous surprise to come across something like that,' Barnett says. At least two of the models have been... learn more