U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Common misconceptions about validation studies

Affiliations.

  • 1 Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA.
  • 2 Department of Global Health, Boston University School of Public Health, Boston, MA, USA.
  • 3 Department of Epidemiology, Rollins School of Public Health, Emory University, Boston, MA, USA.
  • 4 Department of Epidemiology, University of Pittsburgh School of Public Health, Pittsburgh, PA, USA.
  • PMID: 32617564
  • PMCID: PMC7750925
  • DOI: 10.1093/ije/dyaa090

Information bias is common in epidemiology and can substantially diminish the validity of study results. Validation studies, in which an investigator compares the accuracy of a measure with a gold standard measure, are an important way to understand and mitigate this bias. More attention is being paid to the importance of validation studies in recent years, yet they remain rare in epidemiologic research and, in our experience, they remain poorly understood. Many epidemiologists have not had any experience with validations studies, either in the classroom or in their work. We present an example of misclassification of a dichotomous exposure to elucidate some important misunderstandings about how to conduct validation studies to generate valid information. We demonstrate that careful attention to the design of validation studies is central to determining how the bias parameters (e.g. sensitivity and specificity or positive and negative predictive values) can be used in quantitative bias analyses to appropriately correct for misclassification. Whether sampling is done based on the true gold standard measure, the misclassified measure or at random will determine which parameters are valid and the precision of those estimates. Whether or not the validation is done stratified by other key variables (e.g. by the exposure) will also determine the validity of those estimates. We also present sample questions that can be used to teach these concepts. Increasing the presence of validation studies in the classroom could have a positive impact on their use and improve the validity of estimates of effect in epidemiologic research.

Keywords: Information bias; misclassification; sensitivity; specificity; validation studies.

© The Author(s) 2020; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association.

PubMed Disclaimer

Similar articles

  • A primer on quantitative bias analysis with positive predictive values in research using electronic health data. Newcomer SR, Xu S, Kulldorff M, Daley MF, Fireman B, Glanz JM. Newcomer SR, et al. J Am Med Inform Assoc. 2019 Dec 1;26(12):1664-1674. doi: 10.1093/jamia/ocz094. J Am Med Inform Assoc. 2019. PMID: 31365086 Free PMC article. Review.
  • Conditional validation sampling for consistent risk estimation with binary outcome data subject to misclassification. Gravel CA, Farrell PJ, Krewski D. Gravel CA, et al. Pharmacoepidemiol Drug Saf. 2019 Feb;28(2):227-233. doi: 10.1002/pds.4701. Pharmacoepidemiol Drug Saf. 2019. PMID: 30746841
  • Binary regression with differentially misclassified response and exposure variables. Tang L, Lyles RH, King CC, Celentano DD, Lo Y. Tang L, et al. Stat Med. 2015 Apr 30;34(9):1605-20. doi: 10.1002/sim.6440. Epub 2015 Feb 4. Stat Med. 2015. PMID: 25652841 Free PMC article.
  • Sampling Validation Data to Achieve a Planned Precision of the Bias-Adjusted Estimate of Effect. Collin LJ, MacLehose RF, Ahern TP, Gradus JL, Getahun D, Silverberg MJ, Goodman M, Lash TL. Collin LJ, et al. Am J Epidemiol. 2022 Jun 27;191(7):1290-1299. doi: 10.1093/aje/kwac025. Am J Epidemiol. 2022. PMID: 35136909 Free PMC article.
  • Information bias in epidemiological studies with a special focus on obstetrics and gynecology. Kesmodel US. Kesmodel US. Acta Obstet Gynecol Scand. 2018 Apr;97(4):417-423. doi: 10.1111/aogs.13330. Acta Obstet Gynecol Scand. 2018. PMID: 29453880 Review.
  • Guidance of development, validation, and evaluation of algorithms for populating health status in observational studies of routinely collected data (DEVELOP-RCD). Wang W, Jin YH, Liu M, He Q, Xu JY, Wang MQ, Li GW, Fu B, Yan SY, Zou K, Sun X. Wang W, et al. Mil Med Res. 2024 Aug 6;11(1):52. doi: 10.1186/s40779-024-00559-y. Mil Med Res. 2024. PMID: 39107834 Free PMC article.
  • Challenges with misclassification of American Indian/Alaska Native race and Hispanic ethnicity on death records in North Carolina occupational fatalities surveillance. McClure ES, Gartner DR, Bell RA, Cruz TH, Nocera M, Marshall SW, Richardson DB. McClure ES, et al. Front Epidemiol. 2022 Oct 21;2:878309. doi: 10.3389/fepid.2022.878309. eCollection 2022. Front Epidemiol. 2022. PMID: 38455305 Free PMC article.
  • Validation of human immunodeficiency virus diagnosis codes among women enrollees of a U.S. health plan. Pocobelli G, Oliver M, Albertson-Junkans L, Gundersen G, Kamineni A. Pocobelli G, et al. BMC Health Serv Res. 2024 Feb 22;24(1):234. doi: 10.1186/s12913-024-10685-x. BMC Health Serv Res. 2024. PMID: 38389066 Free PMC article.
  • Validity of self-reported night shift work among women with and without breast cancer. Vestergaard JM, Haug JND, Dalbøge A, Bonde JPE, Garde AH, Hansen J, Hansen ÅM, Larsen AD, Härmä M, Costello S, Kolstad HA. Vestergaard JM, et al. Scand J Work Environ Health. 2024 Apr 1;50(3):152-157. doi: 10.5271/sjweh.4142. Epub 2024 Feb 8. Scand J Work Environ Health. 2024. PMID: 38329266 Free PMC article.
  • The effect of disease misclassification on the ability to detect a gene-environment interaction: implications of the specificity of case definitions for research on Gulf War illness. Haley RW, Dever JA, Kramer G, Teiber JF. Haley RW, et al. BMC Med Res Methodol. 2023 Nov 20;23(1):273. doi: 10.1186/s12874-023-02092-3. BMC Med Res Methodol. 2023. PMID: 37986147 Free PMC article.
  • Jurek AM, Greenland S, Maldonado G, Church TR. Proper interpretation of non-differential misclassification effects: expectations vs observations. Int J Epidemiol 2005;34:680–87. - PubMed
  • Marshall JR, Hastrup JL. Mismeasurement and the resonance of strong confounders: uncorrelated errors. Am J Epidemiol 1996;143:1069–078. - PubMed
  • Kim MY, Goldberg JD. The effects of outcome misclassification and measurement error on the design and analysis of therapeutic equivalence trials. Stat Med 2001. ;20:2065–078. - PubMed
  • Jurek AM, Greenland S, Maldonado G. How far from non-differential does exposure or disease misclassification have to be to bias measures of association away from the null? Int J Epidemiol 2008;37:382–85. - PubMed
  • Brenner H, Savitz DA. The effects of sensitivity and specificity of case selection on validity, sample size, precision, and power in hospital-based case-control studies. Am J Epidemiol 1990;132:81–92. - PubMed

Publication types

  • Search in MeSH

Grants and funding

  • R01 LM013049/LM/NLM NIH HHS/United States

LinkOut - more resources

Full text sources.

  • Europe PubMed Central
  • Ovid Technologies, Inc.
  • PubMed Central
  • Silverchair Information Systems

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • For authors
  • Browse by collection
  • BMJ Journals

You are here

  • Volume 9, Issue 10
  • Questionnaire validation practice: a protocol for a systematic descriptive literature review of health literacy assessments
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0001-5704-0490 Melanie Hawkins 1 ,
  • Gerald R Elsworth 1 ,
  • Richard H Osborne 2
  • 1 School of Health and Social Development, Faculty of Health , Deakin University , Burwood , Victoria , Australia
  • 2 Global Health and Equity, Faculty of Health, Arts and Design , Swinburne University of Technology , Hawthorn , Victoria , Australia
  • Correspondence to Melanie Hawkins; melanie.hawkins{at}deakin.edu.au

Introduction Contemporary validity testing theory holds that validity lies in the extent to which a proposed interpretation and use of test scores is justified, the evidence for which is dependent on both quantitative and qualitative research methods. Despite this, we hypothesise that development and validation studies for assessments in the field of health primarily report a limited range of statistical properties, and that a systematic theoretical framework for validity testing is rarely applied. Using health literacy assessments as an exemplar, this paper outlines a protocol for a systematic descriptive literature review about types of validity evidence being reported and if the evidence is reported within a theoretical framework.

Methods and analysis A systematic descriptive literature review of qualitative and quantitative research will be used to investigate the scope of validation practice in the rapidly growing field of health literacy assessment. This review method employs a frequency analysis to reveal potentially interpretable patterns of phenomena in a research area; in this study, patterns in types of validity evidence reported, as assessed against the criteria of the 2014 Standards for Educational and Psychological Testing , and in the number of studies using a theoretical validity testing framework. The search process will be consistent with the Preferred Reporting Items for Systematic Reviews and Meta-analyses statement. Outcomes of the review will describe patterns in reported validity evidence, methods used to generate the evidence and theoretical frameworks underpinning validation practice and claims. This review will inform a theoretical basis for future development and validity testing of health assessments in general.

Ethics and dissemination Ethics approval is not required for this systematic review because only published research will be examined. Dissemination of the review findings will be through publication in a peer-reviewed journal, at conference presentations and in the lead author’s doctoral thesis.

  • validity testing theory
  • health literacy
  • health assessment
  • measurement

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:  http://creativecommons.org/licenses/by-nc/4.0/ .

https://doi.org/10.1136/bmjopen-2019-030753

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

This is the first systematic literature review to examine types of validity evidence for a range of health literacy assessments within the framework of the authoritative reference for validity testing theory, The Standards for Educational and Psychological Testing .

The review is grounded in the contemporary definition of validity as a quality of the interpretations and inferences made from measurement scores rather than as solely based on the properties of a measurement instrument.

The search for the review will be limited only by the end search date (March 2019) because health literacy is a relatively new field and publications are not expected prior to about 30 years ago.

All definitions of health literacy and all types of health literacy assessment instruments will be included.

A limitation of the review is that the search will be restricted to studies published and instruments developed in the English language, and this may introduce an English language and culture bias.

Introduction

Historically, the focus of validation practice has been on the statistical properties of a test or other measurement instrument, and this has been adopted as the basis of validity testing for individual and population assessments in the field of health. 1 However, advancements in validity testing theory hold that validity lies in the justification of a proposed interpretation of test scores for an intended purpose, the evidence for which includes but is not limited to the test’s statistical properties. 2–7 Therefore, to validate means to investigate , through a range of methods, the extent to which a proposed interpretation and use of test scores is justified. 7–9 The term ‘test’ in this paper is used in the same sense as Cronbach uses it in his 1971 Test Validation chapter 8 to refer to all procedures for collecting data about individuals and populations. In health, these procedures include objective tests (eg, clinical assessments) and subjective tests (eg, patient questionnaires) or a combination of both and may involve quantitative (eg, questionnaire) or qualitative methods (eg, interview). The act of testing results in data that require interpretation. In the field of health, such interpretations are usually used for making decisions about individuals or populations. The process of validation needs to provide evidence that these interpretations and decisions are credible, and a theoretical framework to guide this process is warranted. 1 2 10

The authoritative reference for validity testing theory comes from education and psychology: the Standards for Educational and Psychological Testing (the Standards ). 3 The Standards define validity as ‘the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests’ and that ‘the process of validation involves accumulating relevant evidence to provide a sound scientific basis for the proposed score interpretations’ (p.11). 3 A test’s proposed score interpretation and use is described in Kane’s argument-based approach to validation as an interpretation/use argument (IUA; also called an interpretive argument). 11 12 Validity testing theory requires test developers and users to generate and evaluate a range of validity evidence such that a validity argument can determine the plausibility of the IUA. 3 7 9 11 12 Despite this contemporary stance on validity testing theory and practice, the application of validity testing theory and methodology is not common practice for individual and population assessments in the field of health. 1 Furthermore, there are calls for developers, users and translators/adapters of health assessments to establish theoretically driven validation plans for IUAs such that validity evidence can be systematically collected and evaluated. 1 2 7 10

The Standards provide a theoretical framework that can be used or adapted to form a validation plan for development of a new test or to evaluate the validity of an IUA for a new context. 1 2 Based on the notion that construct validity is the foundation of test development and use, the theoretical framework of the Standards outlines five sources of evidence on which validity arguments should be founded: (1) test content, (2) response processes, (3) internal structure, (4) relationship of scores to other variables and (5) validity and the consequences of testing ( table 1 ). 3

  • View inline

The five sources of validity evidence 3

Validity testing in the health context

Two of the five sources of validity evidence defined by the Standards (internal structure and relationship of scores to other variables) have a focus on the statistical properties of a test. However, the other three (test content, response processes and consequences of testing) are strongly reliant on evidence based on qualitative research methods. Greenhalgh et al have called for more credence and publication space to be given to qualitative research in the health sciences. 13 Zumbo and Chan (p.350, 2014) call specifically for more validity evidence from qualitative and mixed methods. 1 It is time to systematically assess if test developers and users in health are generating and integrating a range of quantitative and qualitative evidence to support inferences made from these data. 1

In chapter 1 of their book, Zumbo and Chan report the results of a systematic search of validation studies from the 1960s to 2010. Results from this search for the health sciences categories of ‘life satisfaction, well-being or quality of life’ and ‘health or medicine’, show that there is a dramatic increase in publication of validation studies since the 1990s that produce primarily what is classified as construct validity. 1 Given this was a snapshot review of validation practice during these years, the authors do not delve into the methods used to generate evidence for construct validity. However, Barry et al , in a systematic review investigating the frequency with which psychometric properties were reported for validity and reliability in health education and behaviour (also published in 2014), found that the primary methods used to generate evidence for construct validity were factor analysis, correlation coefficient and χ 2 . 14 This limited view of construct validity as simply correlation between items or tests measuring the same or similar constructs is at odds with the Standards where evaluation and integration of evidence from perhaps several other sources (ie, test content, response processes, internal structure, relationships with theoretically predicted external variables, and intended and unintended consequences) is needed to determine the degree to which a construct is represented by score interpretations (p.11). 3

Health literacy

This literature review will examine validity evidence for health literacy assessments. Health literacy is a relatively new area of measurement, and there has been a rapid development in the definition and measurement of this multi-dimensional concept. 15–18 Health literacy is now a priority of the WHO, 19 and many countries have incorporated it into health policy, 20–24 and are including it in national health surveys. 25–27

Definitions of health literacy include those for functional health literacy (ie, a focus on comprehension and numeric abilities) to multi-dimensional definitions such as that used by the WHO: ‘the cognitive and social skills which determine the motivation and ability of individuals to gain access to, understand and use information in ways which promote and maintain good health’. 28 The general purpose of health literacy assessment is to determine pathways to facilitate access to and improve understanding and use of health information and services, as well as to improve or support the health literacy responsiveness of health services. 28–31 However, these two uses of data (in general, to improve patient outcomes and to improve organisational procedures) may require evaluative integration of different types of evidence to justify score interpretations to inform patient interventions or organisational change. 3 7 9 11 32 A strong and coherent evidence-based conception of the health literacy construct is required to support score interpretations. 14 33–35 Decisions that arise from measurements of health literacy will affect individuals and populations and, as such, there must be strong argument for the validity of score interpretations for each measurement purpose.

To enhance the quality and transparency of the proposed systematic descriptive literature review, this protocol paper outlines the scope and purpose of the review. 36 37 Using the theoretical framework of the five sources of validity evidence of the Standards , and health literacy assessments as an exemplar, the results of this systematic descriptive literature review will indicate current validation practice. The assumptions that underlie this literature review are that, despite the advancement of contemporary validity testing theory in education and psychology, a systematic theoretical framework for validity testing has not been applied in the field of health, and that validation practice for health assessments remains centred on general psychometric properties that typically provide insufficient evidence that the test is fit for its intended use. The purpose of the review is to investigate quantitative and qualitative validity evidence reported for the development and testing of health literacy assessments to describe patterns in the types of validity evidence reported, 38–45 and identify use of theory for validation practice. Specifically, the review will address the following questions:

What is being reported as validity evidence for health literacy assessment data?

Do the studies place the validity evidence within a validity testing framework, such as that offered by the Standards ?

Methods and analysis

Review method.

This review is designed to provide the basis for a critique of validation practice for health literacy assessments within the context of the validity testing framework of the Standards . It is not an evaluation of the specific arguments that authors have made about validity from the data that have been gathered for individual measurement instruments. The review is intended to quantify the types of validity evidence being reported so a systematic descriptive literature review was chosen as the most appropriate review technique. Described by King and He (2005) 42 as belonging towards the qualitative end of a continuum of review techniques, a descriptive literature review nevertheless employs a frequency analysis to reveal interpretable patterns in a research area; such as, in this review, in the types of validity evidence being reported for health literacy assessments and in the number of studies that refer to a validity testing framework. A descriptive literature review can include qualitative and quantitative research and is based on a systematic and exhaustive review method. 38–41 43 44 38 39 The method for this review will be guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. 46

Eligibility criteria

This literature review is not an assessment of participant data but a collation of reported validity evidence. As such, the focus is not on the participants in the studies but on the evidence presented in support of the validity of interpretations and uses of health literacy assessment data. This means that it will be the type of study that is considered for inclusion rather than the type of study participant. Inclusion criteria are as follows:

Development/application/validation studies about health literacy assessments : We expect to find many papers that describe the development and initial validation studies of health literacy assessments. Papers that use an existing health literacy assessment to measure outcomes but do not claim to conduct validity testing will not be included. Studies of comparison (eg, participant groups) or of prediction (eg, health literacy and hospital admissions) will be included only if the authors openly claim that the study results contribute validation evidence for the health literacy assessment instrument.

Not limited by date : There will be no start date to the search such that papers about validation and health literacy assessments from the early days of health literacy measurement will be included in the search. Health literacy is a relatively new concept and the earliest papers are expected to date back only about 30 years. The end search date was in March 2019.

Studies published and health literacy assessments developed in the English language : Due to resource limitations, the search will be restricted to studies published in the English language and instruments developed in the English language. Translated instruments will be excluded. We realise that these exclusions introduce an English language and culture bias, and we recommend that a similar descriptive review of published studies about health literacy assessments developed in or translated to other languages is warranted.

Qualitative and quantitative research methods : Given that comprehensive validity testing includes both qualitative and quantitative methods, studies employing either or both will be included.

All definitions of health literacy : Definitions of health literacy have been accumulating over the past 30 years and reflect a range of health literacy testing methods as well as contexts, interpretations and uses of the data. We include all definitions of health literacy and all types of health literacy assessment instruments, which may include objective, subjective, uni-dimensional and multi-dimensional measurement instruments.

Exclusion criteria

Systematic reviews and other types of reviews captured by the search will not be included in the analysis. However, before being excluded, the reference lists will be checked for articles that may have been missed by the database search. Predictive, association or other comparative studies that do not explicitly claim in the abstract to contribute validity evidence will also not be included. Instruments developed in languages other than English, and translation studies, will be excluded as noted previously.

Information sources

Systematic electronic searches of the following databases will be conducted in EBSCOhost: MEDLINE Complete, Global Health, CINAHL Complete, PsycINFO and Academic Search Complete. EMBASE will also be searched. The electronic database search will be supplemented by searching for dissertations and theses through proquest.com, dissertation.com and openthesis.org. Reference lists of pertinent systematic reviews that are identified in the search will be scanned, as well as article reference lists and the authors’ personal reference lists, to ensure all relevant articles have been captured. The search terms will use medical subject headings and text words related to types of assessment instruments, health literacy, validation and validity testing. Peer reviewed full articles and examined theses will be included in the search.

Search strategy

An expert university librarian has been consulted as part of planning the literature search strategy. The strategy will focus on health literacy, types of assessment instruments, validation and validity, and methods used to determine the validity of interpretation and use of data from health literacy assessments. The search terms have been determined through scoping searches and examining search terms from other measurement and health literacy systematic reviews. The database searches were completed in March 2019 and the search terms used are described in online supplementary file 1 .

Supplemental material

Study selection.

Literature search results will be saved and the titles and abstracts downloaded to Endnote Reference Manager X9. Titles and abstracts of the search results will be screened for duplicates and according to the inclusion and exclusion criteria. The full texts of articles that seem to meet the eligibility criteria or that are potentially eligible will then be obtained and screened. Excluded articles and reasons for exclusions will be recorded. The PRISMA flow diagram will be used to document the review process. 46

Data extraction

The data extraction framework will be adapted from tables in Hawkins et al 2 (p.1702) and Cox and Owen (p.254). 47 Data extraction from eligible articles will be conducted by one reviewer (MH) and comprehensively checked by a second reviewer (GE).

Subjective and objective health literacy assessments will be identified along with those that combine objective and subjective items or scales. Data to be extracted will include the date and source of publication; the context of the study (eg, country, type of organisation/institution, type of investigation, representative population); statements about the use of a theoretical validity testing framework; the types of validity evidence reported; the methods used to generate the evidence; and the validation claims made by the authors of the papers, as based on their reported evidence.

Data synthesis and analysis

A descriptive analysis of extracted data, as based on the theoretical framework of the Standards , will be used to identify patterns in the types of validity evidence being reported, the methods used to generate the evidence and theoretical frameworks underlying validation practice. Where possible and relevant to the concept of validity, changes in validation practice and assessment of health literacy over time will be explored. It is possible that one study may use more than one method and generate more than one type of validity evidence. Statements about a theoretical underpinning to the generation of validity evidence will be collated.

Patient and public involvement

Patients and the public were not involved in the development or design of this literature review.

With the increasing use of health assessment data for decision-making, the health of individuals and populations relies on test developers and users to provide evidence for validity arguments for the interpretations and uses of these data. This systematic descriptive literature review will collate existing validity evidence for health literacy assessments developed in English and identify patterns of reporting frequency according to the five sources of evidence in the Standards , and establish if the validity evidence is being placed within a theoretical framework for validation planning. 3 The potential implications of this review include finding that, when assessed against the Standards’ theoretical framework, current validation practice in health literacy (and possibly in health assessment in general) has limited capacity for determining valid score interpretation and use. The Standards’ framework challenges the long-held perception in health assessment that validity refers to an assessment tool rather than to the interpretation of data for a specific use. 48 49

The validity of decisions based on research data is a critical aspect of health services research. Our understanding of the phenomena we research is dependent on the quality of our measurement of the constructs of interest, which, in turn, affects the validity of the inferences we make and actions we take from data interpretations. 6 7 Too often the measurement quality is considered separate to the decisions that need to be made. 6 50 However, questionable measurement (perhaps through use of an instrument that was developed using suboptimal methods, was inappropriately applied or through gaps in validity testing) cannot lead to valid inferences. 3 50 To make appropriate and responsible decisions for individuals, communities, health services and policy development, we must consider the integrity of the instruments, and the context and purpose of measurement, to justify decisions and actions based on the data.

A limitation of the review is that the search will be restricted to studies published and instruments developed in the English language, and this may introduce an English language and culture bias. A similar review of health literacy assessments developed in or translated to other languages is warranted. A further limitation is that we rely on the information authors provide in identified articles. It is possible that some authors have an incomplete understanding of the specific methods they are using and reporting, and may not accurately or clearly provide details on validity testing procedures employed. Documentation for decisions made during data extraction will be kept by the researchers.

Health literacy is a relatively new area of research. We are fortunate to be at the start of a burgeoning field and can include all publications about validity testing of English-language health literacy assessments. The inclusion of the earliest to the most recent publications provides the opportunity to understand changes and advancements in health literacy measurement and methods of analysis since the introduction of the concept of health literacy. Using health literacy assessments as an exemplar, the outcomes of this review will guide and inform a theoretical basis for the future practice of validity testing of health assessments in general to ensure, as far as is possible, the integrity of the inferences made from data for individual and population benefits.

Acknowledgments

The authors acknowledge and thank Rachel West, Deakin University Liaison Librarian, for her expertise and advice during the preparation of this systematic literature review.

  • Hawkins M ,
  • Elsworth GR ,
  • American Educational Research Association
  • American Psychological Association
  • National Council on Measurement in Education
  • Cronbach LJ
  • Sawatzky R ,
  • Zumbo BD , et al
  • Greenhalgh T ,
  • Annandale E ,
  • Ashcroft R , et al
  • Piazza-Gardner AK , et al
  • Sørensen K ,
  • Van den Broucke S ,
  • Fullam J , et al
  • Jordan JE ,
  • Osborne RH ,
  • Buchbinder R
  • Nguyen TH ,
  • Paasche-Orlow MK ,
  • Kim MT , et al
  • Valerio MA ,
  • McCormack LA , et al
  • World Health Organization
  • Scotland NHS
  • Schaeffer D ,
  • Berens E-M ,
  • Weishaar H , et al
  • Australian Commission on Safety and Quality in Health Care
  • U.S. Department of Health and Human Services Office of Disease Prevention and Health Promotion
  • Osborne RH , et al
  • Australian Institute of Health and Welfare
  • New Zealand Ministry of Health
  • Batterham R ,
  • Beauchamp A ,
  • Trezona A ,
  • Nutbeam D ,
  • Premkumar P
  • Buchbinder R ,
  • Hoffmann F ,
  • Mathes T , et al
  • Schlagenhaufer C ,
  • Winemiller DR ,
  • Mitchell ME ,
  • Sutliff J , et al
  • Jackson SE ,
  • Trudel M-C ,
  • Jaana M , et al
  • Liberati A ,
  • Tetzlaff J , et al
  • Hubley AM ,
  • DeVellis RF ,
  • Alfieri WS ,
  • Ahluwalia IB

Twitter @4MelanieHawkins

Contributors MH and RHO conceptualised the research question and analytical plan. Under supervision from RHO, MH led the development of the search strategy, selection criteria, data extraction criteria and analysis method, which was then comprehensively assessed and checked by GRE. MH drafted the initial manuscript and led subsequent drafts. GRE and RHO read and provided feedback on manuscript iterations. All authors approved the final manuscript. RHO is the guarantor.

Funding MH is funded by a National Health and Medical Research Council (NHMRC) of Australia Postgraduate Scholarship (APP1150679). RHO is funded in part through a National Health and Medical Research Council (NHMRC) of Australia Principal Research Fellowship (APP1155125).

Competing interests None declared.

Patient consent for publication Not required.

Ethics approval Ethics approval is not required for this systematic review because only published research will be examined. Dissemination will be through publication in a peer-reviewed journal and at conference presentations, and in the lead author’s doctoral thesis.

Provenance and peer review Not commissioned; externally peer reviewed.

Read the full text or download the PDF:

  • Search Menu
  • Sign in through your institution
  • Supplements
  • Cohort Profiles
  • Education Corner
  • Author Guidelines
  • Submission Site
  • Open Access
  • About the International Journal of Epidemiology
  • About the International Epidemiological Association
  • Editorial Team
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Contact the IEA
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

Introduction, conclusions, conflict of interest.

  • < Previous

Common misconceptions about validation studies

  • Article contents
  • Figures & tables
  • Supplementary Data

Matthew P Fox, Timothy L Lash, Lisa M Bodnar, Common misconceptions about validation studies, International Journal of Epidemiology , Volume 49, Issue 4, August 2020, Pages 1392–1396, https://doi.org/10.1093/ije/dyaa090

  • Permissions Icon Permissions

Information bias is common in epidemiology and can substantially diminish the validity of study results. Validation studies, in which an investigator compares the accuracy of a measure with a gold standard measure, are an important way to understand and mitigate this bias. More attention is being paid to the importance of validation studies in recent years, yet they remain rare in epidemiologic research and, in our experience, they remain poorly understood. Many epidemiologists have not had any experience with validations studies, either in the classroom or in their work. We present an example of misclassification of a dichotomous exposure to elucidate some important misunderstandings about how to conduct validation studies to generate valid information. We demonstrate that careful attention to the design of validation studies is central to determining how the bias parameters (e.g. sensitivity and specificity or positive and negative predictive values) can be used in quantitative bias analyses to appropriately correct for misclassification. Whether sampling is done based on the true gold standard measure, the misclassified measure or at random will determine which parameters are valid and the precision of those estimates. Whether or not the validation is done stratified by other key variables (e.g. by the exposure) will also determine the validity of those estimates. We also present sample questions that can be used to teach these concepts. Increasing the presence of validation studies in the classroom could have a positive impact on their use and improve the validity of estimates of effect in epidemiologic research.

Information bias is common in epidemiology and can greatly impact the validity of study results, yet validation studies, which could help us to understand and mitigate this bias, are rarely done.

It has been our experience that students of epidemiology have limited training on how to conduct a validation study and as such, many misconceptions about validation studies remain.

Proper understanding of how to implement a validation study could have a strong impact on the validity of study results.

Information bias is common in epidemiology 1 and can substantially diminish the validity of study results. 2–5 Validation studies, in which an investigator compares the accuracy of a measure with a gold standard measure, 6 are an important way to understand and mitigate this bias. Validation data can be combined with quantitative bias analysis methods 7–10 to compute bias-adjusted estimates that account for the systematic error and yield uncertainty intervals that represent total uncertainty better than conventional confidence intervals. Further, their estimates of values for bias parameters, such as the sensitivity and specificity, may be transported to other studies that use the same measurement instrument, which improves the overall state of the science in the topic area. 11 As with other epidemiologic studies, validation studies must be carefully designed and implemented to be useful.

More attention is being paid to the importance of validation studies, and several journals 12 , 13 have even created submission categories for validation studies, yet they remain rare in epidemiologic research. To address this gap, we pose a series of questions that could be used on an exam about validation studies or in teaching validation concepts, and use the explanation of the answers to dispel misconceptions as well as to provide teaching examples that can be used to prevent these misconceptions from taking hold in the first place.

Validation study designs

We will work with an example of misclassification of a dichotomous exposure, though the principles we discuss apply equally to misclassification of dichotomous outcomes and covariates. Suppose we conducted an observational study of the relationship between self-reported human papillomavirus (HPV) vaccination and cancer precursor conditions among 7100 respondents, of whom 2100 were classified as exposed and 5000 were classified as unexposed ( Table 1 , Panel A). The self-reported history of HPV vaccination was likely misclassified compared with a gold standard, such as comprehensive medical record review. The questions pertain to whether we would need to conduct an internal validation study, or whether we can instead apply validation study estimates from the existing literature.

Question 1: True/False: It is usually valid to apply estimates of positive and negative predictive value found from external sources (e.g. the literature) to another study population to conduct a bias analysis that will adjust study estimates for misclassification. Question 2. True/False: It is usually valid to apply estimates of sensitivity and specificity found from external sources (e.g. the literature) to another study population to conduct a bias analysis that will adjust study estimates for misclassification.

Full study population (Panel A) and three possible validation studies of 200 people and the bias parameters that would be estimated from each of the three validation study designs (note that whereas all four bias parameters are presented, they are not all valid; those that are not valid estimates have a line through them). Note: E is HPV vaccination, and respondents sampled by design are shown in bold italics

Panel A. Full study population
Truth
E+E−Total
ClassifiedE+20004002400PPV = 0.83
E−10046004700NPV = 0.98
Total210050007100
Se = 0.95Sp = 0.92Prevalence = 0.30
Panel B. Validation study design 1: 100 classified as exposed, 100 classified as unexposed
Truth
E+E−Total
ClassifiedE+8317 PPV = 0.83
E−298 NPV = 0.98
Total85115200
Se = 0.98Sp = 0.85Prevalence = 0.43
Panel C. Validation study design 2: 100 truly as exposed, 100 truly as unexposed
Truth
E+E-Total
ClassifiedE+958103PPV = 0.92
E−59297NPV = 0.95
Total 200
Se = 0.95Sp = 0.92Prevalence = 0.50
Panel D. Validation study design 3: 200 subjects chosen randomly
Truth
E+E−Total
ClassifiedE+561168PPV = 0.83
E−3130132NPV = 0.98
Total59141
Se = 0.95Sp = 0.92Prevalence = 0.30
Panel A. Full study population
Truth
E+E−Total
ClassifiedE+20004002400PPV = 0.83
E−10046004700NPV = 0.98
Total210050007100
Se = 0.95Sp = 0.92Prevalence = 0.30
Panel B. Validation study design 1: 100 classified as exposed, 100 classified as unexposed
Truth
E+E−Total
ClassifiedE+8317 PPV = 0.83
E−298 NPV = 0.98
Total85115200
Se = 0.98Sp = 0.85Prevalence = 0.43
Panel C. Validation study design 2: 100 truly as exposed, 100 truly as unexposed
Truth
E+E-Total
ClassifiedE+958103PPV = 0.92
E−59297NPV = 0.95
Total 200
Se = 0.95Sp = 0.92Prevalence = 0.50
Panel D. Validation study design 3: 200 subjects chosen randomly
Truth
E+E−Total
ClassifiedE+561168PPV = 0.83
E−3130132NPV = 0.98
Total59141
Se = 0.95Sp = 0.92Prevalence = 0.30

Whereas Se and Sp can vary between populations, PPV and NPV can vary more strongly because the prevalence of the variable is likely to change between populations. Furthermore, if the exposure is associated with the outcome, then the prevalence of exposure would differ within outcome groups, and estimates of PPV and NPV from an external source would have to be available within the outcome categories. In our example, transportability of PPV and NPV from an external source to our study population would require that the prevalence of HPV vaccination is the same in the external population as in our population, that the association between HPV vaccination and cancer precursor conditions is the same in the external population as in our population, and that estimates of PPV and NPV are available within categories of cancer precursor conditions defined in the same way. These conditions are difficult to achieve, illustrating why PPV and NPV are therefore ordinarily considered to be less transportable between populations than Se and Sp, which do not require the same conditions to be transportable.

Without sensitivity and specificity data from the literature to apply to our population, there are three main approaches for how to sample participants into an internal validation study. 6 However, only certain parameters can be estimated from each design.

Parameters that can be estimated from each validation study design

Question 3. In validation design 1, we sample respondents conditional on their imperfectly classified measure (e.g. sample 100 respondents who self-reported as vaccinated and 100 respondents who self-reported as unvaccinated) ( Table 1 , Panel B). What parameters can be validly calculated? Check all that apply.

   i. Se of exposure classification.

   ii. Sp of exposure classification.

  iii. PPV of exposure classification.

   iv. NPV of exposure classification.

The sample is taken within strata of those classified as exposed and unexposed. Sampling based on classified status changes the marginal exposure prevalence in the validation sample (43%) vs the study population (30%). The estimated predictive values within strata of classified exposure status will be valid because they do not rely on the marginal prevalence, but estimates of Se and Sp will be biased due to the change in exposure prevalence on the margins. With design 1, investigators can validly estimate predictive values, but not Se and Sp (at least not directly). Accordingly, the results generated are unlikely to be transportable to another study population, because predictive values depend on prevalence, as discussed above. The correct answer to question 3 should be PPV and NPV only [i.e. (iii) and (iv)].

In validation design 2, we sample respondents conditional on their gold standard measure (e.g. sample 100 respondents who have evidence of vaccination in the medical record and 100 respondents who do not) ( Table 1 , Panel C). This approach allows for the valid calculation of Se and Sp, but not predictive values, again because the sampling changes the marginal prevalence and therefore biases the estimates of the predictive values. Because of this sampling approach, other investigators may be able to apply the generated estimates of Se and Sp to their study. However, this design is seldom feasible because, except in some unusual cases, investigators would rarely have the true gold standard measure on everyone in the study to sample. Thus design 2 is unlikely to be implemented in practice. It is sometimes possible that a subset of the study population has both the gold standard measure and the misclassified measure. For example, a subset of the 7100 respondents might receive healthcare from a healthcare system that maintains a high-quality vaccine registry, and this subset could be included in a validation study that estimates sensitivity and specificity. This design does not truly sample conditional on the gold standard, however; it is a convenience sampling design.

In design 3, we take a random sample of the study population (e.g. sample 200 respondents independent of their true or classified vaccination status). Design 3 ( Table 1 , Panel D) samples independent of classification or truth, so Se, Sp, PPV and NPV can be estimated and the estimates of Se and Sp can be useful to the investigators and other stakeholders using the same misclassified variable. However, this design allows no control over the expected sample size in each cell, which is determined by the distribution of the classified and true variables in the population. As a result, the bias parameters may be imprecisely estimated 14 .

Importance of stratification in validation studies

In our study of HPV vaccine and cancer precursor conditions, should we stratify estimates of exposure classification by a second variable, in this case, the outcome?

Question 4. In design 1, we validate self-reported vaccination status using a gold standard in a random sample of those classified as exposed and a random sample of those classified as unexposed. What assumption did we make about exposure misclassification with respect to the outcome? Check all that apply. We assume that the exposure misclassification is as follows.

Non-differential with respect to the outcome.

Differential with respect to the outcome.

Independent with respect to outcome classification.

Dependent with respect to outcome classification.

By ‘non-differential’ we mean Se and Sp of exposure classification do not differ by outcome status. By ‘differential’ we mean either Se or Sp of exposure classification differs by outcome status. By independent misclassification we mean that the rate of misclassification of the exposure does not depend on the rate of misclassification of the outcome (i.e. the errors are not correlated).

As noted above, design 1 only allows us to calculate predictive values after sampling conditional on the self-reported vaccination status. As explained above, if vaccination exposure is associated with cancer precursor conditions, then the prevalence of vaccination will be different in cases and non-cases. When estimating predictive values, therefore, it is imperative to stratify the estimates within outcome groups. Failure to do so implicitly assumes differential misclassification, although possibly in error. The lesson here is simple, when conducting a validation study, stratify results by other key variables (e.g. with exposure misclassification stratify by outcome, and with outcome misclassification, stratify by the exposure). This of course, requires more study resources, but it is essential to generating valid estimates of classification parameters to inform bias-adjusted estimates of exposure effects.

Biases in validation studies

Question 5. Validation studies can suffer from which of the following sources of error? Check all that apply:

random error

confounding

selection bias

measurement error

Random error (as the full study population is not included), selection bias (if selection is related to both inclusion in the study and classification accuracy) and measurement error (if the gold standard is not a true gold standard 15 ) can all occur in validation studies. Confounding is more complicated. Confounding occurs when two variables (say exposure and outcome) have a common cause, and confounding results when the effects of the third variable mix with the effect of the exposure on the outcome. In validation studies, we have a single variable measured two different ways. It is theoretically possible to represent a validation study in a way that would present a common cause of both measures, but we do not see this as confounding in the typical sense that it is used in epidemiology. This is not to say that there are not factors that predict the rates of misclassification, 16 only that we do not consider this confounding in the usual structural sense of the bias.

We recommend adding design and analysis of validation studies to the curriculum of graduate epidemiology programmes. This addition is recommended at the master’s level and is essential for anyone training at the doctoral level. Further, hands-on training in validation studies as part of graduate epidemiology programmes would allow students to see real-world challenges that occur in conducting validation studies. Our survey results suggest that few epidemiologists have had formal training in validation study design and analysis, lack confidence in their ability to conduct such studies, and make predictable errors. Asking doctoral students to conduct a validation study as part of their thesis would have the added benefit of ensuring students had some experience with primary data collection, which not all doctoral programmes require.

There is more work to be done to educate epidemiologists on the need for high-quality validation studies and the optimal design of these studies. Given the ubiquity of misclassification within epidemiologic research and the known implications it can have for study results, it is essential that we increase competence on this topic and encourage students and training programmes to consider validation studies as part of original research.

This work was supported in part by the US National Library of Medicine [R01LM013049] awarded to T.L.L.

M.P.F. and T.L.L. have both published a textbook on bias analysis for which they receive royalties. T.L.L. provides epidemiologic methods consulting services to the Amgen Methods Council, including services on the topic of quantitative bias analysis. He receives <$5000 per year in consulting fees and travel support.

Jurek AM , Greenland S , Maldonado G , Church TR.   Proper interpretation of non-differential misclassification effects: expectations vs observations . Int J Epidemiol   2005 ; 34 : 680 – 87 .

Google Scholar

Marshall JR , Hastrup JL.   Mismeasurement and the resonance of strong confounders: uncorrelated errors . Am J Epidemiol   1996 ; 143 : 1069 – 078 .

Kim MY , Goldberg JD.   The effects of outcome misclassification and measurement error on the design and analysis of therapeutic equivalence trials . Stat Med   2001 ; 20 : 2065 – 078 .

Jurek AM , Greenland S , Maldonado G.   How far from non-differential does exposure or disease misclassification have to be to bias measures of association away from the null?   Int J Epidemiol   2008 ; 37 : 382 – 85 .

Brenner H , Savitz DA.   The effects of sensitivity and specificity of case selection on validity, sample size, precision, and power in hospital-based case-control studies . Am J Epidemiol   1990 ; 132 : 81 – 92 .

Marshall RJ.   Validation study methods for estimating exposure proportions and odds ratios with misclassified data . J Clin Epidemiol   1990 ; 43 : 941 – 47 .

Lash TL , Fox M , Fink AK.   Applying Quantitative Bias Analysis to Epidemiologic Data . New York : Springer , 2009 .

Google Preview

Greenland S.   Basic methods for sensitivity analysis of biases . Int J Epidemiol   1996 ; 25 : 1107 – 116 .

Greenland S , Lash T.  Bias Analysis. In: Rothman K , Greenland S , Lash T (eds). Modern Epidemiology . 3rd edn. Philidelphia : Lippincott Williams and Wilkins , 2008 : 345 – 80 .

Greenland S.   Multiple bias modelling for analysis of observational data . J R Stat Soc Ser A   2005 ; 168 : 1 – 25 .

Ioannidis J.   Why most published research findings are false . PLoS Med   2005 ; 2 : e124 .

Ehrenstein V , Petersen I , Smeeth L , et al.    Helping everyone do better: a call for validation studies of routinely recorded health data . Clin Epidemiol   2016 ; 8 : 49 – 51 .

Lash TL , Olshan AF.   EPIDEMIOLOGY announces the ‘validation study’; submission category . Epidemiology   2016 ; 27 : 613 – 14 .

Holcroft CA , Spiegelman D.   Design of validation studies for estimating the odds ratio of exposure- disease relationships when exposure is misclassified . Biometrics   1999 ; 55 : 1193 – 201 .

Wacholder S , Armstrong B , Hartge P.   Validation studies using an alloyed gold standard . Am J Epidemiol   1993 ; 137 : 1251 .

Banack HR , Stokes A , Fox MP , et al.    Stratified probabilistic bias analysis for body mass index–related exposure misclassification in postmenopausal women . Epidemiology   2018 ; 29 : 604 – 13 .

Month: Total Views:
July 2020 278
August 2020 120
September 2020 730
October 2020 285
November 2020 388
December 2020 394
January 2021 469
February 2021 416
March 2021 540
April 2021 527
May 2021 425
June 2021 417
July 2021 293
August 2021 316
September 2021 348
October 2021 343
November 2021 436
December 2021 291
January 2022 231
February 2022 279
March 2022 385
April 2022 348
May 2022 276
June 2022 213
July 2022 230
August 2022 176
September 2022 173
October 2022 241
November 2022 198
December 2022 193
January 2023 240
February 2023 245
March 2023 280
April 2023 249
May 2023 209
June 2023 202
July 2023 166
August 2023 179
September 2023 211
October 2023 276
November 2023 202
December 2023 214
January 2024 268
February 2024 284
March 2024 274
April 2024 235
May 2024 199
June 2024 169
July 2024 143
August 2024 112

Email alerts

Citing articles via, looking for your next opportunity.

  • About International Journal of Epidemiology
  • Recommend to your Library

Affiliations

  • Online ISSN 1464-3685
  • Copyright © 2024 International Epidemiological Association
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • Research Process
  • Manuscript Preparation
  • Manuscript Review
  • Publication Process
  • Publication Recognition
  • Language Editing Services
  • Translation Services

Elsevier QRcode Wechat

Why is data validation important in research?

  • 3 minute read
  • 65.5K views

Table of Contents

Data collection and analysis is one of the most important aspects of conducting research. High-quality data allows researchers to interpret findings accurately, act as a foundation for future studies, and give credibility to their research. As such, research often needs to go under the scanner to be free of suspicions of fraud and data falsification . At times, even unintentional errors in data could be viewed as research misconduct. Hence, data integrity is essential to protect your reputation and the reliability of your study.

Owing to the very nature of research and the sheer volume of data collected in large-scale studies, errors are bound to occur. One way to avoid “bad” or erroneous data is through data validation.

What is data validation?

Data validation is the process of examining the quality and accuracy of the collected data before processing and analysing it. It not only ensures the accuracy but also confirms the completeness of your data. However, data validation is time-consuming and can delay analysis significantly. So, is this step really important?

Importance of data validation

Data validation is important for several aspects of a well-conducted study:

  • To ensure a robust dataset: The primary aim of data validation is to ensure an error-free dataset for further analysis. This is especially important if you or other researchers plan to use the dataset for future studies or to train machine learning models.
  • To get a clearer picture of the data: Data validation also includes ‘cleaning-up’ of data, i.e., removing inputs that are incomplete, not standardized, or not within the range specified for your study. This process could also shed light on previously unknown patterns in the data and provide additional insights regarding the findings.
  • To get accurate results: If your dataset has discrepancies, it will impact the final results and lead to inaccurate interpretations. Data validation can help identify errors, thus increasing the accuracy of your results.
  • To mitigate the risk of forming incorrect hypotheses: Only those inferences and hypotheses that are backed by solid data are considered valid. Thus, data validation can help you form logical and reasonable speculations .
  • To ensure the legitimacy of your findings: The integrity of your study is often determined by how reproducible it is. Data validation can enhance the reproducibility of your findings.

Data validation in research

Data validation is necessary for all types of research. For quantitative research, which utilizes measurable data points, the quality of data can be enhanced by selecting the correct methodology, avoiding biases in the study design, choosing an appropriate sample size and type, and conducting suitable statistical analyses.

In contrast, qualitative research , which includes surveys or behavioural studies, is prone to the use of incomplete and/or poor-quality data. This is because of the likelihood that the responses provided by survey participants are inaccurate and due to the subjective nature of observational studies. Thus, it is extremely important to validate data by incorporating a range of clear and objective questions in surveys, bullet-proofing multiple-choice questions, and setting standard parameters for data collection.

Importantly, for studies that utilize machine learning approaches or mathematical models, validating the data model is as important as validating the data inputs. Thus, for the generation of automated data validation protocols, one must rely on appropriate data structures, content, and file types to avoid errors due to automation.

Although data validation may seem like an unnecessary or time-consuming step, it is absolutely critical to validate the integrity of your study and is absolutely worth the effort. To learn more about how to validate data effectively, head over to Elsevier Author Services !

how to write the results section of a research paper

How to write the results section of a research paper

choosing the Right Research Methodology

Choosing the Right Research Methodology: A Guide for Researchers

You may also like.

what is a descriptive research design

Descriptive Research Design and Its Myriad Uses

Doctor doing a Biomedical Research Paper

Five Common Mistakes to Avoid When Writing a Biomedical Research Paper

Writing in Environmental Engineering

Making Technical Writing in Environmental Engineering Accessible

Risks of AI-assisted Academic Writing

To Err is Not Human: The Dangers of AI-assisted Academic Writing

Importance-of-Data-Collection

When Data Speak, Listen: Importance of Data Collection and Analysis Methods

choosing the Right Research Methodology

Writing a good review article

Scholarly Sources What are They and Where can You Find Them

Scholarly Sources: What are They and Where can You Find Them?

Input your search keywords and press Enter.

  • - Google Chrome

Intended for healthcare professionals

  • My email alerts
  • BMA member login
  • Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution

Home

Search form

  • Advanced search
  • Search responses
  • Search blogs
  • Evaluation of clinical...

Evaluation of clinical prediction models (part 2): how to undertake an external validation study

  • Related content
  • Peer review
  • Lucinda Archer , assistant professor in biostatistics 1 2 ,
  • Kym I E Snell , associate professor in biostatistics 1 2 ,
  • Joie Ensor , associate professor in biostatistics 1 2 ,
  • Paula Dhiman , senior researcher in medical statistics 3 ,
  • Glen P Martin , senior lecturer in health data sciences 4 ,
  • Laura J Bonnett , senior lecturer in medical statistics 5 ,
  • Gary S Collins , professor of medical statistics 3
  • 1 Institute of Applied Health Research, College of Medical and Dental Sciences, University of Birmingham, Birmingham B15 2TT, UK
  • 2 National Institute for Health and Care Research (NIHR) Birmingham Biomedical Research Centre, Birmingham, UK
  • 3 Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK
  • 4 Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK
  • 5 Department of Biostatistics, University of Liverpool, Liverpool, UK
  • Correspondence to: r.d.riley{at}bham.ac.uk (or @Richard_D_Riley on Twitter)
  • Accepted 13 September 2023

External validation studies are an important but often neglected part of prediction model research. In this article, the second in a series on model evaluation, Riley and colleagues explain what an external validation study entails and describe the key steps involved, from establishing a high quality dataset to evaluating a model’s predictive performance and clinical usefulness.

A clinical prediction model is used to calculate predictions for an individual conditional on their characteristics. Such predictions might be of a continuous value (eg, blood pressure, fat mass) or the probability of a particular event occurring (eg, disease recurrence), and are often in the context of a particular time point (eg, probability of disease recurrence within the next 12 months). Clinical prediction models are traditionally based on a regression equation but are increasingly derived using artificial intelligence or machine learning methods (eg, random forests, neural networks). Regardless of the modelling approach, part 1 in this series emphasises the importance of model evaluation, and the role of external validation studies to quantify a model’s predictive performance in one or more target population(s) for model deployment. 1 Here, in part 2, we describe how to undertake such an external validation study and guide researchers through the steps involved, with a particular focus on the statistical methods and measures required, complementing other existing work. 2 3 4 5 6 7 8 9 10 11 12 13 These steps form the minimum requirement for external validation of any clinical prediction models, including those based on artificial intelligence, machine learning or regression.

Summary points

External validation is the evaluation of a model’s predictive performance in a different (but relevant) dataset, which was not used in the development process

An external validation study involves five key steps: obtaining a suitable dataset, making outcome predictions, evaluating predictive performance, assessing clinical usefulness, and clearly reporting findings

The validation dataset should represent the target population and setting in which the model is planned to be implemented

At a minimum, the validation dataset must contain the information needed to apply the model (ie, to make predictions) and make comparisons to observed outcomes

A model’s predictive performance should be examined in terms of overall fit, calibration, and discrimination, in the overall population and ideally in key subgroups (eg, defined by ethnic group), as part of fairness checks

Calibration should be examined across the entire range of predicted values, and at each relevant time point for which predictions are being made, using a calibration plot including a smoothed flexible calibration curve

Where the goal is for predictions to direct decision making, a prediction model should also be evaluated for its clinical usefulness, for example, using net benefit and decision curves

Although a well calibrated model is ideal, a miscalibrated model might still have clinical usefulness

The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) statement provides guidance on how to report external validation studies

What do we mean by external validation?

External validation is the evaluation of a model’s predictive performance in a different (but relevant) dataset, which was not used in the development process. 1 5 7 14 15 16 17 18 It does not involve refitting the model to compare how the refitted model equation (or its performance) changes compared to the original model. Rather, it involves applying a model as originally specified and then quantifying the accuracy of the predictions made. Five key steps are involved: obtaining a suitable dataset, making outcome predictions, evaluating predictive performance, assessing clinical usefulness, and clearly reporting findings. In this article, we outline these steps, using real examples for illustration.

Step 1: Obtaining a suitable dataset for external validation

The first step of an external validation study is obtaining a suitable, high quality dataset.

What quality issues should be considered in an external validation dataset?

A high quality dataset is more easily attained when initiating a prospective study to collect data for external validation, but this approach is potentially time consuming and expensive. The use of existing datasets (eg, from electronic health records) is convenient and often cheaper but is of limited value if the quality is low (eg, predictors are missing, outcome or predictor measurement methods do not reflect actual practice, or time of event is not recorded). Also, some existing datasets have a narrower case mix than the wider target population owing to specific entry criteria; for instance, UK Biobank is a highly selective cohort, restricted to individuals aged between 40 and 69—therefore, its use for external validation would leave uncertainty about a model’s validity for the wider population (including those aged <40 or >69).

To help judge whether an existing dataset is suitable for use in an external validation study, we recommend using the signalling questions within the Prediction model Risk Of Bias ASsessment Tool (PROBAST) domains for Participant Selection, Predictors and Outcome ( box 1 ). 19 20 Fundamentally, the dataset should be fit for purpose, such that it represents the target population, setting, and implementation of the model in clinical practice. For instance, it should have patient inclusion and exclusion criteria that match those in the target population and setting for use (eg, in the UK, prediction models intended for use in primary care might consider databases such as QResearch, Clinical Practice Research Datalink, Secure Anonymised Information Linkage, and The Health Improvement Network); measure predictors at or before the start point intended for making predictions; ensure measurement methods (for predictors and outcomes) reflect those to be used in practice; and have suitable follow-up information to cover the time points of interest for outcome prediction. It should also have a suitable sample size to ensure precise estimates of predictive performance (see part 3 of our series), 21 and ideally the amount of missing data should be small (see section on dealing with missing data, below).

Signalling questions within the first three domains of PROBAST (Prediction model Risk Of Bias ASsessment Tool) 19 20 that are important to consider when ensuring a dataset for external validation is fit for purpose

Domain 1: participant selection.

Were appropriate data sources used—for example, cohort or randomised trial for prognostic prediction model research, or cross sectional study for diagnostic prediction model research?

Were all inclusions and exclusions of participants appropriate?

Domain 2: predictors

Were predictors defined and assessed in a similar way for all participants?

Were predictor assessments made without knowledge of outcome data?

Are all predictors available at the time the model is intended to be used?

Domain 3: outcome

Was the outcome determined appropriately?

Was a prespecified or standard outcome definition used?

Were predictors excluded from the outcome definition?

Was the outcome defined and determined in a similar way for all participants?

Was the outcome determined without knowledge of predictor information?

Was the time interval between predictor assessment and outcome determination appropriate?

What population and setting should be used for external validation of a prediction model?

Researchers should focus on evaluating a model’s target validity, 1 3 such that the validation study represents the target population and setting in which the model is planned to be implemented (otherwise it will have little value). The validation study might include the same populations and settings that were used to develop the model. However, it could be a deliberate intention to evaluate a model’s performance in a different target population (eg, country) or setting (eg, secondary care) than that used in model development. For this reason, multiple external validation studies are conducted for the same model, to evaluate performance across different populations and settings. For example, the predictive performance of the Nottingham Prognostic Index has been evaluated in many external validation studies. 22 The more external validations that confirm good performance of a model in different populations and settings, the more likely it will be useful in untested populations and settings.

Most external validation studies are based on data that are convenient (eg, already available from a previous study) or easy to collect locally. As such, they often only evaluate a model’s performance in a specific target setting or (sub)population. To help clarify the scope of the external validation, Debray et al 5 recommend that researchers should quantify the relatedness between the development and validation datasets, and to make it clear whether the focus of the external validation is on reproducibility or transportability. Reproducibility relates to when the external validation dataset is from a population and setting similar to that used for model development. Reproducibility is also examined when applying internal validation methods (eg, cross validation, bootstrapping) to the original development data during the model development, as discussed in our first paper. 1 Conversely, transportability relates to external validation in an intended different population or setting, for which model performance is often expected to change owing to possible differences in predictor effects and the participant case mix compared with the original development dataset (eg, when moving from a primary care to a secondary care setting).

What information needs to be recorded in the external validation dataset?

At a minimum, the external validation dataset must contain the information needed to apply the model (ie, to make predictions) and make comparisons to observed outcomes. This required information means that, for each participant, the dataset should contain the outcome of interest and the values of any predictors included in the model. For time-to-event outcomes, any censoring times (ie, end of follow-up) and the time of any outcome occurrence should also be recorded. Fundamentally, the outcome should be reliably measured, and the recorded predictor information must reflect how, and the moment when, the model will be deployed in practice. For example, for a model to be used before surgery to predict 28 day mortality after surgery, it should use predictors that are available before surgery, and not any perioperative or postoperative predictors.

Step 2: Making predictions from the model

Once the external validation dataset is finalised (ready for analysis), the next step is to apply the existing prediction model to derive predicted values for each participant in the external validation dataset. This step should not be done manually, but rather done by using appropriate (statistical) code that can be programmed to apply the model to each participant in the external validation dataset and compute predicted outcome values based on their multiple predictor values. For some models, typically those based on black box artificial intelligence or machine learning methods, by design they can only be made directly available (by the model developers) as a software object, or accessible via a specific system or server. Figure 1 illustrates the general format of using regression based prediction models to estimate outcome values or event probabilities (risks), and figure 2 and figure 3 provide two case studies.

Fig 1

Application of an existing prediction model to derive predicted values for each participant in the external validation dataset

  • Download figure
  • Open in new tab
  • Download powerpoint

Fig 2

Example of a binary outcome prediction model to be externally validated in new data. SD=standard deviation; IQR=interquartile range

Fig 3

Example of a time-to-event outcome prediction model to be externally validated in new data. SD=standard deviation; IQR=interquartile range

Figure 2 shows a prediction model developed using the US West subset of the GUSTO-I data (2188 individuals, 135 events), which estimates the probability of 30 day mortality after an acute myocardial infarction. 23 The logistic regression model includes eight predictors (with 10 predictor parameters, since an additional two parameters are required to capture the categorical Killip classification). For illustration of externally validating this model, we use the remaining data from the GUSTO-I dataset (with thanks to Duke Clinical Research Institute), 23 which contains all eight predictor variables and outcome information for 38 642 individuals.

Figure 3 shows a prediction model for calculating the five year probability of a recurrence in patients with a diagnosis of primary breast cancer. This survival model was developed for illustrative purposes in 1546 (node positive) participants (974 events) from the Rotterdam breast cancer study, 18 24 including eight predictors with 10 predictor parameters. External validation is carried out using data from the German Breast Cancer Study Group, which contains all eight predictor variables and outcome information for 686 patients (with 299 events). 18 25 26 27

Once the predictions have been calculated for each participant, it is good practice to summarise their observed distribution, for example, as a histogram, with summary statistics such as the mean and standard deviation. This presentation is illustrated for the two examples in figure 2 and figure 3 , separately for those individuals with and without the outcome event.

Step 3: Quantifying a model’s predictive performance

The third step is to quantify a model’s predictive performance in terms of overall fit, calibration, and discrimination. This step requires suitable statistical software, which is discussed in supplementary material S1, 28 29 30 31 32 and example code is provided at www.prognosisresearch.com/software .

Overall fit

Overall performance of a prediction model for a continuous outcome is quantified by R 2 , the proportion of the total variance of outcome values that is explained by the model, with values closer to 1 preferred. Often this value is multiplied by 100, to give the percentage of variation explained. Generalisations of R 2 for binary or time-to-event outcomes have also been proposed, such as the Cox-Snell R 2 (this has a maximum value below 1), 33 Nagelkerke’s R 2 (a scaled version of the Cox-Snell R 2 , which has a maximum value of 1), 34 O’Quigley’s R 2 , 35 Royston’s R 2 , 36 and Royston and Sauerbrei’s R 2 D . 37 We particularly recommend reporting the Cox-Snell R 2 value, as it is needed in sample size calculations for future model development studies. 38

Another overall measure of fit is the mean squared error of predictions, which for continuous outcomes can be obtained on external validation by calculating the mean of the squared difference between participants’ observed outcomes and their estimated (from the model) outcomes. An extension of the mean square error for binary or time-to-event outcomes is the Brier score, 39 40 which compares observed outcomes and estimated probabilities. Overall fit performance estimates are shown for the two examples in table 1 .

Predictive performance of example models when examined in the external validation population. Data are estimates (95% confidence intervals). AUROC=area under the receiver operating characteristic curve

  • View inline

Calibration plots

Calibration refers to the assessment of whether observed and predicted values agree. 41 For example, whether observed event probabilities agree with a model’s estimated event probabilities (risks). Although an individual’s event probability cannot be observed (we only know if they had the outcome event or not), we can still examine calibration of predicted and observed probabilities by deriving smoothed calibration curves fitted using all the individuals’ observed outcomes and the model’s estimated event probabilities ( fig 4 and fig 5 ). At external validation, some miscalibration between the predicted and observed values should be anticipated. The more different the validation dataset is compared with the development dataset (eg, in terms of population case mix, outcome event proportion, timing and measurement of predictors, outcome definition), the greater the potential for miscalibration. Similarly, models developed using low quality approaches (eg, small datasets, unrepresentative samples, unpenalised rather than penalised regression) have greater potential for miscalibration on external validation.

Fig 4

Calibration plots for binary outcome prediction model on external validation. Example shows probability of 30 day mortality after an acute myocardial infarction. Area below the dashed line=where the model’s risk estimates are too high; area above the dashed line=where the model’s risk estimates are too low; 10 circles=10 groups defined by tenths of the distribution of estimated risks; histograms at the bottom of graphs show the distribution of risk estimates for each outcome group

Fig 5

Calibration plots for time-to-event prediction model on external validation. Example shows time-to-event outcome: probability of five year recurrence after a diagnosis of primary breast cancer. Area below the dashed line=where the model’s risk estimates are too high; area above the dashed line=where the model’s risk estimates are too low; 10 circles=10 groups defined by tenths of the distribution of estimated risks; histograms at the bottom of graphs show the distribution of risk estimates for each outcome group

Calibration should be examined across the entire range of predicted values (eg, probabilities between 0 to 1), and at each relevant time point for which predictions are being made. Van Calster et al outline a hierarchy of calibration checks, 42 ranging from the overall mean to subgroups defined by patterns of predictor values. Fundamentally, calibration should be visualised graphically using a calibration plot that compares observed and predicted values in the external validation dataset, and the plot must include a smoothed flexible calibration curve (with a confidence interval) as fitted in the individual data using a smoother or splines. 42 43

Many researchers, however, do not report a calibration plot, 44 and those that do tend to only report grouped points rather than a calibration curve across the entire range. Grouping can be gamed (eg, by altering the number of groups), only reveals calibration in the ill defined groups themselves, and caps the calibration assessment at the average predicted value in the lowest and highest group. Hence, grouping enables researchers to (deliberately) obfuscate any meaningful assessment of miscalibration in particular ranges of predicted values (an example is shown below). A calibration curve provides a more complete picture. For continuous outcomes, the calibration plot and smoothed curve can be supplemented by presenting the pair of observed (y axis) against predicted (x axis) values for all participants. For binary or time-to-event outcomes, observed (y axis) event probabilities against the model’s estimated event probabilities (x axis) can be added for groups defined by, for example, 10ths or 20ths of the model’s predictions—again, to supplement (not replace) a smoothed calibration curve. 43

The calibration plot should be presented in a square format, and the axes should not be distorted (eg, by changing the scale of one of the axes, or having uneven spacing across the range of values) as this could hide miscalibration in particular regions. Researchers should also add the distribution of the predicted values underneath the calibration plot, to show the spread of predictions in the validation dataset, perhaps even for each of the event and non-event groups separately.

If censoring occurs before the time point of interest in the validation dataset, then the true outcome event status is unknown for the censored individuals, which makes it difficult to directly plot the calibration of model predictions at the time point of interest. A common approach is to create groups (eg, 10 groups defined by tenths of the model’s estimated event probabilities), and to plot the model’s average estimated probability against the observed (1–Kaplan-Meier) event probability for each group. However, this approach is unsatisfactory, because the number of groups and the thresholds used to define them are arbitrary; hence, it only provides information on subjectively chosen groups of participants and does not provide granular information on calibration or miscalibration at specified values or ranges of predicted values. To manage this problem, a smoothed calibration curve can be plotted that examines calibration across the entire range of predicted values (analogous to the calibration plot for binary outcomes) at a particular time point. This approach can be achieved using pseudo-observations (or pseudo-values), 45 46 47 48 or flexible adaptive hazard regression or a Cox model using restricted cubic splines. 49 More details are provided in supplementary material S2.

Calibration plots and curves for the two examples are shown in figure 4 and figure 5 . The calibration plot for the binary outcome example ( fig 4 ) shows good calibration for event probabilities between 0 and 0.15. For calculated event probabilities beyond 0.2, the model overestimates the probability of mortality, as revealed by the smoothed calibration curve lying below the diagonal line. Had only grouped points been included (and not a smoothed curve across individuals), the extent of the miscalibration in the range of model predictions above 0.2 would be hidden. For example, consider if the calibration had been checked for 10 groups based on tenths of predicted values (see 10 circles in fig 4 ). Because most of the data involve patients with model predictions less than 0.2, nine of the 10 groups fall below predictions of 0.2. Further, the model’s estimated probabilities in the upper group have a mean of about 0.4, and information above this value is completely lost, incidentally where the miscalibration is most pronounced based on the smoothed curve across all individuals. Therefore, figure 4 demonstrates our earlier point that categorising into groups loses and hides information, and that the calibration curve is essential to show information across the whole range of predictions, including values close to 1.

Although a well calibrated model is ideal, a miscalibrated model might still have clinical usefulness. For example, in figure 4 , miscalibration is most pronounced in regions where the model’s estimated mortality risks are very high (eg, >0.3), with actual observed risks about 0.05 to 0.3 lower. However, in this setting, whether a patient is deemed to have high or very high mortality risks is unlikely to change clinical decisions for that patient. By contrast, in regions where clinical risk thresholds are more relevant (eg, predictions ranging from 0.05 to 0.1), calibration is very good and so the model might still be useful in clinical practice despite the miscalibration at higher risks (see step 4).

The calibration plot for the time-to-event outcome example shows that the predictions are systematically lower than the observed event risk at five years ( fig 5 ), with most of the calibration curve lying above the diagonal. In particular, for predictions between 0.1 and 0.8, the model appears to systematically underestimate the probability of recurrence within five years of a breast cancer diagnosis.

The calibration curve’s confidence interval is important to reveal the precision of the calibration assessment. It also quantifies the uncertainty of the actual risk in a group of individuals defined by a particular predicted value. For example, for the group of individuals with an estimated risk of 0.8 in figure 5 , the 95% confidence interval around the curve suggests that this group’s actual risk is likely between 0.78 to 1.

Quantifying calibration performance

Calibration plots with calibration curves should also be supplemented with statistical measures that summarise the calibration performance observed in the plot. 50 Calibration should not be assessed using the Hosmer-Lemeshow test, or related ones like the Nam-D’Agostino test or Gronnesby-Borgan test, because these require arbitrary grouping of participants that, along with sample size, can influence the calculated P value, and does not quantify the actual magnitude or direction of any miscalibration. Rather, calibration should be quantified by the calibration slope (ideal value of 1), calibration-in-the-large (ideal value of 0) and—for binary or time-to-event outcomes—the observed/expected (O/E) ratio (ideal value of 1) or conversely the E/O ratio. A detailed explanation for each of these measures is given in supplementary material S3. Estimates of these measures should be reported alongside confidence intervals, and derived for the dataset as a whole and, ideally, also for key subgroups (eg, different ethnic groups, regions). To quantify overall miscalibration based on the calibration curve, the estimated or integrated calibration index can be used, which respectively measure an average of the squared or absolute differences between the estimated calibration curve and the 45 degree (diagonal) line of ideal calibration. 51 52

Calibration measures are summarised in table 1 for the two examples, which confirm the visual findings in the calibration plots. For example, the binary outcome prediction model has a calibration slope of 0.72 (95% confidence interval 0.70 to 0.75), suggesting that predictions are too extreme; this is driven by the overprediction in those with estimated event probabilities above 0.2 ( fig 4 ). The time-to-event prediction model has an O/E ratio of 1.27, suggesting that the observed event probabilities are systematically higher than the model’s estimated values, which is seen by the smoothed calibration curve lying mainly above the diagonal line ( fig 5 ). Such situations could motivate model updating to improve calibration performance. 53

The results also emphasise how one calibration measure alone does not provide a full picture. For example, the calibration slope is close to 1 for the time-to-event prediction model (1.10, 95% confidence interval 0.88 to 1.33), but there is clear miscalibration owing to the O/E ratio of 1.27 (1.22 to 1.32). Conversely, O/E ratio is 1.01 (1.01 to 1.02) in the binary outcome example, suggesting good overall agreement, but the calibration slope is 0.72 (0.70 to 0.75) owing to the overestimation of high risks ( fig 4 ). Hence, all measures of calibration should be reported together and—fundamentally—alongside a calibration plot with a smoothed calibration curve.

Quantifying discrimination performance

Discrimination refers to how well a model’s predictions separate between two groups of participants: those who have (or develop) the outcome and those who do not have (or do not develop) the outcome. Therefore, discrimination is only relevant for prediction models of binary and time-to-event outcomes, and not continuous outcomes.

Discrimination is quantified by the concordance (c) statistic (index), 11 54 and a value of 1 indicates the model has perfect discrimination, while a value of 0.5 indicates the model discriminates no better than chance. For binary outcomes, it is equivalent to the area under the receiver operating characteristic curve (AUROC) curve. It gives the probability that for any randomly selected pair of participants, one with and one without the outcome, the model assigns a higher probability to the participant with the outcome. What constitutes a high c statistic is context specific; in some fields where strong predictors exist, a c statistic of 0.8 might be considered high, but in others where prediction is more difficult, values of 0.6 might be deemed high. The c statistic also depends on the case mix distribution. Presenting an ROC curve over and above the c statistic (AUROC) has very little, if any, benefit. 55 56 Similarly, providing traditional measures of test accuracy such as sensitivity and specificity are not as relevant for prediction models, because the focus should be on the overall performance of the model’s predictions without forcing thresholds to define so-called high and low groups. If thresholds are important for clinical decision making, then clinical utility should be assessed at those thresholds, for example, using net benefit and decision curves (see step 4).

Generalisations of the c statistic have been proposed for time-to-event models, most notably Harrell’s C index, but many other variants are available, including Efron’s estimator, Uno’s estimator, Göner and Heller’s estimator, and case mix adjusted estimates. 54 57 Royston’s D statistic is another measure of discrimination, 37 interpreted as the log hazard ratio comparing two equally sized groups defined by dichotomising the (assumed normally distributed) linear predictor from the developed model at the median value. Higher values for the D statistic indicate greater discrimination.

Harrell’s C index and Royston’s D statistic measure discrimination over all time points up to a particular time point (or end of follow-up). However, usually an external validation study aims to examine a model’s predictive performance at a particular time point, and so time dependent discrimination measures are more informative, such as an inverse probability of censoring weighted estimate of the time dependent area under the ROC curve for the time point of interest (t). 58

Discrimination performance for the two examples is shown in table 1 , and show promising discrimination in both cases. For the binary outcome example, the model correctly identifies 80.8% concordant pairs (c statistic 0.81, 95% confidence interval 0.80 to 0.82). The time-to-event example has a Harrell’s C index of 0.67 (0.64 to 0.70) and a time dependent AUROC curve of 0.71 (0.65 to 0.76), suggesting that the model’s discrimination at five years is slightly higher than the discrimination performance averaged across all time points.

Step 4: Quantifying clinical utility

Where the goal is for predictions to direct decision making, a prediction model should also be evaluated for its overall benefit on participant and healthcare outcomes; also known as its clinical utility. 16 59 60 For example, if a model estimates a patient’s event probability above a certain threshold value (eg, >0.1), then the patient and their healthcare professionals could decide on some clinical action (eg, above current clinical care), such as use of a particular treatment, monitoring strategy, or lifestyle change. When externally validating the model, the clinical utility of this approach can be quantified by the net benefit, a measure that weighs the benefits (eg, improved patient outcomes) against the harms (eg, worse patient outcomes, additional costs). 61 62 It requires the researchers to choose a probability (risk) threshold, at or above which there will be a clinical action. The threshold should be chosen before a clinical utility analysis, based on discussion with clinical experts and patient focus groups, and indeed there might be a range of thresholds of interest, because a single threshold is unlikely to be acceptable for all clinical settings and individuals. Then, a decision curve can be used to display a model’s net benefit across the range of chosen threshold values, and compared with other decision making strategies (eg, other models, or options such as treat all and treat none). Further explanation is provided in supplementary material S4, and more detailed guidance is provided in previous tutorials. 61 62 63

We apply this clinical utility step to the two examples in figure 6 and figure 7 , and show results across the entire 0 to 1 probability range for illustration, although in practice a narrower range would be predetermined by clinical and patient groups, as mentioned. Figure 6 shows that the binary outcome model has a positive net benefit for all thresholds below 0.44, where clinical thresholds are likely to fall in this clinical setting, with greater net benefit than the treat all strategy at all thresholds. Figure 7 time-to-event outcome model has a positive net benefit for thresholds up to 0.79, but does not provide added benefit over the treat all strategy if key thresholds fall below 0.38.

Fig 6

Decision curves showing net benefit for binary outcome prediction model across a range of threshold probabilities that define when some clinical action (eg, treatment) is warranted. Example shows probability of 30 day mortality after an acute myocardial infarction. Threshold probability=risk needed to initiate a particular treatment or clinical action; positive values of net benefit indicate clinical utility; treat all=strategy of initiating the particular treatment (or clinical action) for all patients regardless of their estimated risk; treat none=strategy of not initiating the treatment (or clinical action) for any patient; treat per model=strategy of initiating the treatment (or clinical action) for those patients whose estimated risk is at or above the threshold probability. An interactive version of this graphic is available at: https://public.flourish.studio/visualisation/15175981/

Fig 7

Decision curves showing net benefit for time-to-event prediction model across a range of threshold probabilities that define when some clinical action (eg, treatment) is warranted. Example shows probability of five year recurrence after a diagnosis of primary breast cancer. Threshold probability=risk needed to initiate a particular treatment or clinical action; positive values of net benefit indicate clinical utility; treat all=strategy of initiating the particular treatment (or clinical action) for all patients regardless of their estimated risk; treat none=strategy of not initiating the treatment (or clinical action) for any patient; treat per model=strategy of initiating the treatment (or clinical action) for those patients whose estimated risk is at or above the threshold probability. An interactive version of this graphic is available at: https://public.flourish.studio/visualisation/15162451/

Step 5: Clear and transparent reporting

The Transparent Reporting of a multivariable model for Individual Prognosis Or Diagnosis (TRIPOD) statement provides guidance on how to report studies validating a multivariable prediction model. 50 64 For example, the guidance recommends specifying all measures calculated to evaluate model performance and, at a minimum, to report calibration (graphically and quantified) and discrimination, along with corresponding confidence intervals. With the introduction of new sample size criteria for both developing and validating prediction models, 21 38 65 66 67 68 69 70 71 we also recommend reporting either the Cox-Snell or Nagelkerke R 2 , and the distribution of the linear predictor (eg, histograms for those with and without the outcome event, as shown in fig 2 and fig 3 , and at the base of the plots in fig 4 and fig 5 ). These additional reporting recommendations not only provide information on the performance of the model but also provide researchers with key information needed to estimate sample sizes for further external validation, model updating, or when developing new models. 38 65 66 68

Special topics

Dealing with missing data.

The external validation dataset might contain missing data in some of the predictor variables or the outcome. A variety of methods are available to deal with missing data, including analysis of complete cases, single imputation (eg, mean or regression imputation), and multiple imputation. Handling of missing data during external validation is an unresolved topic and an area of active research. 72 73 74 Occasionally the model developers will specify how to deal with missing predictor values during model deployment; in that situation, the external validation should primarily assess that recommended strategy. However, most existing models do not specify or even consider how to deal with missing predictor values at deployment, and an external validation might then need to examine a range of plausible options, such as single or multiple imputation.

Checking subgroups and algorithmic fairness

An important part of external validation is to check a model’s predictive performance in key clusters (eg, countries, regions) and subgroups (eg, defined by sex, ethnic group), for example, as part of examining algorithm fairness. This is discussed in more detail in paper 1 of our series. 1

Multiple external validation studies and individual participant data meta-analyses

Where interest lies in a model’s transportability to multiple populations and settings, multiple external validation studies are often needed. 5 75 76 77 Then, not only is the overall (average) model performance of interest, but also the heterogeneity in performance across the different settings and populations. 5 Heterogeneity can be examined through data sharing initiatives and by using individual participant data meta-analyses, as described elsewhere. 4 78

Competing events

Sometimes competing events can occur that prevent a main event of interest from being observed, such as death before a second hip replacement. In this situation, if a model’s predictions are to be evaluated in the context of the real world (ie, where the competing event will reduce the probability of the main event from occurring), then the predictive performance estimates must account for the competing event in the statistical analysis (eg, when deriving calibration curves). This topic is covered in a related paper in The BMJ on validation of models in competing risks settings. 9

Conclusions

External validation studies should be highly valued by the research community. A model is never completely validated, 3 79 because its predictive performance could change across target settings, populations, and subgroups, and might deteriorate over time owing to improvements in care (leading to calibration drift). Thus, external validation studies should be viewed as a necessary and continual part of evaluating a model’s performance. In the next article in this series, we describe how to calculate the sample size required for such studies. 21

Data availability statement

The GUSTO-I dataset is freely available, for which we kindly acknowledge Duke Clinical Research Institute. It can be installed in R by typing: load(url(' https://hbiostat.org/data/gusto.rda' )).

Contributors: RDR and GSC conceived the paper and produced the first draft. LA provided the examples. All authors provided comments and suggested changes, which were then addressed by RDR and GSC. RDR revised the article, with feedback from GSC and then all authors. RDR is guarantor. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

Funding: This paper presents independent research supported (for RDR, LA, KIES, JE, PD, and GSC) by an Engineering and Physical Sciences Research Council (EPSRC) grant for “Artificial intelligence innovation to accelerate health research” (EP/Y018516/1); a National Institute for Health and Care Research (NIHR)-Medical Research Council (MRC) Better Methods Better Research grant (for RDR, LA, KIES, JE, and GSC) (MR/V038168/1), and the NIHR Birmingham Biomedical Research Centre at the University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham (for RDR, LA, JE, and KIES). The views expressed are those of the author(s) and not necessarily those of the NHS, NIHR, or UK Department of Health and Social Care. GSC was also supported by Cancer Research UK (programme grant C49297/A27294). The funders had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication.

Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/disclosure-of-interest/ and declare: funding from the EPSRC, NIHR-MRC, NIHR Birmingham Biomedical Research Centre, and Cancer Research UK for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.

Patient and public involvement: Patients or the public were not involved in the design, or conduct, or reporting, or dissemination of our research.

Provenance and peer review: Not commissioned; externally peer reviewed.

This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: http://creativecommons.org/licenses/by/4.0/ .

  • Collins GS ,
  • McLernon DJ ,
  • Giardiello D ,
  • Van Calster B ,
  • topic groups 6 and 8 of the STRATOS Initiative
  • Sperrin M ,
  • Debray TP ,
  • Vergouwe Y ,
  • Koffijberg H ,
  • Nieboer D ,
  • Steyerberg EW ,
  • Kengne AP ,
  • Grobbee DE ,
  • Bleeker SE ,
  • Harrell FE Jr .
  • van Geloven N ,
  • Bonneville EF ,
  • STRATOS initiative
  • van der Windt D ,
  • Steyerberg EW
  • Altman DG ,
  • Royston P ,
  • Justice AC ,
  • Covinsky KE ,
  • Reilly BM ,
  • Janssen KJ ,
  • Moons KGM ,
  • PROBAST Group†
  • Snell KIE ,
  • GUSTO investigators
  • Foekens JA ,
  • Peters HA ,
  • Schumacher M ,
  • Bastert G ,
  • German Breast Cancer Study Group
  • Sauerbrei W
  • ↵ Harrell FE Jr. rms: Regression Modeling Strategies. R package version 6.7-0. https://CRAN.R-project.org/package=rms2023 .
  • ↵ Gerds T, Ohlendorff J, Ozenne B. riskRegression: Risk Regression Models and Prediction Scores for Survival Analysis with Competing Risks. R package version 2023.03.22. https://CRAN.R-project.org/package=riskRegression2023 .
  • ↵ Gerds TA. pec: Prediction Error Curves for Risk Prediction Models in Survival Analysis. R package version 2023.04.12. https://CRAN.R-project.org/package=pec2023 .
  • Andersen PK ,
  • Nagelkerke N
  • O’Quigley J ,
  • Schumacher M
  • van Smeden M ,
  • Wynants L ,
  • Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative
  • De Cock B ,
  • Pencina MJ ,
  • Ogundimu EO ,
  • Parner ET ,
  • Andersen PK
  • Austin PC ,
  • Harrell FE Jr . ,
  • van Klaveren D
  • Reitsma JB ,
  • Van Hoorde K ,
  • Van Huffel S ,
  • Timmerman D ,
  • Van Calster B
  • Binuya MAE ,
  • Engelhardt EG ,
  • Schmidt MK ,
  • Verbakel JY ,
  • Brentnall AR ,
  • Blanche P ,
  • Kattan MW ,
  • Localio AR ,
  • Vickers AJ ,
  • Cronin AM ,
  • Debray TPA ,
  • Tsvetanova A ,
  • Martin GP ,
  • Groenwold RHH ,
  • Pennells L ,
  • Kaptoge S ,
  • Thompson SG ,
  • Emerging Risk Factors Collaboration
  • Parmar MKB ,
  • Sylvester R
  • Tierney JF ,
  • van Smeden M

validation study research paper

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Glob Health
  • v.8(2); 2018 Dec

Logo of jogh

Validation studies for population-based intervention coverage indicators: design, analysis, and interpretation

Melinda k munos.

1 Institute for International Programs, Johns Hopkins University Bloomberg School of Public Health, Baltimore, Maryland, USA

Ann K Blanc

2 Population Council, New York, New York, USA

Emily D Carter

Thomas p eisele.

3 Center for Applied Malaria Research and Evaluation, Tulane University School of Public Health and Tropical Medicine, New Orleans, Lousiana, USA

Steve Gesuale

4 Independent consultant, Bend, Oregon, USA

Joanne Katz

5 Department of International Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, Maryland, USA

Tanya Marchant

6 Department of Disease Control, London School of Hygiene & Tropical Medicine, London, UK

Cynthia K Stanton

7 Stanton-Hill Research, LLC, Moultonborough, North Hampshire, USA

Harry Campbell

8 Centre for Population Health Sciences, University of Edinburgh, Edinburgh, Scotland, UK

Associated Data

Population-based intervention coverage indicators are widely used to track country and program progress in improving health and to evaluate health programs. Indicator validation studies that compare survey responses to a “gold standard” measure are useful to understand whether the indicator provides accurate information. The Improving Coverage Measurement (ICM) Core Group has developed and implemented a standard approach to validating coverage indicators measured in household surveys, described in this paper.

The general design of these studies includes measurement of true health status and intervention receipt (gold standard), followed by interviews with the individuals observed, and a comparison of the observations (gold standard) to the responses to survey questions. The gold standard should use a data source external to the respondent to document need for and receipt of an intervention. Most frequently, this is accomplished through direct observation of clinical care, and/or use of a study-trained clinician to obtain a gold standard diagnosis. Follow-up interviews with respondents should employ standard survey questions, where they exist, as well as alternative or additional questions that can be compared against the standard household survey questions.

Indicator validation studies should report on participation at every stage, and provide data on reasons for non-participation. Metrics of individual validity (sensitivity, specificity, area under the receiver operating characteristic curve) and population-level validity (inflation factor) should be reported, as well as the percent of survey responses that are “don’t know” or missing. Associations between interviewer and participant characteristics and measures of validity should be assessed and reported.

Conclusions

These methods allow respondent-reported coverage measures to be validated against more objective measures of need for and receipt of an intervention, and should be considered together with cognitive interviewing, discriminative validity, or reliability testing to inform decisions about which indicators to include in household surveys. Public health researchers should assess the evidence for validity of existing and proposed household survey coverage indicators and consider validation studies to fill evidence gaps.

Population-based measures of intervention coverage, defined as the proportion of individuals in need of a service or intervention who actually receive the service or intervention, are used at the country and global level to track progress in delivering high impact interventions to populations in need [ 1 ] and to evaluate the impact of large-scale health programs. Nationally representative household surveys implemented by The Demographic and Health Surveys (DHS) Program [ 2 ] and the Multiple Indicator Cluster Survey Program (MICS) supported by UNICEF [ 3 ] have been providing population-based estimates of intervention coverage in low and middle income country (LMIC) settings since the 1980s and 1990s, respectively.

The validity of survey questions is often investigated by assessing whether questions represent the items of interest (content validity), examining associations between survey items that are expected to be correlated or that are expected not to be correlated (construct validity); assessing whether survey data agree with an objective gold standard (criterion validity), or examining cognitive processes that respondents use when answering survey questions (cognitive interviewing) [ 4 , 5 ]. In addition, assessments of missing information and patterns of missingness and of reliability can provide information on the quality of data elicited by survey questions.

Historically, validation research on DHS and MICS MNCH intervention coverage indicators has focused primarily on content validity, construct validity, and cognitive interviewing, as well as assessments of data quality [ 6 - 10 ]. All of these methods provide valuable information about survey questions and data; however, they cannot quantify the extent to which survey-based measures of intervention coverage differ from the objective “truth” that they seek to measure (criterion validity). In addition, assessing construct validity of intervention coverage indicators can be challenging, as the correlation between these questions and other survey items may be unknown or may vary between settings. For these reasons, studies comparing survey responses to a gold standard measure of what actually happened are essential to understanding whether the indicator provides accurate information about country or program performance, or whether the information provided is not useful or even misleading. Until recently there have been few such validation studies for coverage indicators measured in household surveys.

The Improving Coverage Measurement (ICM) Core Group, first under the Child Health Epidemiology Reference Group (CHERG) and then under a separate grant from the Bill & Melinda Gates Foundation, has developed and implemented a standard approach to validating coverage indicators measured in household surveys by comparing them against a gold standard. The results of these studies are published in this Research Theme on improving coverage measures for Maternal, Neonatal and Child Health (MNCH) and in a previous Collection published in 2013 [ 11 ].

This paper aims to describe the approach used by ICM to validate coverage indicators in household surveys, including considerations around the study design and methods, analysis, presentation and interpretation of the results, and strengths and limitations of these studies. This approach has been developed and implemented for maternal, newborn, and child health coverage indicators, but these methods can be extended to other areas, such as nutrition and reproductive health coverage indicators.

Following reporting guidelines promotes transparency, quality, and completeness of reporting and facilitates information/data synthesis for the purpose of systematic reviews and meta-analyses. Validation study methods and results should present data described in the Standards for Reporting Diagnostic Accuracy (STARD) guidelines (see http://www.equator-network.org/reporting-guidelines/stard ) [ 12 ]. Further guidance on reporting of studies on validity of measures for prediction of a reference value can be found in the Cochrane Collaboration Diagnostic Test Accuracy Working Group guidelines (see http://srdta.cochrane.org and Cochrane Handbook at http://srdta.cochrane.org/handbook-dta-reviews ).

General study design

Studies validating coverage measures seek to compare coverage estimates obtained from representative household surveys with “true” coverage. Methods for coverage indicator validation studies are based on methods used to evaluate the validity of diagnostic tests. These approaches assess the ability of coverage measurement tools (typically, household survey questionnaires) to correctly classify an individual’s need for and receipt of an intervention by comparing the data collected using the tool against a gold standard measure. The general study design for validation of coverage indicators collected through household surveys includes 1) direct observation or measurement of true health status and intervention delivery among a sample of individuals (gold standard), 2) a recall period, preferably similar to that allowed for recall of the intervention in household surveys, 3) survey interviews with the individuals observed in Step 1, using questions worded as in the household survey, and 4) a comparison of the observations (gold standard) to the responses to the survey questions ( Figure 1 ).

An external file that holds a picture, illustration, etc.
Object name is jogh-08-020804-F1.jpg

Overall design of Improving Coverage Measurement (ICM) coverage validation studies.

Gold standard: direct observation or measurement of true health status and intervention delivery

The ideal gold standard would be an objective measurement of the the “truth” – ie, the true health status of an individual and whether they received the intervention in question. However, any gold standard will be subject to some degree of measurement error, depending on the intervention in question, the context, and the method used to obtain the gold standard. As a result, we cannot recommend a single gold standard, but rather recommend that investigators take the following considerations into account when selecting a gold standard:

  • Measurement error: What are the potential sources and degree of measurement error in the gold standard measure? Can they be mitigated or quantified, eg, through improved training or standardization of data collection practices? If a biomarker is used, has it been validated for this purpose?
  • Bias: Is the measurement error likely to be differential according to relevant variables, such as whether the intervention was received, participant health status/diagnosis, or education or other socio-demographic characteristrics?
  • Effect on reporting: How likely is the gold standard (including the participant enrollment process, if any) to have an effect on participant reporting or recall of health status or intervention receipt?
  • Feasibility: How feasible is the gold standard to implement across the required sample size, within a reasonable amount of time and within the available budget?

The primary approach for obtaining gold standard measures for receipt of an intervention is direct observation of clinical care. This observation is often conducted by a trained clinician, such as a nurse or nurse/midwife [ 13 , 14 ]. Provider medical records may also be considered the gold standard for some validation studies, if of sufficiently high quality [ 15 ], or may be used to complement other gold standard methods [ 16 ]. Validation studies for careseeking indicators may employ GPS tracking, barcoded ID cards, physical tokens, or other methods for documenting patient visits to health providers in the study area [ 16 ]. Examples of gold standards in this Collection include direct observation of clinical care, examination of health facility records, use of GPS, and bar-coded identification cards to track care-seeking, and some triangulation of several different measures.

For some coverage indicators, the determination of need (ie, the denominator) is relatively straightforward and is based on factors such as age, sex, or pregnancy status. These characteristics have not typically been validated, although reporting of age and pregnancy status are both prone to some degree of reporting error. For coverage indicators for curative interventions, however, the denominator is based on an individual’s diagnosis or reported signs and symptoms, and should be validated against a gold standard. Studies that seek to validate measures of need should aim to enroll individuals both with and without the condition in order to obtain precise measures of specificity. Where the assessment of need is complex (eg, diagnosis of pneumonia), it is desirable to employ study-trained clinicians to diagnose and manage enrolled individuals in order to ensure an accurate gold standard [ 17 ], but trained lay workers may also be used for less complex assessments, such as measuring birthweight [ 18 ].

Validation studies may also use survey interviews conducted as soon as possible after the provision of the intervention as the gold standard, under the assumption that limiting the recall period will reduce error. This approach differs from other gold standards in that it does not make use of a data source external to the survey respondent. These gold standard interviews, coupled with later follow-up interviews, are useful for assessing indicator reliability and the erosion of recall over time. However, this approach is not well-suited to evaluate the overall accuracy of an indicator, as the respondent may provide inaccurate responses in the gold standard interview. For example, the respondent may not understand, or may mis-understand, the survey question(s), leading to inaccurate responses. For this reason it is particularly important that the interviewer be well-trained, but even so, responses may be inaccurate. Respondents also may not know which intervention they received. For example, a woman may receive a uterotonic injection after childbirth but may not be told the nature or purpose of the injection. Finally, respondents may not be able to report accurately on their need for an intervention, because they were not given a diagnosis, they misunderstood their diagnosis, or they were mis-diagnosed.

If there are concerns that the gold standard for a particular study could change participant recall or behavior, the study may incorporate control groups to assess this effect and adjust for it. For example, a careseeking validation study that gave women phones with a GPS-tracking app to detect careseeking visits, with frequent follow-up visits to enrolled households, used two control groups to assess the effect of the phones and the follow-up visits. The first control group did not receive phones but still had monthly follow-up visits, while the second group did not receive phones and had only one follow-up visit.

Recall period

Following the direct observation or measurement of health status and/or intervention delivery, time is allowed to elapse before the follow-up visit to administer the survey questionnaire. In general, individuals should be interviewed within the range of the relevant recall period employed by the household survey. Where possible, it may be beneficial to build a greater range of recall periods into the study design to assess recall decay and to determine the optimal recall period for a specific intervention. Where long recall periods (eg, 1 year or more) are used, it may also be useful to begin by evaluating a shorter recall period, and use these interim results to decide whether evaluation of a longer recall period is warranted. This was the approach used by a maternal indicator validation study in Kenya [ 14 , 19 ].

Survey interviews

When designing the survey tools for coverage validation studies, attention should be paid to the wording of survey questions and to the accuracy of translation from one language to another. Survey questions (and their translations, to the extent possible) should match other existing standardized surveys. Where this is not possible – for example, where questions differ between household surveys – the intent and structure of the questions should be preserved. At the time the survey interviews are conducted, other questions may be added with the aim of improving the validity of the indicator. These may include testing new questions, prompts, or use of recall aids such as videos, photographs, or drawings of activities or treatments related to the intervention. However, it is important that these additional questions or tools only be administered after the original survey questions to avoid biasing responses. Alternately, one could randomize the order of the prompts/new questions to be able to examine whether a modification improves validity relative to the original survey question. However, this would require a larger sample size. Assessment of wording and cognition around survey questions may also be conducted [ 5 , 20 ] or additional probing may be used to identify reasons for incorrect responses. Examples of such studies are the testing of “pill board” photographs to improve caregiver recall of treatment for symptoms of acute respiratory infection, testing of videos of severe pneumonia to improve the accuracy of reporting of acute lower respiratory infection [ 17 ], and qualitative in-depth interviews and focus group discussions with respondents and data collectors on recall of birth outcomes in Nepal [ 18 ] and labor and delivery experiences in Kenya [ 19 ]. The survey tool should also collect information on demographic or other covariates to assess whether need for or receipt of interventions are more accurately reported among some subgroups than others, or whether interviewer characteristics influence reporting.

Sample size

The sample size for indicator validation studies is generally determined based on the desired level of precision for sensitivity and specificity estimates. A level of precision of approximately 5-6 percentage points is usually sufficient for drawing inferences regarding the indicator while maintaining a reasonable sample size (approximately 200-400 individuals with and without the intervention). If multiple arms are included in the study – for example if different recall periods or question formulations are being tested – the sample size should ensure an adequate level of precision in each arm.

The sample size and/or sampling design should also take into account the prevalence of the intervention or behavior assessed. For example, if validating maternal recall of delivery by cesarean section, the study must ensure that sufficient deliveries by cesarean section (rare in many low income settings) are observed to obtain the desired level of precision. This can be accomplished either by increasing the total sample size for the gold standard to ensure that enough cases with and without the intervention are observed (eg, [ 21 ]), or by purposively enrolling individuals who have and have not received the intervention (eg, [ 17 ]). The former approach may be more appropriate when a study is validating multiple indicators, only some of which are rare.

In addition to precision and intervention prevalence, the sample size should account for non-response, loss to follow up, and any design effect if cluster sampling is used; the magnitude of these adjustments will vary according to the setting, indicator, and time interval between enrollment and follow up interview.

Ethical considerations

The gold standard approaches described here, as well as the survey interviews, are considered human subjects research and require informed consent from clients and providers and ethical approval from Institutional Review Boards. Ethical issues to consider in the design of coverage validation studies are whether to intervene and how, if an observer witnesses poor quality or harmful clinical care; the education or skill level of the observer; how to develop a consent form that does not bias participant responses; and how to obtain informed consent from women who are in labor or severely ill individuals (or their caregivers).

Study participants

It is important to present data on the sampling procedures, including the level of participation in any comparison groups. The results should state the number of participants eligible to be included in the study and the number included at each stage, including the gold standard measure, survey interview, and in the final analyses. For follow-up studies like these, this information is often best presented in a flowchart ( Figure 2 ). The main reasons for non-participation should be given in order to assess representativeness to the underlying population. An assessment of whether missing data occur at random should be made, where possible. Without going back and surveying some of the participants with missing data, one cannot definitively identify whether data are missing at random. Scientific knowledge about the data or examining patterns in the data can help identify whether data are missing at random but this requires judgement [ 22 ].

An external file that holds a picture, illustration, etc.
Object name is jogh-08-020804-F2.jpg

Example of flowchart of participant enrollment and follow-up (A. Blanc, personal communication, 28 December 2017).

A table should present the characteristics of the population studied so that representativeness of the study population to the underlying population can be assessed. This should generally include demographic data (eg, age, sex, ethnicity), risk factors, disease characteristics and details of any comorbidity, where appropriate. If the study includes a comparison group, the table should present the characteristics of each group and should include potential confounding factors.

Individual-level validity

The validity of the survey data in measuring the parameter under study should then be presented. The results section provides data on the comparison of survey data to a gold standard. This is generally presented in terms of sensitivity, specificity and positive and negative predictive values in a summary 2 × 2 table of test (survey) positive / negative by true (gold standard) positive / negative status ( Table 1 and Table 2 ). Where appropriate and where sufficient sample size permits, stratified 2 × 2 tables can be presented by the covariates of interest. A simplified 2 × 2 table for less technical audiences may also be used ( Figure 3 ). If the study is validating both the numerator and denominator of an indicator, 2 × 2 tables, sensitivity (SN), specificity (SP), and other validation metrics should be presented separately for each.

2 × 2 validation table

Gold standard
a b a + b
c d c + d
a + cb + dn

Definitions of common validation metrics*

TermDefinitionFormula†
Sensitivity The proportion of individuals who truly received an intervention who were classified as having received the intervention by survey questions. a/(a + c)
Specificity The proportion of individuals who truly did not receive an intervention who were classified as not having received the intervention by survey questions. d/(b + d)
Area Under the Receiver Operating Characteristic Curve (AUC) The probability that a test will correctly classify a randomly selected set of one positive observation and one negative observation Calculated as the area under the curve of sensitivity plotted against (1 − specificity)
Accuracy The proportion of individuals surveyed who were correctly classified as having received or not having received the intervention. (a + d)/n
Positive Predictive Value The probability that an individual received an intervention, given that they reported receiving the intervention a/(a + b)
Negative Predictive ValueThe probability that an individual did not receive an intervention, given that they reported not receiving the interventiond/(c + d)

*Adapted from [ 23 , 24 ].

†Variables are defined in Table 1 .

An external file that holds a picture, illustration, etc.
Object name is jogh-08-020804-F3.jpg

Simplified 2 × 2 table.

For interventions with very low or very high prevalence/incidence, validation may not be possible due to sample size. For example, in the studies in this Collection, validation results were not presented if there were fewer than 5 observations in any cell of the 2 × 2 tables for sensitivity and specificity of a coverage indicator. In such cases, it is important that the percent agreement between survey data and the gold standard be presented.

In addition, area under the curve (AUC) can be presented. AUC is calculated as the area under the curve for a plot of SN against 1 − SP ( Figure 4 ) and is useful as a summary measure of the individual validity of an indicator. AUC can take on values between 0 and 1. An AUC of 1 indicates that survey questions provide a perfect measure of whether an individual received an intervention; an AUC of 0.5 indicates that the survey questions are no better than a random guess to determine whether an individual received an intervention or did not; and an AUC of less than 0.5 indicates that the questions produce misleading information about whether an individual received an intervention – worse than a random guess. Most AUCs for coverage validation studies will fall in the range of (0.5, 1). The AUC for these types of validation studies represents the average of the sensitivity and specificity and is useful as a summary metric. If used, AUC should be presented alongside SN and SP in order to better understand the kinds of reporting bias affecting the indicator.

An external file that holds a picture, illustration, etc.
Object name is jogh-08-020804-F4.jpg

Example of area under the curve plot for sensitivity-specificity pairs.

As noted above, percent agreement, often referred to as “Accuracy”, may also be reported. Although percent agreement is more intuitive than AUC and can be computed with small cell sizes, it varies based on the underlying prevalence of the indicator being validated, and at very high and very low prevalence levels may distort the true individual-level validity of a measure [ 25 ]. Where 2 × 2 cell sizes permit, we recommend always reporting sensitivity, specificity, and AUC as the primary metrics of individual-level validity in order to accurately reflect overall validity and understand whether an intervention is being under- or over-reported.

Comparisons of the validity of different survey questions can be made by comparing SN, SP, and AUC values. These measures of individual validity are fixed characteristics and should be invariant to the prevalence of intervention coverage in a particular setting.

These validation measures should be expressed with appropriate 95% confidence intervals, and should be presented separately for each study site in a multi-centre study.

Variation in indicator validity should be reported by participant and interviewer characteristics – including age, sex, parity, and ethnicity where relevant and available. Differences by characteristics should be tested using chi- squared tests (or Fisher exact test with cell counts <10). Associations between interviewer and participant characteristics and measures of validity can be assessed by logistic regression. In circumstances where the validity of the measure has been evaluated in several settings, these analyses may help in understanding the reasons for variation across populations. In some studies, the findings of parallel qualitative data studies on question wording may be presented in order to help interpret quantitative findings [ 19 ].

Population-level validity

The inflation factor (IF) is the measured coverage value divided by the true coverage value in the population, and provides an estimate of the extent to which the survey-based estimate accurately reflects the true population coverage. If validation study participants were selected from the population using probability sampling, the IF can be calculated directly as the ratio of the measured coverage value in the validation study follow up interviews divided by the true population coverage value estimated from the measured sensitivity and specificity. The formula for estimating true population coverage is derived from Vecchio [ 26 ]:

P  = Pr − (1 −  SP ) /  SN  +  SP  − 1

Where P is the true population coverage (prevalence), Pr is the measured population coverage, SN is the sensitivity, and SP the specificity. In the case where validation study participants are selected on disease status or intervention receipt, the measured coverage value must be estimated from the sensitivity and specificity estimates across a range of true coverage values, using the following equation from Vecchio [ 26 ]:

Pr = P ( SN  +  SP  − 1) + (1 −  SP )

The IF represents the population-level validity of the measure and varies by the coverage prevalence in the population. The IF should be presented along with the true coverage and survey-based estimate of coverage in the study population. In addition, estimates of measured coverage should be plotted over a range of true underlying population coverage levels. An example of this is given in Figure 5 and shows that, at differing true coverage levels, respondent report could either over-estimate or substantially under-estimate the true coverage of the indicator (the Stata code for generating this graphic is given in Appendix S1 of Online Supplementary Document (Online Supplementary Document) ) (Stata Corp, College Station, TX, USA). A graphic depicting the impact of prevalence on IF is used in multiple studies in this Collection.

An external file that holds a picture, illustration, etc.
Object name is jogh-08-020804-F5.jpg

Inflation factor scatterplot.

“Don’t know” responses and missing data

The percent of survey responses that are “don’t know” or missing should always be reported. Variables with a high proportion of don’t know or missing responses should be flagged as being of concern for use in surveys, unless well documented statistical approaches to imputing missing data can be demonstrated.

It is vital that implementers, governments, and donors are able to trust the coverage measurement estimates that inform investment and programming decisions for priority interventions in reproductive, maternal, newborn, child and adolescent health [ 27 ]. These estimates are useful only to the extent that they accurately reflect true population coverage across the settings in which they are used. Thus, evidence about the validity of coverage data collected through household surveys across different settings, generated through rigorous and transparent validation studies, is essential. To enhance the knowledge base on methods for validating survey-based coverage indicators, this manuscript has described a general approach to designing validation studies that use a gold-standard measure. Here we discuss the strengths, limitations, and implications of these methods.

When reporting validation study findings, the study characteristics and their implications for interpretation of results need to be carefully documented, and potential limitations identified and discussed. A major strength of these methods is that they allow survey-based, respondent-reported coverage measures to be validated against more objective measures of need for and receipt of an intervention (the gold standard). However, all gold standard measures are subject to some degree of error. Publications and reports presenting validation results should include both a detailed description of the procedures for obtaining the gold-standard measure, and a reflection on the limitations of these procedures and their possible influence on results. The act of enrolling participants into the study and observing clinical care may influence individuals’ recollection of the care received. Medical records are not always complete or accurate, and information solicited from providers on content of a health service encounter may be subject to recall and social desirability bias. Nevertheless, events observed or reported by a health provider are often the best available source to create the gold standard against which to validate survey measures of coverage.

The design phase of validation studies should consider the likely limitations in current measurement practice and aim to produce results that can provide a way forward: not only to identify measures with high or low validity but also where possible to propose improved measurement methods. For example, by using well-established survey questions alongside new adaptations (eg, probes or memory prompts) the potential added value of a change to current wording may be understood. If the measurement method is constrained by a period of recall (for example because of sample size considerations) then testing a range of different recall periods, including the standard practice, can assist decision-making about future measurement approaches. As part of this plan the analytical protocol, including definitions and cut-offs, should be defined and align where possible with existing evidence or clearly state where and how definitions differ.

Most validation studies seek to inform recommendations about whether an indicator should be measured in household surveys. The AUC and IF are generally used for this purpose; however, there are no standards for what constitutes an acceptable level of individual or population-level bias. Earlier coverage indicator validation studies funded by ICM and the CHERG defined cutoffs for individual-level accuracy as high (AUC>0.70), moderate (0.60<AUC<0.70), and low (AUC<0.60), and population-level bias as low (0.75<IF<1.25), moderate (0.50<IF<1.5) and large (IF<0.50 or IF>1.5) [ 3 , 5 , 6 ]. Based on a further review of validation study results, ICM currently uses cut-offs of AUC≥0.70 AND 0.75<IF<1.25 for inclusion of an indicator in a large survey programme, with lower cut-offs (AUC≥0.60 OR 0.75<IF<1.25) for specialized, in-depth surveys. However, these criteria are by nature arbitrary, and higher or lower thresholds could be justified. Despite acknowledged limitations to the use of AUC and IF for indicator validation, the ICM Core Group strongly recommends the use of both because of the individual and population-level perspectives they provide. Individual-level and population-level validity should be considered together when making decisions about which indicators to measure; both should be at acceptable levels for an indicator to be added to a survey programme.

For coverage indicators in which the denominator requires validation (eg, curative interventions), both the numerator and denominators must meet the study’s pre-established thresholds. For rare conditions, such as pneumonia, it is important to consider the potential number of false positives given the specificity and true prevalence of the condition. If the survey questions result in many false positives in the denominator, the indicator will be misleading, even if the numerator can be accurately measured.

The context in which the validation study is conducted should also be taken into account in the interpretation of the results. As noted above, the IF will depend on the prevalence of the coverage indicator in the study setting, and it is likely that individual-level validity also varies somewhat between settings. For example, a validation study conducted in a relatively homogenous rural population may not produce the same results when applied in a more heterogeneous, urban setting, and the interpretation should reflect this. Similarly, in settings where careseeking is extremely low, validation studies using provider-client interactions or medical records as the gold standard may produce biased validity measures compared to the larger population in need of the intervention. For this reason, it is generally desirable to conduct validation studies in a range of settings before making a firm recommendation about the inclusion of an indicator in household surveys.

The careful tracking of the coverage of high impact health interventions is central to decision making around the investment of resources and around evaluating the impact of these investments on populations in need. There is increased attention to accountability regarding health care resources; however the current state of routine health information system data remains variable [ 27 - 30 ] and a substantial percent of the population does not access care when needed from a public health facility. Household surveys, in which adult women are asked to report on their own experience and that of their children, remain the most obvious option for the collection, analysis, and dissemination of intervention coverage data. The mandate for the ICM project has been to increase the amount and quality of the evidence around the validity of coverage indicators for high impact MNCH interventions.

The intention of this paper is to describe a methodology that we hope will contribute to higher quality and greater standardization of validation studies of intervention coverage measures – their design, analysis, and interpretation. The recommendations result from our experience over the last 7 years conducting 10 validation studies in diverse settings [ 13 , 16 - 19 , 21 , 28 , 29 ] thus not every study follows all recommendations presented here. For example, earlier studies 1) used the term “reference standard” vs our now recommended “gold standard”, accepting that any gold standard will have limitations; 2) used a cut-off of AUC≥0.6, which we decided was too low and increased to AUC≥0.7; 3) did not always publish the percent of don’t know responses; 4) did not present the results for the percent agreement between respondent reports and the gold standard for indicators for which validation results could not be presented due to sample size; and 5) used a variety of different graphics to illustrate results vs the graphs provided in this paper. Although there is no consensus on an acceptable level of accuracy for coverage indicators, we hope that using a more standardized methodology for indicator validation will improve the generalizability of results and lead to the development of alternative indicators, improved question wording, or innovations in data collection (eg, biomarkers). We note that most of the studies cited here focused on interventions delivered primarily through health facility settings; validation of home-based interventions presents additional design challenges, particularly around obtaining an appropriate gold standard. Evidence from indicator validation studies should be taken in conjunction with the results of other assessments such as cognitive interviewing, discriminative validity, or reliability testing to inform the interpretation of indicators and decisions about whether indicators/questions should be included in a particular survey or the appropriate length of recall.

Studies conducted using this methodology have led to a number of insights. For example, the proportion of children with symptoms of acute respiratory infection whose caretaker reports receiving antibiotic treatment has been found to be a poor proxy for the proportion of children with pneumonia who received antibiotic treatment in studies in Pakistan and Bangladesh [ 17 , 29 ]. Other studies that examined intrapartum and immediate postnatal interventions had mixed results, but generally found that women’s recall of many interventions that take place during and immediately following a delivery, especially those that require reporting on timing (eg, breastfeeding within the first hour) or sequence (eg, uterotonic given before or after delivery of the placenta) have low levels of validity, although standard indicators like cesarean section tended to be better reported [ 13 , 19 , 21 ]. Interventions received during postnatal care visits seem to be more accurately reported [ 30 ]. Indicators of preterm birth and low birthweight based on mothers’ recall underestimated these indicators in Nepal [ 18 ]. On the other hand, a study of maternal recall of care-seeking for childhood illness in rural Zambia found that mothers’ reports were valid [ 16 ].

The studies in this and the previous Collection assess the validity for a range of coverage indicators for maternal, newborn, and child health. However, relatively few other health interventions reported on in household surveys have been the subject of validation studies. For example, to our knowledge, women’s reporting of the content of family planning counseling has not been validated. Few (if any) nutrition coverage indicators, such as vitamin A or iron-folic acid supplementation, or newborn health interventions, have been assessed for validity. As maternal, newborn, and child mortality rates fall and more attention is devoted to interventions for adolescent health and non-communicable diseases, it is likely that additional indicators and questions will be added to DHS and MICS. Public health researchers in these and other global health domains should assess the evidence for validity of existing and proposed survey-based coverage indicators and consider conducting validation studies to fill the evidence gaps for indicators that have a viable gold standard for comparison purposes. Assessing the validity of respondent reports in household surveys is a key component of improving the quality of the evidence used in planning, policy-making, and evaluation.

Acknowledgments

ICM Core Group: Fred Arnold (ICF), Ann Blanc (Population Council), Harry Campbell (University of Edinburgh), Emily Carter (JHSPH), Thomas Eisele (Tulane University), John Grove (WHO), Attila Hancioglu (MICS/Unicef), Shane M Khan (MICS/Unicef), Tanya Marchant (LSHTM), Melinda Munos (JHSPH), Cynthia Stanton (Stanton-Hill Research, LLC).

Acknowledgements: We thank Jennifer Bryce for her leadership in the development of these methods and in obtaining funding for this research. We would also like to thank Andrew Marsh for his editorial assistance with this manuscript.

Funding: This work was supported by the Improving Coverage Measurement (ICM) grant from the Bill & Melinda Gates Foundation (OPP1084442).

Authorship contributions: Conceptualized the paper: HC, AKB, TPE, CKS, MKM, JK, TM, FA, SMK. Wrote or critically revised the paper: MKM, AKB, TPE, JK, TM, CKS, EC, SG, SMK, AH, FA and HC. Agree with manuscript and conclusions: MKM, AKB, TPE, JK, TM, CKS, EC, SG, FA, SMK, AH, JG and HC.

Competing interests: Harry Campbell is the Co-Editor in Chief of the Journal of Global Health . To ensure that any possible conflict of interest relevant to the journal has been addressed, this article was reviewed according to best practice guidelines of international editorial organisations. All authors have completed the ICMJE uniform disclosure form at http://www.icmje.org/coi_disclosure.pdf (available upon request from the corresponding author) and declare no conflicts of interest.

Additional Material

Journal of Technology and Science Education

OmniaScience

validation study research paper

Printed Edition

validation study research paper

SJR (Scopus)

SCImago Journal & Country Rank

CiteScore Rank (Scopus)

validation study research paper

See more: Latindex, Google Scholar...

CrossCheck - iThenticate

  • For Readers
  • For Authors
  • For Librarians
  • Submissions
  • Publication Fee
  • Indexing & Statistics

DESIGN AND VALIDATION OF A QUESTIONNAIRE TO MEASURE RESEARCH SKILLS: EXPERIENCE WITH ENGINEERING STUDENTS

1 University of Armed Forces (Ecuador)

2 Council of Evaluation, Accreditation and Quality Assurance of Higher Education (Ecuador)

3 Universi ty of Ja e n (Spain)

Received May 201 6

Accepted September 201 6

Universities in Latin American countries are undergoing major changes in its institutional and academic settings. One strategy for continuous improvement of teaching and learning process is the incorporation of methods and teaching aids seeking to develop scientific research skills in students from their undergraduate studies.

The aim of this study is the validation of a questionnaire to measure research skills with engineering students.

Questionnaire validation was performed by: literature review, semantic and content validation by experts from three Latin American universities, finishing with a factorial and reliability validation. The instrument was applied to 150 students (75 . 3% men and 24 . 7% women) that were enrolled in the basic level of engineering.

The validated questionnaire has 20 items. The correlations between factors of the instrument, show relationship and dependence between them, indicating the validity of the questionnaire. The reliability of the instrument was calculated using Cronbach's alpha coefficient, which reached a value of .91 in the total scale.

The statistical results to validate the questionnaire have been significant, allowing us to propose this experience as a starting point to implement further studies about the development of research skills in university’s students from other areas of knowledge.

Keywords – Validation, Learning, Research skills, University.

1. I ntroduction

We may translate the challenges of higher education into the necessity of training students that could be more competitive in accordance of the actual knowledge society in which we live; this will require developing and entrenching strong thinking skills, intellectual flexibility, creativity, analysis and the capability to replicate and create knowledge.

Some aut hors such as Hurtado (2000), Lipman (2001) , Restrepo (2003), Tünnermann (2003), Sayous (2007), Garc í a and Ladino (2008) and Brew (2013) agree on the need to train students to develop research skills from their undergraduate studies.

Research as a learning process, has been conceived as the result of a process and strategy that could have started to developed in the first academic year of the students, and not as the culmination of their training. Hence, the fact that students from the postgraduate research begin their training in investigation at that moment and not from the undergraduate makes students see the researching process more like a requirement to complete their studies than a pillar of their education.

Hunter, Laursen and Seymour (2007) conducted a study in four liberal arts universities, addressing fundamental questions about the benefits of participation in undergraduate research projects. The results were highly positive, the students who participated in the investigation have been "thinking and working" as a scientist (23%), wanted to become a scientist (20%) believe they have gained some benefits personal-professional (19%), have been cleared / confirmed career plans (16%), improved career (10%), improving skills like arguing and presenting information, organize projects and work, understanding and written expression (8%). the most important found in the study is that students want to "become scientists," manifested themselves and supported by half of the observations of teachers (52%), the latter described the changes observed in the behavior and the way students began to exhibit behaviors and attitudes that are a researcher, as curiosity and initiative, were becoming less fearful of taking responsibility for research and more willing to take risks, trust the ability to research, interest in contributing to science, presentation of research and defend them, among others.

Likewise, Ward, Bennett and Bauer (2003) conducted a study to evaluate the educational effectiveness of research experience in undergraduates; students indicated that by getting involved in research projects facilitated their learning.

On the other hand, the knowledge society warns that research is a fundamental function of every universi ty (Gonz á lez, Galindo , Galindo & Gold, 2004; Cerda-Guti é rrez, 2006) and should be, therefore should be linked not only teachers but also the learning processes of students (Nuñez, 2007).

The report of the Boyer Commission for Education Research Universities in the United States, recommended the implementation the method research-based learning (RBL), because higher education offering American universities lacked adequate scientific literacy, and a low commitment to the creation and production of knowledge and had separation of research and teaching activities in university classrooms (The Boyer Commission, 1998).

The Council of Undergraduate Research of the United States notes that undergraduate research is unquestionable and should be seen as "an investigation conducted by a college student who makes an original intellectual or creative contribution to the discipline (Council for Undergraduate Research, 2013).

The European Union has recognized learning by guided research (Inquiry Based Learning) as the ideal methodology to improve the teaching of science and mathemati cs (European Commission, 2008; European Commission, 2011; National Research Council, 2000, ci ted by Abril , Ariza, Quesada & García , 2013).

There are other experiences on incorporating research and teaching-learning strategy at grade level. Thus, -University of Warwick in the UK, The University of Adelaide in Australia and South Carolina Honors College of the United States are examples of institutions that have adopted the research-based learning focus on learning processes of different degrees, the first, developed the model in different undergraduate degrees; the second, has developed a conceptual framework based on research in the curriculum of the different degrees; and the third, has been used as a strategy type curriculum that allows its graduates to be more competitive for scholarships and admission to professional sch ools (Martínez & Buendía, 2005).

Healey and Jenkins (2009), w ith reference t o Griffiths (2004) have developed a framework to help conceptualize and explain how research is integrated into the learning environment undergraduate students, depending on whether the learning opportunity is focused student-centered or teacher, or if the learning opportunity focuses on product research or the research process itself. The framework identifies four ways of how research can be introduced into teaching:

• Teaching guided by research: the curriculum is dominated by the interests of the institution.

• Teaching oriented research: the student learns about the research process, how knowledge is created, and the researcher's mind.

• Based-research learning: Students act as researchers, learn associated skills, the curriculum is dominated by search-based activities. Teaching is aimed at helping students understand the phenomena of how the experts do it.

• Inquiry Based Learning connects student learning in the context of a problem.

In the Latin American context, the Monterrey Institute of Technology and Higher Education, defined based research and application of teaching strategies and learning that are intended to connect research with teaching and learning, which allow partial or total incorporation of the student in a research based in scientific methods, under the supervision of profe ssor (Instituto Tecnológico de Estudios Superiores de Monterrey, 2010).

For Chávez (2013), Rojas and Méndez (2013), Morales, Rincón and Romero (2004), there are some additional advantages by using research-based learning, in particular would be:

• Enter the student in the way of research and empowers teachers working in it.

• Establishes a link between academic programs and potential areas of research institution and research groups.

• Promotes students during their years of study are able to develop the skills necessary to investigate (critical thinking, analysis, synthesis, leadership, creativity, entrepreneurship, problem solving, etc.) in order to involve them in the process of scientific discovery skills within classroom work in their specific scientific disciplines.

• Students learn in the context of research seeking new knowledge and acquire commitments to lifelong learning.

• The teacher has the ability to target the entire research process more efficiently, to the extent that successful experiences can be extrapolated in the classroom.

For other authors, the advantages of using the research-based learning approach is determined by the development of skills, and these are defined as those intellectual capacities that are associated with performing certain actions you can run the business subject and which mostly develop only when you access own research tasks (Moreno, 2002).

The research-based learning is conceived as one of the strategies best suited to develop culture and research skills, it proposes that learning is built on real scenarios that link students and teachers in a building process knowledge inspired by the process of scientific research.

Identifying research skills could guide teachers and researchers to include research as learning method. The shortage of these instruments led to the creation of the scale of "Self-rated skills for research-based learning" (AHABI). The aim of this study is to develop an instrument to measure research skills and serve as reference to include the teaching research focus on the learning process of students.

2. Methodology

Research has implemented a quasi-experimental design with control and experimental group. The experimental group was exposed to the influence of RBL method and the control group remained free from the influence. These groups have not been matched by randomization and hence are not equivalent groups; they had been already formed. Therefore the design that has been raised is adjusted to the conditions in educational research (Arnal, Del Rincon and Latorre, 1992).

2.1. Participants

The study sample was composed by 150 students (75.3% men and 24.7% women) aged between 16 and 27 years, divided into four groups, which belonged to the class of Physics, Differential Calculus, General Chemistry I, General Chemistry. The sample was probabilistic and intentional and constitutes of students groups of basic cycle engineering.

2.2. Instruments

Validated measurement instrument "AHABI" is a self-rated questionnaire consisting of 20 items, according to a Likert scale (1 means strongly disagree and 5 strongly agree). The design and development of the Likert scale was developed in three phases.

In the first phase a literature review was conducted, of those elements that influence the development of research skills in students and instruments proposed to measure these skills. the lack of specific instruments to analyze variables related research skills, there are some general references were reviewed, among them are the "Scale attitude towards learning research” (ESCAI ) of Ruiz and Torres (2002); the questionnaire for identifying general skills and qualities of scientific research of Fern á ndez, Cordeiro, Cordeiro and Pérez (2004); Attitudes Toward scale research of Papanastasiou (2005); self-assessment tool research skills, Rivera and Torres (2006); and an inventory of skills for university research (ICUNI) Sierra, Alejo and Silva (2011).

In the second phase, to validate the contents of the questionnaire, were se lected 8 experts from two universities of Spain and one university of Ecuador, with experience in the field of educational research, which would urge them to give their professional judgment about semantics and content of the scale addition, evaluation of the structure of the questionnaire, understanding of the items, analysis of the format and presentation of the questionnaire, and to the analysis of the following questions What other aspects should pick up the scale? What items should be deleted? In this process valuation experts recommended the removal 20 items falling in 36 of the 56 items initially plated on the scale.

In the third phase we proceeded to the factorial validation of the instrument, the scale was subjected to analysis of reliability and validity, with a sample of 150 students who were not part of the course for the development of research skills, but had characteristics similar to experimental and control group, with this action, the basic assumption that if you are going to make a factorial analysis, the sample should not lose 150 sub jects (Morales, U rosa & Blanco , 2003). The main objective is to check whether the 36 items of the scale can be summarized in some way, if there are commonalities between them.

For the realization of this study is were analyzed three factors: the first is "Process scientific information," the second "Managing scientific information" and the third is "Develop scientific information", for the analysis and contextualization of these factors have taken into account other data to define the sample such as the type and year career coursing, age, sex.

2.3. Process

It fulfilled the reporting procedures, compliance and acceptance of student participation in this study; we proceeded to the administration of the questionnaires, whose characteristics confidentiality and voluntariness in their filling. Self-administered questionnaire spent during regular classes, requesting authorization of the corresponding teacher, in the case of the control groups. For the experimental groups, the questionnaire was administered before and after training sessions on developing research skills. In both cases, the researcher was present, which gave precise instructions for students to complete the questionnaire; students took approximately 20 minutes to respond the instrument.

2.4. Statistical analysis

Data from this study were obtained using the SPSS statistical program for Windows V.20.0. They were calculated analysis of internal reliability (Cronbach's alpha) of the questionnaire. The Varimax orthogonal rotation method helped us to group reagents or components factors that may explain the observed variance in the answers given by the subject s (Escalante & Caro, 2006). Next, we analyze the correlation between variables, which must be high in order to perform the factorial analysis. The index KMO (Kaiser, 1970) sampling adequacy and Bartlett sphericity (Bartlett, 1950) test was also used.

2.5. Factorial validity

Once verified that the sample size was the ideal number of subjects for the study, we proceeded to study the factorial validity of the instrument scale to see if the 36 items of the scale can be summarized, is grouping them say whether there are commonalities between them. The homogeneity of the questionnaire was calculated, allowing us to remove 16 items that had a low level of discrimination and therefore a correlation <200 with the total scale, according to the recommendations of Elbel (1965). The scale was co mposed of 20 items with Cronbach's alpha reliability of .91.

Then the degree of correlation between the variables was studied, the values of this analysis should be high in order to perform the factorial analysis. The KMO sampling adequacy ratio reached the value of .891 and Barlett sphericity test reaches 1429.971 (p <.001). These data deemed that the answers are substantially related, justifying the realization of factor analysis.

Then we determine commonalities or proportion of variance that is explained by the common factors, which resulted in three common factors. In general, the absence of values close to zero, it can be stated that the 20 items are explained by the components. The analysis of main components and Varimax rotation revealed, after eleven iterations, the convergence of three factors which explain 54.848% of the variance. The first component is the most amount of variance explained with 24.708%, the second factor with 16.436% and the third with a 13.705%, as shown in Table 1.

 

Factor 1

Factor 2

Factor 3

% Variance

24.708

16.436

13.705

% accumulated

24.708

41.143

54.848

Table 1. Variance explained according factors

The analysis, the items were ordered according to the degree of saturation presenting a higher load factor 3 (Table 2).

Items

Factor 1

Factor 2

Factor 3

1. I management research articles of a theme drawn from scientific journals, databases, etc.

 

.510

 

2. I recognize a scientific paper in a document of Wikipedia, Rincón del Vago, etc.

 

.408

 

3. I know what is literature review

 

.589

 

4. I identify scientific journals

 

.628

 

5. I recognize database of scientific journals

 

.812

 

6. I identify the structure of a scientific research article

 

.569

 

7. I use scientific techniques to organize information

.563

 

 

8. I analyze main ideas of a scientific article

.648

 

 

9. I reflect as I read a scientific article

.598

 

 

10. I interpret data, graphics, etc. a scientific article

.636

 

 

11. I summarize scientific information

.664

 

 

12. Discuss critically research article

.544

 

 

13. I make conclusions after reviewing scientific literature

.632

 

 

14. I prepare an abstract / essay of a research topic

 

 

.942

15. I use references according to rules of scientific writing in a text that I elaborate, is an abstract or essay

 

 

.490

16. I write in English keywords for a research topic

 

 

.297

17. I identify a new research topic in the literature review

.450

 

 

18. I am able to communicate orally the results of a review of scientific literature

 

.629

 

19. I Elaborate keywords of a research topic

 

 

.376

20 I bring my ideas in developing a research topic

 

 

.347

Cronbach's alpha coefficients

.891

.711

.687

Table 2. Matrix rotated component and Cronbach´s Alpha by factor

The correlations of factors indicate relationship and dependence between them, can say that the data confirm the validity of the questionnaire, with a structure of three factors.

The first factor group items 7, 8, 9, 10, 11, 12, 13 and 17, these items assess skills related to the organization of collecting scientific information. This factor is given the name "Process scientific information".

The second corresponds to items 1, 2, 3, 4, 5, 6 and 18. These items assess skills regarding the management and search of scientific information. This factor is called "Managing scientific information".

The third factor is part of the group of items 14, 15, 16, 19 and 20; these items assess implementation related to new understandings and new working skills. The name of the factor is "Develop scientific information".

2.6. Reliability Analysis

Once the validity of the scale established, the reliability of the instrument was calculated by Cronbach's alpha coefficient, which reaches a value of .91 in the total scale; .891 for the factor 1 "Process scientific information"; .711 for the factor 2 "Managing scientific information"; and .687 for the factor 3 "Develop scientific information", indicating adequate internal consistency of the instrument, which makes the AHABI reliable instrument. The following table 3 shows the reliability of the scale by the item-total correlation, which reflects the means between the groups with higher and lower total scores, in our analysis we have values ranging between .89 and .92, a small strip which ensures basic instrument dimensionality.

 

Average scale if the item is deleted

Scale variance if the item is deleted

Total corrected correlation-element

Cronbach's alpha if the item is removed

Item 1

79.31

293.75

0.59

.89

Item 2

78.53

307.82

0.43

.90

Item 3

78.87

318.07

0.24

.91

Item 4

78.65

301.11

0.60

.90

Item 5

79.45

292.45

0.72

.89

Item 6

79.25

302.86

0.54

.90

Item 7

79.81

306.67

0.38

.91

Item 8

79.30

310.43

0.32

.91

Item 9

80.07

295.33

0.58

.90

Iem 10

79.92

302.30

0.57

.90

Item 11

80.69

349.07

-0.32

.92

Item 12

80.48

316.48

0.36

.91

Item 13

80.66

321.54

0.27

.91

Item 14

80.09

322.20

0.23

.91

Item 15

80.28

310.11

0.42

.90

Item 16

79.06

305.22

0.51

.90

Item 17

79.16

296.94

0.65

.89

Item 18

80.96

310.61

0.46

.90

Item 19

80.70

315.81

0.36

.91

Item 20

79.31

316.92

0.33

.91

Table 3. Correlation items with total scale

3. Discussion

Significant results of this research are consistent with those found in the literature survey of other studies that have pointed to the effectiveness of a system based on skills development research method, proving to be a learning strategy for students cognitively attractive undergraduate many disciplines, and allowing them to work increasingly academic logic and research. ( Hunter et al ., 2007; Ward et al. , 2003; Willison and O'Regan, 2007; Chaplin, 2003; Hoskins, Stevens & Nehm, 2007; Luckie, Maleszewski, Loznak & Krha, 2004). In the same way in line with the conclusions drawn by Willison (2009) and reflected in its proposed of Research Skill Development and its application in some underg raduate programs (tourism, Engineering, Health, etc.) at the University of Adelaide, choosing different types and facets of research for each area (research Literature Review, field, laboratory).

The factors that form the questionnaire, "Process scientific information", "Managing scientific information" and "Develop scientific information" could evalu ate proposals like Bastidas (2013), who mentions that based teaching research students act as researchers learn related skills and teaching aims to help students understand the phenomena of how the experts do it, plus it could assess the methodological proposals of Rizo (2012) and Torres (2012).

4. Conclusions

The scale has high internal consistency since the Cronbach alpha coefficient reached the .91. Additional saturations of each item with their respective factors have high values. On the other hand, correlations between factors indicate a good relationship and dependence between them, so we can say that this study has generated a valid measure learning research skills instrument, since the results presented, as a whole, confirm the high reliability, also factorial validity and content.

Factors 1, 2 and 3 grouped questionnaire items, indicating adequate internal consistency, so the AHABI is deemed reliable instrument for use in program evaluation with a focus on teaching research at the university.

Having reached the goal of the research, it contributes to this study to new learning experiences in university classrooms that allow time to develop and evaluate innovative forms of education is incorporated, The method of teaching research not only provides micro curriculum level, but also at the policy level for the quality of higher education.

R eferences

Abril, A. , Ariza, M. , Quesada, A. , & García, J. (2013). Creencias del profesorado en ejercicio y en formación sobre el aprendizaje por investigación. Revista Eureka sobre Enseñanza y Divulgación de las Ciencias , 11(1), 22-33.

Arnal, J. , del Rincón, D. , & Latorre, A. (1992). Investigación educativa. Metodologías de investigación educativa . Barcelona: Labor.

Bartlett, M.S. (1950). Test of significance in factor analysis . England : University of Manchester.

Bastidas, V. (2013). Aprendizaje Basado en Investigación . Retrieved from:   http://sitios.itesm.mx/va/dide2/tecnicas_didacticas/abi/abi.htm

Boyer Commission Report (1998). The Boyer Commission on Educating Undergraduates in the Research University, Reinventing Undergraduate Education: A Blueprint for America’s Research Universities . Retrieved from : http://www.niu.edu/engagedlearning/research/pdfs/Boyer_Report.pdf

Brew, A. (2013). Understanding the scope of undergraduate research: A framework for curricular and pedagogical decision-making. Higher Education , (66), 603-618. http://dx.doi.org/10.1007/s10734-013-9624-x

Cerda-Gutiérrez, H. (2006). Formación investigativa en la Educación Superior Colombiana . Universidad Cooperativa de Colombia. Bogotá: EDUCC.

Chávez, G. (2013). La investigación formativa en la universidad. Proyecto de investigación del Cuerpo Académico “Cambio educativo: discursos, actores y prácticas” . México : Universidad Autónoma de Nuevo León.

Chaplin, S. (2003). Guided development of independent inquiry in an anatomy/physiology laboratory. Advances in Physiology Education . Minnesota: University of St. Thomas.

Council for Undergraduate Research (2013). About CUR: Frequent questions . Retrieved from : http://www.cur.org/about_cur/frequently_asked_questions_/#2

Elbel, R. (1965). Measuring educational achievement . Englewood: Prentice-Hall.

Escalante, E. , & Caro, A. (2006). Investigación y análisis estadísticos de datos en SPSS . Mendoza: FEEyE.

Fernández, D. , Cordeiro, A. , Cordeiro, E. , & Pérez, C. (2004). Diseño de un cuestionario para la identificación de las habilidades generales y las cualidades del investigador científico. Pedagogía Universitaria , 9(1), 25-36.

García, G.A. , & Ladino, Y. (2008). Desarrollo de competencias científicas a través de una estrategia de enseñanza y aprendizaje por investigación. Studiositas , 3(3), 7-16.

González, J. , Galindo, N. , Galindo, J.L. , & Gold, M. (2004). Los paradigmas de la calidad educativa. De la autoevaluación a la acreditación . México: Unión de Universidades de América Latina.

Griffiths, R. (2004). Knowledge production and the research-learning nexus: The case of the built environment disciplines. Studies in Higher Education , 29(6), 709-726. http://dx.doi.org/10.1080/0307507042000287212

Healey M. , & Jenkins, A. (2009). Undergraduate Research and Inquiry . York: Higher Education Academy.

Hoskins, S. Stevens, L. , & Nehm, R. (2007). Selective Use of the primary literature transforms the classroom into a virtual laboratory. Genetics , 176, 1381- 138 9.   http://dx.doi.org/10.1534/genetics.107.071183

Hunter, A., Laursen, S. , & Seymour, E. (2007). Becoming a Scientist: the role of undergraduate research in students’ cognitive, personal and professional development. Science Education , 91, 36 ‑ 74. http://dx.doi.org/10.1002/sce.20173

Hurtado, J. (2000). Retos y alternativas en la formación de investigadores . Venezuela: SYPAL.

Instituto Tecnológico de Estudios Superiores de Monterrey (2010). Aprendizaje basado en investigación. Investigación e Innovación Educativa . Retrieved from:  http://sitios.itesm.mx/va/dide2/tecnicas_didacticas/abi/copabi.htm

Kaiser, H.F. (1970). A second generation Little Jiffy. Psycometrika , 35, 401-415. http://dx.doi.org/10.1007/BF02291817

Lipman, M. (2001). Pensamiento Complejo y Educación . Madrid: Ediciones de la Torre.

Luckie, D., Maleszewski, J., Loznak, S., & Krha, M. (2004). Infusion of collaborative inquiry throughout a biology curriculum increases student learning: a four year study of "Teams and Streams". Advances in Physiology Education , 28, 199-209. http://dx.doi.org/10.1152/advan.00025.2004

Martínez, A ., & Buendía, A. (2005). Aprendizaje basado en la investigación. Tecnológico de Monterrey . Retrieved from : http://www.mty.itesm.mx/rectoria/dda/rieee/pdf-05/29(EGADE).A.BuendiaA.Mtz.pdf

Morales, P. , Urosa, B. , & Blanco, A. (2003). Construcción de escalas de actitudes tipo Likert . Madrid: La Muralla.

Morales, O. Rincón, A ., & Romero, J. (2004). Cómo enseñar a investigar en la universidad. Educere , 9(29), 217-224.

Moreno, M. (2002). Formación para la investigación centrada en el desarrollo de habilidades . México: Universidad de Guadalajara.

Núñez, N. (2007). Desarrollo de habilidades para la investigación (DHIN). Revista Iberoamericana de Educación , 44, 6-15.

Papanastasiou, E. (2005). Factor structure of the attitudes toward research scale. Statistics Education Research Journal , 4(1), 16-26.

Restrepo, B. (2003). Investigación formativa e investigación productiva de conocimiento en la universidad. Nómadas , 18, 195-202.

Rivera, Mª E. , & Torres, C. (2006). Percepción de los estudiantes universitarios de sus propias habilidades de investigación. Revista de la Comisión de Investigación de FIMPES , 1(1), 36-49.

Rizo, M (2012). Enseñar a investigar investigando. Experiencias de investigación en comunicación con estudiantes de la Licenciatura en Comunicación y Cultura de la Universidad Autónoma de la Ciudad de México . México: Universidad Autónoma de la Ciudad de México.

Rojas, M. , & Méndez, R. (2013). Cómo enseñar a investigar. Un reto para la pedagogía universitaria. Educ. , 1(16), 95-108.

Ruiz, C. , & Torres, V. (2002). Actitud hacia el aprendizaje de la investigación. Conceptualización y medición. Educación y Ciencias Humanas , X (18), 15-30.

Sayous, N. (2007). La investigación científica y el aprendizaje social para la producción de conocimientos en la formación del ingeniero civil. Ingeniería , 11(2), 39-46.

Sierra, M. , Alejo, M. , & Silva, F. (2011). Evaluación de competencias de investigación en alumnos de licenciatura en psicología . Retrieved from :        http://www.researchgate.net/publication/232069603_Evaluacin_de_competencias_de_investigacin_en_alumnos_de_licenciatura_en_Psicologa

Torres, A. (2012). Aprendizaje Basado en la Investigación. Técnicas Didácticas. Tecnológico de Monterrey . Retrieved from : http://rodin.uca.es:8081/xmlui/bitstream/handle/10498/15117/7313_Penaherrera.pdf?sequence=7

Tünnermann, C. (2003). La universidad latinoamericana ante los retos del siglo XXI . México: Universidad Autónoma de Yucatán Mérida.

Ward, C. Bennett, J. , & Bauer, K.W. (2003). Content analysis of undergraduate research student evaluations. Retrieved from :        https://www.sarc.miami.edu/ReinventionCenter/Public/assets/files/contentAnalysis.pdf

Willison, J. (2009). Multiple contexts, multiple outcomes, one conceptual framework for research skill development in the undergraduate curriculum. Spring , (29 ) 3, 10-14.

Willison, J., & O’Regan. K. (2007). Commonly known, commonly not known, totally unknown: A framework for students becoming researchers. Higher Education Research and Development , 26(4), 393-405. http://dx.doi.org/10.1080/07294360701658609

Licencia de Creative Commons

This work is licensed under a Creative Commons Attribution 4.0 International License

Journal of Technology and Science Education, 2011-2024

Online ISSN: 2013-6374; Print ISSN: 2014-5349; DL: B-2000-2012

Publisher: OmniaScience

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Sustainability
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

Study reveals the benefits and downside of fasting

Press contact :, media download.

A large glowing stem cell, with clocks and empty plates in background.

*Terms of Use:

Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license . You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images; if one is not provided below, credit the images to "MIT."

A large glowing stem cell, with clocks and empty plates in background.

Previous image Next image

Low-calorie diets and intermittent fasting have been shown to have numerous health benefits: They can delay the onset of some age-related diseases and lengthen lifespan, not only in humans but many other organisms.

Many complex mechanisms underlie this phenomenon. Previous work from MIT has shown that one way fasting exerts its beneficial effects is by boosting the regenerative abilities of intestinal stem cells, which helps the intestine recover from injuries or inflammation.

In a study of mice, MIT researchers have now identified the pathway that enables this enhanced regeneration, which is activated once the mice begin “refeeding” after the fast. They also found a downside to this regeneration: When cancerous mutations occurred during the regenerative period, the mice were more likely to develop early-stage intestinal tumors.

“Having more stem cell activity is good for regeneration, but too much of a good thing over time can have less favorable consequences,” says Omer Yilmaz, an MIT associate professor of biology, a member of MIT’s Koch Institute for Integrative Cancer Research, and the senior author of the new study.

Yilmaz adds that further studies are needed before forming any conclusion as to whether fasting has a similar effect in humans.

“We still have a lot to learn, but it is interesting that being in either the state of fasting or refeeding when exposure to mutagen occurs can have a profound impact on the likelihood of developing a cancer in these well-defined mouse models,” he says.

MIT postdocs Shinya Imada and Saleh Khawaled are the lead authors of the paper, which appears today in Nature .

Driving regeneration

For several years, Yilmaz’s lab has been investigating how fasting and low-calorie diets affect intestinal health. In a 2018 study , his team reported that during a fast, intestinal stem cells begin to use lipids as an energy source, instead of carbohydrates. They also showed that fasting led to a significant boost in stem cells’ regenerative ability.

However, unanswered questions remained: How does fasting trigger this boost in regenerative ability, and when does the regeneration begin?

“Since that paper, we’ve really been focused on understanding what is it about fasting that drives regeneration,” Yilmaz says. “Is it fasting itself that’s driving regeneration, or eating after the fast?”

In their new study, the researchers found that stem cell regeneration is suppressed during fasting but then surges during the refeeding period. The researchers followed three groups of mice — one that fasted for 24 hours, another one that fasted for 24 hours and then was allowed to eat whatever they wanted during a 24-hour refeeding period, and a control group that ate whatever they wanted throughout the experiment.

The researchers analyzed intestinal stem cells’ ability to proliferate at different time points and found that the stem cells showed the highest levels of proliferation at the end of the 24-hour refeeding period. These cells were also more proliferative than intestinal stem cells from mice that had not fasted at all.

“We think that fasting and refeeding represent two distinct states,” Imada says. “In the fasted state, the ability of cells to use lipids and fatty acids as an energy source enables them to survive when nutrients are low. And then it’s the postfast refeeding state that really drives the regeneration. When nutrients become available, these stem cells and progenitor cells activate programs that enable them to build cellular mass and repopulate the intestinal lining.”

Further studies revealed that these cells activate a cellular signaling pathway known as mTOR, which is involved in cell growth and metabolism. One of mTOR’s roles is to regulate the translation of messenger RNA into protein, so when it’s activated, cells produce more protein. This protein synthesis is essential for stem cells to proliferate.

The researchers showed that mTOR activation in these stem cells also led to production of large quantities of polyamines — small molecules that help cells to grow and divide.

“In the refed state, you’ve got more proliferation, and you need to build cellular mass. That requires more protein, to build new cells, and those stem cells go on to build more differentiated cells or specialized intestinal cell types that line the intestine,” Khawaled says.

Too much of a good thing

The researchers also found that when stem cells are in this highly regenerative state, they are more prone to become cancerous. Intestinal stem cells are among the most actively dividing cells in the body, as they help the lining of the intestine completely turn over every five to 10 days. Because they divide so frequently, these stem cells are the most common source of precancerous cells in the intestine.

In this study, the researchers discovered that if they turned on a cancer-causing gene in the mice during the refeeding stage, they were much more likely to develop precancerous polyps than if the gene was turned on during the fasting state. Cancer-linked mutations that occurred during the refeeding state were also much more likely to produce polyps than mutations that occurred in mice that did not undergo the cycle of fasting and refeeding.

“I want to emphasize that this was all done in mice, using very well-defined cancer mutations. In humans it’s going to be a much more complex state,” Yilmaz says. “But it does lead us to the following notion: Fasting is very healthy, but if you’re unlucky and you’re refeeding after a fasting, and you get exposed to a mutagen, like a charred steak or something, you might actually be increasing your chances of developing a lesion that can go on to give rise to cancer.”

Yilmaz also noted that the regenerative benefits of fasting could be significant for people who undergo radiation treatment, which can damage the intestinal lining, or other types of intestinal injury. His lab is now studying whether polyamine supplements could help to stimulate this kind of regeneration, without the need to fast.

“This fascinating study provides insights into the complex interplay between food consumption, stem cell biology, and cancer risk,” says Ophir Klein, a professor of medicine at the University of California at San Francisco and Cedars-Sinai Medical Center, who was not involved in the study. “Their work lays a foundation for testing polyamines as compounds that may augment intestinal repair after injuries, and it suggests that careful consideration is needed when planning diet-based strategies for regeneration to avoid increasing cancer risk.”

The research was funded, in part, by a Pew-Stewart Trust Scholar award, the Marble Center for Cancer Nanomedicine, the Koch Institute-Dana Farber/Harvard Cancer Center Bridge Project, and the MIT Stem Cell Initiative.

Share this news article on:

Press mentions, medical news today.

A new study led by researchers at MIT suggests that fasting and then refeeding stimulates cell regeneration in the intestines, reports Katharine Lang for Medical News Today . However, notes Lang, researchers also found that fasting “carries the risk of stimulating the formation of intestinal tumors.” 

MIT researchers have discovered how fasting impacts the regenerative abilities of intestinal stem cells, reports Ed Cara for Gizmodo . “The major finding of our current study is that refeeding after fasting is a distinct state from fasting itself,” explain Prof. Ömer Yilmaz and postdocs Shinya Imada and Saleh Khawaled. “Post-fasting refeeding augments the ability of intestinal stem cells to, for example, repair the intestine after injury.” 

Prof. Ömer Yilmaz and his colleagues have discovered the potential health benefits and consequences of fasting, reports Max Kozlov for Nature . “There is so much emphasis on fasting and how long to be fasting that we’ve kind of overlooked this whole other side of the equation: what is going on in the refed state,” says Yilmaz.

Previous item Next item

Related Links

  • Omer Yilmaz
  • Koch Institute
  • Department of Biology

Related Topics

Related articles.

On dark background is a snake-like shape of colorful tumor cells, mainly in blue. Near top are pinkish-red cells, and near bottom are lime-green cells.

How early-stage cancer cells hide from the immune system

MIT biologists found that intestinal stem cells express high levels of a ketogenic enzyme called HMGCS2, shown in brown.

Study links certain metabolites to stem cell function in the intestine

Intestinal stem cells from mice that fasted for 24 hours, at right, produced much more substantial intestinal organoids than stem cells from mice that did not fast, at left.

Fasting boosts stem cells’ regenerative capacity

“Not only does the high-fat diet change the biology of stem cells, it also changes the biology of non-stem-cell populations, which collectively leads to an increase in tumor formation,” Omer Yilmaz says.

How diet influences colon cancer

More mit news.

Sebastian Lourido wears a lab coat with his name, and stands in a lab with blue-lit equipment.

Pursuing the secrets of a stealthy parasite

Read full story →

A transparnt cylinder with metal end caps contains a matrix of interconnected blue polygons. At its top, a funnel collects yellow polygons poured from another transparent cylinder containing interconnected red and yellow polygons.

Study of disordered rock salts leads to battery breakthrough

Quantum computer

Toward a code-breaking quantum computer

Amulya Aluru poses with her bicycle in front of the columns of MIT's Building 10

Uphill battles: Across the country in 75 days

Aneal Krishnan, William Cruz, Alexander Edwards, and David LoBosco pose in front of a desk with a backlit “IQT” logo. Cruz and Edwards wear military cadet uniforms.

3 Questions: From the bench to the battlefield

Duane Boning headshot

Duane Boning named vice provost for international activities

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

This paper is in the following e-collection/theme issue:

Published on 23.8.2024 in Vol 26 (2024)

Provenance Information for Biomedical Data and Workflows: Scoping Review

Authors of this article:

Author Orcid Image

There are no citations yet available for this article according to Crossref .

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Study Protocol

Operational research to inform post-validation surveillance of lymphatic filariasis in Tonga study protocol: History of lymphatic filariasis elimination, rational, objectives, and design

Roles Conceptualization, Methodology, Writing – original draft

* E-mail: [email protected]

Affiliation UQ Centre for Clinical Research, The University of Queensland, Brisbane, QLD, Australia

ORCID logo

Roles Conceptualization, Methodology, Writing – review & editing

Affiliation Public Health Division, Ministry of Health, Nuku’alofa, Tonga

Affiliations UQ Centre for Clinical Research, The University of Queensland, Brisbane, QLD, Australia, National Centre for Immunisation Research and Surveillance of Vaccine Preventable Diseases, Sydney, NSW, Australia

Roles Conceptualization, Funding acquisition, Methodology, Writing – review & editing

  • Harriet Lawford, 
  • ‘Ofa Tukia, 
  • Joseph Takai, 
  • Sarah Sheridan, 
  • Colleen L. Lau

PLOS

  • Published: August 20, 2024
  • https://doi.org/10.1371/journal.pone.0307331
  • Peer Review
  • Reader Comments

Fig 1

Lymphatic filariasis (LF), a mosquito-borne helminth infection, is an important cause of chronic disability globally. The World Health Organization has validated eight Pacific Island countries as having eliminated lymphatic filariasis (LF) as a public health problem, but there are limited data to support an evidence-based approach to post-validation surveillance (PVS). Tonga was validated as having eliminated LF in 2017 but no surveillance has been conducted since 2015. This paper describes a protocol for an operational research project investigating different PVS methods in Tonga to provide an evidence base for national and regional PVS strategies.

Programmatic baseline surveys and Transmission Assessment Surveys conducted between 2000–2015 were reviewed to identify historically ‘high-risk’ and ‘low-risk’ schools and communities. ‘High-risk’ were those with LF antigen (Ag)-positive individuals recorded in more than one survey, whilst ‘low-risk’ were those with no recorded Ag-positives. The outcome measure for ongoing LF transmission will be Ag-positivity, diagnosed using Alere™ Filariasis Test Strips. A targeted study will be conducted in May-July 2024 including: (i) high and low-risk schools and communities, (ii) boarding schools, and (iii) patients attending a chronic-disease clinic. We estimate a total sample size of 2,010 participants.

Conclusions

Our methodology for targeted surveillance of suspected ‘high-risk’ populations using historical survey data can be adopted by countries when designing their PVS strategies. The results of this study will allow us to understand the current status of LF in Tonga and will be used to develop the next phase of activities.

Citation: Lawford H, Tukia ‘, Takai J, Sheridan S, Lau CL (2024) Operational research to inform post-validation surveillance of lymphatic filariasis in Tonga study protocol: History of lymphatic filariasis elimination, rational, objectives, and design. PLoS ONE 19(8): e0307331. https://doi.org/10.1371/journal.pone.0307331

Editor: Marianne Clemence, Public Library of Science, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND

Received: June 26, 2024; Accepted: July 1, 2024; Published: August 20, 2024

Copyright: © 2024 Lawford et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: No datasets were generated or analysed during the current study. All relevant data from this study will be made available upon study completion.

Funding: This work received financial support from the Coalition for Operational Research on Neglected Tropical Diseases (COR-NTD), which is funded at The Task Force for Global Health primarily by the Bill & Melinda Gates Foundation (OPP1190754) and by the United States Agency for International Development through its Neglected Tropical Diseases Program. Under the grant conditions of the Foundation, a Creative Commons Attribution 4.0 Generic License has already been assigned to the Author Accepted Manuscript version that might arise from this submission. This work was also supported by an Australian National Health and Medical Research Council (NHMRC) Investigator Grant (APP1158469 to CLL). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Lymphatic filariasis (LF) is a mosquito-borne parasitic infection caused by three species of filarial worms ( Wuchereria bancrofti , Brugia malayi , or Brugia timori ) [ 1 ]. In Tonga, diurnally sub-periodic W . bancrofti is the dominant species of parasite causing LF [ 2 ], which is transmitted primarily by Aedes tongae and Ae . tabu mosquitoes [ 3 ]. Both vectors are diurnal feeders that feed indoors and outdoors, and are container breeders with habitats that include tree and coconut holes, leaf axils, and artificial containers [ 4 ].

The Global Programme to Eliminate LF (GPELF) is one the largest public health programs in the world [ 5 ]. GPELF was launched by the World Health Organization (WHO) in 2000 with the aim to i) interrupt transmission through mass drug administration (MDA) of anthelminthic medicines, and ii) control morbidity of affected populations by 2020 [ 6 ]. New milestones and targets beyond 2020 have been developed; the WHO Neglected Tropical Diseases Roadmap 2030 [ 7 ] proposes that all countries complete their MDA programmes, implement post-MDA or post-validation surveillance, and have implemented a minimum package of care for LF morbidity by 2030 [ 8 ]. MDA aims to reduce the prevalence of infections to a level where transmission is thought to be no longer sustainable [ 9 ], with the elimination criteria for Aedes vector areas set at <1.0% antigen (Ag) prevalence in 6-7-year-old children.

Many countries in the Pacific region have achieved validation of LF elimination; however, there is a limited evidence base to inform the development of effective and efficient PVS strategies. Operational research is required to determine effective sampling strategies to confirm the presence or absence of LF transmission post-validation, to identify any ongoing transmission in a cost-effective and timely manner, and to integrate surveillance into other public health programs to ensure long-term sustainability.

History of LF elimination activities in Tonga, 1925 to 2017

The Kingdom of Tonga is an archipelago of over 170 islands ( Fig 1 ). Historical surveys of LF prevalence in Tonga recorded microfilaria (Mf) prevalence ranging from 13.5% in Tongatapu (1925) to 71.0% in Niuatoputapu (1970), and nationwide Mf prevalence of 17.4% in 1976 [ 10 ]. In 1977, MDA of diethylcarbamazine (DEC) (one dose/month for 12 months) was implemented, following which nationwide surveys in 1979, 1983/4, and 1998/9 recorded Mf prevalence of 1.0%, 0.4%, and 0.63%, respectively, suggesting ongoing residual transmission [ 10 ].

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

Source: Nations Online Project ( https://www.nationsonline.org/oneworld/map/tonga-map.htm ).

https://doi.org/10.1371/journal.pone.0307331.g001

The Pacific Programme for the Elimination of LF (PacELF) was launched in 2000, following which the Ministry of Health (MOH) in Tonga established the National Programme to Eliminate LF (NPELF) which aimed to (i) achieve 100% geographic coverage for MDA by the year 2001, (ii) implement five effective rounds of MDA throughout the country, and (iii) achieve interruption of transmission by 2005 [ 10 ]. Each division/island group was designated as an implementation unit (IU), giving a total of five IUs: ‘Eua, Ha’apai, Ongo Niua, Tongatapu, and Vava’u.

A baseline survey (A Survey) of convenience sampling of adults (n = 4,002) at sentinel sites was conducted in 1999–2000 (prior to the first round of MDA) and found an Ag prevalence of 2.7% (n = 108 positive), following which three rounds of MDA (DEC at 6 mg/kg body weight and one tablet of albendazole at 400 mg) were distributed throughout the country between 2001–2003 [ 11 ]. Following the third round of MDA, a mid-term evaluation (B Survey) of adults (n = 3,294) in sentinel sites was implemented in 2003–2004, which recorded an Ag prevalence of 2.5% (n = 81 positive) [ 11 ]. Two further MDA rounds were conducted in 2004 and 2005. In 2006, a pre-stop MDA survey (C Survey) was implemented in all sites/villages in Ongo Niuas (where a high baseline Ag prevalence was recorded) and 4–6 sites in the remaining IUs. Overall, 2,927 participants were surveyed and 11 were Ag-positive, giving an overall Ag prevalence of 0.4% [ 11 ]. Five positive cases were identified in Niuatoputapu district, Ongo Niua, five in Ha’apai, and one in ‘Eua. Following the C Survey, a further round of MDA was distributed in 2006 in Niuatoputapu district only.

To determine whether MDA could be stopped, a stop MDA survey (Transmission Assessment Survey [TAS]-1/D Survey) was conducted in 2007 among 2,391 First Grade children (5–6 years old) in all IUs, with no children testing positive. In addition, dried blood spots (DBS) were collected from 797/801 children from ‘Eua, Ha’apai, and Vava’u for antifilarial antibody (Ab) testing with the Filariasis ELISA (Joseph et al , 2011 [ 12 ]). An overall Bm14 Ab prevalence of 6.3% was reported among these children, though there were no Ag-positive cases, suggesting that LF transmission remained successfully interrupted [ 12 ]. Of note, the five individuals from Niuatoputapu district, Ongo Niua, who tested positive in the 2006 C Survey were retested in 2007, and three of the five remained Ag-positive [ 11 ]. No Mf testing was conducted on these individuals.

TAS-2 was conducted in all IUs in 2011 and returned no positive samples among 2,451 children. However, a survey by Chu et al . (2014), which aimed to simultaneously assess the impact of MDAs on LF and STH by incorporating STH testing into TAS-2, reported 7/2,434 Ag-positive students from six schools in the IUs Tongatapu (n = 3), Ha’apai (n = 2), and Ongo Niua (n = 2), giving an overall Ag prevalence of 0.3% [ 13 ]. A final TAS (TAS-3) was conducted in 2015 among 2,806 children from all five divisions; one child tested positive in Ongo Niua, giving an overall Ag prevalence 0.04%, suggesting continued interruption of LF transmission in Tonga. In 2017, Tonga successfully obtained WHO’s validation of elimination of LF as a public health problem [ 10 ]. Please refer to S1 Table for details of previous LF surveys in Tonga and Fig 2 for the estimated Ag prevalence and MDA coverage in the years leading to LF elimination in Tonga.

thumbnail

Programmatic baseline surveys (A) and Transmission Assessment Surveys (B) indicating antigen prevalence and mass drug administration coverage between 2000–2015, Tonga. Asterisks indicate an antigen prevalence of 0% (no Ag-positives were found).

https://doi.org/10.1371/journal.pone.0307331.g002

Study rationale

The identification of seven positive children in the 2011 TAS-2, and one positive child in the 2015 TAS-3, suggests that there may be persistent localised areas of LF transmission in some communities in Tonga despite five rounds of nationwide MDA (and six rounds of MDA in Niuatongatapu district, Ongo Niua). There has been no community surveillance of LF since 2015, thus transmission may have been re-established but not yet been identified. This study will be the first to determine whether LF transmission is still occurring in Tonga five-years post-validation of LF elimination, and if transmission is found, will enable early identification of LF to allow a timely response to prevent widespread recrudescence of LF in Tonga. Further, we will investigate different PVS methodologies to determine the most efficient method to identity residual LF transmission in Tonga, and thereby provide an evidence-base for national and regional PVS strategies.

In WHO’s Global Programme to Eliminate Lymphatic Filariasis, school-based TAS of 6-7-year-old school children are currently used to determine whether Ag prevalence has dropped below the critical threshold of 1.0% to signify LF elimination. However, rather than conducting a study to estimate Ag prevalence, we have chosen to adopt a targeted surveillance approach that aims to sample areas/people with the highest probability of LF infection to determine the presence or absence of disease. We believe that there are several advantages to conducting targeted surveillance for PVS. Firstly, the justification for conducting TAS among 6-7-year-olds is that young children should be protected from LF infection (and thus have low or zero Ag prevalence) if previous MDA rounds were successful. Furthermore, antigenemia detected in older children and adults may be due to infections pre-dating MDA [ 14 ]. However, previous surveys have shown that TAS-like sampling of children performs poorly for detecting microfoci of ongoing transmission, with significant persistence of LF reported in adults despite areas having met elimination targets in school-based TAS [ 15 ]. Studies have also reported significantly higher Ag prevalence in adults compared to children [ 16 ], and in community surveys (especially true for people >30 years) [ 17 ]. Therefore, we have chosen to test older aged school children (in the last two Grades/Forms of primary/high school) as well as community members to determine whether this increases the likelihood of detecting Ag-positives. We have also chosen to test children in the last two years of boarding school, as this provides a unique opportunity to access older, high-school aged children who (i) may have a higher Ag prevalence than 6-7-year-olds, and (ii) are from remote island communities that can be hard to access and may otherwise be missed.

We will also conduct a clinic-based survey. We hypothesise that individuals suffering from chronic diseases may be less likely to have taken MDA in previous rounds due to serious illness or co-morbidities, or concerns about taking MDA in addition to their regular medications, and may therefore more likely to have untreated LF infections. These ‘never treated’ individuals could potentially be acting as reservoirs of infection in communities, leading to sustained LF transmission [ 18 – 20 ]. Integrating LF testing into pre-established screening activities for patients attending chronic disease clinics is a less resource intensive and convenient means to test this potentially high-risk population.

Lastly, evidence suggests that anti-filarial Ab markers could be more sensitive measures of transmission than Ag in low prevalence settings [ 21 ], and Ab prevalence may indicate pre-antigenemic LF infection and ongoing transmission or resurgence, thereby allowing a timelier response compared to testing for Ag alone [ 21 – 23 ]. This study will concurrently measure seroprevalence of LF Ag and Abs, thereby providing an opportunity to determine the utility of Ab as a marker of LF transmission or resurgence, and assess the sensitivity of different diagnostic tools in PVS settings.

Aim and objectives

The aim of this study is to investigate different approaches for PVS of LF and provide an evidence base for developing ongoing PVS strategies in Tonga and regionally. The primary objective is to determine if there is any evidence of ongoing transmission of LF in Tonga post-validation of elimination of LF as a public health problem. In addition, we will:

  • estimate the prevalence of Mf and Ag in communities with ongoing transmission.
  • describe the demographic characteristics and geographical distribution of Ag-positive individuals (if found).
  • investigate whether anti-filarial Abs ( Wb123 Ab, Bm14 Ab, Bm33 Ab) provide an earlier signal of ongoing transmission compared to Ag.
  • determine whether seroprevalence of Abs vary between populations with recently recorded LF exposure, and populations with no recorded history of LF exposure.

Study setting

To confirm the presence/absence of LF, this study will be conducted in four different settings: primary schools (‘low-risk’ and ‘high-risk’), high schools (including boarding schools), communities (‘low-risk’ and ‘high-risk’), and a diabetes outpatient clinic based at the national hospital ( Table 1 ).

thumbnail

https://doi.org/10.1371/journal.pone.0307331.t001

Identification of study sites

Programmatic baseline surveys and TAS conducted between 2000–2015 were reviewed to identify ‘high-risk’ and ‘low-risk’ locations. ‘Low-risk’ locations will be used as reference populations against which LF Ag and Ab seroprevalence from the targeted, high-risk schools and villages can be compared. The six primary schools that recorded Ag-positive children in the 2011 TAS were considered ‘high-risk’ primary schools, and the communities in which the schools are located were considered ‘high-risk’ communities. An additional primary school was selected as it was in a community that recorded high Ag prevalence in the 2003 B Survey. Four primary schools that recorded no Ag-positive children in 2011 TAS were considered low-risk reference schools; these schools were from ‘low-risk’ communities that also recorded no positive cases in the 2006 community-based C survey. Following consultation with the MOH, boarding schools that were believed to have a high proportion of students boarding from outer islands were selected. Lastly, following consultation with MOH, the diabetes clinic at Vaiola Hospital in Nuku’alofa was selected for the clinic-based component of the study due to large number of patients who could be recruited.

Participant selection and target sample size

At selected schools, all students in Grades 5–6 of primary school, and Forms 6–7 of high school, will be enrolled in the study. Based on estimated student enrolment numbers from the Tonga Statistics Department [ 24 ], we estimate approximately 55 students to be enrolled per primary school (25 students in Grade 5 and 30 students in Grade 6), and 50 students per high school (25 students in Form 6 and 25 students in Form 7). In schools with low enrolment numbers, eligible grades in primary school will be expanded to include Grades 3–4 and Forms 1–2, and in high schools will be expanded to include Forms 4–5 and Technical and Vocational Education (TVET) students. If any Ag-positive students are identified, they will be followed-up at home and all household members ≥5 years old will be offered testing.

For the community survey, a list of households in the selected villages will be obtained from the Tonga Statistics Department. All households will be assigned a number from one to the total number of households per village. The total number of households in each village will be entered into a Random Number Generator. Fifteen unique random numbers between one and the total number of households per village will be generated. We will survey the households represented by these numbers. An additional five households per village will be selected in case any of the selected households are uninhabited or the inhabitants cannot be reached. Based on Tonga’s Population and Housing Census 2021 [ 25 ], we estimate five participants per household and 75 participants per community.

For the clinic-based survey, all patients attending the diabetes clinic at Vaiola Hospital, Nuku’alofa, during the survey weeks will be invited to participate. The target sample size will be 200 patients (if no Ag-positives are detected, this provides an upper 95% CI of 1.5%). If any Ag-positive patients are identified, they will be followed-up at home and all household members ≥5 years old will be tested.

The study has been approved by the Tongan National Health Ethics and Research Committee of the Tongan Ministry of Health (Ref#20240129) and ratified by The University of Queensland’s Human Research Ethics Committee (Project#2024/HE000493).

Eligibility criteria and consent process

For the school-based survey, all children in the selected Grades or Forms will be invited to participate. Approximately one week prior to the survey, school principals will be informed about the survey by a team member. Principals or teachers will be asked to distribute participant information sheets and consent forms to all children in the selected Grades or Forms and children will be asked to give these forms to their parents to sign. On the day of the survey, team members will collect signed consent forms and enrol consented students into the study.

For the community-based survey, all household members aged ≥5 years will be eligible to participate. A person will be considered a household member if they slept at the house the previous night or identify the house as their primary residence. Upon arrival at the household, a standard participant information sheet and consent form will be administered to all study participants. A team member will explain the purposes of the study and seek written consent (parental or guardian consent in the case of a children <18 years of age) to participate in the study. The total number of household members including those who are absent, will be recorded.

At the diabetes clinic, patients will be recruited whilst they are in the waiting area prior to their scheduled appointment. A team member will explain the purposes of the study and seek written consent (parental or guardian consent in the case of a children <18 years of age) to participate in the study.

All participants will be advised that they can revoke their consent at any time without any prejudice.

Questionnaires

Standardized electronic questionnaires will be administered using EpiCollect5 [ 26 ]. Questionnaires will be used to collect demographic information (including age, sex, country/place of birth, village of residence, occupation) and travel history in the past 12 months (both international and within Tonga). At selected households, schools, and chronic disease clinic, an environmental assessment will be conducted by trained Environmental Health Officers as part of their routine inspections. These inspections will include an assessment of the materials of the household/school/clinic structure, water sourcing and storage, and potential vector breeding sites. GPS coordinates of each enrolled household will be recorded.

Blood collection and laboratory testing

For each enrolled participant, at least 300μL of blood will be collected by finger prick into heparin-coated BD Microtainer® Blood Collection Tubes. Alere™ Filariasis Test Strips (FTS) (Abbott, Scarborough, ME) will be used to detect LF antigen. For any Ag-positive samples, Mf slides will be prepared using methods described previously in the literature [ 27 ]. DBS will be collected from all individuals (irrespective of Ag positivity) for Multiplex Bead Assays (MBA) to detect anti-filarial Abs using methods described previously in the literature [ 21 ].

Sample and data linkage

To enable linkage of demographic variables, environmental variables, and FTS/DBS results collected in EpiCollect5, each participant will be given a unique identifying number that will be printed as QR code stickers and attached to consent forms, questionnaires, blood collection tubes, FTS, slides, and dried blood spots.

Data analysis plan

The primary outcome measure to signal the presence of LF in Tonga will be a positive Ag test. Crude Ag and Ab prevalence with 95% confidence intervals (CI) will be estimated using binomial exact methods. Seroprevalence of LF Ab will be estimated by measuring IgG responses using MBA with Ag-specific cut-off values (measured by Median Fluorescence Intensity [MFI-bg]) used to determine seropositivity. Ag and Ab prevalence estimates will be adjusted for survey design and sex distribution for the school-based surveys, and adjusted for survey design, sex, and age distribution for the community-based survey based on Tonga’s Population and Housing Census 2021 [ 25 ].

Differences in demographic characteristics between Ag/Ab positive individuals will be described using mean ± standard deviation (SD), median [interquartile range], or number (%), and tested using Student’s t -test or Mann–Whitney U test for continuous data and Pearson’s chi-squared test of independence or Fisher’s exact test for categorical data. Logistic regression will be used to assess any associations between demographic variables and Ag/Ab positivity. Any variables with p <0.2 on univariate analyses will be tested using multivariable logistic regression. Variables will be assessed using variation inflation factor <5 to assess for potential collinearity, and final models will be selected using backward elimination, wherein variables are sequentially removed from the multivariable models to arrive at the most parsimonious models, in which variables with p <0.05 will be retained.

If possible, the sensitivity of Ag vs Ab to detect LF transmission in the post-validation period will be determined as the percentage of individuals with a positive FTS test among those testing Ab positive using MBA. A Kappa statistic will be performed to estimate the strength of agreement between FTS and Abs. The weights for agreement will be ranged as follows: k  < 0.00 to indicate no agreement, k 0.00–0.20 poor agreement, k 0.21–0.40 fair agreement, k 0.41–0.60 moderate agreement, k  0.61–0.80 substantial agreement, and k 0.81–1.00 almost perfect agreement. Lastly, seroprevalence estimates and mean MFI-bg values will be compared between communities with a history of high LF transmission and ‘low-risk’ reference communities for significant differences.

Recruitment

Recruitment of participants began on the 8 th May 2024 and is anticipated to be completed by the 31 st July 2024.

This study is a unique opportunity to test the effectiveness of PVS strategies in the Pacific Island setting, which may provide evidence for strategies that will be applicable to countries and territories with similar contexts. The methodology adopted in this study will provide an evidence base to develop PVS of LF in Tonga and other Pacific Island countries that are at a similar stage of LF elimination. In this protocol, we propose adopting a targeted sampling approach of ‘high-risk’ individuals from areas with historically high Ag prevalence based on survey data, and ‘low-risk’ individuals from areas where no Ag-positive individuals were identified in previous surveys who can serve as a reference population. A targeted sampling approach has the advantage of being a cost-effective means to detect the presence or absence of LF.

The results of this survey will allow us to understand the current status of LF in Tonga. This information will be used to develop the next phase of activities and the development of an appropriate strategic response to the following scenarios:

  • If no Ag-positive individuals are identified in areas where LF transmission is most likely to be found, we can be highly confident that LF has been eliminated as a public health problem in Tonga;
  • If there is low Ag prevalence, our results will be useful to inform a more extensive survey, and ongoing PVS strategy for Tonga; and
  • If high prevalence of Ag-positive, or any Mf-positive individuals are identified, more intensive surveillance and targeted MDA could be considered.

Depending on the study’s findings, other potential strategies for ongoing surveillance of LF could be considered, including opportunistic testing of blood samples from blood donors, antenatal clinics, pre-employment medicals, military recruits, and/or routine blood tests.

Lastly, we propose collecting DBS for MBA that will provide opportunities to measure the seroprevalence of anti-filarial Abs. MBA analysis will facilitate an assessment of the utility of Ab serosurveillance to signal LF transmission or resurgence, thereby providing evidence for PVS strategies that will be applicable to similar contexts in the region. Estimating Ab seroprevalence will provide opportunities to (i) further investigate the interpretation of Ab signals; and (ii) assess the sensitivity of Ab vs Ag to signal ongoing LF transmission, thereby providing evidence for PVS strategies that will be applicable to similar contexts in the region.

Supporting information

S1 table. detailed timeline of milestones towards lf elimination in tonga..

https://doi.org/10.1371/journal.pone.0307331.s001

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 2. Sasa M. Human filariasis. A global survey of epidemiology and control: University Park Press.; 1976.
  • 9. World Health Organization. Global programme to eliminate lymphatic filariasis: monitoring and epidemiological assessment of mass drug administration. World Health Organization, Geneva, Switzerland. 2011.
  • 24. Tonga Statistics Department.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 24 August 2024

A scoping review of large language model based approaches for information extraction from radiology reports

  • Daniel Reichenpfader   ORCID: orcid.org/0000-0002-8052-3359 1 , 2 ,
  • Henning Müller   ORCID: orcid.org/0000-0001-6800-9878 3 , 4 &
  • Kerstin Denecke   ORCID: orcid.org/0000-0001-6691-396X 1  

npj Digital Medicine volume  7 , Article number:  222 ( 2024 ) Cite this article

95 Accesses

Metrics details

  • Computer science
  • Institutions
  • Medical imaging

Radiological imaging is a globally prevalent diagnostic method, yet the free text contained in radiology reports is not frequently used for secondary purposes. Natural Language Processing can provide structured data retrieved from these reports. This paper provides a summary of the current state of research on Large Language Model (LLM) based approaches for information extraction (IE) from radiology reports. We conduct a scoping review that follows the PRISMA-ScR guideline. Queries of five databases were conducted on August 1st 2023. Among the 34 studies that met inclusion criteria, only pre-transformer and encoder-based models are described. External validation shows a general performance decrease, although LLMs might improve generalizability of IE approaches. Reports related to CT and MRI examinations, as well as thoracic reports, prevail. Most common challenges reported are missing validation on external data and augmentation of the described methods. Different reporting granularities affect the comparability and transparency of approaches.

Similar content being viewed by others

validation study research paper

Information extraction from German radiological reports for general clinical text and language understanding

validation study research paper

Comparative analysis of machine learning algorithms for computer-assisted reporting based on fully automated cross-lingual RadLex mappings

validation study research paper

Adapted large language models can outperform medical experts in clinical text summarization

Introduction.

In contemporary medicine, diagnostic tests, particularly various forms of radiological imaging, are vital for informed decision-making 1 . Radiologists create for image examinations semi-structured free-text radiology reports by dictation, sticking to a personal or institutional schema to organize the information contained. Structured reporting that is only used in few institutions and for specific cases on the other hand offers a possibility to enhance automatic analysis of reports by defining standardized report layouts and contents.

Despite the potential benefits of structured reporting in radiology, its implementation often encounters resistance due to the possible temporary increase in radiologists’ workload, rendering the integration into clinical practice challenging 2 . Natural language processing (NLP) can provide the means to make structured information available by maintaining existing documentation procedures. NLP is defined as “tract of artificial intelligence and linguistics, devoted to making computers understand the statements or words written in human languages” 3 . Applied on radiology reports, methods related to NLP can extract clinically relevant information. Specifically, information extraction (IE) provides techniques to use this clinical information for secondary purposes, such as prediction, quality assurance or research.

IE, a subfield within NLP, involves extracting pertinent information from free-text. Subtasks include named entity recognition (NER), relation extraction (RE), and template filling. These subtasks are realized using heuristic-based methods, machine learning-based techniques (e.g., support vector machines or Naıve Bayes), and deep learning-based methods 4 . Within the field of deep learning, a new architecture of models has recently emerged - namely large language models (LLMs).

LLMs are “deep learning models with a huge number of parameters trained in an unsupervised way on large volumes of text” 5 . These models typically exceed one million parameters and have proven highly effective in information extraction tasks. The transformer architecture, introduced in 2017, serves as the foundation for most contemporary LLMs, comprising two distinct architectural blocks; the encoder and the decoder. Both blocks apply an innovative approach of creating contextualized word embeddings called attention 6 . Prior to the “age of transformers” still present today, recurrent neural network (RNN)-based LLMs were regarded as state-of-the-art for creating contextualized word embeddings. ELMo, a language model based on a bidirectional Long Short Term Memory (BiLSTM) network 7 , is an example thereof. Noteworthy transformer-based LLMs include encoder-based models like BERT (2018) 8 , decoder-based models like GPT-3 (2020) 9 and GPT-4 (2023) 10 , as well as models applying both encoder and decocoder blocks, e.g., Megatron-ML (2019) 11 . Models continue to evolve, being trained on expanding datasets and consistently surpassing the performance benchmarks established by previous state-of-the-art models. The question arises how these new models shape IE applied to radiology reports.

Regarding existing literature concerning IE from radiology reports, several reviews are available, although these sources either miss current developments or only focus on a specific aspect or clinical domain, see Table 1 . The application of NLP to radiology reports for IE has already been subject to two systematic reviews in 2016 12 and 2021 13 . While the former is not freely available, the latter searches only Google Scholar and includes only one study based on LLMs. Davidson et al. focused on comparing the quality of studies applying NLP-related methods to radiology reports 14 . More recent reviews include a specific scoping review on the application of NLP to reports specifically related to breast cancer 15 , the extraction of cancer concepts from clinical notes 16 , and a systematic review on BERT-based NLP applications in radiology without a specific focus on information extraction 17 .

As LLMs have only recently gained a strong momentum, a research gap exists as there is no overview of LLM-based approaches for IE from radiology reports available. With this scoping review, we therefore intend to answer the following research question:

What is the state of research regarding information extraction from free-text radiology reports based on LLMs?

Specifically, we are interested in the subquestions that arise from the posed research question:

RQ.01 - Performance: What is the performance of LLMs for information extraction from radiology reports?

RQ.02 - Training and Modeling: Which models are used and how is the pre-training and fine-tuning process designed?

RQ.03 - Use cases: Which modalities and anatomical regions do the analyzed reports correspond to?

RQ.04 - Data and annotation: How much data was used to train the model, how was the annotation process designed and is the data publicly available?

RQ.05 - Challenges: What are open challenges and common limitations of existing approaches?

The objective of this scoping review is to answer the above-mentioned questions, provide an overview of recent developments, identify key trends and highlight future research by identifying outstanding challenges and limitations of current approaches.

Study selection

As shown in Fig. 1 , the systematic search yielded 1,237 records, retrieved from five databases. After removing duplicate records and records published before 2018, 374 records (title, abstract) were screened for eligibility. The screening process resulted in the exclusion of 302 records. The remaining 72 records were sought for full-text-retrieval, of which 68 could be retrieved. During data extraction, 43 papers were excluded due to not fulfilling inclusion criteria, which was not apparent based on information provided in the abstract.

figure 1

Querying of five databases resulted in a total of 1237 sources of evidence eligible for screening. This number was reduced to 374 after deduplication and removal based on publication year. Eventually, 34 studies were included in this review after completion of the screening process.

Within the cited references of included papers, nine additional papers fulfilling all inclusion criteria were identified. Therefore, following the above-mentioned methodology, 34 records in total were included in this review.

Study characteristics

In the following, we organize the extracted information according to the structure of the extraction table, which in turn reflects the defined research questions. This review covers studies that were published between 01/01/2018 and 01/08/2023. The earliest study included was published in 2019. After eight included studies published in 2020, the topic reaches its peak with eleven studies published in 2021. Eight studies of 2022 were included. Six included studies were published in the first half of 2023.

Based on corresponding author address, 15 out of 35 papers are located in the USA, followed by six in China and three each in the UK and Germany. Other countries include Austria ( n  = 1), Canada ( n  = 2), Japan ( n  = 2), Spain ( n  = 1) and The Netherlands ( n  = 1) (Table 2 ).

Extracted information

This chapter describes the NLP task, the extracted entities, the information model development process and data normalization strategies of the included studies.

Extracted concepts encompass various entities, attributes, and relations. These concepts relate to abnormalities 18 , 19 , 20 , anatomical information 21 , breast-cancer related concepts 22 , clinical findings 23 , 24 , 25 , devices 26 , diagnoses 27 , 28 , observations 29 , pathological concepts 30 , protected health information (PHI) 31 , recommendations 32 , scores (TI-RADS 33 , tumor response categories 34 ), spatial expressions 35 , 36 , 37 , staging-related information 38 , 39 , and stroke phenotypes 40 . Several papers extract various concepts, e.g., ref. 41 .

Studies solely describing document-level single-label classification were excluded from this review. Two studies apply document-level multi-class classification. Document-level multi-label classification is described in nine studies (26%), whereof three only classify more than two classes for each entity. The majority of the included studies ( n  = 21, 62%) describes NER methods, ten studies additionally apply RE methods. These studies encompass sequence-labeling and span-labeling approaches. Question answering (QA)-based methods are described in two studies, see Fig. 2 .

figure 2

The circles contain the absolute number of studies per task. NER Named entity recognition, RE Relation extraction, ML-CL Binary multi-label classification, MC-CL Multi-class classification, QA Question answering.

The number of extracted concepts (including entities, attributes, and relations) ranges from one entity in both papers describing multi-class classification 33 , 34 up to 64 entities described in a NER-based study 30 .

Three studies base their information model on clinical guidelines, namely the Response evaluation criteria in solid tumors 42 and the TNM Classification of Malignant Tumors (TNM) staging system 43 . Development by domain experts ( n  = 2), references to previous studies ( n  = 3), regulations of the Health Insurance Portability and Accountability Act 44 ( n  = 1), the Stanza radiology model 45 ( n  = 1) and references to previously developed schemes ( n  = 2) are other foundations for information model development. One study provides detailed information about the development process of the information model as supplementary information 19 . One study reports development of their information model based on the RadLex terminology 46 , another based on the National Cancer Institute Thesaurus 47 . 21 studies (62%) do not report any details regarding the development of the information model.

Out of the 34 included studies, only three describe methods to structure and/or normalize extracted information. While Torres-Lopez et al. apply rule-based methods to structure extracted data based on entity positions and combinations 30 , Sugimoto et al. additionally apply rule-based normalization based on a concept table 24 . Datta et al. describe a hybrid approach to normalize extracted entities by first generating concept candidates with BM25, a ranking algorithm, and then choosing the best equivalent with a BERT-based classifier 48 .

Regarding the distribution of annotated entities within the datasets, only one study reports on having conducted measures to counteract class imbalance 19 . Another study reports on not having used F1 score as a performance measure, as the F1 score is not suited when class imbalances are present 27 . Four studies (12%) report coarse entity distributions and seven studies (21%) describe granular entity distributions.

In the following, details regarding the reported model architectures and implementations are described, including base models, (further) pre-training and fine-tuning methods, hyperparameters, performance measures, external validation and hardware details.

For an overview of applied model architectures, see Table 3 . 28 out of 34 papers (82%) describe at least one transformer-based architecture, while the remaining six studies apply various adaptions of the Bidirectional Long Short-Term Memory (Bi-LSTM) architecture. Out of the 28 studies that describe transformer-based architectures, 27 are based on the BERT architecture 8 and one is based on the ERNIE architecture 49 . Eight studies (24%) describe further pre-training of a BERT-based, pre-trained model on in-house data. Eighteen studies (53%) use a BERT-based, pre-trained model without further pre-training. One study applies pre-training to other layers than the LLM. Two studies do not provide any details regarding the architecture of the BERT models. One study combines both BERT- and BiLSTM-based architectures 28 . Out of six studies that describe only BiLSTM-based architectures, two studies apply pre-training of word vectors based on word2vec 50 . 31 studies (91%) provide sufficient details about the fine-tuning process. Three studies do not provide details 24 , 39 , 51 .

Reported performance measures vary between included studies, including traditional measures like precision, recall, and accuracy as well as different variations of the F1 score (micro, macro, averaged, weighted, pooled). The performance of studies reporting a F1-score variation (including micro-, macro-, pooled- generalized, exact match and weighted F1) is compared in Table 4 . If a study describes multiple models, the score of the best model was chosen. If two or more datasets are compared, the higher score was chosen. If applicable, the result of external validation is also presented. 22 studies (65%) report having conducted statistical tests, including cross-validation, McNemar test, Mann-Whitney U test and Tukey-Kramer test.

Hyperparameters used to train the models (e.g., learning rate, batch size, embedding dimensions) are described in 28 studies (82%), however with varying degree of detail. Six studies (18%) do not report any details on hyperparameters. Seven studies (21%) describe a validation of their algorithm on training data from an external institution. Seven studies (21%) include details about hardware and computational resources spent during the training process.

In this section, we describe the study characteristics related to data sets, encompassing number of reports, data splits, modalities, anatomic regions, origin, language, and ethics approval.

Data set size used for fine-tuning ranges from 50 to 10,155 reports. The amount of external validation data ranges from 10% to 31% of the amount of data used for fine-tuning. For further pre-training of transformer-based architectures, 50,000 up to 3.8 million reports are used. Jantscher et al. additionally use the content of a public clinical knowledge platform ( DocCheck Flexicon 52 ) 53 . Zhang et al. only report the amount of data (3 GB) 54 . Jaiswal et al. performed further pre-training on the complete MIMIC-CXR corpus 29 . Two studies that described pre-training of word embeddings for Bi-LSTM-based architectures used 3.3 million and 317,130 reports, respectively 24 , 32 .

Data splits vary widely; the majority of studies ( n  = 23, 68%) divide their data into three sets, namely train-, validation- and test-set, with the most common split being 80/10/10, respectively. This split variation is reported in eight studies (24%). Seven studies (21%) use two sets only, four studies (12%) apply cross-validation-based methods.

19 studies (56%) describe the timeframe within which reports had been extracted. Dada et al. report the longest timeframe of 22 years, using reports between 1999 and 2021 for further pre-training 41 . The shortest timeframe reported is less than one year (2020–2021) 26 .

Several studies are based on publicly available datasets: MIMIC-CXR 55 was used once 29 while MIMIC 56 was used by two studies 40 , 57 . MIMIC-III 58 was used by six studies (18%) 37 , 40 , 48 , 57 , 59 , 60 . The Indiana chest X-ray collection 61 was used twice 35 , 36 . For external validation, MIMIC-II was applied by Mithun et al. 62 and MIMIC-CXR by Lau et al. 23 . While some of these studies use the datasets as-is, some perform additional annotation. Other studies use data from hospitals, hospital networks, other tertiary care institutions, medical big data companies, research centers, care centers or university research repositories.

Figures 3 and 4 show the frequencies of modalities and anatomical regions, respectively. Note that frequencies were counted on study-level and not weighted by the number of reports.

figure 3

The diagram shows absolute numbers of mentioned modalities. Several studies use reports obtained from multiple modalities. Other modalities include positron emission tomography-computed tomography (PET-CT) ( n  = 1) and ultrasound ( n  = 2). Three studies did not explicitly mention associated modalities. Abbreviations: CT Computer tomography, MRI Magnetic resonance imaging.

figure 4

The diagram shows absolute numbers of mentioned anatomical regions. Several studies use reports corresponding to multiple anatomical regions. Other anatomical regions include the heart, abdomen, pelvis, “all body regions'', nose, thyroid ( n  = 1 each) and breast ( n  = 2). Four studies did not explicitly mention associated anatomical regions.

Report language was inferred from the location of the institution of the corresponding author: Most studies use English reports ( n  = 21, 62%) followed by Chinese ( n  = 6, 18%), German ( n  = 4, 12%), Japanese ( n  = 2, 6%) and Spanish ( n  = 1). The corresponding author address of one study is located in the Netherlands but using data from an Indian Hospital 62 .

19 studies (56%) explicitly state that the endeavor was approved by either a national committee or agency ( n  = 3, 9%) or a local institutional or hospital review board or committee ( n  = 15, 44%). One study reports approval only for in-house data, but not for the external validation set from another institution 33 .

Annotation process

28 studies (82%) describe an exclusively manual annotation process. Five studies (15%) explicitly state that each report was annotated by two persons independently. Lau et al. use annotated data to train a classifier that supports the annotation process by proposing only documents that contain potential annotations 32 . Two studies use tools for automated annotation with manual correction and review 29 , 31 . Lybarger et al. do not provide details on their augmentation of an existing dataset 21 , three others do not report details as they either extract information available in the hospital information system 33 or exclusively use existing annotated datasets 36 , 59 .

Annotation tagging schemes mentioned include IOB(2), BISO and BIOES (short for beginning, inside, outside, start, end). The number of involved annotators ranges from one to five, roles include clinical coordinators, radiologists, radiology residents, medical and graduate students, medical informatics engineers, neurologists, neuro-radiologists, surgeons, radiological technologists and internists. Existing annotation guidelines are reported by three studies, four studies mention that instructions exist but do not provide details. 23 studies (68%) do not mention information regarding annotation guidelines.

Inter-annotator-agreement (IAA) is reported by 23 (68%) studies. Measures include F1 score variants ( n  = 8, 24%), Cohen kappa ( n  = 7, 21%), Fleiss kappa ( n  = 19, 56%) and the intraclass correlation coefficient ( n  = 1). IAA results are reported by 16 studies (47%) and range, for Cohen kappa, from 81% to 93.7%. Eleven studies (32%) mention the tool used for annotation, including Brat 23 , 37 , 39 , 48 , 53 , 60 , Doccano 34 , TagEditor 30 , Talen 46 and two self-developed tools 19 , 63 .

Data and source code availability

Five studies (15%) state that data is available upon request. One study claims availability, although there is no data present in the referenced online repository 57 . One study published its dataset in a GitHub repository 35 . One study only uses annotations provided within a dataset with credentialed access 59 . The remaining 22 studies (65%) do not mention whether data is available or not. Regarding source code availability, ten studies (29%) claim their code to be available. The remaining 24 studies (71%) do not mention whether the source code is available or not.

Challenges and limitations

Various aspects related to limitations and challenges are described. The most common mentioned limitation is that studies use only data from a single institution 21 , 22 , 24 , 30 , 36 , 51 , 53 . Similarly, multiple studies mention validation on external or multi-institutional data as a future research direction 19 , 26 , 59 . Two studies mention the need of semantic enrichment or normalization of extracted information 48 , 54 .

Many studies report intentions to augment their described approaches to other report types 21 , 28 , 30 , 37 , other report sections 22 , to include other or more data sources 35 , 39 , 54 or entities 32 , 62 , body parts 46 , clinical contexts 34 or modalities 35 , 53 , 59 .

Additional limitations include the application to only a single modality or clinical area 21 , 46 , 53 , small dataset size 27 , 32 , 54 , technical limitations 27 , 63 , no negation detection 35 , 62 , few extracted entities 24 , 28 or result degradation upon evaluation on external data 19 or more recent reports 25 . Missing interpretability is mentioned by two studies 28 , 41 .

Performance measures reported in Table 4 cannot be compared due to differences in datasets, number of extracted concepts and the heterogeneity of applied performance measures. External validation performed by six studies shows in general lower performance of the algorithm applied to external data, so data from a source different from the one used for training. The largest performance drop of 35% (overall F1 score) was reported in a Bi-LSTM-based study, performing multi-label binary classification of only three entities on the document-level 62 . On the contrary, Torres-Lopez et al. extracted a total of 64 entities with a performance drop of only 3.16% (F1 score), although not providing details on their model architecture. The smallest performance drop amounts to only 0.74% (Micro F1) for extracting seven entities based on a further pre-trained model 46 . However, it cannot be assumed that further pre-training increases model generalizability and therefore performance.

Upon analysis of performance, several inconsistencies between included studies impairs comparability: First, there is no standardized measure or best-practice to assess model performance for information extraction. Although in general, the F1 score is most often applied and well known, there exist many variations, including micro-, macro-, exact and inexact match scores, weighted F1 score and 1-Margin F1 scores. On the contrary, Zaman et al. argue that macro-averaged F1 score or overall accuracy are not suited as performance measures when class imbalances are present 27 . For the same reason, F1 score is only used to assess binary classification and not for multi-class classification by Wood et al. 19 .

While 22 studies apply some variation of cross-validation to assess model performance, 12 studies apply simple split validation methods. Singh et al. show that if data sets are small, simple split validation shows significant differences of performance measures compared to cross-validation 64 .

Specific statistical tests to compare performance of different models include DeLong’s test to compare Area under the ROC Curves 19 , 27 , the Tukey-Kramer method for multiple comparison analysis 46 and the McNemar test to compare the agreement between two models 22 . However, appropriateness of each test method remains unclear, as shown by Demner et al. 65 .

In general, equations on how performance metrics are computed should always be included in the manuscript to improve understandability, e.g., as done by 22 or 30 . To improve comparability of studies, scores for each class as well as a reasonable aggregated score over all classes should be reported.

This review identified only decoder-based architectures or pre-transformer architectures and no generative models, such as GPT-4 (released in March 2023). The majority of the described models is based on the encoder-only BERT architecture, first described by Devlin et al. 8 . We envision multiple reasons: First, while having been available since 2018 66 , generative models first needed time to be established as a new technology to be investigated and applied in the healthcare sector. Second, early generative models might have demonstrated poor performance due to their relatively small size and lack of domain-specific data for pre-training 67 . Third, poor performance might also entail model hallucinations: Farquhar et al. define hallucination as “answering unreliably or without necessary information” 68 . Hallucinations include, among others, provision of wrong answers due to erroneous training data, lying in pursuit of a reward or errors related to reasoning and generalization 68 . On the contrary, encoder-only models like the BERT architecture cannot hallucinate as they provide only context-aware embeddings of input data; the actual NLP task (e.g., sequence labeling, classification or regression) is performed by a relatively simple, downstream neural network, rendering this architecture more transparent and verifiable than generative models.

An advantage of LLMs is their capability to be customized to a specific language or general domain (e.g., medicine): First, a base version of the model is trained using a large amount of unlabeled data: This process is called pre-training. The concept of transfer-learning enables researchers to further customize a pre-trained model to a more specific domain (e.g., clinical domain, another language or from a certain hospital). This is also referred to as further pre-training. The process of training the model to perform a particular NLP task (e.g., classification) based on labeled data is called fine-tuning. These definitions (pre-training, further pre-training, transfer learning and fine-tuning) tend to be confused by authors or replaced by other term variants, e.g., “supervised learning”. However, it is imperative to use clear and concise language to distinguish between the concepts mentioned above.

Seven included studies apply further pre-training as defined above. The effect of further pre-training depends on various factors, including specifications of the input model used or amount and quality of the data used for further pre-training. Interestingly, further pre-training of a pre-trained model to another language was not reported.

Opposed to the traditional further pre-training as described above, Jaiswal et al. show how BERT-based models achieve higher performance when little data is available based on contrastive pre-training 29 . The authors claim that their model achieves better results than conventional transformers when the number of annotated reports is limited.

Only two studies solve the task of information extraction based on extractive question answering 41 , 59 . Extractive question answering was already described in the original BERT paper 8 : Instead of generating a pooled embedding of the input text or one embedding per input token, a BERT model fine-tuned for question answering takes an answer as an input and outputs the start and end token of the text span that contains the answer to the posed question - this is also possible if no answer or multiple answers are contained within the text as shown by Zhang et al. 69 .

The most common modalities for which reports of findings were used in the included studies are CT ( n  = 16), MRI ( n  = 15) and X-Ray ( n  = 14). CT reports appear to be the most common source when using in-house data. According to data provided by the Organisation for Economic Cooperation and Development (OECD), the availability of CT scanners and MRI machines has increased steadily during the past decades. Furthermore, there has been a general upwards trend in the number of performed CT and MRI interventions worldwide 70 . CT exams are fast and cheap compared to MRI.

The most common anatomical regions studied are thorax ( n  = 17) and brain ( n  = 8). There might be different reasons for this distribution. First, chest X-Ray is one of the most frequently performed imaging examinations. Second, six studies used reports obtained from MIMIC datasets, including thorax X-Ray, brain MRI and babygram examinations. Two studies used thorax X-Ray reports obtained from publicly available datasets. Furthermore, a report on the annual exposure from medical imaging in Switzerland shows that the thorax region is the third most common anatomical region of CT procedures (11.8%), preceded by abdomen and thorax (16.4%) and abdomen only (17.7%) 71 .

We identified several aspects that showed different interpretations in the included studies. One of the major ambiguities discovered is the clear definition of the terms test set and validation set: Some studies use these two very distinct terms interchangeably. However, agreement is needed upon which set is used during parameter optimization of a model and which set is used for evaluation of the final model. Furthermore, studies either report number of sentences or number of documents, hindering comparability. It also remains unclear, whether the stated dataset size includes documents without annotation or annotated data only. Report language is never explicitly stated.

Regarding annotation, it becomes apparent that there is no standard for IAA calculation, recommended number of annotator and their backgrounds, number of reports, number of reconciliation rounds and especially, IAA calculation methods. All these aspects differ widely in the included papers.

Good practices observed in the included papers include reporting of descriptive annotation statistics 35 and conducting complexity analysis of the report corpus 29 , 34 : These complexity metrics include e.g., unique n-gram counts, lexical diversity as measured with the Yule 1 score and the Type-Token-Ratio, as reported in ref. 46 . Wood et al. highlight the importance of splitting data on patient-level instead of report level 19 .

Last, we want to highlight interesting approaches: Fine et al. first use structured reports for fine-tuning and then apply the resulting model on unstructured reports 34 . Jaiswal et al. introduce three novel data augmentation techniques before fine-tuning their model based on contrastive learning 29 . Pérez-Díez et al. developed a randomization algorithm to substitute detected entities with synthetic alternatives to disguise undetected personal information 31 .

The mentioned challenges and limitations are manifold and diverse. Ten papers in total address the topic of generalizing to data from other institutions. Another challenge are the limitations of every study, be it a limited number of entities and usually a single modality and clinical domain. Every included study is based on a pre-defined information model and fine-tuned on annotated data. This means, that by August 2023, no truly generalized approach for IE has been described in the identified literature.

Upon interpretation of the above-mentioned results, several limitations of this review can be mentioned. First, the definition of information extraction proved to be challenging. We defined information extraction as a collective term for the NLP tasks of document-level multi-label classification (including binary or multiple classes for each label), NER (including RE), as well as question answering approaches. We excluded binary classification on the document level. While a narrow definition of IE would possibly only include NER and RE, whereas the widest definition would also include binary document classification. With our approach, we wanted to ensure a balanced level of task complexity.

Furthermore, the definition of an LLM was also unclear. In the protocol for this review, LLMs are defined as “deep learning models with more than one million parameters, trained on unlabeled text data” 72 . Although BiLSTM-based architectures are not trained on text, the applied context-aware word embeddings like fastText and word2vec stipulate the inclusion of these architectures into this review. An additional argument for including BiLSTM-based architectures is ELMO, a BiLSTM-based architecture with ~ 13M parameters, and referred to as one of the first LLMs. However, we decided not to include BiGRU-based architectures, as information on their parameter count was usually not available. A more narrow definition would only include transformer-based architectures, having billions of parameters. This definition seems to have recently reached consensus among researchers and in industry. As of the time of submission in June 2024, LLMs tend to be defined even more narrow, only including generative models based on autoregressive sampling 73 . This might be due to generative models currently being the most common and frequent model architecture. On the contrary, a wider definition would also potentially include BiGRU-based, CNN-based and other architectures. It also remains subject to discussion whether summarization can be regarded as information extraction—for this study, summarization was not included, potentially missing studies of interest, e.g., ref. 74 . Likewise, image-to-text report generation was excluded.

Regarding the search strategy, we decided not to include numerous model names to keep the complexity of the search term low. Instead, we initially only included the terms transformers and Bert . Eventually, only two search dimensions were used because otherwise, the number of search results would have been too small. To minimize the number of missed studies, the forward search of references of included studies was carried out, eventually leading to nine additionally included studies that were not covered by the search strategy. Nevertheless, our search strategy was not exhaustive: Studies that used terms related to transformation or structuring of reports, e.g.,refs. 75 , 76 , were missed as these terms are missing in the search strategy.

No generative models and therefore no approaches based on generative models (including few-, single- or zero-shot learning) are included in the search results. This might be due to the fact that generative models have only started to become widely accessible with the publication of chatGPT in November 2022. Only later, open-source alternatives became available. However, due to the sensitive nature of patient data, utilization of publicly serviced models, e.g., GPT-4, is restricted due to data protection rules. Until the cut-off time of this review, state-of-the-art, open-source generative models, e.g., LLama 2 (70B), had still required vast computational resources, restricting the possibilities of on-premise deployment within hospital infrastructures. Furthermore, early studies might so far only be published without peer-review (e.g., on arXiv), excluding them for this review, e.g., ref. 77 . As no search updates were performed for this review, arXiv papers that were later peer-reviewed were also not included, e.g., 78 . Relevant papers published in the ACL Anthology were also not included, potentially missing papers describing generative approaches, e.g., by Agrawal et al. 79 and Kartchner et al. 80 . Sources that did not mention “information extraction”, “named entity recognition” or “relation extraction” in the title or abstract and were not referred to by other papers were also not included, e.g., ref. 81 .

Given the diverse nature of the included studies alongside discrepancies in both the quality and quantity of reported data, a comprehensive analysis of the extracted information was deemed impossible. Future systematic reviews could enhance this comparison by refining the research question and subquestions to a more specific scope. However, according to the protocol for this scoping review, a purely descriptive presentation of findings was conducted.

Another potential limitation is the fact that data extraction was performed by one author (DR) only. However, prior to data extraction, two studies were extracted by two authors, and the resulting information compared. This led to the addition of six additional aspects to the original data extraction table, including details on hardware specification, hyperparameters, ethical approval, timeframe of dataset and class imbalance measures.

Last, we want to highlight that this scoping review strictly adheres to the PRISMA-ScR and PRISMA-S guidelines. Our search strategy of five databases resulted in over 1200 primary search results, minimizing the risk of missing relevant studies. This risk was further minimized by carefully choosing a balanced definition of both IE and LLMs. As only peer-reviewed studies were taken into account, a certain study quality was furthermore ensured.

Due to the current rapid technical progress, we summarize the latest developments regarding LLMs in general, their application in medicine, as well with regard to this review’s topic. We give an overview on studies published outside the scope of our review (published after August 1st 2023) as well as on the application of LLMs in clinical domains and tasks different from IE from radiology reports.

As of June 2024, the majority of recently published LLMs, be it commercial or open-source, are generative models, based on the decoder-block of the original transformer architecture. Two development strategies can be observed to increase model performance: The first strategy is about simply increasing the amount of model parameters (and therefore, model size), leading also to an increased demand for training data. The second strategy, on the other hand, is about optimizing existing models based on different strategies, including model pruning, quantization or distillation, as shown by Rohanian et al. 82 . Recent models include the Gemini family (2024) 83 , the T5 family 84 , LLama 3 (2024) 85 and Mixtral (2024) 86 . Moreover, research has increasingly been focussing on developing domain-specific models, e.g., Meditron, Med-PaLM 2, or Med-Gemini for the healthcare domain 87 , 88 , 89 .

In the broad clinical domain, these recent, generative LLMs show impressive capabilities, partly outperforming clinicians in test settings regarding, e.g., medical summary generation 90 , prediction of clinical outcomes 91 and answering of clinical questions 92 . Dagdelen et al. have recently demonstrated that, in the context of structured information extraction from scientific texts, even generative models require a few hundred training examples to effectively extract and organize information using the open-source model Llama-2 93 .

For the specific topic of structured IE from radiology reports, several papers and pre-prints have been published since August 2023: In general, it becomes apparent that resource-demanding generative models seem not to show better results compared to encoder-based approaches, as shown by the following studies: When applying the open-source model Vicuna 94 to binary label 13 concepts on document-level of radiology reports, Mukherjee et al. showed only moderate to substantial agreement with existing, less resource-demanding approaches 95 . Document-level binary level was also investigated by Adams et al., who compared GPT-4 to a BERT-based model further pre-trained on German medical documents 75 . In this comparison, the smaller, open-source model 96 outperformed GPT-4 for five out of nine concepts. The authors also tested GPT-4 on English radiology reports, however not providing any detailed performance measures. Similarily, Hu et al. used ChatGPT as a commercial platform to extract eleven concepts from radiology reports without further fine-tuning or provision of examples 97 . The results show inferiority of ChatGPT upon comparison with a previously described approach (BERT-based multiturn question answering 98 ) as well as a rule-based approach (averaged F1 scores: 0.88, 0.91, 0.93, respectively). Mallio et al. qualitatively compared several closed-source generative LLMs for structured reporting, although lacking clear results 99 . Additionally, several key gaps remain with the application of above-mentioned generative models. For example, closed-source models continue getting larger, requiring an increasing extent of scarce hardware resources and training data. Moreover, although large generative models currently show the best performance, they are less explainable than, e.g., encoder-based architectures prevalent in this review’s results 100 .

Generative models and encoder-based models each offer unique advantages and disadvantages. Yang et al. show that generative models might excel at generalizing to external data by applying in-context learning 101 . Generative models are by design able to aggregate information, and might be therefore more suitable to extract more complex concepts. Recently, open-source models are becoming more efficient and compact, as seen in recent advancements, e.g., the Phi 3 model family 102 . However, generative models are usually computationally intensive and require substantial resources for training and deployment. While still facing issues regarding hallucination, this behavior might be improved by combining LLMs with knowledge graphs, as introduced by Gilbert et al. 103 .

On the other hand, encoder-based models, such as BERT, are highly effective at understanding and generating bidirectional contextual embeddings of input data, which makes them particularly strong in tasks requiring precise comprehension or annotation of text, such as extractive question answering or NER. They tend to be more resource-efficient during inference compared to generative models. However, encoder-based models often struggle with generating coherent text, a task where generative models excel. Additionally, while encoder-based models can be fine-tuned for specific tasks, they may not generalize as well as generative models. Moreover, research and industry currently focus on the development of generative models, as the last encoder-based architecture was published in 2021 104 . In summary, while generative models currently offer flexibility and powerful aggregation capabilities, encoder-based models provide efficiency and precision.

In this review, we provide a comprehensive overview of recent studies on LLM-based information extraction from radiology reports, published between January 2018 and August 2023. No generative model architectures for IE from radiology reports were described in literature. After August 2023, generative models have been becoming more common, however tending not to show a performance increase compared to pre-transformer and encoder-based architectures. According to the included studies, pre-transformer and encoder-based models show promising results, although comparison is hindered by different performance score calculation methods and vastly different data sets and tasks. LLMs might improve generalizability of IE methods, although external validation is performed in only seven studies. The majority of studies used pre-trained LLMs without further pre-training on their own data. So far, research has focused on IE from reports related to CT and MRI examinations and most frequently on reports related to the thorax region. We recognize a lack of publicly available datasets. Furthermore, a lack of standardization of the annotation process results in potential differences regarding data quality. The source code is made available by only ten studies, limiting reproducibility of the described methods. Most common challenges reported are missing validation on external data and augmentation of the described method to other clinical domains, report types, concepts, modalities and anatomical regions.

No generative model architectures for IE from radiology reports were described in literature. After August 2023, generative models have been becoming more common, however tending not to show a performance increase compared to pre-transformer and encoder-based architectures. According to the included studies, pre-transformer and encoder-based models show promising results, although comparison is hindered by different performance score calculation methods and vastly different data sets and tasks. LLMs might improve generalizability of IE methods, although external validation is performed in only seven studies.

We conclude by highlighting the need to facilitate comparability of studies and to review generative AI-based approaches. We therefore plan to develop a reporting framework for clinical application of NLP methods. This need is confirmed by Davidson et al. who also state that available guidelines are limited 14 ; journal-specific guidelines already exist 105 . Considering the periodical publication of larger, more capable generative models, transparent and verifiable reporting of all aspects described in this review is essential to compare and identify successful approaches. We furthermore suggest future research to focus on the optimization and standardization of annotation processes to develop few-shot prompts. Currently, the correlation between annotation quality, quantity and model performance is unknown. Last, we recommend the development and publication of standardized, multilingual datasets to foster external validation of models.

This scoping review was conducted according to the JBI Manual for evidence synthesis and adheres to the PRISMA extension for scoping reviews (PRISMA-ScR). Regarding methodological details, we refer to the published protocol for this review 72 . In this section, we give an overview on the applied methodology and explain the adaptations made to the protocol. The completed PRISMA-ScR checklist is provided in Supplementary Table 1 .

Search strategy

The search strategy comprised three steps: First, a preliminary search was conducted by searching two databases (Google Scholar and PubMed), using keywords related to this review’s research question. Based on the results, a list of relevant search and index terms was retrieved, which in turn served as a basis for the iterative development of the full search query.

During search query development, different combinations of terms and dimensions of the research topic were combined to build query combinations that were run on PubMed. Balancing of search results and relevance showed that the inclusion of only two dimensions, “radiology” and “information extraction”, showed the best balance regarding the quantity and quality of results and was therefore chosen as the final search query.

Second, a systematic search was carried out using the final version of the search query. The PubMed-based query was adapted to meet syntactical requirements of the other four databases, comprising IEEE Xplore, ACM Digital Library, Web of Science Core Collection and Embase. The systematic search was conducted on 01/08/2023, and included all sources of evidence (SOE) since database inception. No additional limits, restrictions, or filters were applied. The full query for each database as well as a completed PRISMA-S extension checklist are shown in Supplementary Table 2 and Supplementary Table 3 . Third, reference lists of included studies were manually checked for additional sources of evidence and included if fulfilling all inclusion criteria. No search updates were performed.

Inclusion criteria

Inclusion criteria were discussed among and agreed on by all three authors. No separation was made between exclusion and inclusion criteria; reports were included upon fulfillment of all the following six aspects:

C.01: The full-text SOE is retrievable.

C.02: The SOE was published after 31/12/2017.

C.03: The SOE is published in a peer-reviewed journal or conference proceeding.

C.04: The SOE describes original research, excluding reviews, comments, patents and white papers.

C.05: The SOE describes the application of NLP methods for the purpose of IE from free-text radiology reports.

C.06: The described approach is LLM-based (defined as deep learning models with more than one million parameters, trained on unlabeled text data).

Screening and data extraction

Record screening was performed by two authors (KD, DR), using the online-platform Rayyan 106 . To improve alignment regarding inclusion criteria between reviewers, a first batch of 25 records was screened individually. Two conflicting decisions were discussed and clarified, leading to the consensus that BiLSTM-based architectures might also classify as LLMs and should therefore be included. In order to validate this change, a second batch of 25 records was screened and compared. Three conflicting decisions helped to clarify that, when a LLM-based architecture is not explicitly stated in the title or abstract, the record should still be marked as included to maximize overall recall of relevant papers.

Upon clarification of the inclusion criteria, each remaining record (title, abstract) was screened twice. After completion of the screening process, conflicts (comprising differing decisions or records marked as “maybe”) were resolved by including all records that are marked at least once as “included”.

After screening, records were sought for full-text retrieval. Data extraction was performed by one author (DR). During the extraction phase, reports were ex post excluded when a violation of inclusion criteria became apparent from the full-text. Reference lists of included papers were screened for further reports to include. Changes to the published protocol for this review are documented in Supplementary Table 4 , including its description, reason, and date.

Data availability

The complete list of extracted documents for all queried databases as well as the completed data extraction table are available in the OSF repository, see https://doi.org/10.17605/OSF.IO/RWU5M .

Code availability

For data screening, the publicly available online platform rayyain.ai was used (free plan), see https://www.rayyan.ai .

Müskens, J. L. J. M., Kool, R. B., Van Dulmen, S. A. & Westert, G. P. Overuse of diagnostic testing in healthcare: a systematic review. BMJ Qual. Saf. 31 , 54–63 (2022).

Article   PubMed   Google Scholar  

Nobel, J. M., Van Geel, K. & Robben, S. G. F. Structured reporting in radiology: a systematic review to explore its potential. Eur. Radiol. 32 , 2837–2854 (2022).

Khurana, D., Koli, A., Khatter, K. & Singh, S. Natural language processing: state of the art, current trends and challenges. Multimed. Tools Appl. 82 , 3713–3744 (2023).

Jurafsky, D. & Martin, J. H. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Pearson Education, 2024).

Birhane, A., Kasirzadeh, A., Leslie, D. & Wachter, S. Science in the age of large language models. Nat. Rev. Phys. 5 , 277–280 (2023).

Article   Google Scholar  

Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems , Vol. 30 (Curran Associates, Inc., 2017).

Peters, M. E. et al. Deep contextualized word representations 1802. 05365 (2018).

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C. & Solorio, T. (eds.) In Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , 4171–4186 (Association for Computational Linguistics, Minneapolis, Minnesota, 2019).

Brown, T. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems , vol. 33, 1877–1901 (Curran Associates, Inc., 2020).

OpenAI et al. GPT-4 Technical Report 2303.08774. (2023).

Shoeybi, M. et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism 1909.08053 (2020).

Pons, E., Braun, L. M. M., Hunink, M. G. M. & Kors, J. A. Natural language processing in radiology: a systematic review. Radiology 279 , 329–343 (2016).

Casey, A. et al. A systematic review of natural language processing applied to radiology reports. BMC Med. Inform. Decis. Mak. 21 , 179 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Davidson, E. M. et al. The reporting quality of natural language processing studies: systematic review of studies of radiology reports. BMC Med. Imaging 21 , 142 (2021).

Saha, A., Burns, L. & Kulkarni, A. M. A scoping review of natural language processing of radiology reports in breast cancer. Front. Oncol. 13 , 1160167 (2023).

Gholipour, M., Khajouei, R., Amiri, P., Hajesmaeel Gohari, S. & Ahmadian, L. Extracting cancer concepts from clinical notes using natural language processing: a systematic review. BMC Bioinform. 24 , 405 (2023).

Gorenstein, L., Konen, E., Green, M. & Klang, E. Bidirectional encoder representations from transformers in radiology: a systematic review of natural language processing applications. J. Am. Coll. Radiol. 21 , 914–941 (2024).

Wood, D. A. et al. Automated labelling using an attention model for radiology reports of MRI scans (ALARM). In Arbel, T. et al. (eds.) Proceedings of the Third Conference on Medical Imaging with Deep Learning , vol. 121 of Proceedings of Machine Learning Research , 811–826 (PMLR, 2020-07-06/2020-07-08).

Wood, D. A. et al. Deep learning to automate the labelling of head MRI datasets for computer vision applications. Eur. Radiol. 32 , 725–736 (2022).

Li, Z. & Ren, J. Fine-tuning ERNIE for chest abnormal imaging signs extraction. J. Biomed. Inform. 108 , 103492 (2020).

Lybarger, K., Damani, A., Gunn, M., Uzuner, O. Z. & Yetisgen, M. Extracting radiological findings with normalized anatomical information using a span-based BERT relation extraction model. AMIA Jt. Summits Transl. Sci. Proc. 2022 , 339–348 (2022).

PubMed   PubMed Central   Google Scholar  

Kuling, G., Curpen, B. & Martel, A. L. BI-RADS BERT and using section segmentation to understand radiology reports. J. Imaging 8 , 131 (2022).

Lau, W., Lybarger, K., Gunn, M. L. & Yetisgen, M. Event-based clinical finding extraction from radiology reports with pre-trained language model. J. Digit. Imaging 36 , 91–104 (2023).

Sugimoto, K. et al. End-to-end approach for structuring radiology reports. Stud. Health Technol. Inform. 270 , 203–207 (2020).

PubMed   Google Scholar  

Zhang, Y. et al. Using recurrent neural networks to extract high-quality information from lung cancer screening computerized tomography reports for inter-radiologist audit and feedback quality improvement. JCO Clin. Cancer Inform. 7 , e2200153 (2023).

Tejani, A. S. et al. Performance of multiple pretrained BERT models to automate and accelerate data annotation for large datasets. Radiol. Artif. Intell. 4 , e220007 (2022).

Zaman, S. et al. Automatic diagnosis labeling of cardiovascular MRI by using semisupervised natural language processing of text reports. Radiol. Artif. Intell. 4 , e210085 (2022).

Liu, H. et al. Use of BERT (bidirectional encoder representations from transformers)-based deep learning method for extracting evidences in chinese radiology reports: Development of a computer-aided liver cancer diagnosis framework. J. Med. Internet Res. 23 , e19689 (2021).

Jaiswal, A. et al. RadBERT-CL: factually-aware contrastive learning for radiology report classification. In Proc. Machine Learning for Health , 196–208 (PMLR, 2021).

Torres-Lopez, V. M. et al. Development and validation of a model to identify critical brain injuries using natural language processing of text computed tomography reports. JAMA Netw. Open 5 , e2227109 (2022).

Pérez-Díez, I., Pérez-Moraga, R., López-Cerdán, A., Salinas-Serrano, J. M. & la Iglesia-Vayá, M. De-identifying Spanish medical texts - named entity recognition applied to radiology reports. J. Biomed. Semant. 12 , 6 (2021).

Lau, W., Payne, T. H., Uzuner, O. & Yetisgen, M. Extraction and analysis of clinically important follow-up recommendations in a large radiology dataset. AMIA Summits Transl. Sci. Proc. 2020 , 335–344 (2020).

Santos, T. et al. A fusion NLP model for the inference of standardized thyroid nodule malignancy scores from radiology report text. Annu. Symp. Proc. AMIA Symp. 2021 , 1079–1088 (2021).

Fink, M. A. et al. Deep learning–based assessment of oncologic outcomes from natural language processing of structured radiology reports. Radiol. Artif. Intell. 4 , e220055 (2022).

Datta, S. et al. Understanding spatial language in radiology: representation framework, annotation, and spatial relation extraction from chest X-ray reports using deep learning. J. Biomed. Inform. 108 , 103473 (2020).

Datta, S. & Roberts, K. Spatial relation extraction from radiology reports using syntax-aware word representations. AMIA Jt. Summits Transl. Sci. Proc. 2020 , 116–125 (2020).

Datta, S. & Roberts, K. A Hybrid deep learning approach for spatial trigger extraction from radiology reports. In Proc. Third International Workshop on Spatial Language Understanding , 50–55 (Association for Computational Linguistics, Online, 2020).

Zhang, H. et al. A novel deep learning approach to extract Chinese clinical entities for lung cancer screening and staging. BMC Med. Inform. Decis. Mak. 21 , 214 (2021).

Hu, D. et al. Automatic extraction of lung cancer staging information from computed tomography reports: Deep learning approach. JMIR Med. Inform. 9 , e27955 (2021).

Datta, S., Khanpara, S., Riascos, R. F. & Roberts, K. Leveraging spatial information in radiology reports for ischemic stroke phenotyping. AMIA Jt. Summits Transl. Sci. Proc. 2021 , 170–179 (2021).

Dada, A. et al. Information extraction from weakly structured radiological reports with natural language queries. Eur. Radiol. 34 , 330–337 (2023).

Eisenhauer, E. et al. New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). Eur. J. Cancer 45 , 228–247 (2009).

Article   CAS   PubMed   Google Scholar  

Rosen, R. D. & Sapra, A. TNM Classification. In StatPearls (StatPearls Publishing, 2023).

University of California Berkeley. HIPAA PHI: definition of PHI and List of 18 Identifiers. https://cphs.berkeley.edu/hipaa/hipaa18.html# (2023).

Stanford NLP Group. Stanfordnlp/stanza. Stanford NLP (2024).

Sugimoto, K. et al. Extracting clinical terms from radiology reports with deep learning. J. Biomed. Inform. 116 , 103729 (2021).

US National Institutes of Health. NationalCancer Institute. NCI Thesaurus. https://ncit.nci.nih.gov/ncitbrowser/ .

Datta, S., Godfrey-Stovall, J. & Roberts, K. RadLex normalization in radiology reports. AMIA Annu. Symp. Proc. 2020 , 338–347 (2021).

Zhang, Z. et al. ERNIE: Enhanced Language Representation with Informative Entities In Proc. 57th Annual Meeting of the Association for Computational Linguistics , pages 1441–1451, Florence, Italy. Association for Computational Linguistics (2019).

Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient Estimation of Word Representations in Vector Space 1301.3781 (2013).

Huang, X., Chen, H. & Yan, J. D. Study on structured method of Chinese MRI report of nasopharyngeal carcinoma. BMC Med. Inform. Decis. Mak. 21 , 203 (2021).

DocCheck. DocCheck Flexicon. https://flexikon.doccheck.com/de/Hauptseite (2024).

Jantscher, M. et al. Information extraction from German radiological reports for general clinical text and language understanding. Sci. Rep. 13 , 2353 (2023).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Zhang, X. et al. Extracting comprehensive clinical information for breast cancer using deep learning methods. Int. J. Med. Inform. 132 , 103985 (2019).

Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6 , 317 (2019).

Moody, G. B. & Mark, R. G. The MIMIC Database (1992).

Datta, S. & Roberts, K. Weakly supervised spatial relation extraction from radiology reports. JAMIA Open 6 , ooad027 (2023).

Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3 , 160035 (2016).

Datta, S. & Roberts, K. Fine-grained spatial information extraction in radiology as two-turn question answering. Int. J. Med. Inform. 158 , 104628 (2022).

Datta, S. et al. Rad-SpatialNet: a frame-based resource for fine-grained spatial relations in radiology reports. In Calzolari, N. et al . (eds.) Proc. Twelfth Language Resources and Evaluation Conference , 2251–2260 (European Language Resources Association, Marseille, France, 2020).

Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23 , 304–310 (2016).

Mithun, S. et al. Clinical concept-based radiology reports classification pipeline for lung carcinoma. J. Digit. Imaging 36 , 812–826 (2023).

Bressem, K. K. et al. Highly accurate classification of chest radiographic reports using a deep learning natural language model pre-trained on 3.8 million text reports. Bioinformatics 36 , 5255–5261 (2021).

Singh, V. et al. Impact of train/test sample regimen on performance estimate stability of machine learning in cardiovascular imaging. Sci. Rep. 11 , 14490 (2021).

Demler, O. V., Pencina, M. J. & D’Agostino, R. B. Misuse of DeLong test to compare AUCs for nested models. Stat. Med. 31 , 2577–2587 (2012).

Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training (2018).

Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29 , 1930–1940 (2023).

Farquhar, S., Kossen, J., Kuhn, L. & Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 630 , 625–630 (2024).

Zhang, Y. & Xu, Z. BERT for question answering on SQuAD 2.0 (2019).

OECD. Diagnostic technologies (2023).

Viry, A. et al. Annual exposure of the Swiss population from medical imaging in 2018. Radiat. Prot. Dosim. 195 , 289–295 (2021).

Reichenpfader, D., Müller, H. & Denecke, K. Large language model-based information extraction from free-text radiology reports: a scoping review protocol. BMJ Open 13 , e076865 (2023).

Shanahan, M., McDonell, K. & Reynolds, L. Role play with large language models. Nature 623 , 493–498 (2023).

Liang, S. et al. Fine-tuning BERT Models for Summarizing German Radiology Findings. In Naumann, T., Bethard, S., Roberts, K. & Rumshisky, A. (eds.) Proc. 4th Clinical Natural Language Processing Workshop , 30–40 (Association for Computational Linguistics, Seattle, WA, 2022).

Adams, L. C. et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology 307 , e230725 (2023).

Nowak, S. et al. Transformer-based structuring of free-text radiology report databases. Eur. Radiol. 33 , 4228–4236 (2023).

Košprdić, M., Prodanović, N., Ljajić, A., Bašaragin, B. & Milošević, N. From zero to hero: harnessing transformers for biomedical named entity recognition in zero- and few-shot contexts 2305.04928 (2023).

Smit, A. et al. Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT. In Webber, B., Cohn, T., He, Y. & Liu, Y. (eds.) Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP) , 1500–1519 (Association for Computational Linguistics, Online, 2020).

Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. In Goldberg, Y., Kozareva, Z. & Zhang, Y. (eds.) Proc. Conference on Empirical Methods in Natural Language Processing , 1998–2022 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).

Kartchner, D., Ramalingam, S., Al-Hussaini, I., Kronick, O. & Mitchell, C. Zero-shot information extraction for clinical meta-analysis using large language models. In Demner-fushman, D., Ananiadou, S. & Cohen, K. (eds.) Proc. 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks , 396–405 (Association for Computational Linguistics, Toronto, Canada, 2023).

Jupin-Delevaux, É. et al. BERT-based natural language processing analysis of French CT reports: application to the measurement of the positivity rate for pulmonary embolism. Res. Diagn. Interv. Imaging 6 , 100027 (2023).

Rohanian, O., Nouriborji, M., Kouchaki, S. & Clifton, D. A. On the effectiveness of compact biomedical transformers. Bioinformatics 39 , btad103 (2023).

Gemini Team, Google. Gemini: a family of highly capable multimodal models. https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf (2024).

Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer 1910.10683 (2023).

Llama-3. Meta (2024).

Jiang, A. Q. et al. Mixtral of experts 2401.04088 (2024).

Chen, Z. et al. MEDITRON-70B: scaling medical pretraining for large language models 2311.16079 (2023).

Singhal, K. et al. Towards expert-level medical question answering with large language models 2305.09617 (2023).

Saab, K. et al. Capabilities of Gemini models in medicine 2404.18416 (2024).

Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30 , 1134–1142 (2024).

Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619 , 357–362 (2023).

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620 , 172–180 (2023).

Dagdelen, J. et al. Structured information extraction from scientific text with large language models. Nat. Commun. 15 , 1418 (2024).

Zheng, L. et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. Adv. Neural Inf. Process Syst. 36 , 46595–46623 (2023).

Google Scholar  

Mukherjee, P., Hou, B., Lanfredi, R. B. & Summers, R. M. Feasibility of using the privacy-preserving large language model Vicuna for labeling radiology reports. Radiology 309 , e231147 (2023).

Bressem, K. K. et al. MEDBERT.de: a comprehensive German BERT model for the medical domain. Expert Syst. Appl. 237 , 121598 (2024).

Hu, D., Liu, B., Zhu, X., Lu, X. & Wu, N. Zero-shot information extraction from radiological reports using ChatGPT. Int. J. Med. Inform. 183 , 105321 (2024).

Hu, D., Li, S., Zhang, H., Wu, N. & Lu, X. Using natural language processing and machine learning to preoperatively predict lymph node metastasis for non–small cell lung cancer with electronic medical records: development and validation study. JMIR Med. Inform. 10 , e35475 (2022).

Mallio, C. A., Sertorio, A. C., Bernetti, C. & Beomonte Zobel, B. Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing. La Radiol. Med. 128 , 808–812 (2023).

Zhao, H. et al. Explainability for large language models: a survey. ACM Trans. Intell. Syst. Technol. 15 , 1–38 (2024).

Article   CAS   Google Scholar  

Yang, H. et al. Unveiling the generalization power of fine-tuned large language models. In Proc. of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Duh, K., Gomez, H. & Bethard, S.) 884–899 (Association for Computational Linguistics, Mexico City, Mexico, 2024). https://doi.org/10.18653/v1/2024.naacl-long.51 .

Abdin, M. et al. Phi-3 technical report: a highly capable language model locally on your phone 2404.14219 (2024).

Gilbert, S., Kather, J. N. & Hogan, A. Augmented non-hallucinating large language models as medical information curators. npj Digital Med. 7 , 1–5 (2024).

He, P., Liu, X., Gao, J. & Chen, W. DeBERTa: decoding-enhanced BERT with disentangled attention 2006.03654 (2021).

Kakarmath, S. et al. Best practices for authors of healthcare-related artificial intelligence manuscripts. NPJ Digit. Med. 3 , 134 (2020).

Rayyan - AI Powered Tool for Systematic Literature Reviews (2021).

Si, Y., Wang, J., Xu, H. & Roberts, K. Enhancing clinical concept extraction with contextual embeddings. J. Am. Med. Inform. Assoc. 26 , 1297–1304 (2019).

Liu, Y. et al. RoBERTa: a robustly optimized BERT pretraining approach 1907.11692 (2019).

Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 , 1234–1240 (2020).

Alsentzer, E. et al. Publicly Available Clinical BERT Embeddings. In Rumshisky, A., Roberts, K., Bethard, S. & Naumann, T. (eds.) Proc. 2nd Clinical Natural Language Processing Workshop , 72–78 (Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019).

Deepset. German BERT. https://huggingface.co/bert-base-german-cased (2019).

Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3 , 2:1–2:23 (2021).

Sanh, V., Debut, L., Chaumond, J. & Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter 1910.01108 (2020).

Cui, Y., Che, W., Liu, T., Qin, B. & Yang, Z. Pre-training with whole word masking for Chinese BERT. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29 , 3504–3514 (2021).

Peng, Y., Yan, S. & Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. In Proc. of the 18th BioNLP Workshop and Shared Task (eds Demner-Fushman, D., Cohen, K. B., Ananiadou, S. & Tsujii, J.) 58–65 (Association for Computational Linguistics, Florence, Italy, 2019). https://doi.org/10.18653/v1/W19-5006 .

Chan, B., Schweter, S. & Möller, T. German’s next language model. In Proc. of the 28th International Conference on Computational Linguistics (eds Scott, D., Bel, N. & Zong, C.) 6788–6796 (International Committee on Computational Linguistics, Barcelona, Spain (Online), 2020). https://doi.org/10.18653/v1/2020.coling-main.598 .

Shrestha, M. Development of a Language Model for the Medical Domain . Ph.D. thesis (Rhine-Waal University of Applied Sciences, 2021).

The MultiBERTs: BERT reproductions for robustness analysis. In Sellam, T. et al. (eds.) ICLR 2022 (2022).

Wu, S. & He, Y. Enriching pre-trained language model with entity information for relation classification. In Proc. of the 28th ACM International Conference on Information and Knowledge Management , 2361–2364 (Association for Computing Machinery, New York, NY, USA, 2019). https://doi.org/10.1145/3357384.3358119 .

Beltagy, I., Lo, K. & Cohan, A. SciBERT: a pretrained language model for scientific text. In Proc. Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , (eds Inui, K., Jiang, J., Ng, V. & Wan, X.), 3615–3620 (Association for Computational Linguistics, Hong Kong, China, 2019).

Eberts, M. & Ulges, A. Span-based joint entity and relation extraction with transformer pre-training. In ECAI 2020 , 2006–2013 (IOS Press, 2020).

Yang, Z. et al. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems vol. 32 (Curran Associates, Inc., 2019).

Download references

Acknowledgements

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. We thank Cornelia Zelger for her support during the search query definition process.

Author information

Authors and affiliations.

Institute for Patient-Centered Digital Health, Bern University of Applied Sciences, Biel/Bienne, Switzerland

Daniel Reichenpfader & Kerstin Denecke

Faculty of Medicine, University of Geneva, Geneva, Switzerland

Daniel Reichenpfader

Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland

Henning Müller

Informatics Institute, HES-SO Valais-Wallis, Sierre, Switzerland

You can also search for this author in PubMed   Google Scholar

Contributions

D.R. conceptualized the study, defined the methodology (incl. the search strategy), performed the database searches and managed the screening process. D.R. also performed data extraction and authored the original draft. K.D. focused on reviewing and editing the manuscript. K.D. also participated in the screening process. H.M. provided supervision and contributed to writing review.

Corresponding author

Correspondence to Daniel Reichenpfader .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental material, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Reichenpfader, D., Müller, H. & Denecke, K. A scoping review of large language model based approaches for information extraction from radiology reports. npj Digit. Med. 7 , 222 (2024). https://doi.org/10.1038/s41746-024-01219-0

Download citation

Received : 21 February 2024

Accepted : 09 August 2024

Published : 24 August 2024

DOI : https://doi.org/10.1038/s41746-024-01219-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

validation study research paper

Research shows our bodies go through rapid changes in our 40s and our 60s

For many people, reaching their mid-40s may bring unpleasant signs the body isn’t working as well as it once did. Injuries seem to happen more frequently. Muscles may feel weaker.

A new study, published Wednesday in Nature Aging , shows what may be causing the physical decline. Researchers have found that molecules and microorganisms both inside and outside our bodies are going through dramatic changes, first at about age 44 and then again when we hit 60. Those alterations may be causing significant differences in cardiovascular health and immune function.

The findings come from Stanford scientists who analyzed blood and other biological samples of 108 volunteers ages 25 to 75, who continued to donate samples for several years. 

“While it’s obvious that you’re aging throughout your entire life, there are two big periods where things really shift,” said the study’s senior author, Michael Snyder, a professor of genetics and director of the Center for Genomics and Personalized Medicine at Stanford Medicine. For example, “there’s a big shift in the metabolism of lipids when people are in their 40s and in the metabolism of carbohydrates when people are in their 60s.”

Lipids are fatty substances, including LDL, HDL and triglycerides, that perform a host of functions in the body, but they can be harmful if they build up in the blood.

The scientists tracked many kinds of molecules in the samples, including RNA and proteins, as well as the participants’ microbiomes.

The metabolic changes the researchers discovered indicate not that people in their 40s are burning calories more slowly but rather that the body is breaking food down differently. The scientists aren’t sure exactly what impact those changes have on health.

Previous research showed that resting energy use, or metabolic rate , didn’t change from ages 20 to 60. The new study’s findings don't contradict that.

The changes in metabolism affect how the body reacts to alcohol or caffeine, although the health consequences aren’t yet clear. In the case of caffeine, it may result in higher sensitivity. 

It’s also not known yet whether the shifts could be linked to lifestyle or behavioral factors. For example, the changes in alcohol metabolism might be because people are drinking more in their mid-40s, Snyder said.

For now, Snyder suggests people in their 40s keep a close eye on their lipids, especially LDL cholesterol.

“If they start going up, people might want to think about taking statins if that’s what their doctor recommends,” he said. Moreover, “knowing there’s a shift in the molecules that affect muscles and skin, you might want to warm up more before exercising so you don’t hurt yourself.”

Until we know better what those changes mean, the best way to deal with them would be to eat healthy foods and to exercise regularly, Snyder said.Dr. Josef Coresh, founding director of the Optimal Aging Institute at the NYU Grossman School of Medicine, compared the new findings to the invention of the microscope.

“The beauty of this type of paper is the level of detail we can see in molecular changes,” said Coresh, a professor of medicine at the school. “But it will take time to sort out what individual changes mean and how we can tailor medications to those changes. We do know that the origins of many diseases happen in midlife when people are in their 40s, though the disease may occur decades later.”

The new study “is an important step forward,” said Dr. Lori Zeltser, a professor of pathology and cell biology at the Columbia University Vagelos College of Physicians and Surgeons. While we don’t know what the consequences of those metabolic changes are yet, “right now, we have to acknowledge that we metabolize food differently in our 40s, and that is something really new.”

The shifts the researchers found might help explain numerous age-related health changes, such as muscle loss, because “your body is breaking down food differently,” Zeltser said.

Linda Carroll is a regular health contributor to NBC News. She is coauthor of "The Concussion Crisis: Anatomy of a Silent Epidemic" and "Out of the Clouds: The Unlikely Horseman and the Unwanted Colt Who Conquered the Sport of Kings." 

New research suggests rainwater could have helped form the first protocell walls

Uchicago-led study casts new light on the origins of life on earth.

One of the major unanswered questions about the origin of life is how droplets of RNA floating around the primordial soup turned into the membrane-protected packets of life we call cells.

A new paper by researchers with the University of Chicago and the University of Houston proposes a solution.

They show how rainwater could have helped create a meshy wall around protocells 3.8 billion years ago, a critical step in the transition from tiny beads of RNA to every bacterium, plant, animal, and human that ever lived.

The paper was published Aug. 21 in Science Advances by UChicago Pritzker Molecular Engineering (PME) postdoctoral researcher Aman Agrawal and his co-authors—including PME Dean Emeritus Matthew Tirrell and Nobel Prize-winning biologist Jack Szostak, director of UChicago’s Chicago Center for the Origins of Life .

“This is a distinctive and novel observation,” said Tirrell.

Droplets and discovery

The research looks at “coacervate droplets”—naturally occurring compartments of complex molecules like proteins, lipids, and RNA. ( In the early 2000s , Szostak started looking at RNA as the first biological material to develop, rather than DNA.)

The droplets, which behave like drops of cooking oil in water, have long been eyed as a candidate for the first protocells. But there was a problem, Szostak found in 2014.

It wasn’t that these droplets couldn’t exchange molecules between each other, a key step in evolution; the problem was that they did it too well, and too fast. Any droplet containing a new, potentially useful pre-life mutation of RNA would exchange this RNA with the other RNA droplets within minutes, meaning they would quickly all be the same.

There would be no differentiation and no competition—meaning no evolution. And that means no life.

“If molecules continually exchange between droplets or between cells, then all the cells after a short while will look alike, and there will be no evolution because you are ending up with identical clones,” Agrawal said.

Agrawal started transferring coacervate droplets into distilled water during his PhD research at the University of Houston with Prof. Alamgir Karim , studying their behavior under an electric field. At this point, the research had nothing to do with the origin of life—just studying the fascinating material from an engineering perspective.

Karim had worked decades earlier at the University of Minnesota under one of the world’s top experts—Tirrell, who later became founding dean of the UChicago Pritzker School of Molecular Engineering. During a lunch with Agrawal and Karim, Tirrell brought up how the research into the effects of distilled water on coacervate droplets might relate to the origin of life on Earth. Tirrell asked where distilled water would have existed 3.8 billion years ago.

“I spontaneously said ‘rainwater!’ His eyes lit up and he was very excited at the suggestion,” Karim said. “So, you can say it was a spontaneous combustion of ideas or ideation!”

Tirrell brought Agrawal’s distilled water research to Szostak, who had recently joined the University of Chicago to lead a new push to understand the origins of life on Earth .

Working with RNA samples from Szostak, Agrawal found that transferring coacervate droplets into distilled water increased the time scale of RNA exchange – from mere minutes to several days. This was long enough for mutation, competition, and evolution.

Then, to make sure rain itself could work rather than distilled water, “We simply collected water from rain in Houston and tested the stability of our droplets in it, just to make sure what we are reporting is accurate,” Agrawal said.

In tests with the actual rainwater and with lab water modified to mimic the acidity of rainwater, they found the same results. The meshy walls formed, creating the conditions that could have led to life.

The chemical composition of the rain falling over Houston in the 2020s is not the rain that would have fallen 750 million years after the Earth formed, and the same can be said for the model protocell system Agrawal tested. The new paper proves that this approach of building a meshy wall around protocells is possible and can work together to compartmentalize the molecules of life, putting researchers closer than ever to finding the right set of chemical and environmental conditions that allow protocells to evolve.

“The molecules we used to build these protocells are just models until more suitable molecules can be found as substitutes,” Agrawal said. “While the chemistry would be a little bit different, the physics will remain the same.”

Interdisciplinary findings

Life is by nature interdisciplinary, so Szostak, the director of UChicago’s Chicago Center for the Origins of Life , said it was natural to collaborate with both UChicago PME , UChicago’s interdisciplinary school of molecular engineering, and the chemical engineering department at the University of Houston.

“Engineers have been studying the physical chemistry of these types of complexes—and polymer chemistry more generally—for a long time. It makes sense that there's expertise in the engineering school,” Szostak said. “When we're looking at something like the origin of life, it's so complicated and there are so many parts that we need people to get involved who have any kind of relevant experience.”

Citation: “ Did the exposure of coacervate droplets to rain make them the first stable protocells? ” Agrawal et al, Science Advances , August 21, 2024. DOI: 10.1126/sciadv.adn9657

Funding: Houston Endowment Fellowship, Welch Foundation, U.S. Department of Energy

— Adapted from an article published by the Pritzker School of Molecular Engineering .

Top Stories

  • DUNE completes underground excavation in South Dakota for massive neutrino experiment
  • How homeownership shaped race in America, with Adrienne Brown (Ep. 141)
  • UChicago’s Film Studies Center to preserve groundbreaking work by Black and Filipino filmmakers

Get more with UChicago News delivered to your inbox.

Recommended

Earth

The origin of life on Earth, explained

Crust

Earth could have supported crust, life earlier than thought

Related Topics

Latest news, big brains podcast: how homeownership shaped race in america.

DNC at the United Center

UChicago political scholars reflect on the DNC, 2024 election

Chen Lab

Inside the Lab

Inside the Lab: New ways to grow cells to protect our lungs from disease

Artistic rendition showing balls forming with stormclouds in background

Biochemistry

Inside the Lab

Go 'Inside the Lab' at UChicago

Explore labs through videos and Q&As with UChicago faculty, staff and students

A women's hands painted with blue nail polish hold onto a man's signing hand.

New book explores emergence of touch-based language in DeafBlind communities

Jasmine Nirody

Meet a UChicagoan

UChicago biophysicist studies locomotion in creatures from all walks of life

Around uchicago.

Photo closeup of a gloved hand holding a small beaker with yellow liquid

Breakthrough by UChicago scientists could ease notoriously difficult chemical reaction

Quantrell and PhD Teaching Awards

UChicago announces 2024 winners of Quantrell and PhD Teaching Awards

Campus News

Project to improve accessibility, sustainability of Main Quadrangles

National Academy of Sciences

Five UChicago faculty elected to National Academy of Sciences in 2024

Group photo of 100+ people outside a building with many windows

UChicago’s Kavli Institute for Cosmological Physics celebrates 20 years of discovery

Dean Thomas Miles with Richard Sandor (far right) and his wife Ellen (center)

University of Chicago Law School

Coase-Sandor Institute for Law and Economics celebrates decade of impact

Biological Sciences Division

“You have to be open minded, planning to reinvent yourself every five to seven years.”

Prof. Chuan He faces camera smiling with hands on hips with a chemistry lab in the background

Meet A UChicagoan

Organist pulls out all the stops to bring Bach to UChicago

IMAGES

  1. Checklist For Evaluating A Scientific Research Paper

    validation study research paper

  2. Summary of Validation Studies

    validation study research paper

  3. Data Validation and Verification Free Essay Example

    validation study research paper

  4. Validation Report

    validation study research paper

  5. Validation Research Paper

    validation study research paper

  6. Data Verification and Validation Free Essay Example

    validation study research paper

COMMENTS

  1. Method of preparing a document for survey instrument validation by experts

    This paper is structured as follows: Section 1 provides the introduction to the need for a validation format for research, and the fundamentals of validation and the factors involved in validation from various literature studies are discussed in Section 2. Section 3 presents the methodology used in framing the validation format.

  2. A Step-by-step Guide to Questionnaire Validation Research

    4.3 The practice of combining a validation study and . a research study together by using an unvalidated . ... paper which contains a set of questions to assess .

  3. Common misconceptions about validation studies

    Validation studies, in which an investigator compares the accuracy of a measure with a gold standard measure, are an important way to understand and mitigate this bias. More attention is being paid to the importance of validation studies in recent years, yet they remain rare in epidemiologic research and, in our experience, they remain poorly ...

  4. Best Practices for Developing and Validating Scales for Health, Social

    Domain identification . The first step is to articulate the domain(s) that you are endeavoring to measure. A domain or construct refers to the concept, attribute, or unobserved behavior that is the target of the study ().Therefore, the domain being examined should be decided upon and defined before any item activity ().A well-defined domain will provide a working knowledge of the phenomenon ...

  5. Common misconceptions about validation studies

    Validation studies, in which an investigator compares the accuracy of a measure with a gold standard measure, are an important way to understand and mitigate this bias. More attention is being paid to the importance of validation studies in recent years, yet they remain rare in epidemiologic research and, in our experience, they remain poorly ...

  6. Questionnaire validation practice: a protocol for a systematic

    Methods and analysis A systematic descriptive literature review of qualitative and quantitative research will be used to investigate the scope of validation practice in the rapidly growing field of health literacy assessment. This review method employs a frequency analysis to reveal potentially interpretable patterns of phenomena in a research area; in this study, patterns in types of validity ...

  7. Method of preparing a document for survey instrument validation by

    This paper is structured as follows: Section 1 provides the introduction to the need for a validation format for research, and the fundamentals of validation and the factors involved in validation from various literature studies are discussed in Section 2. Section 3 presents the methodology used in framing the validation format.

  8. Systematic literature review of validation methods for AI systems

    Consequently, validation challenges have been well observed in the earlier research, and our aim in this paper is to study the validation methods that resolve or alleviate these challenges. Gao et al. (2019) also argue that there is a deficiency in supporting tools for validating AI systems. Readily available tools are not non-existent.

  9. Validating Design Methods & Research: The Validation Square

    The George W. W oodruff School of Mechanical. Engineering. Georgia Institute of Technology. Atlanta, GA 30332-0405. USA. * Corresponding Author, to whom correspondence should be addressed: janet ...

  10. Development and validation of a questionnaire to measure research

    Surveys are among the most widely used tools in research impact evaluation. Quantitative approaches as surveys are suggested for accountability purposes, as the most appropriate way that calls for transparency (Guthrie et al. 2013).They provide a broad overview of the status of a body of research and supply comparable, easy-to-analyze data referring to a range of researchers and/or grants.

  11. Common misconceptions about validation studies

    More attention is being paid to the importance of validation studies, and several journals 12, 13 have even created submission categories for validation studies, yet they remain rare in epidemiologic research. To address this gap, we pose a series of questions that could be used on an exam about validation studies or in teaching validation ...

  12. PDF Analytical Method Development and Validation: a Review

    product studies and improvement, formulation pilot batch testing, scale-up research, exchange of innovation to business scale groups, setting up stability conditions, and managing of in-process, finished pharmaceutical formulations, qualification of equipment, master documents, and process limit [4]. Stage 2 This involves process validation phase.

  13. Verification, analytical validation, and clinical validation (V3): the

    Given (1) the historical context for the terms verification and validation in software and hardware standards, regulations, and guidances, and (2) the separated concepts of analytical and clinical ...

  14. Why is data validation important in research?

    Importance of data validation. Data validation is important for several aspects of a well-conducted study: To ensure a robust dataset: The primary aim of data validation is to ensure an error-free dataset for further analysis. This is especially important if you or other researchers plan to use the dataset for future studies or to train machine ...

  15. (PDF) Validation in Qualitative Research: General Aspects and

    The criteria for the validation of qualitative research are still open to discussion. This article has two aims: first, to present a summary of concepts, emerging from the field of qualitative ...

  16. Evaluation of clinical prediction models (part 2): how to undertake an

    External validation studies are an important but often neglected part of prediction model research. In this article, the second in a series on model evaluation, Riley and colleagues explain what an external validation study entails and describe the key steps involved, from establishing a high quality dataset to evaluating a model's predictive performance and clinical usefulness. A clinical ...

  17. Evidence for test validation: A guide for practitioners.

    Background: Validity is a core topic in educational and psychological assessment. Although there are many available resources describing the concept of validity, sources of validity evidence, and suggestions about how to obtain validity evidence; there is little guidance providing specific instructions for planning and carrying out validation studies. Method: In this paper we describe (a) the ...

  18. PDF The Role of Academic Validation in Developing Mattering and Academic

    This study examines academic validation, or the validation of students by instructors (Rendόn, 1994; Rendόn & Munoz, 2011). Regardless of full- or part-time status, living ... Rendόn's (1994) research oered examples of validating practices demonstrated by fac-ulty, including: (1) demonstrating genuine and authentic concern when teaching ...

  19. Validation studies for population-based intervention coverage

    Results. Indicator validation studies should report on participation at every stage, and provide data on reasons for non-participation. Metrics of individual validity (sensitivity, specificity, area under the receiver operating characteristic curve) and population-level validity (inflation factor) should be reported, as well as the percent of survey responses that are "don't know" or ...

  20. Verification Strategies for Establishing Reliability and Validity in

    The emphasis on strategies that are implemented during the research process has been replaced by strategies for evaluating trustworthiness and utility that are implemented once a study is completed. In this article, we argue that reliability and validity remain appropriate concepts for attaining rigor in qualitative research.

  21. Validation of a questionnaire on research-based learning with ...

    The statistical results to validate the questionnaire have been significant, allowing us to propose this experience as a starting point to implement further studies about the development of research skills in university's students from other areas of knowledge. Keywords - Validation, Learning, Research skills, University. -----1. I ntroduction

  22. Validity in Qualitative Evaluation: Linking Purposes, Paradigms, and

    Peer debriefing is a form of external evaluation of the qualitative research process. Lincoln and Guba (1985, p. 308) describe the role of the peer reviewer as the "devil's advocate.". It is a person who asks difficult questions about the procedures, meanings, interpretations, and conclusions of the investigation.

  23. The Process of Academic Validation Within a Comprehensive College

    Joseph A. Kitchen is an assistant professor of higher education at the University of Miami in Miami, Florida. Dr. Kitchen conducts quantitative, qualitative, and mixed-methods research and his research agenda spans several areas, with a central focus on the role of college transition, outreach, and support programs and interventions in promoting equitable outcomes and college success among ...

  24. Study reveals the benefits and downside of fasting

    MIT researchers have discovered how fasting impacts the regenerative abilities of intestinal stem cells, reports Ed Cara for Gizmodo.. "The major finding of our current study is that refeeding after fasting is a distinct state from fasting itself," explain Prof. Ömer Yilmaz and postdocs Shinya Imada and Saleh Khawaled.

  25. Journal of Medical Internet Research

    Full-text reading and screening were performed independently by 2 reviewers, and information was extracted into a pretested template for the 5 research questions. Disagreements were resolved by a domain expert. The study protocol has previously been published. Results: The search resulted in a total of 764 papers.

  26. (PDF) Validation

    The validation is an important variable in typical research and development study, especially for the study which develops a product. According to Glod-Lendvai (2018), validation is to test the ...

  27. Operational research to inform post-validation surveillance of

    Background Lymphatic filariasis (LF), a mosquito-borne helminth infection, is an important cause of chronic disability globally. The World Health Organization has validated eight Pacific Island countries as having eliminated lymphatic filariasis (LF) as a public health problem, but there are limited data to support an evidence-based approach to post-validation surveillance (PVS). Tonga was ...

  28. A scoping review of large language model based approaches for ...

    Similarly, multiple studies mention validation on external or multi-institutional data as a future research direction 19,26,59. Two studies mention the need of semantic enrichment or normalization ...

  29. Research shows the ages our metabolism undergoes massive rapid changes

    Research shows our bodies go through rapid changes in our 40s and our 60s ... "The beauty of this type of paper is the level of detail we can see in molecular changes," said Coresh, a ...

  30. New research suggests rainwater could have helped form the first

    A new paper by researchers with the University of Chicago and the University of Houston proposes a solution. They show how rainwater could have helped create a meshy wall around protocells 3.8 billion years ago, a critical step in the transition from tiny beads of RNA to every bacterium, plant, animal, and human that ever lived.