Factor 1
Factor 2
Factor 3
% Variance
24.708
16.436
13.705
% accumulated
24.708
41.143
54.848
Table 1. Variance explained according factors
The analysis, the items were ordered according to the degree of saturation presenting a higher load factor 3 (Table 2).
Items | Factor 1 | Factor 2 | Factor 3 |
1. I management research articles of a theme drawn from scientific journals, databases, etc. |
| .510 |
|
2. I recognize a scientific paper in a document of Wikipedia, Rincón del Vago, etc. |
| .408 |
|
3. I know what is literature review |
| .589 |
|
4. I identify scientific journals |
| .628 |
|
5. I recognize database of scientific journals |
| .812 |
|
6. I identify the structure of a scientific research article |
| .569 |
|
7. I use scientific techniques to organize information | .563 |
|
|
8. I analyze main ideas of a scientific article | .648 |
|
|
9. I reflect as I read a scientific article | .598 |
|
|
10. I interpret data, graphics, etc. a scientific article | .636 |
|
|
11. I summarize scientific information | .664 |
|
|
12. Discuss critically research article | .544 |
|
|
13. I make conclusions after reviewing scientific literature | .632 |
|
|
14. I prepare an abstract / essay of a research topic |
|
| .942 |
15. I use references according to rules of scientific writing in a text that I elaborate, is an abstract or essay |
|
| .490 |
16. I write in English keywords for a research topic |
|
| .297 |
17. I identify a new research topic in the literature review | .450 |
|
|
18. I am able to communicate orally the results of a review of scientific literature |
| .629 |
|
19. I Elaborate keywords of a research topic |
|
| .376 |
20 I bring my ideas in developing a research topic |
|
| .347 |
Cronbach's alpha coefficients | .891 | .711 | .687 |
Table 2. Matrix rotated component and Cronbach´s Alpha by factor
The correlations of factors indicate relationship and dependence between them, can say that the data confirm the validity of the questionnaire, with a structure of three factors.
The first factor group items 7, 8, 9, 10, 11, 12, 13 and 17, these items assess skills related to the organization of collecting scientific information. This factor is given the name "Process scientific information".
The second corresponds to items 1, 2, 3, 4, 5, 6 and 18. These items assess skills regarding the management and search of scientific information. This factor is called "Managing scientific information".
The third factor is part of the group of items 14, 15, 16, 19 and 20; these items assess implementation related to new understandings and new working skills. The name of the factor is "Develop scientific information".
2.6. Reliability Analysis
Once the validity of the scale established, the reliability of the instrument was calculated by Cronbach's alpha coefficient, which reaches a value of .91 in the total scale; .891 for the factor 1 "Process scientific information"; .711 for the factor 2 "Managing scientific information"; and .687 for the factor 3 "Develop scientific information", indicating adequate internal consistency of the instrument, which makes the AHABI reliable instrument. The following table 3 shows the reliability of the scale by the item-total correlation, which reflects the means between the groups with higher and lower total scores, in our analysis we have values ranging between .89 and .92, a small strip which ensures basic instrument dimensionality.
| Average scale if the item is deleted | Scale variance if the item is deleted | Total corrected correlation-element | Cronbach's alpha if the item is removed |
Item 1 | 79.31 | 293.75 | 0.59 | .89 |
Item 2 | 78.53 | 307.82 | 0.43 | .90 |
Item 3 | 78.87 | 318.07 | 0.24 | .91 |
Item 4 | 78.65 | 301.11 | 0.60 | .90 |
Item 5 | 79.45 | 292.45 | 0.72 | .89 |
Item 6 | 79.25 | 302.86 | 0.54 | .90 |
Item 7 | 79.81 | 306.67 | 0.38 | .91 |
Item 8 | 79.30 | 310.43 | 0.32 | .91 |
Item 9 | 80.07 | 295.33 | 0.58 | .90 |
Iem 10 | 79.92 | 302.30 | 0.57 | .90 |
Item 11 | 80.69 | 349.07 | -0.32 | .92 |
Item 12 | 80.48 | 316.48 | 0.36 | .91 |
Item 13 | 80.66 | 321.54 | 0.27 | .91 |
Item 14 | 80.09 | 322.20 | 0.23 | .91 |
Item 15 | 80.28 | 310.11 | 0.42 | .90 |
Item 16 | 79.06 | 305.22 | 0.51 | .90 |
Item 17 | 79.16 | 296.94 | 0.65 | .89 |
Item 18 | 80.96 | 310.61 | 0.46 | .90 |
Item 19 | 80.70 | 315.81 | 0.36 | .91 |
Item 20 | 79.31 | 316.92 | 0.33 | .91 |
Table 3. Correlation items with total scale
3. Discussion
Significant results of this research are consistent with those found in the literature survey of other studies that have pointed to the effectiveness of a system based on skills development research method, proving to be a learning strategy for students cognitively attractive undergraduate many disciplines, and allowing them to work increasingly academic logic and research. ( Hunter et al ., 2007; Ward et al. , 2003; Willison and O'Regan, 2007; Chaplin, 2003; Hoskins, Stevens & Nehm, 2007; Luckie, Maleszewski, Loznak & Krha, 2004). In the same way in line with the conclusions drawn by Willison (2009) and reflected in its proposed of Research Skill Development and its application in some underg raduate programs (tourism, Engineering, Health, etc.) at the University of Adelaide, choosing different types and facets of research for each area (research Literature Review, field, laboratory).
The factors that form the questionnaire, "Process scientific information", "Managing scientific information" and "Develop scientific information" could evalu ate proposals like Bastidas (2013), who mentions that based teaching research students act as researchers learn related skills and teaching aims to help students understand the phenomena of how the experts do it, plus it could assess the methodological proposals of Rizo (2012) and Torres (2012).
4. Conclusions
The scale has high internal consistency since the Cronbach alpha coefficient reached the .91. Additional saturations of each item with their respective factors have high values. On the other hand, correlations between factors indicate a good relationship and dependence between them, so we can say that this study has generated a valid measure learning research skills instrument, since the results presented, as a whole, confirm the high reliability, also factorial validity and content.
Factors 1, 2 and 3 grouped questionnaire items, indicating adequate internal consistency, so the AHABI is deemed reliable instrument for use in program evaluation with a focus on teaching research at the university.
Having reached the goal of the research, it contributes to this study to new learning experiences in university classrooms that allow time to develop and evaluate innovative forms of education is incorporated, The method of teaching research not only provides micro curriculum level, but also at the policy level for the quality of higher education.
R eferences
Abril, A. , Ariza, M. , Quesada, A. , & García, J. (2013). Creencias del profesorado en ejercicio y en formación sobre el aprendizaje por investigación. Revista Eureka sobre Enseñanza y Divulgación de las Ciencias , 11(1), 22-33.
Arnal, J. , del Rincón, D. , & Latorre, A. (1992). Investigación educativa. Metodologías de investigación educativa . Barcelona: Labor.
Bartlett, M.S. (1950). Test of significance in factor analysis . England : University of Manchester.
Bastidas, V. (2013). Aprendizaje Basado en Investigación . Retrieved from: http://sitios.itesm.mx/va/dide2/tecnicas_didacticas/abi/abi.htm
Boyer Commission Report (1998). The Boyer Commission on Educating Undergraduates in the Research University, Reinventing Undergraduate Education: A Blueprint for America’s Research Universities . Retrieved from : http://www.niu.edu/engagedlearning/research/pdfs/Boyer_Report.pdf
Brew, A. (2013). Understanding the scope of undergraduate research: A framework for curricular and pedagogical decision-making. Higher Education , (66), 603-618. http://dx.doi.org/10.1007/s10734-013-9624-x
Cerda-Gutiérrez, H. (2006). Formación investigativa en la Educación Superior Colombiana . Universidad Cooperativa de Colombia. Bogotá: EDUCC.
Chávez, G. (2013). La investigación formativa en la universidad. Proyecto de investigación del Cuerpo Académico “Cambio educativo: discursos, actores y prácticas” . México : Universidad Autónoma de Nuevo León.
Chaplin, S. (2003). Guided development of independent inquiry in an anatomy/physiology laboratory. Advances in Physiology Education . Minnesota: University of St. Thomas.
Council for Undergraduate Research (2013). About CUR: Frequent questions . Retrieved from : http://www.cur.org/about_cur/frequently_asked_questions_/#2
Elbel, R. (1965). Measuring educational achievement . Englewood: Prentice-Hall.
Escalante, E. , & Caro, A. (2006). Investigación y análisis estadísticos de datos en SPSS . Mendoza: FEEyE.
Fernández, D. , Cordeiro, A. , Cordeiro, E. , & Pérez, C. (2004). Diseño de un cuestionario para la identificación de las habilidades generales y las cualidades del investigador científico. Pedagogía Universitaria , 9(1), 25-36.
García, G.A. , & Ladino, Y. (2008). Desarrollo de competencias científicas a través de una estrategia de enseñanza y aprendizaje por investigación. Studiositas , 3(3), 7-16.
González, J. , Galindo, N. , Galindo, J.L. , & Gold, M. (2004). Los paradigmas de la calidad educativa. De la autoevaluación a la acreditación . México: Unión de Universidades de América Latina.
Griffiths, R. (2004). Knowledge production and the research-learning nexus: The case of the built environment disciplines. Studies in Higher Education , 29(6), 709-726. http://dx.doi.org/10.1080/0307507042000287212
Healey M. , & Jenkins, A. (2009). Undergraduate Research and Inquiry . York: Higher Education Academy.
Hoskins, S. Stevens, L. , & Nehm, R. (2007). Selective Use of the primary literature transforms the classroom into a virtual laboratory. Genetics , 176, 1381- 138 9. http://dx.doi.org/10.1534/genetics.107.071183
Hunter, A., Laursen, S. , & Seymour, E. (2007). Becoming a Scientist: the role of undergraduate research in students’ cognitive, personal and professional development. Science Education , 91, 36 ‑ 74. http://dx.doi.org/10.1002/sce.20173
Hurtado, J. (2000). Retos y alternativas en la formación de investigadores . Venezuela: SYPAL.
Instituto Tecnológico de Estudios Superiores de Monterrey (2010). Aprendizaje basado en investigación. Investigación e Innovación Educativa . Retrieved from: http://sitios.itesm.mx/va/dide2/tecnicas_didacticas/abi/copabi.htm
Kaiser, H.F. (1970). A second generation Little Jiffy. Psycometrika , 35, 401-415. http://dx.doi.org/10.1007/BF02291817
Lipman, M. (2001). Pensamiento Complejo y Educación . Madrid: Ediciones de la Torre.
Luckie, D., Maleszewski, J., Loznak, S., & Krha, M. (2004). Infusion of collaborative inquiry throughout a biology curriculum increases student learning: a four year study of "Teams and Streams". Advances in Physiology Education , 28, 199-209. http://dx.doi.org/10.1152/advan.00025.2004
Martínez, A ., & Buendía, A. (2005). Aprendizaje basado en la investigación. Tecnológico de Monterrey . Retrieved from : http://www.mty.itesm.mx/rectoria/dda/rieee/pdf-05/29(EGADE).A.BuendiaA.Mtz.pdf
Morales, P. , Urosa, B. , & Blanco, A. (2003). Construcción de escalas de actitudes tipo Likert . Madrid: La Muralla.
Morales, O. Rincón, A ., & Romero, J. (2004). Cómo enseñar a investigar en la universidad. Educere , 9(29), 217-224.
Moreno, M. (2002). Formación para la investigación centrada en el desarrollo de habilidades . México: Universidad de Guadalajara.
Núñez, N. (2007). Desarrollo de habilidades para la investigación (DHIN). Revista Iberoamericana de Educación , 44, 6-15.
Papanastasiou, E. (2005). Factor structure of the attitudes toward research scale. Statistics Education Research Journal , 4(1), 16-26.
Restrepo, B. (2003). Investigación formativa e investigación productiva de conocimiento en la universidad. Nómadas , 18, 195-202.
Rivera, Mª E. , & Torres, C. (2006). Percepción de los estudiantes universitarios de sus propias habilidades de investigación. Revista de la Comisión de Investigación de FIMPES , 1(1), 36-49.
Rizo, M (2012). Enseñar a investigar investigando. Experiencias de investigación en comunicación con estudiantes de la Licenciatura en Comunicación y Cultura de la Universidad Autónoma de la Ciudad de México . México: Universidad Autónoma de la Ciudad de México.
Rojas, M. , & Méndez, R. (2013). Cómo enseñar a investigar. Un reto para la pedagogía universitaria. Educ. , 1(16), 95-108.
Ruiz, C. , & Torres, V. (2002). Actitud hacia el aprendizaje de la investigación. Conceptualización y medición. Educación y Ciencias Humanas , X (18), 15-30.
Sayous, N. (2007). La investigación científica y el aprendizaje social para la producción de conocimientos en la formación del ingeniero civil. Ingeniería , 11(2), 39-46.
Sierra, M. , Alejo, M. , & Silva, F. (2011). Evaluación de competencias de investigación en alumnos de licenciatura en psicología . Retrieved from : http://www.researchgate.net/publication/232069603_Evaluacin_de_competencias_de_investigacin_en_alumnos_de_licenciatura_en_Psicologa
Torres, A. (2012). Aprendizaje Basado en la Investigación. Técnicas Didácticas. Tecnológico de Monterrey . Retrieved from : http://rodin.uca.es:8081/xmlui/bitstream/handle/10498/15117/7313_Penaherrera.pdf?sequence=7
Tünnermann, C. (2003). La universidad latinoamericana ante los retos del siglo XXI . México: Universidad Autónoma de Yucatán Mérida.
Ward, C. Bennett, J. , & Bauer, K.W. (2003). Content analysis of undergraduate research student evaluations. Retrieved from : https://www.sarc.miami.edu/ReinventionCenter/Public/assets/files/contentAnalysis.pdf
Willison, J. (2009). Multiple contexts, multiple outcomes, one conceptual framework for research skill development in the undergraduate curriculum. Spring , (29 ) 3, 10-14.
Willison, J., & O’Regan. K. (2007). Commonly known, commonly not known, totally unknown: A framework for students becoming researchers. Higher Education Research and Development , 26(4), 393-405. http://dx.doi.org/10.1080/07294360701658609
This work is licensed under a Creative Commons Attribution 4.0 International License
Journal of Technology and Science Education, 2011-2024
Online ISSN: 2013-6374; Print ISSN: 2014-5349; DL: B-2000-2012
Publisher: OmniaScience
Suggestions or feedback?
Press contact :, media download.
Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license . You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images; if one is not provided below, credit the images to "MIT."
Previous image Next image
Low-calorie diets and intermittent fasting have been shown to have numerous health benefits: They can delay the onset of some age-related diseases and lengthen lifespan, not only in humans but many other organisms.
Many complex mechanisms underlie this phenomenon. Previous work from MIT has shown that one way fasting exerts its beneficial effects is by boosting the regenerative abilities of intestinal stem cells, which helps the intestine recover from injuries or inflammation.
In a study of mice, MIT researchers have now identified the pathway that enables this enhanced regeneration, which is activated once the mice begin “refeeding” after the fast. They also found a downside to this regeneration: When cancerous mutations occurred during the regenerative period, the mice were more likely to develop early-stage intestinal tumors.
“Having more stem cell activity is good for regeneration, but too much of a good thing over time can have less favorable consequences,” says Omer Yilmaz, an MIT associate professor of biology, a member of MIT’s Koch Institute for Integrative Cancer Research, and the senior author of the new study.
Yilmaz adds that further studies are needed before forming any conclusion as to whether fasting has a similar effect in humans.
“We still have a lot to learn, but it is interesting that being in either the state of fasting or refeeding when exposure to mutagen occurs can have a profound impact on the likelihood of developing a cancer in these well-defined mouse models,” he says.
MIT postdocs Shinya Imada and Saleh Khawaled are the lead authors of the paper, which appears today in Nature .
Driving regeneration
For several years, Yilmaz’s lab has been investigating how fasting and low-calorie diets affect intestinal health. In a 2018 study , his team reported that during a fast, intestinal stem cells begin to use lipids as an energy source, instead of carbohydrates. They also showed that fasting led to a significant boost in stem cells’ regenerative ability.
However, unanswered questions remained: How does fasting trigger this boost in regenerative ability, and when does the regeneration begin?
“Since that paper, we’ve really been focused on understanding what is it about fasting that drives regeneration,” Yilmaz says. “Is it fasting itself that’s driving regeneration, or eating after the fast?”
In their new study, the researchers found that stem cell regeneration is suppressed during fasting but then surges during the refeeding period. The researchers followed three groups of mice — one that fasted for 24 hours, another one that fasted for 24 hours and then was allowed to eat whatever they wanted during a 24-hour refeeding period, and a control group that ate whatever they wanted throughout the experiment.
The researchers analyzed intestinal stem cells’ ability to proliferate at different time points and found that the stem cells showed the highest levels of proliferation at the end of the 24-hour refeeding period. These cells were also more proliferative than intestinal stem cells from mice that had not fasted at all.
“We think that fasting and refeeding represent two distinct states,” Imada says. “In the fasted state, the ability of cells to use lipids and fatty acids as an energy source enables them to survive when nutrients are low. And then it’s the postfast refeeding state that really drives the regeneration. When nutrients become available, these stem cells and progenitor cells activate programs that enable them to build cellular mass and repopulate the intestinal lining.”
Further studies revealed that these cells activate a cellular signaling pathway known as mTOR, which is involved in cell growth and metabolism. One of mTOR’s roles is to regulate the translation of messenger RNA into protein, so when it’s activated, cells produce more protein. This protein synthesis is essential for stem cells to proliferate.
The researchers showed that mTOR activation in these stem cells also led to production of large quantities of polyamines — small molecules that help cells to grow and divide.
“In the refed state, you’ve got more proliferation, and you need to build cellular mass. That requires more protein, to build new cells, and those stem cells go on to build more differentiated cells or specialized intestinal cell types that line the intestine,” Khawaled says.
Too much of a good thing
The researchers also found that when stem cells are in this highly regenerative state, they are more prone to become cancerous. Intestinal stem cells are among the most actively dividing cells in the body, as they help the lining of the intestine completely turn over every five to 10 days. Because they divide so frequently, these stem cells are the most common source of precancerous cells in the intestine.
In this study, the researchers discovered that if they turned on a cancer-causing gene in the mice during the refeeding stage, they were much more likely to develop precancerous polyps than if the gene was turned on during the fasting state. Cancer-linked mutations that occurred during the refeeding state were also much more likely to produce polyps than mutations that occurred in mice that did not undergo the cycle of fasting and refeeding.
“I want to emphasize that this was all done in mice, using very well-defined cancer mutations. In humans it’s going to be a much more complex state,” Yilmaz says. “But it does lead us to the following notion: Fasting is very healthy, but if you’re unlucky and you’re refeeding after a fasting, and you get exposed to a mutagen, like a charred steak or something, you might actually be increasing your chances of developing a lesion that can go on to give rise to cancer.”
Yilmaz also noted that the regenerative benefits of fasting could be significant for people who undergo radiation treatment, which can damage the intestinal lining, or other types of intestinal injury. His lab is now studying whether polyamine supplements could help to stimulate this kind of regeneration, without the need to fast.
“This fascinating study provides insights into the complex interplay between food consumption, stem cell biology, and cancer risk,” says Ophir Klein, a professor of medicine at the University of California at San Francisco and Cedars-Sinai Medical Center, who was not involved in the study. “Their work lays a foundation for testing polyamines as compounds that may augment intestinal repair after injuries, and it suggests that careful consideration is needed when planning diet-based strategies for regeneration to avoid increasing cancer risk.”
The research was funded, in part, by a Pew-Stewart Trust Scholar award, the Marble Center for Cancer Nanomedicine, the Koch Institute-Dana Farber/Harvard Cancer Center Bridge Project, and the MIT Stem Cell Initiative.
Press mentions, medical news today.
A new study led by researchers at MIT suggests that fasting and then refeeding stimulates cell regeneration in the intestines, reports Katharine Lang for Medical News Today . However, notes Lang, researchers also found that fasting “carries the risk of stimulating the formation of intestinal tumors.”
MIT researchers have discovered how fasting impacts the regenerative abilities of intestinal stem cells, reports Ed Cara for Gizmodo . “The major finding of our current study is that refeeding after fasting is a distinct state from fasting itself,” explain Prof. Ömer Yilmaz and postdocs Shinya Imada and Saleh Khawaled. “Post-fasting refeeding augments the ability of intestinal stem cells to, for example, repair the intestine after injury.”
Prof. Ömer Yilmaz and his colleagues have discovered the potential health benefits and consequences of fasting, reports Max Kozlov for Nature . “There is so much emphasis on fasting and how long to be fasting that we’ve kind of overlooked this whole other side of the equation: what is going on in the refed state,” says Yilmaz.
Previous item Next item
Related articles.
More mit news.
Read full story →
Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA
Published on 23.8.2024 in Vol 26 (2024)
Authors of this article:
There are no citations yet available for this article according to Crossref .
Click through the PLOS taxonomy to find articles in your field.
For more information about PLOS Subject Areas, click here .
Loading metrics
Open Access
Study Protocol
Roles Conceptualization, Methodology, Writing – original draft
* E-mail: [email protected]
Affiliation UQ Centre for Clinical Research, The University of Queensland, Brisbane, QLD, Australia
Roles Conceptualization, Methodology, Writing – review & editing
Affiliation Public Health Division, Ministry of Health, Nuku’alofa, Tonga
Affiliations UQ Centre for Clinical Research, The University of Queensland, Brisbane, QLD, Australia, National Centre for Immunisation Research and Surveillance of Vaccine Preventable Diseases, Sydney, NSW, Australia
Roles Conceptualization, Funding acquisition, Methodology, Writing – review & editing
Lymphatic filariasis (LF), a mosquito-borne helminth infection, is an important cause of chronic disability globally. The World Health Organization has validated eight Pacific Island countries as having eliminated lymphatic filariasis (LF) as a public health problem, but there are limited data to support an evidence-based approach to post-validation surveillance (PVS). Tonga was validated as having eliminated LF in 2017 but no surveillance has been conducted since 2015. This paper describes a protocol for an operational research project investigating different PVS methods in Tonga to provide an evidence base for national and regional PVS strategies.
Programmatic baseline surveys and Transmission Assessment Surveys conducted between 2000–2015 were reviewed to identify historically ‘high-risk’ and ‘low-risk’ schools and communities. ‘High-risk’ were those with LF antigen (Ag)-positive individuals recorded in more than one survey, whilst ‘low-risk’ were those with no recorded Ag-positives. The outcome measure for ongoing LF transmission will be Ag-positivity, diagnosed using Alere™ Filariasis Test Strips. A targeted study will be conducted in May-July 2024 including: (i) high and low-risk schools and communities, (ii) boarding schools, and (iii) patients attending a chronic-disease clinic. We estimate a total sample size of 2,010 participants.
Our methodology for targeted surveillance of suspected ‘high-risk’ populations using historical survey data can be adopted by countries when designing their PVS strategies. The results of this study will allow us to understand the current status of LF in Tonga and will be used to develop the next phase of activities.
Citation: Lawford H, Tukia ‘, Takai J, Sheridan S, Lau CL (2024) Operational research to inform post-validation surveillance of lymphatic filariasis in Tonga study protocol: History of lymphatic filariasis elimination, rational, objectives, and design. PLoS ONE 19(8): e0307331. https://doi.org/10.1371/journal.pone.0307331
Editor: Marianne Clemence, Public Library of Science, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND
Received: June 26, 2024; Accepted: July 1, 2024; Published: August 20, 2024
Copyright: © 2024 Lawford et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: No datasets were generated or analysed during the current study. All relevant data from this study will be made available upon study completion.
Funding: This work received financial support from the Coalition for Operational Research on Neglected Tropical Diseases (COR-NTD), which is funded at The Task Force for Global Health primarily by the Bill & Melinda Gates Foundation (OPP1190754) and by the United States Agency for International Development through its Neglected Tropical Diseases Program. Under the grant conditions of the Foundation, a Creative Commons Attribution 4.0 Generic License has already been assigned to the Author Accepted Manuscript version that might arise from this submission. This work was also supported by an Australian National Health and Medical Research Council (NHMRC) Investigator Grant (APP1158469 to CLL). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Lymphatic filariasis (LF) is a mosquito-borne parasitic infection caused by three species of filarial worms ( Wuchereria bancrofti , Brugia malayi , or Brugia timori ) [ 1 ]. In Tonga, diurnally sub-periodic W . bancrofti is the dominant species of parasite causing LF [ 2 ], which is transmitted primarily by Aedes tongae and Ae . tabu mosquitoes [ 3 ]. Both vectors are diurnal feeders that feed indoors and outdoors, and are container breeders with habitats that include tree and coconut holes, leaf axils, and artificial containers [ 4 ].
The Global Programme to Eliminate LF (GPELF) is one the largest public health programs in the world [ 5 ]. GPELF was launched by the World Health Organization (WHO) in 2000 with the aim to i) interrupt transmission through mass drug administration (MDA) of anthelminthic medicines, and ii) control morbidity of affected populations by 2020 [ 6 ]. New milestones and targets beyond 2020 have been developed; the WHO Neglected Tropical Diseases Roadmap 2030 [ 7 ] proposes that all countries complete their MDA programmes, implement post-MDA or post-validation surveillance, and have implemented a minimum package of care for LF morbidity by 2030 [ 8 ]. MDA aims to reduce the prevalence of infections to a level where transmission is thought to be no longer sustainable [ 9 ], with the elimination criteria for Aedes vector areas set at <1.0% antigen (Ag) prevalence in 6-7-year-old children.
Many countries in the Pacific region have achieved validation of LF elimination; however, there is a limited evidence base to inform the development of effective and efficient PVS strategies. Operational research is required to determine effective sampling strategies to confirm the presence or absence of LF transmission post-validation, to identify any ongoing transmission in a cost-effective and timely manner, and to integrate surveillance into other public health programs to ensure long-term sustainability.
The Kingdom of Tonga is an archipelago of over 170 islands ( Fig 1 ). Historical surveys of LF prevalence in Tonga recorded microfilaria (Mf) prevalence ranging from 13.5% in Tongatapu (1925) to 71.0% in Niuatoputapu (1970), and nationwide Mf prevalence of 17.4% in 1976 [ 10 ]. In 1977, MDA of diethylcarbamazine (DEC) (one dose/month for 12 months) was implemented, following which nationwide surveys in 1979, 1983/4, and 1998/9 recorded Mf prevalence of 1.0%, 0.4%, and 0.63%, respectively, suggesting ongoing residual transmission [ 10 ].
Source: Nations Online Project ( https://www.nationsonline.org/oneworld/map/tonga-map.htm ).
https://doi.org/10.1371/journal.pone.0307331.g001
The Pacific Programme for the Elimination of LF (PacELF) was launched in 2000, following which the Ministry of Health (MOH) in Tonga established the National Programme to Eliminate LF (NPELF) which aimed to (i) achieve 100% geographic coverage for MDA by the year 2001, (ii) implement five effective rounds of MDA throughout the country, and (iii) achieve interruption of transmission by 2005 [ 10 ]. Each division/island group was designated as an implementation unit (IU), giving a total of five IUs: ‘Eua, Ha’apai, Ongo Niua, Tongatapu, and Vava’u.
A baseline survey (A Survey) of convenience sampling of adults (n = 4,002) at sentinel sites was conducted in 1999–2000 (prior to the first round of MDA) and found an Ag prevalence of 2.7% (n = 108 positive), following which three rounds of MDA (DEC at 6 mg/kg body weight and one tablet of albendazole at 400 mg) were distributed throughout the country between 2001–2003 [ 11 ]. Following the third round of MDA, a mid-term evaluation (B Survey) of adults (n = 3,294) in sentinel sites was implemented in 2003–2004, which recorded an Ag prevalence of 2.5% (n = 81 positive) [ 11 ]. Two further MDA rounds were conducted in 2004 and 2005. In 2006, a pre-stop MDA survey (C Survey) was implemented in all sites/villages in Ongo Niuas (where a high baseline Ag prevalence was recorded) and 4–6 sites in the remaining IUs. Overall, 2,927 participants were surveyed and 11 were Ag-positive, giving an overall Ag prevalence of 0.4% [ 11 ]. Five positive cases were identified in Niuatoputapu district, Ongo Niua, five in Ha’apai, and one in ‘Eua. Following the C Survey, a further round of MDA was distributed in 2006 in Niuatoputapu district only.
To determine whether MDA could be stopped, a stop MDA survey (Transmission Assessment Survey [TAS]-1/D Survey) was conducted in 2007 among 2,391 First Grade children (5–6 years old) in all IUs, with no children testing positive. In addition, dried blood spots (DBS) were collected from 797/801 children from ‘Eua, Ha’apai, and Vava’u for antifilarial antibody (Ab) testing with the Filariasis ELISA (Joseph et al , 2011 [ 12 ]). An overall Bm14 Ab prevalence of 6.3% was reported among these children, though there were no Ag-positive cases, suggesting that LF transmission remained successfully interrupted [ 12 ]. Of note, the five individuals from Niuatoputapu district, Ongo Niua, who tested positive in the 2006 C Survey were retested in 2007, and three of the five remained Ag-positive [ 11 ]. No Mf testing was conducted on these individuals.
TAS-2 was conducted in all IUs in 2011 and returned no positive samples among 2,451 children. However, a survey by Chu et al . (2014), which aimed to simultaneously assess the impact of MDAs on LF and STH by incorporating STH testing into TAS-2, reported 7/2,434 Ag-positive students from six schools in the IUs Tongatapu (n = 3), Ha’apai (n = 2), and Ongo Niua (n = 2), giving an overall Ag prevalence of 0.3% [ 13 ]. A final TAS (TAS-3) was conducted in 2015 among 2,806 children from all five divisions; one child tested positive in Ongo Niua, giving an overall Ag prevalence 0.04%, suggesting continued interruption of LF transmission in Tonga. In 2017, Tonga successfully obtained WHO’s validation of elimination of LF as a public health problem [ 10 ]. Please refer to S1 Table for details of previous LF surveys in Tonga and Fig 2 for the estimated Ag prevalence and MDA coverage in the years leading to LF elimination in Tonga.
Programmatic baseline surveys (A) and Transmission Assessment Surveys (B) indicating antigen prevalence and mass drug administration coverage between 2000–2015, Tonga. Asterisks indicate an antigen prevalence of 0% (no Ag-positives were found).
https://doi.org/10.1371/journal.pone.0307331.g002
The identification of seven positive children in the 2011 TAS-2, and one positive child in the 2015 TAS-3, suggests that there may be persistent localised areas of LF transmission in some communities in Tonga despite five rounds of nationwide MDA (and six rounds of MDA in Niuatongatapu district, Ongo Niua). There has been no community surveillance of LF since 2015, thus transmission may have been re-established but not yet been identified. This study will be the first to determine whether LF transmission is still occurring in Tonga five-years post-validation of LF elimination, and if transmission is found, will enable early identification of LF to allow a timely response to prevent widespread recrudescence of LF in Tonga. Further, we will investigate different PVS methodologies to determine the most efficient method to identity residual LF transmission in Tonga, and thereby provide an evidence-base for national and regional PVS strategies.
In WHO’s Global Programme to Eliminate Lymphatic Filariasis, school-based TAS of 6-7-year-old school children are currently used to determine whether Ag prevalence has dropped below the critical threshold of 1.0% to signify LF elimination. However, rather than conducting a study to estimate Ag prevalence, we have chosen to adopt a targeted surveillance approach that aims to sample areas/people with the highest probability of LF infection to determine the presence or absence of disease. We believe that there are several advantages to conducting targeted surveillance for PVS. Firstly, the justification for conducting TAS among 6-7-year-olds is that young children should be protected from LF infection (and thus have low or zero Ag prevalence) if previous MDA rounds were successful. Furthermore, antigenemia detected in older children and adults may be due to infections pre-dating MDA [ 14 ]. However, previous surveys have shown that TAS-like sampling of children performs poorly for detecting microfoci of ongoing transmission, with significant persistence of LF reported in adults despite areas having met elimination targets in school-based TAS [ 15 ]. Studies have also reported significantly higher Ag prevalence in adults compared to children [ 16 ], and in community surveys (especially true for people >30 years) [ 17 ]. Therefore, we have chosen to test older aged school children (in the last two Grades/Forms of primary/high school) as well as community members to determine whether this increases the likelihood of detecting Ag-positives. We have also chosen to test children in the last two years of boarding school, as this provides a unique opportunity to access older, high-school aged children who (i) may have a higher Ag prevalence than 6-7-year-olds, and (ii) are from remote island communities that can be hard to access and may otherwise be missed.
We will also conduct a clinic-based survey. We hypothesise that individuals suffering from chronic diseases may be less likely to have taken MDA in previous rounds due to serious illness or co-morbidities, or concerns about taking MDA in addition to their regular medications, and may therefore more likely to have untreated LF infections. These ‘never treated’ individuals could potentially be acting as reservoirs of infection in communities, leading to sustained LF transmission [ 18 – 20 ]. Integrating LF testing into pre-established screening activities for patients attending chronic disease clinics is a less resource intensive and convenient means to test this potentially high-risk population.
Lastly, evidence suggests that anti-filarial Ab markers could be more sensitive measures of transmission than Ag in low prevalence settings [ 21 ], and Ab prevalence may indicate pre-antigenemic LF infection and ongoing transmission or resurgence, thereby allowing a timelier response compared to testing for Ag alone [ 21 – 23 ]. This study will concurrently measure seroprevalence of LF Ag and Abs, thereby providing an opportunity to determine the utility of Ab as a marker of LF transmission or resurgence, and assess the sensitivity of different diagnostic tools in PVS settings.
The aim of this study is to investigate different approaches for PVS of LF and provide an evidence base for developing ongoing PVS strategies in Tonga and regionally. The primary objective is to determine if there is any evidence of ongoing transmission of LF in Tonga post-validation of elimination of LF as a public health problem. In addition, we will:
To confirm the presence/absence of LF, this study will be conducted in four different settings: primary schools (‘low-risk’ and ‘high-risk’), high schools (including boarding schools), communities (‘low-risk’ and ‘high-risk’), and a diabetes outpatient clinic based at the national hospital ( Table 1 ).
https://doi.org/10.1371/journal.pone.0307331.t001
Programmatic baseline surveys and TAS conducted between 2000–2015 were reviewed to identify ‘high-risk’ and ‘low-risk’ locations. ‘Low-risk’ locations will be used as reference populations against which LF Ag and Ab seroprevalence from the targeted, high-risk schools and villages can be compared. The six primary schools that recorded Ag-positive children in the 2011 TAS were considered ‘high-risk’ primary schools, and the communities in which the schools are located were considered ‘high-risk’ communities. An additional primary school was selected as it was in a community that recorded high Ag prevalence in the 2003 B Survey. Four primary schools that recorded no Ag-positive children in 2011 TAS were considered low-risk reference schools; these schools were from ‘low-risk’ communities that also recorded no positive cases in the 2006 community-based C survey. Following consultation with the MOH, boarding schools that were believed to have a high proportion of students boarding from outer islands were selected. Lastly, following consultation with MOH, the diabetes clinic at Vaiola Hospital in Nuku’alofa was selected for the clinic-based component of the study due to large number of patients who could be recruited.
At selected schools, all students in Grades 5–6 of primary school, and Forms 6–7 of high school, will be enrolled in the study. Based on estimated student enrolment numbers from the Tonga Statistics Department [ 24 ], we estimate approximately 55 students to be enrolled per primary school (25 students in Grade 5 and 30 students in Grade 6), and 50 students per high school (25 students in Form 6 and 25 students in Form 7). In schools with low enrolment numbers, eligible grades in primary school will be expanded to include Grades 3–4 and Forms 1–2, and in high schools will be expanded to include Forms 4–5 and Technical and Vocational Education (TVET) students. If any Ag-positive students are identified, they will be followed-up at home and all household members ≥5 years old will be offered testing.
For the community survey, a list of households in the selected villages will be obtained from the Tonga Statistics Department. All households will be assigned a number from one to the total number of households per village. The total number of households in each village will be entered into a Random Number Generator. Fifteen unique random numbers between one and the total number of households per village will be generated. We will survey the households represented by these numbers. An additional five households per village will be selected in case any of the selected households are uninhabited or the inhabitants cannot be reached. Based on Tonga’s Population and Housing Census 2021 [ 25 ], we estimate five participants per household and 75 participants per community.
For the clinic-based survey, all patients attending the diabetes clinic at Vaiola Hospital, Nuku’alofa, during the survey weeks will be invited to participate. The target sample size will be 200 patients (if no Ag-positives are detected, this provides an upper 95% CI of 1.5%). If any Ag-positive patients are identified, they will be followed-up at home and all household members ≥5 years old will be tested.
The study has been approved by the Tongan National Health Ethics and Research Committee of the Tongan Ministry of Health (Ref#20240129) and ratified by The University of Queensland’s Human Research Ethics Committee (Project#2024/HE000493).
For the school-based survey, all children in the selected Grades or Forms will be invited to participate. Approximately one week prior to the survey, school principals will be informed about the survey by a team member. Principals or teachers will be asked to distribute participant information sheets and consent forms to all children in the selected Grades or Forms and children will be asked to give these forms to their parents to sign. On the day of the survey, team members will collect signed consent forms and enrol consented students into the study.
For the community-based survey, all household members aged ≥5 years will be eligible to participate. A person will be considered a household member if they slept at the house the previous night or identify the house as their primary residence. Upon arrival at the household, a standard participant information sheet and consent form will be administered to all study participants. A team member will explain the purposes of the study and seek written consent (parental or guardian consent in the case of a children <18 years of age) to participate in the study. The total number of household members including those who are absent, will be recorded.
At the diabetes clinic, patients will be recruited whilst they are in the waiting area prior to their scheduled appointment. A team member will explain the purposes of the study and seek written consent (parental or guardian consent in the case of a children <18 years of age) to participate in the study.
All participants will be advised that they can revoke their consent at any time without any prejudice.
Standardized electronic questionnaires will be administered using EpiCollect5 [ 26 ]. Questionnaires will be used to collect demographic information (including age, sex, country/place of birth, village of residence, occupation) and travel history in the past 12 months (both international and within Tonga). At selected households, schools, and chronic disease clinic, an environmental assessment will be conducted by trained Environmental Health Officers as part of their routine inspections. These inspections will include an assessment of the materials of the household/school/clinic structure, water sourcing and storage, and potential vector breeding sites. GPS coordinates of each enrolled household will be recorded.
For each enrolled participant, at least 300μL of blood will be collected by finger prick into heparin-coated BD Microtainer® Blood Collection Tubes. Alere™ Filariasis Test Strips (FTS) (Abbott, Scarborough, ME) will be used to detect LF antigen. For any Ag-positive samples, Mf slides will be prepared using methods described previously in the literature [ 27 ]. DBS will be collected from all individuals (irrespective of Ag positivity) for Multiplex Bead Assays (MBA) to detect anti-filarial Abs using methods described previously in the literature [ 21 ].
To enable linkage of demographic variables, environmental variables, and FTS/DBS results collected in EpiCollect5, each participant will be given a unique identifying number that will be printed as QR code stickers and attached to consent forms, questionnaires, blood collection tubes, FTS, slides, and dried blood spots.
The primary outcome measure to signal the presence of LF in Tonga will be a positive Ag test. Crude Ag and Ab prevalence with 95% confidence intervals (CI) will be estimated using binomial exact methods. Seroprevalence of LF Ab will be estimated by measuring IgG responses using MBA with Ag-specific cut-off values (measured by Median Fluorescence Intensity [MFI-bg]) used to determine seropositivity. Ag and Ab prevalence estimates will be adjusted for survey design and sex distribution for the school-based surveys, and adjusted for survey design, sex, and age distribution for the community-based survey based on Tonga’s Population and Housing Census 2021 [ 25 ].
Differences in demographic characteristics between Ag/Ab positive individuals will be described using mean ± standard deviation (SD), median [interquartile range], or number (%), and tested using Student’s t -test or Mann–Whitney U test for continuous data and Pearson’s chi-squared test of independence or Fisher’s exact test for categorical data. Logistic regression will be used to assess any associations between demographic variables and Ag/Ab positivity. Any variables with p <0.2 on univariate analyses will be tested using multivariable logistic regression. Variables will be assessed using variation inflation factor <5 to assess for potential collinearity, and final models will be selected using backward elimination, wherein variables are sequentially removed from the multivariable models to arrive at the most parsimonious models, in which variables with p <0.05 will be retained.
If possible, the sensitivity of Ag vs Ab to detect LF transmission in the post-validation period will be determined as the percentage of individuals with a positive FTS test among those testing Ab positive using MBA. A Kappa statistic will be performed to estimate the strength of agreement between FTS and Abs. The weights for agreement will be ranged as follows: k < 0.00 to indicate no agreement, k 0.00–0.20 poor agreement, k 0.21–0.40 fair agreement, k 0.41–0.60 moderate agreement, k 0.61–0.80 substantial agreement, and k 0.81–1.00 almost perfect agreement. Lastly, seroprevalence estimates and mean MFI-bg values will be compared between communities with a history of high LF transmission and ‘low-risk’ reference communities for significant differences.
Recruitment of participants began on the 8 th May 2024 and is anticipated to be completed by the 31 st July 2024.
This study is a unique opportunity to test the effectiveness of PVS strategies in the Pacific Island setting, which may provide evidence for strategies that will be applicable to countries and territories with similar contexts. The methodology adopted in this study will provide an evidence base to develop PVS of LF in Tonga and other Pacific Island countries that are at a similar stage of LF elimination. In this protocol, we propose adopting a targeted sampling approach of ‘high-risk’ individuals from areas with historically high Ag prevalence based on survey data, and ‘low-risk’ individuals from areas where no Ag-positive individuals were identified in previous surveys who can serve as a reference population. A targeted sampling approach has the advantage of being a cost-effective means to detect the presence or absence of LF.
The results of this survey will allow us to understand the current status of LF in Tonga. This information will be used to develop the next phase of activities and the development of an appropriate strategic response to the following scenarios:
Depending on the study’s findings, other potential strategies for ongoing surveillance of LF could be considered, including opportunistic testing of blood samples from blood donors, antenatal clinics, pre-employment medicals, military recruits, and/or routine blood tests.
Lastly, we propose collecting DBS for MBA that will provide opportunities to measure the seroprevalence of anti-filarial Abs. MBA analysis will facilitate an assessment of the utility of Ab serosurveillance to signal LF transmission or resurgence, thereby providing evidence for PVS strategies that will be applicable to similar contexts in the region. Estimating Ab seroprevalence will provide opportunities to (i) further investigate the interpretation of Ab signals; and (ii) assess the sensitivity of Ab vs Ag to signal ongoing LF transmission, thereby providing evidence for PVS strategies that will be applicable to similar contexts in the region.
S1 table. detailed timeline of milestones towards lf elimination in tonga..
https://doi.org/10.1371/journal.pone.0307331.s001
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
npj Digital Medicine volume 7 , Article number: 222 ( 2024 ) Cite this article
95 Accesses
Metrics details
Radiological imaging is a globally prevalent diagnostic method, yet the free text contained in radiology reports is not frequently used for secondary purposes. Natural Language Processing can provide structured data retrieved from these reports. This paper provides a summary of the current state of research on Large Language Model (LLM) based approaches for information extraction (IE) from radiology reports. We conduct a scoping review that follows the PRISMA-ScR guideline. Queries of five databases were conducted on August 1st 2023. Among the 34 studies that met inclusion criteria, only pre-transformer and encoder-based models are described. External validation shows a general performance decrease, although LLMs might improve generalizability of IE approaches. Reports related to CT and MRI examinations, as well as thoracic reports, prevail. Most common challenges reported are missing validation on external data and augmentation of the described methods. Different reporting granularities affect the comparability and transparency of approaches.
Introduction.
In contemporary medicine, diagnostic tests, particularly various forms of radiological imaging, are vital for informed decision-making 1 . Radiologists create for image examinations semi-structured free-text radiology reports by dictation, sticking to a personal or institutional schema to organize the information contained. Structured reporting that is only used in few institutions and for specific cases on the other hand offers a possibility to enhance automatic analysis of reports by defining standardized report layouts and contents.
Despite the potential benefits of structured reporting in radiology, its implementation often encounters resistance due to the possible temporary increase in radiologists’ workload, rendering the integration into clinical practice challenging 2 . Natural language processing (NLP) can provide the means to make structured information available by maintaining existing documentation procedures. NLP is defined as “tract of artificial intelligence and linguistics, devoted to making computers understand the statements or words written in human languages” 3 . Applied on radiology reports, methods related to NLP can extract clinically relevant information. Specifically, information extraction (IE) provides techniques to use this clinical information for secondary purposes, such as prediction, quality assurance or research.
IE, a subfield within NLP, involves extracting pertinent information from free-text. Subtasks include named entity recognition (NER), relation extraction (RE), and template filling. These subtasks are realized using heuristic-based methods, machine learning-based techniques (e.g., support vector machines or Naıve Bayes), and deep learning-based methods 4 . Within the field of deep learning, a new architecture of models has recently emerged - namely large language models (LLMs).
LLMs are “deep learning models with a huge number of parameters trained in an unsupervised way on large volumes of text” 5 . These models typically exceed one million parameters and have proven highly effective in information extraction tasks. The transformer architecture, introduced in 2017, serves as the foundation for most contemporary LLMs, comprising two distinct architectural blocks; the encoder and the decoder. Both blocks apply an innovative approach of creating contextualized word embeddings called attention 6 . Prior to the “age of transformers” still present today, recurrent neural network (RNN)-based LLMs were regarded as state-of-the-art for creating contextualized word embeddings. ELMo, a language model based on a bidirectional Long Short Term Memory (BiLSTM) network 7 , is an example thereof. Noteworthy transformer-based LLMs include encoder-based models like BERT (2018) 8 , decoder-based models like GPT-3 (2020) 9 and GPT-4 (2023) 10 , as well as models applying both encoder and decocoder blocks, e.g., Megatron-ML (2019) 11 . Models continue to evolve, being trained on expanding datasets and consistently surpassing the performance benchmarks established by previous state-of-the-art models. The question arises how these new models shape IE applied to radiology reports.
Regarding existing literature concerning IE from radiology reports, several reviews are available, although these sources either miss current developments or only focus on a specific aspect or clinical domain, see Table 1 . The application of NLP to radiology reports for IE has already been subject to two systematic reviews in 2016 12 and 2021 13 . While the former is not freely available, the latter searches only Google Scholar and includes only one study based on LLMs. Davidson et al. focused on comparing the quality of studies applying NLP-related methods to radiology reports 14 . More recent reviews include a specific scoping review on the application of NLP to reports specifically related to breast cancer 15 , the extraction of cancer concepts from clinical notes 16 , and a systematic review on BERT-based NLP applications in radiology without a specific focus on information extraction 17 .
As LLMs have only recently gained a strong momentum, a research gap exists as there is no overview of LLM-based approaches for IE from radiology reports available. With this scoping review, we therefore intend to answer the following research question:
What is the state of research regarding information extraction from free-text radiology reports based on LLMs?
Specifically, we are interested in the subquestions that arise from the posed research question:
RQ.01 - Performance: What is the performance of LLMs for information extraction from radiology reports?
RQ.02 - Training and Modeling: Which models are used and how is the pre-training and fine-tuning process designed?
RQ.03 - Use cases: Which modalities and anatomical regions do the analyzed reports correspond to?
RQ.04 - Data and annotation: How much data was used to train the model, how was the annotation process designed and is the data publicly available?
RQ.05 - Challenges: What are open challenges and common limitations of existing approaches?
The objective of this scoping review is to answer the above-mentioned questions, provide an overview of recent developments, identify key trends and highlight future research by identifying outstanding challenges and limitations of current approaches.
As shown in Fig. 1 , the systematic search yielded 1,237 records, retrieved from five databases. After removing duplicate records and records published before 2018, 374 records (title, abstract) were screened for eligibility. The screening process resulted in the exclusion of 302 records. The remaining 72 records were sought for full-text-retrieval, of which 68 could be retrieved. During data extraction, 43 papers were excluded due to not fulfilling inclusion criteria, which was not apparent based on information provided in the abstract.
Querying of five databases resulted in a total of 1237 sources of evidence eligible for screening. This number was reduced to 374 after deduplication and removal based on publication year. Eventually, 34 studies were included in this review after completion of the screening process.
Within the cited references of included papers, nine additional papers fulfilling all inclusion criteria were identified. Therefore, following the above-mentioned methodology, 34 records in total were included in this review.
In the following, we organize the extracted information according to the structure of the extraction table, which in turn reflects the defined research questions. This review covers studies that were published between 01/01/2018 and 01/08/2023. The earliest study included was published in 2019. After eight included studies published in 2020, the topic reaches its peak with eleven studies published in 2021. Eight studies of 2022 were included. Six included studies were published in the first half of 2023.
Based on corresponding author address, 15 out of 35 papers are located in the USA, followed by six in China and three each in the UK and Germany. Other countries include Austria ( n = 1), Canada ( n = 2), Japan ( n = 2), Spain ( n = 1) and The Netherlands ( n = 1) (Table 2 ).
This chapter describes the NLP task, the extracted entities, the information model development process and data normalization strategies of the included studies.
Extracted concepts encompass various entities, attributes, and relations. These concepts relate to abnormalities 18 , 19 , 20 , anatomical information 21 , breast-cancer related concepts 22 , clinical findings 23 , 24 , 25 , devices 26 , diagnoses 27 , 28 , observations 29 , pathological concepts 30 , protected health information (PHI) 31 , recommendations 32 , scores (TI-RADS 33 , tumor response categories 34 ), spatial expressions 35 , 36 , 37 , staging-related information 38 , 39 , and stroke phenotypes 40 . Several papers extract various concepts, e.g., ref. 41 .
Studies solely describing document-level single-label classification were excluded from this review. Two studies apply document-level multi-class classification. Document-level multi-label classification is described in nine studies (26%), whereof three only classify more than two classes for each entity. The majority of the included studies ( n = 21, 62%) describes NER methods, ten studies additionally apply RE methods. These studies encompass sequence-labeling and span-labeling approaches. Question answering (QA)-based methods are described in two studies, see Fig. 2 .
The circles contain the absolute number of studies per task. NER Named entity recognition, RE Relation extraction, ML-CL Binary multi-label classification, MC-CL Multi-class classification, QA Question answering.
The number of extracted concepts (including entities, attributes, and relations) ranges from one entity in both papers describing multi-class classification 33 , 34 up to 64 entities described in a NER-based study 30 .
Three studies base their information model on clinical guidelines, namely the Response evaluation criteria in solid tumors 42 and the TNM Classification of Malignant Tumors (TNM) staging system 43 . Development by domain experts ( n = 2), references to previous studies ( n = 3), regulations of the Health Insurance Portability and Accountability Act 44 ( n = 1), the Stanza radiology model 45 ( n = 1) and references to previously developed schemes ( n = 2) are other foundations for information model development. One study provides detailed information about the development process of the information model as supplementary information 19 . One study reports development of their information model based on the RadLex terminology 46 , another based on the National Cancer Institute Thesaurus 47 . 21 studies (62%) do not report any details regarding the development of the information model.
Out of the 34 included studies, only three describe methods to structure and/or normalize extracted information. While Torres-Lopez et al. apply rule-based methods to structure extracted data based on entity positions and combinations 30 , Sugimoto et al. additionally apply rule-based normalization based on a concept table 24 . Datta et al. describe a hybrid approach to normalize extracted entities by first generating concept candidates with BM25, a ranking algorithm, and then choosing the best equivalent with a BERT-based classifier 48 .
Regarding the distribution of annotated entities within the datasets, only one study reports on having conducted measures to counteract class imbalance 19 . Another study reports on not having used F1 score as a performance measure, as the F1 score is not suited when class imbalances are present 27 . Four studies (12%) report coarse entity distributions and seven studies (21%) describe granular entity distributions.
In the following, details regarding the reported model architectures and implementations are described, including base models, (further) pre-training and fine-tuning methods, hyperparameters, performance measures, external validation and hardware details.
For an overview of applied model architectures, see Table 3 . 28 out of 34 papers (82%) describe at least one transformer-based architecture, while the remaining six studies apply various adaptions of the Bidirectional Long Short-Term Memory (Bi-LSTM) architecture. Out of the 28 studies that describe transformer-based architectures, 27 are based on the BERT architecture 8 and one is based on the ERNIE architecture 49 . Eight studies (24%) describe further pre-training of a BERT-based, pre-trained model on in-house data. Eighteen studies (53%) use a BERT-based, pre-trained model without further pre-training. One study applies pre-training to other layers than the LLM. Two studies do not provide any details regarding the architecture of the BERT models. One study combines both BERT- and BiLSTM-based architectures 28 . Out of six studies that describe only BiLSTM-based architectures, two studies apply pre-training of word vectors based on word2vec 50 . 31 studies (91%) provide sufficient details about the fine-tuning process. Three studies do not provide details 24 , 39 , 51 .
Reported performance measures vary between included studies, including traditional measures like precision, recall, and accuracy as well as different variations of the F1 score (micro, macro, averaged, weighted, pooled). The performance of studies reporting a F1-score variation (including micro-, macro-, pooled- generalized, exact match and weighted F1) is compared in Table 4 . If a study describes multiple models, the score of the best model was chosen. If two or more datasets are compared, the higher score was chosen. If applicable, the result of external validation is also presented. 22 studies (65%) report having conducted statistical tests, including cross-validation, McNemar test, Mann-Whitney U test and Tukey-Kramer test.
Hyperparameters used to train the models (e.g., learning rate, batch size, embedding dimensions) are described in 28 studies (82%), however with varying degree of detail. Six studies (18%) do not report any details on hyperparameters. Seven studies (21%) describe a validation of their algorithm on training data from an external institution. Seven studies (21%) include details about hardware and computational resources spent during the training process.
In this section, we describe the study characteristics related to data sets, encompassing number of reports, data splits, modalities, anatomic regions, origin, language, and ethics approval.
Data set size used for fine-tuning ranges from 50 to 10,155 reports. The amount of external validation data ranges from 10% to 31% of the amount of data used for fine-tuning. For further pre-training of transformer-based architectures, 50,000 up to 3.8 million reports are used. Jantscher et al. additionally use the content of a public clinical knowledge platform ( DocCheck Flexicon 52 ) 53 . Zhang et al. only report the amount of data (3 GB) 54 . Jaiswal et al. performed further pre-training on the complete MIMIC-CXR corpus 29 . Two studies that described pre-training of word embeddings for Bi-LSTM-based architectures used 3.3 million and 317,130 reports, respectively 24 , 32 .
Data splits vary widely; the majority of studies ( n = 23, 68%) divide their data into three sets, namely train-, validation- and test-set, with the most common split being 80/10/10, respectively. This split variation is reported in eight studies (24%). Seven studies (21%) use two sets only, four studies (12%) apply cross-validation-based methods.
19 studies (56%) describe the timeframe within which reports had been extracted. Dada et al. report the longest timeframe of 22 years, using reports between 1999 and 2021 for further pre-training 41 . The shortest timeframe reported is less than one year (2020–2021) 26 .
Several studies are based on publicly available datasets: MIMIC-CXR 55 was used once 29 while MIMIC 56 was used by two studies 40 , 57 . MIMIC-III 58 was used by six studies (18%) 37 , 40 , 48 , 57 , 59 , 60 . The Indiana chest X-ray collection 61 was used twice 35 , 36 . For external validation, MIMIC-II was applied by Mithun et al. 62 and MIMIC-CXR by Lau et al. 23 . While some of these studies use the datasets as-is, some perform additional annotation. Other studies use data from hospitals, hospital networks, other tertiary care institutions, medical big data companies, research centers, care centers or university research repositories.
Figures 3 and 4 show the frequencies of modalities and anatomical regions, respectively. Note that frequencies were counted on study-level and not weighted by the number of reports.
The diagram shows absolute numbers of mentioned modalities. Several studies use reports obtained from multiple modalities. Other modalities include positron emission tomography-computed tomography (PET-CT) ( n = 1) and ultrasound ( n = 2). Three studies did not explicitly mention associated modalities. Abbreviations: CT Computer tomography, MRI Magnetic resonance imaging.
The diagram shows absolute numbers of mentioned anatomical regions. Several studies use reports corresponding to multiple anatomical regions. Other anatomical regions include the heart, abdomen, pelvis, “all body regions'', nose, thyroid ( n = 1 each) and breast ( n = 2). Four studies did not explicitly mention associated anatomical regions.
Report language was inferred from the location of the institution of the corresponding author: Most studies use English reports ( n = 21, 62%) followed by Chinese ( n = 6, 18%), German ( n = 4, 12%), Japanese ( n = 2, 6%) and Spanish ( n = 1). The corresponding author address of one study is located in the Netherlands but using data from an Indian Hospital 62 .
19 studies (56%) explicitly state that the endeavor was approved by either a national committee or agency ( n = 3, 9%) or a local institutional or hospital review board or committee ( n = 15, 44%). One study reports approval only for in-house data, but not for the external validation set from another institution 33 .
28 studies (82%) describe an exclusively manual annotation process. Five studies (15%) explicitly state that each report was annotated by two persons independently. Lau et al. use annotated data to train a classifier that supports the annotation process by proposing only documents that contain potential annotations 32 . Two studies use tools for automated annotation with manual correction and review 29 , 31 . Lybarger et al. do not provide details on their augmentation of an existing dataset 21 , three others do not report details as they either extract information available in the hospital information system 33 or exclusively use existing annotated datasets 36 , 59 .
Annotation tagging schemes mentioned include IOB(2), BISO and BIOES (short for beginning, inside, outside, start, end). The number of involved annotators ranges from one to five, roles include clinical coordinators, radiologists, radiology residents, medical and graduate students, medical informatics engineers, neurologists, neuro-radiologists, surgeons, radiological technologists and internists. Existing annotation guidelines are reported by three studies, four studies mention that instructions exist but do not provide details. 23 studies (68%) do not mention information regarding annotation guidelines.
Inter-annotator-agreement (IAA) is reported by 23 (68%) studies. Measures include F1 score variants ( n = 8, 24%), Cohen kappa ( n = 7, 21%), Fleiss kappa ( n = 19, 56%) and the intraclass correlation coefficient ( n = 1). IAA results are reported by 16 studies (47%) and range, for Cohen kappa, from 81% to 93.7%. Eleven studies (32%) mention the tool used for annotation, including Brat 23 , 37 , 39 , 48 , 53 , 60 , Doccano 34 , TagEditor 30 , Talen 46 and two self-developed tools 19 , 63 .
Five studies (15%) state that data is available upon request. One study claims availability, although there is no data present in the referenced online repository 57 . One study published its dataset in a GitHub repository 35 . One study only uses annotations provided within a dataset with credentialed access 59 . The remaining 22 studies (65%) do not mention whether data is available or not. Regarding source code availability, ten studies (29%) claim their code to be available. The remaining 24 studies (71%) do not mention whether the source code is available or not.
Various aspects related to limitations and challenges are described. The most common mentioned limitation is that studies use only data from a single institution 21 , 22 , 24 , 30 , 36 , 51 , 53 . Similarly, multiple studies mention validation on external or multi-institutional data as a future research direction 19 , 26 , 59 . Two studies mention the need of semantic enrichment or normalization of extracted information 48 , 54 .
Many studies report intentions to augment their described approaches to other report types 21 , 28 , 30 , 37 , other report sections 22 , to include other or more data sources 35 , 39 , 54 or entities 32 , 62 , body parts 46 , clinical contexts 34 or modalities 35 , 53 , 59 .
Additional limitations include the application to only a single modality or clinical area 21 , 46 , 53 , small dataset size 27 , 32 , 54 , technical limitations 27 , 63 , no negation detection 35 , 62 , few extracted entities 24 , 28 or result degradation upon evaluation on external data 19 or more recent reports 25 . Missing interpretability is mentioned by two studies 28 , 41 .
Performance measures reported in Table 4 cannot be compared due to differences in datasets, number of extracted concepts and the heterogeneity of applied performance measures. External validation performed by six studies shows in general lower performance of the algorithm applied to external data, so data from a source different from the one used for training. The largest performance drop of 35% (overall F1 score) was reported in a Bi-LSTM-based study, performing multi-label binary classification of only three entities on the document-level 62 . On the contrary, Torres-Lopez et al. extracted a total of 64 entities with a performance drop of only 3.16% (F1 score), although not providing details on their model architecture. The smallest performance drop amounts to only 0.74% (Micro F1) for extracting seven entities based on a further pre-trained model 46 . However, it cannot be assumed that further pre-training increases model generalizability and therefore performance.
Upon analysis of performance, several inconsistencies between included studies impairs comparability: First, there is no standardized measure or best-practice to assess model performance for information extraction. Although in general, the F1 score is most often applied and well known, there exist many variations, including micro-, macro-, exact and inexact match scores, weighted F1 score and 1-Margin F1 scores. On the contrary, Zaman et al. argue that macro-averaged F1 score or overall accuracy are not suited as performance measures when class imbalances are present 27 . For the same reason, F1 score is only used to assess binary classification and not for multi-class classification by Wood et al. 19 .
While 22 studies apply some variation of cross-validation to assess model performance, 12 studies apply simple split validation methods. Singh et al. show that if data sets are small, simple split validation shows significant differences of performance measures compared to cross-validation 64 .
Specific statistical tests to compare performance of different models include DeLong’s test to compare Area under the ROC Curves 19 , 27 , the Tukey-Kramer method for multiple comparison analysis 46 and the McNemar test to compare the agreement between two models 22 . However, appropriateness of each test method remains unclear, as shown by Demner et al. 65 .
In general, equations on how performance metrics are computed should always be included in the manuscript to improve understandability, e.g., as done by 22 or 30 . To improve comparability of studies, scores for each class as well as a reasonable aggregated score over all classes should be reported.
This review identified only decoder-based architectures or pre-transformer architectures and no generative models, such as GPT-4 (released in March 2023). The majority of the described models is based on the encoder-only BERT architecture, first described by Devlin et al. 8 . We envision multiple reasons: First, while having been available since 2018 66 , generative models first needed time to be established as a new technology to be investigated and applied in the healthcare sector. Second, early generative models might have demonstrated poor performance due to their relatively small size and lack of domain-specific data for pre-training 67 . Third, poor performance might also entail model hallucinations: Farquhar et al. define hallucination as “answering unreliably or without necessary information” 68 . Hallucinations include, among others, provision of wrong answers due to erroneous training data, lying in pursuit of a reward or errors related to reasoning and generalization 68 . On the contrary, encoder-only models like the BERT architecture cannot hallucinate as they provide only context-aware embeddings of input data; the actual NLP task (e.g., sequence labeling, classification or regression) is performed by a relatively simple, downstream neural network, rendering this architecture more transparent and verifiable than generative models.
An advantage of LLMs is their capability to be customized to a specific language or general domain (e.g., medicine): First, a base version of the model is trained using a large amount of unlabeled data: This process is called pre-training. The concept of transfer-learning enables researchers to further customize a pre-trained model to a more specific domain (e.g., clinical domain, another language or from a certain hospital). This is also referred to as further pre-training. The process of training the model to perform a particular NLP task (e.g., classification) based on labeled data is called fine-tuning. These definitions (pre-training, further pre-training, transfer learning and fine-tuning) tend to be confused by authors or replaced by other term variants, e.g., “supervised learning”. However, it is imperative to use clear and concise language to distinguish between the concepts mentioned above.
Seven included studies apply further pre-training as defined above. The effect of further pre-training depends on various factors, including specifications of the input model used or amount and quality of the data used for further pre-training. Interestingly, further pre-training of a pre-trained model to another language was not reported.
Opposed to the traditional further pre-training as described above, Jaiswal et al. show how BERT-based models achieve higher performance when little data is available based on contrastive pre-training 29 . The authors claim that their model achieves better results than conventional transformers when the number of annotated reports is limited.
Only two studies solve the task of information extraction based on extractive question answering 41 , 59 . Extractive question answering was already described in the original BERT paper 8 : Instead of generating a pooled embedding of the input text or one embedding per input token, a BERT model fine-tuned for question answering takes an answer as an input and outputs the start and end token of the text span that contains the answer to the posed question - this is also possible if no answer or multiple answers are contained within the text as shown by Zhang et al. 69 .
The most common modalities for which reports of findings were used in the included studies are CT ( n = 16), MRI ( n = 15) and X-Ray ( n = 14). CT reports appear to be the most common source when using in-house data. According to data provided by the Organisation for Economic Cooperation and Development (OECD), the availability of CT scanners and MRI machines has increased steadily during the past decades. Furthermore, there has been a general upwards trend in the number of performed CT and MRI interventions worldwide 70 . CT exams are fast and cheap compared to MRI.
The most common anatomical regions studied are thorax ( n = 17) and brain ( n = 8). There might be different reasons for this distribution. First, chest X-Ray is one of the most frequently performed imaging examinations. Second, six studies used reports obtained from MIMIC datasets, including thorax X-Ray, brain MRI and babygram examinations. Two studies used thorax X-Ray reports obtained from publicly available datasets. Furthermore, a report on the annual exposure from medical imaging in Switzerland shows that the thorax region is the third most common anatomical region of CT procedures (11.8%), preceded by abdomen and thorax (16.4%) and abdomen only (17.7%) 71 .
We identified several aspects that showed different interpretations in the included studies. One of the major ambiguities discovered is the clear definition of the terms test set and validation set: Some studies use these two very distinct terms interchangeably. However, agreement is needed upon which set is used during parameter optimization of a model and which set is used for evaluation of the final model. Furthermore, studies either report number of sentences or number of documents, hindering comparability. It also remains unclear, whether the stated dataset size includes documents without annotation or annotated data only. Report language is never explicitly stated.
Regarding annotation, it becomes apparent that there is no standard for IAA calculation, recommended number of annotator and their backgrounds, number of reports, number of reconciliation rounds and especially, IAA calculation methods. All these aspects differ widely in the included papers.
Good practices observed in the included papers include reporting of descriptive annotation statistics 35 and conducting complexity analysis of the report corpus 29 , 34 : These complexity metrics include e.g., unique n-gram counts, lexical diversity as measured with the Yule 1 score and the Type-Token-Ratio, as reported in ref. 46 . Wood et al. highlight the importance of splitting data on patient-level instead of report level 19 .
Last, we want to highlight interesting approaches: Fine et al. first use structured reports for fine-tuning and then apply the resulting model on unstructured reports 34 . Jaiswal et al. introduce three novel data augmentation techniques before fine-tuning their model based on contrastive learning 29 . Pérez-Díez et al. developed a randomization algorithm to substitute detected entities with synthetic alternatives to disguise undetected personal information 31 .
The mentioned challenges and limitations are manifold and diverse. Ten papers in total address the topic of generalizing to data from other institutions. Another challenge are the limitations of every study, be it a limited number of entities and usually a single modality and clinical domain. Every included study is based on a pre-defined information model and fine-tuned on annotated data. This means, that by August 2023, no truly generalized approach for IE has been described in the identified literature.
Upon interpretation of the above-mentioned results, several limitations of this review can be mentioned. First, the definition of information extraction proved to be challenging. We defined information extraction as a collective term for the NLP tasks of document-level multi-label classification (including binary or multiple classes for each label), NER (including RE), as well as question answering approaches. We excluded binary classification on the document level. While a narrow definition of IE would possibly only include NER and RE, whereas the widest definition would also include binary document classification. With our approach, we wanted to ensure a balanced level of task complexity.
Furthermore, the definition of an LLM was also unclear. In the protocol for this review, LLMs are defined as “deep learning models with more than one million parameters, trained on unlabeled text data” 72 . Although BiLSTM-based architectures are not trained on text, the applied context-aware word embeddings like fastText and word2vec stipulate the inclusion of these architectures into this review. An additional argument for including BiLSTM-based architectures is ELMO, a BiLSTM-based architecture with ~ 13M parameters, and referred to as one of the first LLMs. However, we decided not to include BiGRU-based architectures, as information on their parameter count was usually not available. A more narrow definition would only include transformer-based architectures, having billions of parameters. This definition seems to have recently reached consensus among researchers and in industry. As of the time of submission in June 2024, LLMs tend to be defined even more narrow, only including generative models based on autoregressive sampling 73 . This might be due to generative models currently being the most common and frequent model architecture. On the contrary, a wider definition would also potentially include BiGRU-based, CNN-based and other architectures. It also remains subject to discussion whether summarization can be regarded as information extraction—for this study, summarization was not included, potentially missing studies of interest, e.g., ref. 74 . Likewise, image-to-text report generation was excluded.
Regarding the search strategy, we decided not to include numerous model names to keep the complexity of the search term low. Instead, we initially only included the terms transformers and Bert . Eventually, only two search dimensions were used because otherwise, the number of search results would have been too small. To minimize the number of missed studies, the forward search of references of included studies was carried out, eventually leading to nine additionally included studies that were not covered by the search strategy. Nevertheless, our search strategy was not exhaustive: Studies that used terms related to transformation or structuring of reports, e.g.,refs. 75 , 76 , were missed as these terms are missing in the search strategy.
No generative models and therefore no approaches based on generative models (including few-, single- or zero-shot learning) are included in the search results. This might be due to the fact that generative models have only started to become widely accessible with the publication of chatGPT in November 2022. Only later, open-source alternatives became available. However, due to the sensitive nature of patient data, utilization of publicly serviced models, e.g., GPT-4, is restricted due to data protection rules. Until the cut-off time of this review, state-of-the-art, open-source generative models, e.g., LLama 2 (70B), had still required vast computational resources, restricting the possibilities of on-premise deployment within hospital infrastructures. Furthermore, early studies might so far only be published without peer-review (e.g., on arXiv), excluding them for this review, e.g., ref. 77 . As no search updates were performed for this review, arXiv papers that were later peer-reviewed were also not included, e.g., 78 . Relevant papers published in the ACL Anthology were also not included, potentially missing papers describing generative approaches, e.g., by Agrawal et al. 79 and Kartchner et al. 80 . Sources that did not mention “information extraction”, “named entity recognition” or “relation extraction” in the title or abstract and were not referred to by other papers were also not included, e.g., ref. 81 .
Given the diverse nature of the included studies alongside discrepancies in both the quality and quantity of reported data, a comprehensive analysis of the extracted information was deemed impossible. Future systematic reviews could enhance this comparison by refining the research question and subquestions to a more specific scope. However, according to the protocol for this scoping review, a purely descriptive presentation of findings was conducted.
Another potential limitation is the fact that data extraction was performed by one author (DR) only. However, prior to data extraction, two studies were extracted by two authors, and the resulting information compared. This led to the addition of six additional aspects to the original data extraction table, including details on hardware specification, hyperparameters, ethical approval, timeframe of dataset and class imbalance measures.
Last, we want to highlight that this scoping review strictly adheres to the PRISMA-ScR and PRISMA-S guidelines. Our search strategy of five databases resulted in over 1200 primary search results, minimizing the risk of missing relevant studies. This risk was further minimized by carefully choosing a balanced definition of both IE and LLMs. As only peer-reviewed studies were taken into account, a certain study quality was furthermore ensured.
Due to the current rapid technical progress, we summarize the latest developments regarding LLMs in general, their application in medicine, as well with regard to this review’s topic. We give an overview on studies published outside the scope of our review (published after August 1st 2023) as well as on the application of LLMs in clinical domains and tasks different from IE from radiology reports.
As of June 2024, the majority of recently published LLMs, be it commercial or open-source, are generative models, based on the decoder-block of the original transformer architecture. Two development strategies can be observed to increase model performance: The first strategy is about simply increasing the amount of model parameters (and therefore, model size), leading also to an increased demand for training data. The second strategy, on the other hand, is about optimizing existing models based on different strategies, including model pruning, quantization or distillation, as shown by Rohanian et al. 82 . Recent models include the Gemini family (2024) 83 , the T5 family 84 , LLama 3 (2024) 85 and Mixtral (2024) 86 . Moreover, research has increasingly been focussing on developing domain-specific models, e.g., Meditron, Med-PaLM 2, or Med-Gemini for the healthcare domain 87 , 88 , 89 .
In the broad clinical domain, these recent, generative LLMs show impressive capabilities, partly outperforming clinicians in test settings regarding, e.g., medical summary generation 90 , prediction of clinical outcomes 91 and answering of clinical questions 92 . Dagdelen et al. have recently demonstrated that, in the context of structured information extraction from scientific texts, even generative models require a few hundred training examples to effectively extract and organize information using the open-source model Llama-2 93 .
For the specific topic of structured IE from radiology reports, several papers and pre-prints have been published since August 2023: In general, it becomes apparent that resource-demanding generative models seem not to show better results compared to encoder-based approaches, as shown by the following studies: When applying the open-source model Vicuna 94 to binary label 13 concepts on document-level of radiology reports, Mukherjee et al. showed only moderate to substantial agreement with existing, less resource-demanding approaches 95 . Document-level binary level was also investigated by Adams et al., who compared GPT-4 to a BERT-based model further pre-trained on German medical documents 75 . In this comparison, the smaller, open-source model 96 outperformed GPT-4 for five out of nine concepts. The authors also tested GPT-4 on English radiology reports, however not providing any detailed performance measures. Similarily, Hu et al. used ChatGPT as a commercial platform to extract eleven concepts from radiology reports without further fine-tuning or provision of examples 97 . The results show inferiority of ChatGPT upon comparison with a previously described approach (BERT-based multiturn question answering 98 ) as well as a rule-based approach (averaged F1 scores: 0.88, 0.91, 0.93, respectively). Mallio et al. qualitatively compared several closed-source generative LLMs for structured reporting, although lacking clear results 99 . Additionally, several key gaps remain with the application of above-mentioned generative models. For example, closed-source models continue getting larger, requiring an increasing extent of scarce hardware resources and training data. Moreover, although large generative models currently show the best performance, they are less explainable than, e.g., encoder-based architectures prevalent in this review’s results 100 .
Generative models and encoder-based models each offer unique advantages and disadvantages. Yang et al. show that generative models might excel at generalizing to external data by applying in-context learning 101 . Generative models are by design able to aggregate information, and might be therefore more suitable to extract more complex concepts. Recently, open-source models are becoming more efficient and compact, as seen in recent advancements, e.g., the Phi 3 model family 102 . However, generative models are usually computationally intensive and require substantial resources for training and deployment. While still facing issues regarding hallucination, this behavior might be improved by combining LLMs with knowledge graphs, as introduced by Gilbert et al. 103 .
On the other hand, encoder-based models, such as BERT, are highly effective at understanding and generating bidirectional contextual embeddings of input data, which makes them particularly strong in tasks requiring precise comprehension or annotation of text, such as extractive question answering or NER. They tend to be more resource-efficient during inference compared to generative models. However, encoder-based models often struggle with generating coherent text, a task where generative models excel. Additionally, while encoder-based models can be fine-tuned for specific tasks, they may not generalize as well as generative models. Moreover, research and industry currently focus on the development of generative models, as the last encoder-based architecture was published in 2021 104 . In summary, while generative models currently offer flexibility and powerful aggregation capabilities, encoder-based models provide efficiency and precision.
In this review, we provide a comprehensive overview of recent studies on LLM-based information extraction from radiology reports, published between January 2018 and August 2023. No generative model architectures for IE from radiology reports were described in literature. After August 2023, generative models have been becoming more common, however tending not to show a performance increase compared to pre-transformer and encoder-based architectures. According to the included studies, pre-transformer and encoder-based models show promising results, although comparison is hindered by different performance score calculation methods and vastly different data sets and tasks. LLMs might improve generalizability of IE methods, although external validation is performed in only seven studies. The majority of studies used pre-trained LLMs without further pre-training on their own data. So far, research has focused on IE from reports related to CT and MRI examinations and most frequently on reports related to the thorax region. We recognize a lack of publicly available datasets. Furthermore, a lack of standardization of the annotation process results in potential differences regarding data quality. The source code is made available by only ten studies, limiting reproducibility of the described methods. Most common challenges reported are missing validation on external data and augmentation of the described method to other clinical domains, report types, concepts, modalities and anatomical regions.
No generative model architectures for IE from radiology reports were described in literature. After August 2023, generative models have been becoming more common, however tending not to show a performance increase compared to pre-transformer and encoder-based architectures. According to the included studies, pre-transformer and encoder-based models show promising results, although comparison is hindered by different performance score calculation methods and vastly different data sets and tasks. LLMs might improve generalizability of IE methods, although external validation is performed in only seven studies.
We conclude by highlighting the need to facilitate comparability of studies and to review generative AI-based approaches. We therefore plan to develop a reporting framework for clinical application of NLP methods. This need is confirmed by Davidson et al. who also state that available guidelines are limited 14 ; journal-specific guidelines already exist 105 . Considering the periodical publication of larger, more capable generative models, transparent and verifiable reporting of all aspects described in this review is essential to compare and identify successful approaches. We furthermore suggest future research to focus on the optimization and standardization of annotation processes to develop few-shot prompts. Currently, the correlation between annotation quality, quantity and model performance is unknown. Last, we recommend the development and publication of standardized, multilingual datasets to foster external validation of models.
This scoping review was conducted according to the JBI Manual for evidence synthesis and adheres to the PRISMA extension for scoping reviews (PRISMA-ScR). Regarding methodological details, we refer to the published protocol for this review 72 . In this section, we give an overview on the applied methodology and explain the adaptations made to the protocol. The completed PRISMA-ScR checklist is provided in Supplementary Table 1 .
The search strategy comprised three steps: First, a preliminary search was conducted by searching two databases (Google Scholar and PubMed), using keywords related to this review’s research question. Based on the results, a list of relevant search and index terms was retrieved, which in turn served as a basis for the iterative development of the full search query.
During search query development, different combinations of terms and dimensions of the research topic were combined to build query combinations that were run on PubMed. Balancing of search results and relevance showed that the inclusion of only two dimensions, “radiology” and “information extraction”, showed the best balance regarding the quantity and quality of results and was therefore chosen as the final search query.
Second, a systematic search was carried out using the final version of the search query. The PubMed-based query was adapted to meet syntactical requirements of the other four databases, comprising IEEE Xplore, ACM Digital Library, Web of Science Core Collection and Embase. The systematic search was conducted on 01/08/2023, and included all sources of evidence (SOE) since database inception. No additional limits, restrictions, or filters were applied. The full query for each database as well as a completed PRISMA-S extension checklist are shown in Supplementary Table 2 and Supplementary Table 3 . Third, reference lists of included studies were manually checked for additional sources of evidence and included if fulfilling all inclusion criteria. No search updates were performed.
Inclusion criteria were discussed among and agreed on by all three authors. No separation was made between exclusion and inclusion criteria; reports were included upon fulfillment of all the following six aspects:
C.01: The full-text SOE is retrievable.
C.02: The SOE was published after 31/12/2017.
C.03: The SOE is published in a peer-reviewed journal or conference proceeding.
C.04: The SOE describes original research, excluding reviews, comments, patents and white papers.
C.05: The SOE describes the application of NLP methods for the purpose of IE from free-text radiology reports.
C.06: The described approach is LLM-based (defined as deep learning models with more than one million parameters, trained on unlabeled text data).
Record screening was performed by two authors (KD, DR), using the online-platform Rayyan 106 . To improve alignment regarding inclusion criteria between reviewers, a first batch of 25 records was screened individually. Two conflicting decisions were discussed and clarified, leading to the consensus that BiLSTM-based architectures might also classify as LLMs and should therefore be included. In order to validate this change, a second batch of 25 records was screened and compared. Three conflicting decisions helped to clarify that, when a LLM-based architecture is not explicitly stated in the title or abstract, the record should still be marked as included to maximize overall recall of relevant papers.
Upon clarification of the inclusion criteria, each remaining record (title, abstract) was screened twice. After completion of the screening process, conflicts (comprising differing decisions or records marked as “maybe”) were resolved by including all records that are marked at least once as “included”.
After screening, records were sought for full-text retrieval. Data extraction was performed by one author (DR). During the extraction phase, reports were ex post excluded when a violation of inclusion criteria became apparent from the full-text. Reference lists of included papers were screened for further reports to include. Changes to the published protocol for this review are documented in Supplementary Table 4 , including its description, reason, and date.
The complete list of extracted documents for all queried databases as well as the completed data extraction table are available in the OSF repository, see https://doi.org/10.17605/OSF.IO/RWU5M .
For data screening, the publicly available online platform rayyain.ai was used (free plan), see https://www.rayyan.ai .
Müskens, J. L. J. M., Kool, R. B., Van Dulmen, S. A. & Westert, G. P. Overuse of diagnostic testing in healthcare: a systematic review. BMJ Qual. Saf. 31 , 54–63 (2022).
Article PubMed Google Scholar
Nobel, J. M., Van Geel, K. & Robben, S. G. F. Structured reporting in radiology: a systematic review to explore its potential. Eur. Radiol. 32 , 2837–2854 (2022).
Khurana, D., Koli, A., Khatter, K. & Singh, S. Natural language processing: state of the art, current trends and challenges. Multimed. Tools Appl. 82 , 3713–3744 (2023).
Jurafsky, D. & Martin, J. H. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Pearson Education, 2024).
Birhane, A., Kasirzadeh, A., Leslie, D. & Wachter, S. Science in the age of large language models. Nat. Rev. Phys. 5 , 277–280 (2023).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems , Vol. 30 (Curran Associates, Inc., 2017).
Peters, M. E. et al. Deep contextualized word representations 1802. 05365 (2018).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C. & Solorio, T. (eds.) In Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , 4171–4186 (Association for Computational Linguistics, Minneapolis, Minnesota, 2019).
Brown, T. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems , vol. 33, 1877–1901 (Curran Associates, Inc., 2020).
OpenAI et al. GPT-4 Technical Report 2303.08774. (2023).
Shoeybi, M. et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism 1909.08053 (2020).
Pons, E., Braun, L. M. M., Hunink, M. G. M. & Kors, J. A. Natural language processing in radiology: a systematic review. Radiology 279 , 329–343 (2016).
Casey, A. et al. A systematic review of natural language processing applied to radiology reports. BMC Med. Inform. Decis. Mak. 21 , 179 (2021).
Article PubMed PubMed Central Google Scholar
Davidson, E. M. et al. The reporting quality of natural language processing studies: systematic review of studies of radiology reports. BMC Med. Imaging 21 , 142 (2021).
Saha, A., Burns, L. & Kulkarni, A. M. A scoping review of natural language processing of radiology reports in breast cancer. Front. Oncol. 13 , 1160167 (2023).
Gholipour, M., Khajouei, R., Amiri, P., Hajesmaeel Gohari, S. & Ahmadian, L. Extracting cancer concepts from clinical notes using natural language processing: a systematic review. BMC Bioinform. 24 , 405 (2023).
Gorenstein, L., Konen, E., Green, M. & Klang, E. Bidirectional encoder representations from transformers in radiology: a systematic review of natural language processing applications. J. Am. Coll. Radiol. 21 , 914–941 (2024).
Wood, D. A. et al. Automated labelling using an attention model for radiology reports of MRI scans (ALARM). In Arbel, T. et al. (eds.) Proceedings of the Third Conference on Medical Imaging with Deep Learning , vol. 121 of Proceedings of Machine Learning Research , 811–826 (PMLR, 2020-07-06/2020-07-08).
Wood, D. A. et al. Deep learning to automate the labelling of head MRI datasets for computer vision applications. Eur. Radiol. 32 , 725–736 (2022).
Li, Z. & Ren, J. Fine-tuning ERNIE for chest abnormal imaging signs extraction. J. Biomed. Inform. 108 , 103492 (2020).
Lybarger, K., Damani, A., Gunn, M., Uzuner, O. Z. & Yetisgen, M. Extracting radiological findings with normalized anatomical information using a span-based BERT relation extraction model. AMIA Jt. Summits Transl. Sci. Proc. 2022 , 339–348 (2022).
PubMed PubMed Central Google Scholar
Kuling, G., Curpen, B. & Martel, A. L. BI-RADS BERT and using section segmentation to understand radiology reports. J. Imaging 8 , 131 (2022).
Lau, W., Lybarger, K., Gunn, M. L. & Yetisgen, M. Event-based clinical finding extraction from radiology reports with pre-trained language model. J. Digit. Imaging 36 , 91–104 (2023).
Sugimoto, K. et al. End-to-end approach for structuring radiology reports. Stud. Health Technol. Inform. 270 , 203–207 (2020).
PubMed Google Scholar
Zhang, Y. et al. Using recurrent neural networks to extract high-quality information from lung cancer screening computerized tomography reports for inter-radiologist audit and feedback quality improvement. JCO Clin. Cancer Inform. 7 , e2200153 (2023).
Tejani, A. S. et al. Performance of multiple pretrained BERT models to automate and accelerate data annotation for large datasets. Radiol. Artif. Intell. 4 , e220007 (2022).
Zaman, S. et al. Automatic diagnosis labeling of cardiovascular MRI by using semisupervised natural language processing of text reports. Radiol. Artif. Intell. 4 , e210085 (2022).
Liu, H. et al. Use of BERT (bidirectional encoder representations from transformers)-based deep learning method for extracting evidences in chinese radiology reports: Development of a computer-aided liver cancer diagnosis framework. J. Med. Internet Res. 23 , e19689 (2021).
Jaiswal, A. et al. RadBERT-CL: factually-aware contrastive learning for radiology report classification. In Proc. Machine Learning for Health , 196–208 (PMLR, 2021).
Torres-Lopez, V. M. et al. Development and validation of a model to identify critical brain injuries using natural language processing of text computed tomography reports. JAMA Netw. Open 5 , e2227109 (2022).
Pérez-Díez, I., Pérez-Moraga, R., López-Cerdán, A., Salinas-Serrano, J. M. & la Iglesia-Vayá, M. De-identifying Spanish medical texts - named entity recognition applied to radiology reports. J. Biomed. Semant. 12 , 6 (2021).
Lau, W., Payne, T. H., Uzuner, O. & Yetisgen, M. Extraction and analysis of clinically important follow-up recommendations in a large radiology dataset. AMIA Summits Transl. Sci. Proc. 2020 , 335–344 (2020).
Santos, T. et al. A fusion NLP model for the inference of standardized thyroid nodule malignancy scores from radiology report text. Annu. Symp. Proc. AMIA Symp. 2021 , 1079–1088 (2021).
Fink, M. A. et al. Deep learning–based assessment of oncologic outcomes from natural language processing of structured radiology reports. Radiol. Artif. Intell. 4 , e220055 (2022).
Datta, S. et al. Understanding spatial language in radiology: representation framework, annotation, and spatial relation extraction from chest X-ray reports using deep learning. J. Biomed. Inform. 108 , 103473 (2020).
Datta, S. & Roberts, K. Spatial relation extraction from radiology reports using syntax-aware word representations. AMIA Jt. Summits Transl. Sci. Proc. 2020 , 116–125 (2020).
Datta, S. & Roberts, K. A Hybrid deep learning approach for spatial trigger extraction from radiology reports. In Proc. Third International Workshop on Spatial Language Understanding , 50–55 (Association for Computational Linguistics, Online, 2020).
Zhang, H. et al. A novel deep learning approach to extract Chinese clinical entities for lung cancer screening and staging. BMC Med. Inform. Decis. Mak. 21 , 214 (2021).
Hu, D. et al. Automatic extraction of lung cancer staging information from computed tomography reports: Deep learning approach. JMIR Med. Inform. 9 , e27955 (2021).
Datta, S., Khanpara, S., Riascos, R. F. & Roberts, K. Leveraging spatial information in radiology reports for ischemic stroke phenotyping. AMIA Jt. Summits Transl. Sci. Proc. 2021 , 170–179 (2021).
Dada, A. et al. Information extraction from weakly structured radiological reports with natural language queries. Eur. Radiol. 34 , 330–337 (2023).
Eisenhauer, E. et al. New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). Eur. J. Cancer 45 , 228–247 (2009).
Article CAS PubMed Google Scholar
Rosen, R. D. & Sapra, A. TNM Classification. In StatPearls (StatPearls Publishing, 2023).
University of California Berkeley. HIPAA PHI: definition of PHI and List of 18 Identifiers. https://cphs.berkeley.edu/hipaa/hipaa18.html# (2023).
Stanford NLP Group. Stanfordnlp/stanza. Stanford NLP (2024).
Sugimoto, K. et al. Extracting clinical terms from radiology reports with deep learning. J. Biomed. Inform. 116 , 103729 (2021).
US National Institutes of Health. NationalCancer Institute. NCI Thesaurus. https://ncit.nci.nih.gov/ncitbrowser/ .
Datta, S., Godfrey-Stovall, J. & Roberts, K. RadLex normalization in radiology reports. AMIA Annu. Symp. Proc. 2020 , 338–347 (2021).
Zhang, Z. et al. ERNIE: Enhanced Language Representation with Informative Entities In Proc. 57th Annual Meeting of the Association for Computational Linguistics , pages 1441–1451, Florence, Italy. Association for Computational Linguistics (2019).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient Estimation of Word Representations in Vector Space 1301.3781 (2013).
Huang, X., Chen, H. & Yan, J. D. Study on structured method of Chinese MRI report of nasopharyngeal carcinoma. BMC Med. Inform. Decis. Mak. 21 , 203 (2021).
DocCheck. DocCheck Flexicon. https://flexikon.doccheck.com/de/Hauptseite (2024).
Jantscher, M. et al. Information extraction from German radiological reports for general clinical text and language understanding. Sci. Rep. 13 , 2353 (2023).
Article CAS PubMed PubMed Central Google Scholar
Zhang, X. et al. Extracting comprehensive clinical information for breast cancer using deep learning methods. Int. J. Med. Inform. 132 , 103985 (2019).
Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6 , 317 (2019).
Moody, G. B. & Mark, R. G. The MIMIC Database (1992).
Datta, S. & Roberts, K. Weakly supervised spatial relation extraction from radiology reports. JAMIA Open 6 , ooad027 (2023).
Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3 , 160035 (2016).
Datta, S. & Roberts, K. Fine-grained spatial information extraction in radiology as two-turn question answering. Int. J. Med. Inform. 158 , 104628 (2022).
Datta, S. et al. Rad-SpatialNet: a frame-based resource for fine-grained spatial relations in radiology reports. In Calzolari, N. et al . (eds.) Proc. Twelfth Language Resources and Evaluation Conference , 2251–2260 (European Language Resources Association, Marseille, France, 2020).
Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23 , 304–310 (2016).
Mithun, S. et al. Clinical concept-based radiology reports classification pipeline for lung carcinoma. J. Digit. Imaging 36 , 812–826 (2023).
Bressem, K. K. et al. Highly accurate classification of chest radiographic reports using a deep learning natural language model pre-trained on 3.8 million text reports. Bioinformatics 36 , 5255–5261 (2021).
Singh, V. et al. Impact of train/test sample regimen on performance estimate stability of machine learning in cardiovascular imaging. Sci. Rep. 11 , 14490 (2021).
Demler, O. V., Pencina, M. J. & D’Agostino, R. B. Misuse of DeLong test to compare AUCs for nested models. Stat. Med. 31 , 2577–2587 (2012).
Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training (2018).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29 , 1930–1940 (2023).
Farquhar, S., Kossen, J., Kuhn, L. & Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 630 , 625–630 (2024).
Zhang, Y. & Xu, Z. BERT for question answering on SQuAD 2.0 (2019).
OECD. Diagnostic technologies (2023).
Viry, A. et al. Annual exposure of the Swiss population from medical imaging in 2018. Radiat. Prot. Dosim. 195 , 289–295 (2021).
Reichenpfader, D., Müller, H. & Denecke, K. Large language model-based information extraction from free-text radiology reports: a scoping review protocol. BMJ Open 13 , e076865 (2023).
Shanahan, M., McDonell, K. & Reynolds, L. Role play with large language models. Nature 623 , 493–498 (2023).
Liang, S. et al. Fine-tuning BERT Models for Summarizing German Radiology Findings. In Naumann, T., Bethard, S., Roberts, K. & Rumshisky, A. (eds.) Proc. 4th Clinical Natural Language Processing Workshop , 30–40 (Association for Computational Linguistics, Seattle, WA, 2022).
Adams, L. C. et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology 307 , e230725 (2023).
Nowak, S. et al. Transformer-based structuring of free-text radiology report databases. Eur. Radiol. 33 , 4228–4236 (2023).
Košprdić, M., Prodanović, N., Ljajić, A., Bašaragin, B. & Milošević, N. From zero to hero: harnessing transformers for biomedical named entity recognition in zero- and few-shot contexts 2305.04928 (2023).
Smit, A. et al. Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT. In Webber, B., Cohn, T., He, Y. & Liu, Y. (eds.) Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP) , 1500–1519 (Association for Computational Linguistics, Online, 2020).
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. In Goldberg, Y., Kozareva, Z. & Zhang, Y. (eds.) Proc. Conference on Empirical Methods in Natural Language Processing , 1998–2022 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).
Kartchner, D., Ramalingam, S., Al-Hussaini, I., Kronick, O. & Mitchell, C. Zero-shot information extraction for clinical meta-analysis using large language models. In Demner-fushman, D., Ananiadou, S. & Cohen, K. (eds.) Proc. 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks , 396–405 (Association for Computational Linguistics, Toronto, Canada, 2023).
Jupin-Delevaux, É. et al. BERT-based natural language processing analysis of French CT reports: application to the measurement of the positivity rate for pulmonary embolism. Res. Diagn. Interv. Imaging 6 , 100027 (2023).
Rohanian, O., Nouriborji, M., Kouchaki, S. & Clifton, D. A. On the effectiveness of compact biomedical transformers. Bioinformatics 39 , btad103 (2023).
Gemini Team, Google. Gemini: a family of highly capable multimodal models. https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf (2024).
Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer 1910.10683 (2023).
Llama-3. Meta (2024).
Jiang, A. Q. et al. Mixtral of experts 2401.04088 (2024).
Chen, Z. et al. MEDITRON-70B: scaling medical pretraining for large language models 2311.16079 (2023).
Singhal, K. et al. Towards expert-level medical question answering with large language models 2305.09617 (2023).
Saab, K. et al. Capabilities of Gemini models in medicine 2404.18416 (2024).
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30 , 1134–1142 (2024).
Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619 , 357–362 (2023).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620 , 172–180 (2023).
Dagdelen, J. et al. Structured information extraction from scientific text with large language models. Nat. Commun. 15 , 1418 (2024).
Zheng, L. et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. Adv. Neural Inf. Process Syst. 36 , 46595–46623 (2023).
Google Scholar
Mukherjee, P., Hou, B., Lanfredi, R. B. & Summers, R. M. Feasibility of using the privacy-preserving large language model Vicuna for labeling radiology reports. Radiology 309 , e231147 (2023).
Bressem, K. K. et al. MEDBERT.de: a comprehensive German BERT model for the medical domain. Expert Syst. Appl. 237 , 121598 (2024).
Hu, D., Liu, B., Zhu, X., Lu, X. & Wu, N. Zero-shot information extraction from radiological reports using ChatGPT. Int. J. Med. Inform. 183 , 105321 (2024).
Hu, D., Li, S., Zhang, H., Wu, N. & Lu, X. Using natural language processing and machine learning to preoperatively predict lymph node metastasis for non–small cell lung cancer with electronic medical records: development and validation study. JMIR Med. Inform. 10 , e35475 (2022).
Mallio, C. A., Sertorio, A. C., Bernetti, C. & Beomonte Zobel, B. Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing. La Radiol. Med. 128 , 808–812 (2023).
Zhao, H. et al. Explainability for large language models: a survey. ACM Trans. Intell. Syst. Technol. 15 , 1–38 (2024).
Article CAS Google Scholar
Yang, H. et al. Unveiling the generalization power of fine-tuned large language models. In Proc. of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Duh, K., Gomez, H. & Bethard, S.) 884–899 (Association for Computational Linguistics, Mexico City, Mexico, 2024). https://doi.org/10.18653/v1/2024.naacl-long.51 .
Abdin, M. et al. Phi-3 technical report: a highly capable language model locally on your phone 2404.14219 (2024).
Gilbert, S., Kather, J. N. & Hogan, A. Augmented non-hallucinating large language models as medical information curators. npj Digital Med. 7 , 1–5 (2024).
He, P., Liu, X., Gao, J. & Chen, W. DeBERTa: decoding-enhanced BERT with disentangled attention 2006.03654 (2021).
Kakarmath, S. et al. Best practices for authors of healthcare-related artificial intelligence manuscripts. NPJ Digit. Med. 3 , 134 (2020).
Rayyan - AI Powered Tool for Systematic Literature Reviews (2021).
Si, Y., Wang, J., Xu, H. & Roberts, K. Enhancing clinical concept extraction with contextual embeddings. J. Am. Med. Inform. Assoc. 26 , 1297–1304 (2019).
Liu, Y. et al. RoBERTa: a robustly optimized BERT pretraining approach 1907.11692 (2019).
Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 , 1234–1240 (2020).
Alsentzer, E. et al. Publicly Available Clinical BERT Embeddings. In Rumshisky, A., Roberts, K., Bethard, S. & Naumann, T. (eds.) Proc. 2nd Clinical Natural Language Processing Workshop , 72–78 (Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019).
Deepset. German BERT. https://huggingface.co/bert-base-german-cased (2019).
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3 , 2:1–2:23 (2021).
Sanh, V., Debut, L., Chaumond, J. & Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter 1910.01108 (2020).
Cui, Y., Che, W., Liu, T., Qin, B. & Yang, Z. Pre-training with whole word masking for Chinese BERT. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29 , 3504–3514 (2021).
Peng, Y., Yan, S. & Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. In Proc. of the 18th BioNLP Workshop and Shared Task (eds Demner-Fushman, D., Cohen, K. B., Ananiadou, S. & Tsujii, J.) 58–65 (Association for Computational Linguistics, Florence, Italy, 2019). https://doi.org/10.18653/v1/W19-5006 .
Chan, B., Schweter, S. & Möller, T. German’s next language model. In Proc. of the 28th International Conference on Computational Linguistics (eds Scott, D., Bel, N. & Zong, C.) 6788–6796 (International Committee on Computational Linguistics, Barcelona, Spain (Online), 2020). https://doi.org/10.18653/v1/2020.coling-main.598 .
Shrestha, M. Development of a Language Model for the Medical Domain . Ph.D. thesis (Rhine-Waal University of Applied Sciences, 2021).
The MultiBERTs: BERT reproductions for robustness analysis. In Sellam, T. et al. (eds.) ICLR 2022 (2022).
Wu, S. & He, Y. Enriching pre-trained language model with entity information for relation classification. In Proc. of the 28th ACM International Conference on Information and Knowledge Management , 2361–2364 (Association for Computing Machinery, New York, NY, USA, 2019). https://doi.org/10.1145/3357384.3358119 .
Beltagy, I., Lo, K. & Cohan, A. SciBERT: a pretrained language model for scientific text. In Proc. Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , (eds Inui, K., Jiang, J., Ng, V. & Wan, X.), 3615–3620 (Association for Computational Linguistics, Hong Kong, China, 2019).
Eberts, M. & Ulges, A. Span-based joint entity and relation extraction with transformer pre-training. In ECAI 2020 , 2006–2013 (IOS Press, 2020).
Yang, Z. et al. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems vol. 32 (Curran Associates, Inc., 2019).
Download references
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. We thank Cornelia Zelger for her support during the search query definition process.
Authors and affiliations.
Institute for Patient-Centered Digital Health, Bern University of Applied Sciences, Biel/Bienne, Switzerland
Daniel Reichenpfader & Kerstin Denecke
Faculty of Medicine, University of Geneva, Geneva, Switzerland
Daniel Reichenpfader
Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
Henning Müller
Informatics Institute, HES-SO Valais-Wallis, Sierre, Switzerland
You can also search for this author in PubMed Google Scholar
D.R. conceptualized the study, defined the methodology (incl. the search strategy), performed the database searches and managed the screening process. D.R. also performed data extraction and authored the original draft. K.D. focused on reviewing and editing the manuscript. K.D. also participated in the screening process. H.M. provided supervision and contributed to writing review.
Correspondence to Daniel Reichenpfader .
Competing interests.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplemental material, rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .
Reprints and permissions
Cite this article.
Reichenpfader, D., Müller, H. & Denecke, K. A scoping review of large language model based approaches for information extraction from radiology reports. npj Digit. Med. 7 , 222 (2024). https://doi.org/10.1038/s41746-024-01219-0
Download citation
Received : 21 February 2024
Accepted : 09 August 2024
Published : 24 August 2024
DOI : https://doi.org/10.1038/s41746-024-01219-0
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.
For many people, reaching their mid-40s may bring unpleasant signs the body isn’t working as well as it once did. Injuries seem to happen more frequently. Muscles may feel weaker.
A new study, published Wednesday in Nature Aging , shows what may be causing the physical decline. Researchers have found that molecules and microorganisms both inside and outside our bodies are going through dramatic changes, first at about age 44 and then again when we hit 60. Those alterations may be causing significant differences in cardiovascular health and immune function.
The findings come from Stanford scientists who analyzed blood and other biological samples of 108 volunteers ages 25 to 75, who continued to donate samples for several years.
“While it’s obvious that you’re aging throughout your entire life, there are two big periods where things really shift,” said the study’s senior author, Michael Snyder, a professor of genetics and director of the Center for Genomics and Personalized Medicine at Stanford Medicine. For example, “there’s a big shift in the metabolism of lipids when people are in their 40s and in the metabolism of carbohydrates when people are in their 60s.”
Lipids are fatty substances, including LDL, HDL and triglycerides, that perform a host of functions in the body, but they can be harmful if they build up in the blood.
The scientists tracked many kinds of molecules in the samples, including RNA and proteins, as well as the participants’ microbiomes.
The metabolic changes the researchers discovered indicate not that people in their 40s are burning calories more slowly but rather that the body is breaking food down differently. The scientists aren’t sure exactly what impact those changes have on health.
Previous research showed that resting energy use, or metabolic rate , didn’t change from ages 20 to 60. The new study’s findings don't contradict that.
The changes in metabolism affect how the body reacts to alcohol or caffeine, although the health consequences aren’t yet clear. In the case of caffeine, it may result in higher sensitivity.
It’s also not known yet whether the shifts could be linked to lifestyle or behavioral factors. For example, the changes in alcohol metabolism might be because people are drinking more in their mid-40s, Snyder said.
For now, Snyder suggests people in their 40s keep a close eye on their lipids, especially LDL cholesterol.
“If they start going up, people might want to think about taking statins if that’s what their doctor recommends,” he said. Moreover, “knowing there’s a shift in the molecules that affect muscles and skin, you might want to warm up more before exercising so you don’t hurt yourself.”
Until we know better what those changes mean, the best way to deal with them would be to eat healthy foods and to exercise regularly, Snyder said.Dr. Josef Coresh, founding director of the Optimal Aging Institute at the NYU Grossman School of Medicine, compared the new findings to the invention of the microscope.
“The beauty of this type of paper is the level of detail we can see in molecular changes,” said Coresh, a professor of medicine at the school. “But it will take time to sort out what individual changes mean and how we can tailor medications to those changes. We do know that the origins of many diseases happen in midlife when people are in their 40s, though the disease may occur decades later.”
The new study “is an important step forward,” said Dr. Lori Zeltser, a professor of pathology and cell biology at the Columbia University Vagelos College of Physicians and Surgeons. While we don’t know what the consequences of those metabolic changes are yet, “right now, we have to acknowledge that we metabolize food differently in our 40s, and that is something really new.”
The shifts the researchers found might help explain numerous age-related health changes, such as muscle loss, because “your body is breaking down food differently,” Zeltser said.
Linda Carroll is a regular health contributor to NBC News. She is coauthor of "The Concussion Crisis: Anatomy of a Silent Epidemic" and "Out of the Clouds: The Unlikely Horseman and the Unwanted Colt Who Conquered the Sport of Kings."
Uchicago-led study casts new light on the origins of life on earth.
One of the major unanswered questions about the origin of life is how droplets of RNA floating around the primordial soup turned into the membrane-protected packets of life we call cells.
A new paper by researchers with the University of Chicago and the University of Houston proposes a solution.
They show how rainwater could have helped create a meshy wall around protocells 3.8 billion years ago, a critical step in the transition from tiny beads of RNA to every bacterium, plant, animal, and human that ever lived.
The paper was published Aug. 21 in Science Advances by UChicago Pritzker Molecular Engineering (PME) postdoctoral researcher Aman Agrawal and his co-authors—including PME Dean Emeritus Matthew Tirrell and Nobel Prize-winning biologist Jack Szostak, director of UChicago’s Chicago Center for the Origins of Life .
“This is a distinctive and novel observation,” said Tirrell.
The research looks at “coacervate droplets”—naturally occurring compartments of complex molecules like proteins, lipids, and RNA. ( In the early 2000s , Szostak started looking at RNA as the first biological material to develop, rather than DNA.)
The droplets, which behave like drops of cooking oil in water, have long been eyed as a candidate for the first protocells. But there was a problem, Szostak found in 2014.
It wasn’t that these droplets couldn’t exchange molecules between each other, a key step in evolution; the problem was that they did it too well, and too fast. Any droplet containing a new, potentially useful pre-life mutation of RNA would exchange this RNA with the other RNA droplets within minutes, meaning they would quickly all be the same.
There would be no differentiation and no competition—meaning no evolution. And that means no life.
“If molecules continually exchange between droplets or between cells, then all the cells after a short while will look alike, and there will be no evolution because you are ending up with identical clones,” Agrawal said.
Agrawal started transferring coacervate droplets into distilled water during his PhD research at the University of Houston with Prof. Alamgir Karim , studying their behavior under an electric field. At this point, the research had nothing to do with the origin of life—just studying the fascinating material from an engineering perspective.
Karim had worked decades earlier at the University of Minnesota under one of the world’s top experts—Tirrell, who later became founding dean of the UChicago Pritzker School of Molecular Engineering. During a lunch with Agrawal and Karim, Tirrell brought up how the research into the effects of distilled water on coacervate droplets might relate to the origin of life on Earth. Tirrell asked where distilled water would have existed 3.8 billion years ago.
“I spontaneously said ‘rainwater!’ His eyes lit up and he was very excited at the suggestion,” Karim said. “So, you can say it was a spontaneous combustion of ideas or ideation!”
Tirrell brought Agrawal’s distilled water research to Szostak, who had recently joined the University of Chicago to lead a new push to understand the origins of life on Earth .
Working with RNA samples from Szostak, Agrawal found that transferring coacervate droplets into distilled water increased the time scale of RNA exchange – from mere minutes to several days. This was long enough for mutation, competition, and evolution.
Then, to make sure rain itself could work rather than distilled water, “We simply collected water from rain in Houston and tested the stability of our droplets in it, just to make sure what we are reporting is accurate,” Agrawal said.
In tests with the actual rainwater and with lab water modified to mimic the acidity of rainwater, they found the same results. The meshy walls formed, creating the conditions that could have led to life.
The chemical composition of the rain falling over Houston in the 2020s is not the rain that would have fallen 750 million years after the Earth formed, and the same can be said for the model protocell system Agrawal tested. The new paper proves that this approach of building a meshy wall around protocells is possible and can work together to compartmentalize the molecules of life, putting researchers closer than ever to finding the right set of chemical and environmental conditions that allow protocells to evolve.
“The molecules we used to build these protocells are just models until more suitable molecules can be found as substitutes,” Agrawal said. “While the chemistry would be a little bit different, the physics will remain the same.”
Life is by nature interdisciplinary, so Szostak, the director of UChicago’s Chicago Center for the Origins of Life , said it was natural to collaborate with both UChicago PME , UChicago’s interdisciplinary school of molecular engineering, and the chemical engineering department at the University of Houston.
“Engineers have been studying the physical chemistry of these types of complexes—and polymer chemistry more generally—for a long time. It makes sense that there's expertise in the engineering school,” Szostak said. “When we're looking at something like the origin of life, it's so complicated and there are so many parts that we need people to get involved who have any kind of relevant experience.”
Citation: “ Did the exposure of coacervate droplets to rain make them the first stable protocells? ” Agrawal et al, Science Advances , August 21, 2024. DOI: 10.1126/sciadv.adn9657
Funding: Houston Endowment Fellowship, Welch Foundation, U.S. Department of Energy
— Adapted from an article published by the Pritzker School of Molecular Engineering .
Get more with UChicago News delivered to your inbox.
The origin of life on Earth, explained
Earth could have supported crust, life earlier than thought
Latest news, big brains podcast: how homeownership shaped race in america.
Inside the Lab
Biochemistry
Go 'Inside the Lab' at UChicago
Meet a UChicagoan
Around uchicago.
Quantrell and PhD Teaching Awards
Campus News
National Academy of Sciences
University of Chicago Law School
Biological Sciences Division
Meet A UChicagoan
IMAGES
COMMENTS
This paper is structured as follows: Section 1 provides the introduction to the need for a validation format for research, and the fundamentals of validation and the factors involved in validation from various literature studies are discussed in Section 2. Section 3 presents the methodology used in framing the validation format.
4.3 The practice of combining a validation study and . a research study together by using an unvalidated . ... paper which contains a set of questions to assess .
Validation studies, in which an investigator compares the accuracy of a measure with a gold standard measure, are an important way to understand and mitigate this bias. More attention is being paid to the importance of validation studies in recent years, yet they remain rare in epidemiologic research and, in our experience, they remain poorly ...
Domain identification . The first step is to articulate the domain(s) that you are endeavoring to measure. A domain or construct refers to the concept, attribute, or unobserved behavior that is the target of the study ().Therefore, the domain being examined should be decided upon and defined before any item activity ().A well-defined domain will provide a working knowledge of the phenomenon ...
Validation studies, in which an investigator compares the accuracy of a measure with a gold standard measure, are an important way to understand and mitigate this bias. More attention is being paid to the importance of validation studies in recent years, yet they remain rare in epidemiologic research and, in our experience, they remain poorly ...
Methods and analysis A systematic descriptive literature review of qualitative and quantitative research will be used to investigate the scope of validation practice in the rapidly growing field of health literacy assessment. This review method employs a frequency analysis to reveal potentially interpretable patterns of phenomena in a research area; in this study, patterns in types of validity ...
This paper is structured as follows: Section 1 provides the introduction to the need for a validation format for research, and the fundamentals of validation and the factors involved in validation from various literature studies are discussed in Section 2. Section 3 presents the methodology used in framing the validation format.
Consequently, validation challenges have been well observed in the earlier research, and our aim in this paper is to study the validation methods that resolve or alleviate these challenges. Gao et al. (2019) also argue that there is a deficiency in supporting tools for validating AI systems. Readily available tools are not non-existent.
The George W. W oodruff School of Mechanical. Engineering. Georgia Institute of Technology. Atlanta, GA 30332-0405. USA. * Corresponding Author, to whom correspondence should be addressed: janet ...
Surveys are among the most widely used tools in research impact evaluation. Quantitative approaches as surveys are suggested for accountability purposes, as the most appropriate way that calls for transparency (Guthrie et al. 2013).They provide a broad overview of the status of a body of research and supply comparable, easy-to-analyze data referring to a range of researchers and/or grants.
More attention is being paid to the importance of validation studies, and several journals 12, 13 have even created submission categories for validation studies, yet they remain rare in epidemiologic research. To address this gap, we pose a series of questions that could be used on an exam about validation studies or in teaching validation ...
product studies and improvement, formulation pilot batch testing, scale-up research, exchange of innovation to business scale groups, setting up stability conditions, and managing of in-process, finished pharmaceutical formulations, qualification of equipment, master documents, and process limit [4]. Stage 2 This involves process validation phase.
Given (1) the historical context for the terms verification and validation in software and hardware standards, regulations, and guidances, and (2) the separated concepts of analytical and clinical ...
Importance of data validation. Data validation is important for several aspects of a well-conducted study: To ensure a robust dataset: The primary aim of data validation is to ensure an error-free dataset for further analysis. This is especially important if you or other researchers plan to use the dataset for future studies or to train machine ...
The criteria for the validation of qualitative research are still open to discussion. This article has two aims: first, to present a summary of concepts, emerging from the field of qualitative ...
External validation studies are an important but often neglected part of prediction model research. In this article, the second in a series on model evaluation, Riley and colleagues explain what an external validation study entails and describe the key steps involved, from establishing a high quality dataset to evaluating a model's predictive performance and clinical usefulness. A clinical ...
Background: Validity is a core topic in educational and psychological assessment. Although there are many available resources describing the concept of validity, sources of validity evidence, and suggestions about how to obtain validity evidence; there is little guidance providing specific instructions for planning and carrying out validation studies. Method: In this paper we describe (a) the ...
This study examines academic validation, or the validation of students by instructors (Rendόn, 1994; Rendόn & Munoz, 2011). Regardless of full- or part-time status, living ... Rendόn's (1994) research oered examples of validating practices demonstrated by fac-ulty, including: (1) demonstrating genuine and authentic concern when teaching ...
Results. Indicator validation studies should report on participation at every stage, and provide data on reasons for non-participation. Metrics of individual validity (sensitivity, specificity, area under the receiver operating characteristic curve) and population-level validity (inflation factor) should be reported, as well as the percent of survey responses that are "don't know" or ...
The emphasis on strategies that are implemented during the research process has been replaced by strategies for evaluating trustworthiness and utility that are implemented once a study is completed. In this article, we argue that reliability and validity remain appropriate concepts for attaining rigor in qualitative research.
The statistical results to validate the questionnaire have been significant, allowing us to propose this experience as a starting point to implement further studies about the development of research skills in university's students from other areas of knowledge. Keywords - Validation, Learning, Research skills, University. -----1. I ntroduction
Peer debriefing is a form of external evaluation of the qualitative research process. Lincoln and Guba (1985, p. 308) describe the role of the peer reviewer as the "devil's advocate.". It is a person who asks difficult questions about the procedures, meanings, interpretations, and conclusions of the investigation.
Joseph A. Kitchen is an assistant professor of higher education at the University of Miami in Miami, Florida. Dr. Kitchen conducts quantitative, qualitative, and mixed-methods research and his research agenda spans several areas, with a central focus on the role of college transition, outreach, and support programs and interventions in promoting equitable outcomes and college success among ...
MIT researchers have discovered how fasting impacts the regenerative abilities of intestinal stem cells, reports Ed Cara for Gizmodo.. "The major finding of our current study is that refeeding after fasting is a distinct state from fasting itself," explain Prof. Ömer Yilmaz and postdocs Shinya Imada and Saleh Khawaled.
Full-text reading and screening were performed independently by 2 reviewers, and information was extracted into a pretested template for the 5 research questions. Disagreements were resolved by a domain expert. The study protocol has previously been published. Results: The search resulted in a total of 764 papers.
The validation is an important variable in typical research and development study, especially for the study which develops a product. According to Glod-Lendvai (2018), validation is to test the ...
Background Lymphatic filariasis (LF), a mosquito-borne helminth infection, is an important cause of chronic disability globally. The World Health Organization has validated eight Pacific Island countries as having eliminated lymphatic filariasis (LF) as a public health problem, but there are limited data to support an evidence-based approach to post-validation surveillance (PVS). Tonga was ...
Similarly, multiple studies mention validation on external or multi-institutional data as a future research direction 19,26,59. Two studies mention the need of semantic enrichment or normalization ...
Research shows our bodies go through rapid changes in our 40s and our 60s ... "The beauty of this type of paper is the level of detail we can see in molecular changes," said Coresh, a ...
A new paper by researchers with the University of Chicago and the University of Houston proposes a solution. They show how rainwater could have helped create a meshy wall around protocells 3.8 billion years ago, a critical step in the transition from tiny beads of RNA to every bacterium, plant, animal, and human that ever lived.