University of Tasmania, Australia

Systematic reviews for health: 1. formulate the research question.

  • Handbooks / Guidelines for Systematic Reviews
  • Standards for Reporting
  • Registering a Protocol
  • Tools for Systematic Review
  • Online Tutorials & Courses
  • Books and Articles about Systematic Reviews
  • Finding Systematic Reviews
  • Critical Appraisal
  • Library Help
  • Bibliographic Databases
  • Grey Literature
  • Handsearching
  • Citation Searching
  • 1. Formulate the Research Question
  • 2. Identify the Key Concepts
  • 3. Develop Search Terms - Free-Text
  • 4. Develop Search Terms - Controlled Vocabulary
  • 5. Search Fields
  • 6. Phrase Searching, Wildcards and Proximity Operators
  • 7. Boolean Operators
  • 8. Search Limits
  • 9. Pilot Search Strategy & Monitor Its Development
  • 10. Final Search Strategy
  • 11. Adapt Search Syntax
  • Documenting Search Strategies
  • Handling Results & Storing Papers

the research question in a systematic review

Step 1. Formulate the Research Question

A systematic review is based on a pre-defined specific research question ( Cochrane Handbook, 1.1 ). The first step in a systematic review is to determine its focus - you should clearly frame the question(s) the review seeks to answer  ( Cochrane Handbook, 2.1 ). It may take you a while to develop a good review question - it is an important step in your review.  Well-formulated questions will guide many aspects of the review process, including determining eligibility criteria, searching for studies, collecting data from included studies, and presenting findings ( Cochrane Handbook, 2.1 ).

The research question should be clear and focused - not too vague, too specific or too broad.

You may like to consider some of the techniques mentioned below to help you with this process. They can be useful but are not necessary for a good search strategy.

PICO - to search for quantitative review questions

P I C O

if appropriate
Most important characteristics of patient (e.g. age, disease/condition, gender) Main intervention (e.g. drug treatment, diagnostic/screening test) Main alternative (e.g. placebo, standard therapy, no treatment, gold standard) What you are trying to accomplish, measure, improve, affect (e.g. reduced mortality or morbidity, improved memory)

Richardson, WS, Wilson, MC, Nishikawa, J & Hayward, RS 1995, 'The well-built clinical question: A key to evidence-based decisions', ACP Journal Club , vol. 123, no. 3, pp. A12-A12 .

We do not have access to this article at UTAS.

A variant of PICO is PICOS . S stands for Study designs . It establishes which study designs are appropriate for answering the question, e.g. randomised controlled trial (RCT). There is also PICO C (C for context) and PICO T (T for timeframe).

You may find this document on PICO / PIO / PEO useful:

  • Framing a PICO / PIO / PEO question Developed by Teesside University

SPIDER - to search for qualitative and mixed methods research studies

S PI D E R
Sample Phenomenon of Interest Design Evaluation Research type

Cooke, A, Smith, D & Booth, A 2012, 'Beyond pico the spider tool for qualitative evidence synthesis', Qualitative Health Research , vol. 22, no. 10, pp. 1435-1443.

This article is only accessible for UTAS staff and students.

SPICE - to search for qualitative evidence

S P I C E
Setting (where?) Perspecitve (for whom?) Intervention (what?) Comparison (compared with what?) Evaluation (with what result?)

Cleyle, S & Booth, A 2006, 'Clear and present questions: Formulating questions for evidence based practice', Library hi tech , vol. 24, no. 3, pp. 355-368.

ECLIPSE - to search for health policy/management information

E C L I P Se
Expectation (improvement or information or innovation) Client group (at whom the service is aimed) Location (where is the service located?) Impact (outcomes) Professionals (who is involved in providing/improving the service) Service (for which service are you looking for information)

Wildridge, V & Bell, L 2002, 'How clip became eclipse: A mnemonic to assist in searching for health policy/management information', Health Information & Libraries Journal , vol. 19, no. 2, pp. 113-115.

There are many more techniques available. See the below guide from the CQUniversity Library for an extensive list:

  • Question frameworks overview from Framing your research question guide, developed by CQUniversity Library

This is the specific research question used in the example:

"Is animal-assisted therapy more effective than music therapy in managing aggressive behaviour in elderly people with dementia?"

Within this question are the four PICO concepts :

P elderly patients with dementia
I animal-assisted therapy
C music therapy
O aggressive behaviour

S - Study design

This is a therapy question. The best study design to answer a therapy question is a randomised controlled trial (RCT). You may decide to only include studies in the systematic review that were using a RCT, see  Step 8 .

See source of example

Need More Help? Book a consultation with a  Learning and Research Librarian  or contact  [email protected] .

  • << Previous: Building Search Strategies
  • Next: 2. Identify the Key Concepts >>
  • Last Updated: Sep 3, 2024 10:18 AM
  • URL: https://utas.libguides.com/SystematicReviews

Australian Aboriginal Flag

X

Library Services

UCL LIBRARY SERVICES

  • Guides and databases
  • Library skills
  • Systematic reviews

Formulating a research question

  • What are systematic reviews?
  • Types of systematic reviews
  • Identifying studies
  • Searching databases
  • Describing and appraising studies
  • Synthesis and systematic maps
  • Software for systematic reviews
  • Online training and support
  • Live and face to face training
  • Individual support
  • Further help

Searching for information

Clarifying the review question leads to specifying what type of studies can best address that question and setting out criteria for including such studies in the review. This is often called inclusion criteria or eligibility criteria. The criteria could relate to the review topic, the research methods of the studies, specific populations, settings, date limits, geographical areas, types of interventions, or something else.

Systematic reviews address clear and answerable research questions, rather than a general topic or problem of interest. They also have clear criteria about the studies that are being used to address the research questions. This is often called inclusion criteria or eligibility criteria.

Six examples of types of question are listed below, and the examples show different questions that a review might address based on the topic of influenza vaccination. Structuring questions in this way aids thinking about the different types of research that could address each type of question. Mneumonics can help in thinking about criteria that research must fulfil to address the question. The criteria could relate to the context, research methods of the studies, specific populations, settings, date limits, geographical areas, types of interventions, or something else.

Examples of review questions

  • Needs - What do people want? Example: What are the information needs of healthcare workers regarding vaccination for seasonal influenza?
  • Impact or effectiveness - What is the balance of benefit and harm of a given intervention? Example: What is the effectiveness of strategies to increase vaccination coverage among healthcare workers. What is the cost effectiveness of interventions that increase immunisation coverage?
  • Process or explanation - Why does it work (or not work)? How does it work (or not work)?  Example: What factors are associated with uptake of vaccinations by healthcare workers?  What factors are associated with inequities in vaccination among healthcare workers?
  • Correlation - What relationships are seen between phenomena? Example: How does influenza vaccination of healthcare workers vary with morbidity and mortality among patients? (Note: correlation does not in itself indicate causation).
  • Views / perspectives - What are people's experiences? Example: What are the views and experiences of healthcare workers regarding vaccination for seasonal influenza?
  • Service implementation - What is happening? Example: What is known about the implementation and context of interventions to promote vaccination for seasonal influenza among healthcare workers?

Examples in practice :  Seasonal influenza vaccination of health care workers: evidence synthesis / Loreno et al. 2017

Example of eligibility criteria

Research question: What are the views and experiences of UK healthcare workers regarding vaccination for seasonal influenza?

  • Population: healthcare workers, any type, including those without direct contact with patients.
  • Context: seasonal influenza vaccination for healthcare workers.
  • Study design: qualitative data including interviews, focus groups, ethnographic data.
  • Date of publication: all.
  • Country: all UK regions.
  • Studies focused on influenza vaccination for general population and pandemic influenza vaccination.
  • Studies using survey data with only closed questions, studies that only report quantitative data.

Consider the research boundaries

It is important to consider the reasons that the research question is being asked. Any research question has ideological and theoretical assumptions around the meanings and processes it is focused on. A systematic review should either specify definitions and boundaries around these elements at the outset, or be clear about which elements are undefined. 

For example if we are interested in the topic of homework, there are likely to be pre-conceived ideas about what is meant by 'homework'. If we want to know the impact of homework on educational attainment, we need to set boundaries on the age range of children, or how educational attainment is measured. There may also be a particular setting or contexts: type of school, country, gender, the timeframe of the literature, or the study designs of the research.

Research question: What is the impact of homework on children's educational attainment?

  • Scope : Homework - Tasks set by school teachers for students to complete out of school time, in any format or setting.
  • Population: children aged 5-11 years.
  • Outcomes: measures of literacy or numeracy from tests administered by researchers, school or other authorities.
  • Study design: Studies with a comparison control group.
  • Context: OECD countries, all settings within mainstream education.
  • Date Limit: 2007 onwards.
  • Any context not in mainstream primary schools.
  • Non-English language studies.

Mnemonics for structuring questions

Some mnemonics that sometimes help to formulate research questions, set the boundaries of question and inform a search strategy.

Intervention effects

PICO  Population – Intervention– Outcome– Comparison

Variations: add T on for time, or ‘C’ for context, or S’ for study type,

Policy and management issues

ECLIPSE : Expectation – Client group – Location – Impact ‐ Professionals involved – Service

Expectation encourages  reflection on what the information is needed for i.e. improvement, innovation or information.  Impact looks at what  you would like to achieve e.g. improve team communication .

  • How CLIP became ECLIPSE: a mnemonic to assist in searching for health policy/management information / Wildridge & Bell, 2002

Analysis tool for management and organisational strategy

PESTLE:  Political – Economic – Social – Technological – Environmental ‐ Legal

An analysis tool that can be used by organizations for identifying external factors which may influence their strategic development, marketing strategies, new technologies or organisational change.

  • PESTLE analysis / CIPD, 2010

Service evaluations with qualitative study designs

SPICE:  Setting (context) – Perspective– Intervention – Comparison – Evaluation

Perspective relates to users or potential users. Evaluation is how you plan to measure the success of the intervention.

  • Clear and present questions: formulating questions for evidence based practice / Booth, 2006

Read more about some of the frameworks for constructing review questions:

  • Formulating the Evidence Based Practice Question: A Review of the Frameworks / Davis, 2011
  • << Previous: Stages in a systematic review
  • Next: Identifying studies >>
  • Last Updated: Aug 2, 2024 9:22 AM
  • URL: https://library-guides.ucl.ac.uk/systematic-reviews

RMIT University

Teaching and Research guides

Systematic reviews.

  • Starting the review
  • About systematic reviews

Develop your research question

Types of questions, pico framework, spice, spider and eclipse.

  • Plan your search
  • Sources to search
  • Search example
  • Screen and analyse
  • Further help

A systematic review is an in-depth attempt to answer a specific, focused question in a methodical way.

Start with a clearly defined, researchable  question , that should accurately and succinctly sum up the review's line of inquiry.

A well formulated review question will help determine your inclusion and exclusion criteria, the creation of your search strategy, the collection of data and the presentation of your findings.

It is important to ensure the question:

  • relates to what you really need to know about your topic
  • is answerable, specific and focused
  • should strike a suitable balance between being too broad or too narrow in scope
  • has been formulated with care so as to avoid missing relevant studies or collecting a potentially biased result set

Is the research question justified?

  • Are healthcare providers, consumers, researchers, and policy makers requiring this evidence for their healthcare decisions?
  • Is there a gap in the current literature? The question should be worthy of an answer.
  • Has a similar review been done before?

Question types

To help in focusing the question and determining the most appropriate type of evidence consider the type of question. Is there is a study design (eg. Randomized Controlled Trials, Meta-Analysis) that would provide the best answer.

Is your research question to focus on:

  • Diagnosis : How to select and interpret diagnostic tests
  • Intervention/Therapy : How to select treatments to offer patients that do more good than harm and that are worth the efforts and costs of using them
  • Prediction/Prognosis : How to estimate the patient’s likely clinical course over time and anticipate likely complications of disease
  • Exploration/Etiology : How to identify causes for disease, including genetics

If appropriate, use a  framework  to help in the development of your research question. A framework will assist in identifying the important concepts in your question.

A good question will combine several concepts. Identifying the relevant concepts is crucial to successful development and execution of your systematic search. Your research question should provide you with a checklist for the main concepts to be included in your search strategy.

Using a framework to aid in the development of a research question can be useful. The more you understand your question the more likely you are to obtain relevant results for your review. There are a number of different frameworks available.

A technique often used in research for formulating a  clinical research question  is the PICO   model. PICO is explored in more detail in this guide. Slightly different versions of this concept are used to search for quantitative and qualitative reviews.

For quantitative reviews-

PICO  = Population, Intervention, Comparison, Outcome

Population, Patient or Problem Intervention or Indicator Comparison or Control Outcome

Who or what is the question about?

What is the problem you are looking at?

Is there a specific population you need to focus on?

Describe the most important characteristics of the patient, population or problem.

What treatment or changes are you looking to explore?

What do you want to do with this patient?

What factor may influence the prognosis of the patient?

Is there a comparison treatment to be considered?

The comparison may be with another medication, another form of treatment, or no treatment at all.

Your clinical question does not have to always have a specific comparison. Use if you are comparing multiple interventions. Use a if you are comparing an intervention to no intervention.

What are you trying to accomplish, measure, improve or affect?

What are you trying to do for the patient? Relieve or eliminate the symptoms? Reduce the number of adverse events? Improve function or test scores?

What results will you consider to determine if, or how well, the intervention is working?

For qualitative reviews-

= Population or Problem, Interest, Context
Population or Problem Interest Context

What are the characteristics of the Population or the Patient?

What is the Problem, condition or disease you are interested in?

Interest relates to a defined event, activity, experience or process

Context is the setting or distinct characteristics

For qualitative evidence-

= Setting, Perspective, Intervention or Exposure or Interest, Comparison, Evaluation

Setting Perspective Intervention, Exposure or Interest Comparison Evaluation

Setting is the context for the question - 

Perspective is the users, potential users, or stakeholders of the service - 

Intervention is the action taken for the users, potential users, or stakeholders - 

Comparison is the alternative actions or outcomes - 

Evaluation is the result or measurement that will determine the success of the intervention - 

  • Booth, A. (2006). Clear and present questions: Formulating questions for evidence based practice. Library hi tech, 24(3), 355-368.

= Sample, Phenomenon of Interest, Design, Evaluation, Research Type

Sample Phenomenon of Interest Design Evaluation Research Type

Sample size may very if qualitative and quantitative studies

Phenomena of Interest include behaviours, experiences and interventions

Design influences the strength of the study analysis and finding

Evaluation outcomes may include more subjective outcomes - such as views, attitudes, etc.

Research types include qualitative, quantitative or mixed method studies

  • Cooke, A., Smith, D., & Booth, A. (2012). Beyond PICO: The SPIDER tool for qualitative evidence synthesis. Qualitative Health Research, 22(10), 1435-1443.

= Expectation, Client, Location, Impact, Professionals, Service

Expectation Client Location Impact ProfessionalsType Service

Improvement or information or innovation

At whom the service is aimed

Where is the service located?

Outcomes

Who is involved in providing/improving the service?

For which service are you looking for information?

  • Wildridge, V., & Bell, L. (2002). How CLIP became ECLIPSE: A mnemonic to assist in searching for health policy/management information. Health Information & Libraries Journal, 19(2), 113-115.
  • << Previous: About systematic reviews
  • Next: Protocol >>

Creative Commons license: CC-BY-NC.

  • Last Updated: Aug 30, 2024 4:17 PM
  • URL: https://rmit.libguides.com/systematicreviews

Home

  • Duke NetID Login
  • 919.660.1100
  • Duke Health Badge: 24-hour access
  • Accounts & Access
  • Databases, Journals & Books
  • Request & Reserve
  • Training & Consulting
  • Request Articles & Books
  • Renew Online
  • Reserve Spaces
  • Reserve a Locker
  • Study & Meeting Rooms
  • Course Reserves
  • Pay Fines/Fees
  • Recommend a Purchase
  • Access From Off Campus
  • Building Access
  • Computers & Equipment
  • Wifi Access
  • My Accounts
  • Mobile Apps
  • Known Access Issues
  • Report an Access Issue
  • All Databases
  • Article Databases
  • Basic Sciences
  • Clinical Sciences
  • Dissertations & Theses
  • Drugs, Chemicals & Toxicology
  • Grants & Funding
  • Interprofessional Education
  • Non-Medical Databases
  • Search for E-Journals
  • Search for Print & E-Journals
  • Search for E-Books
  • Search for Print & E-Books
  • E-Book Collections
  • Biostatistics
  • Global Health
  • MBS Program
  • Medical Students
  • MMCi Program
  • Occupational Therapy
  • Path Asst Program
  • Physical Therapy
  • Population Health
  • Researchers
  • Community Partners

Conducting Research

  • Archival & Historical Research
  • Black History at Duke Health
  • Data Analytics & Viz Software
  • Data: Find and Share
  • Evidence-Based Practice
  • NIH Public Access Policy Compliance
  • Publication Metrics
  • Qualitative Research
  • Searching Animal Alternatives

Systematic Reviews

  • Test Instruments

Using Databases

  • JCR Impact Factors
  • Web of Science

Finding & Accessing

  • COVID-19: Core Clinical Resources
  • Health Literacy
  • Health Statistics & Data
  • Library Orientation

Writing & Citing

  • Creating Links
  • Getting Published
  • Reference Mgmt
  • Scientific Writing

Meet a Librarian

  • Request a Consultation
  • Find Your Liaisons
  • Register for a Class
  • Request a Class
  • Self-Paced Learning

Search Services

  • Literature Search
  • Systematic Review
  • Animal Alternatives (IACUC)
  • Research Impact

Citation Mgmt

  • Other Software

Scholarly Communications

  • About Scholarly Communications
  • Publish Your Work
  • Measure Your Research Impact
  • Engage in Open Science
  • Libraries and Publishers
  • Directions & Maps
  • Floor Plans

Library Updates

  • Annual Snapshot
  • Conference Presentations
  • Contact Information
  • Gifts & Donations
  • What is a Systematic Review?
  • Types of Reviews
  • Manuals and Reporting Guidelines
  • Our Service
  • 1. Assemble Your Team

2. Develop a Research Question

  • 3. Write and Register a Protocol
  • 4. Search the Evidence
  • 5. Screen Results
  • 6. Assess for Quality and Bias
  • 7. Extract the Data
  • 8. Write the Review
  • Additional Resources
  • Finding Full-Text Articles

A well-developed and answerable question is the foundation for any systematic review. This process involves:

  • Systematic review questions typically follow a PICO-format (patient or population, intervention, comparison, and outcome)
  • Using the PICO framework can help team members clarify and refine the scope of their question. For example, if the population is breast cancer patients, is it all breast cancer patients or just a segment of them? 
  • When formulating your research question, you should also consider how it could be answered. If it is not possible to answer your question (the research would be unethical, for example), you'll need to reconsider what you're asking
  • Typically, systematic review protocols include a list of studies that will be included in the review. These studies, known as exemplars, guide the search development but also serve as proof of concept that your question is answerable. If you are unable to find studies to include, you may need to reconsider your question

Other Question Frameworks

PICO is a helpful framework for clinical research questions, but may not be the best for other types of research questions. Did you know there are at least  25 other question frameworks  besides variations of PICO?  Frameworks like PEO, SPIDER, SPICE, and ECLIPS can help you formulate a focused research question. The table and example below were created by the  Medical University of South Carolina (MUSC) Libraries .

The PEO question framework is useful for qualitative research topics. PEO questions identify three concepts: population, exposure, and outcome. Research question : What are the daily living experiences of mothers with postnatal depression?

opulation Who is my question focused on? mothers
xposure What is the issue I am interested in? postnatal depression
utcome What, in relation to the issue, do I want to examine? daily living experiences

The SPIDER question framework is useful for qualitative or mixed methods research topics focused on "samples" rather than populations. SPIDER questions identify five concepts: sample, phenomenon of interest, design, evaluation, and research type.

Research question : What are the experiences of young parents in attendance at antenatal education classes?

Element Definition Example
ample Who is the group of people being studied? young parents
henomenon
of  nterest
What are the reasons for behavior and decisions? attendance at antenatal education classes
esign How has the research been collected (e.g., interview, survey)? interviews
valuation What is the outcome being impacted?

experiences

esearch type What type of research (qualitative or mixed methods)? qualitative studies

The SPICE question framework is useful for qualitative research topics evaluating the outcomes of a service, project, or intervention. SPICE questions identify five concepts: setting, perspective, intervention/exposure/interest, comparison, and evaluation.

Research question : For teenagers in South Carolina, what is the effect of provision of Quit Kits to support smoking cessation on number of successful attempts to give up smoking compared to no support ("cold turkey")?

Element Definition Example
etting Setting is the context for the question (where). South Carolina
erspective Perspective is the users, potential users, or stakeholders of the service (for whom). teenagers
ntervention / Exposure Intervention is the action taken for the users, potential users, or stakeholders (what). provision of Quit Kits to support smoking cessation
omparison Comparison is the alternative actions or outcomes (compared to what).

no support or "cold turkey"

valuation Evaluation is the result or measurement that will determine the success of the intervention (what is the result, how well). number of successful attempts to give up smoking with Quit Kits compared to number of successful attempts with no support

The ECLIPSE framework is useful for qualitative research topics investigating the outcomes of a policy or service. ECLIPSE questions identify six concepts: expectation, client group, location, impact, professionals, and service.

Research question:  How can I increase access to wireless internet for hospital patients?

xpectation What are you looking to improve or change? What is the information going to be used for? to increase access to wireless internet in the hospital
lient group Who is the service or policy aimed at? patients and families
ocation Where is the service or policy located? hospitals
mpact What is the change in service or policy that the researcher is investigating? clients have easy access to free internet
rofessionals Who is involved in providing or improving the service or policy? IT, hospital administration
rvice What kind of service or policy is this? provision of free wireless internet to patients
  • << Previous: 1. Assemble Your Team
  • Next: 3. Write and Register a Protocol >>
  • Last Updated: Jun 18, 2024 9:41 AM
  • URL: https://guides.mclibrary.duke.edu/sysreview
  • Duke Health
  • Duke University
  • Duke Libraries
  • Medical Center Archives
  • Duke Directory
  • Seeley G. Mudd Building
  • 10 Searle Drive
  • [email protected]

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • Systematic Review | Definition, Example, & Guide

Systematic Review | Definition, Example & Guide

Published on June 15, 2022 by Shaun Turney . Revised on November 20, 2023.

A systematic review is a type of review that uses repeatable methods to find, select, and synthesize all available evidence. It answers a clearly formulated research question and explicitly states the methods used to arrive at the answer.

They answered the question “What is the effectiveness of probiotics in reducing eczema symptoms and improving quality of life in patients with eczema?”

In this context, a probiotic is a health product that contains live microorganisms and is taken by mouth. Eczema is a common skin condition that causes red, itchy skin.

Table of contents

What is a systematic review, systematic review vs. meta-analysis, systematic review vs. literature review, systematic review vs. scoping review, when to conduct a systematic review, pros and cons of systematic reviews, step-by-step example of a systematic review, other interesting articles, frequently asked questions about systematic reviews.

A review is an overview of the research that’s already been completed on a topic.

What makes a systematic review different from other types of reviews is that the research methods are designed to reduce bias . The methods are repeatable, and the approach is formal and systematic:

  • Formulate a research question
  • Develop a protocol
  • Search for all relevant studies
  • Apply the selection criteria
  • Extract the data
  • Synthesize the data
  • Write and publish a report

Although multiple sets of guidelines exist, the Cochrane Handbook for Systematic Reviews is among the most widely used. It provides detailed guidelines on how to complete each step of the systematic review process.

Systematic reviews are most commonly used in medical and public health research, but they can also be found in other disciplines.

Systematic reviews typically answer their research question by synthesizing all available evidence and evaluating the quality of the evidence. Synthesizing means bringing together different information to tell a single, cohesive story. The synthesis can be narrative ( qualitative ), quantitative , or both.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

the research question in a systematic review

Systematic reviews often quantitatively synthesize the evidence using a meta-analysis . A meta-analysis is a statistical analysis, not a type of review.

A meta-analysis is a technique to synthesize results from multiple studies. It’s a statistical analysis that combines the results of two or more studies, usually to estimate an effect size .

A literature review is a type of review that uses a less systematic and formal approach than a systematic review. Typically, an expert in a topic will qualitatively summarize and evaluate previous work, without using a formal, explicit method.

Although literature reviews are often less time-consuming and can be insightful or helpful, they have a higher risk of bias and are less transparent than systematic reviews.

Similar to a systematic review, a scoping review is a type of review that tries to minimize bias by using transparent and repeatable methods.

However, a scoping review isn’t a type of systematic review. The most important difference is the goal: rather than answering a specific question, a scoping review explores a topic. The researcher tries to identify the main concepts, theories, and evidence, as well as gaps in the current research.

Sometimes scoping reviews are an exploratory preparation step for a systematic review, and sometimes they are a standalone project.

Prevent plagiarism. Run a free check.

A systematic review is a good choice of review if you want to answer a question about the effectiveness of an intervention , such as a medical treatment.

To conduct a systematic review, you’ll need the following:

  • A precise question , usually about the effectiveness of an intervention. The question needs to be about a topic that’s previously been studied by multiple researchers. If there’s no previous research, there’s nothing to review.
  • If you’re doing a systematic review on your own (e.g., for a research paper or thesis ), you should take appropriate measures to ensure the validity and reliability of your research.
  • Access to databases and journal archives. Often, your educational institution provides you with access.
  • Time. A professional systematic review is a time-consuming process: it will take the lead author about six months of full-time work. If you’re a student, you should narrow the scope of your systematic review and stick to a tight schedule.
  • Bibliographic, word-processing, spreadsheet, and statistical software . For example, you could use EndNote, Microsoft Word, Excel, and SPSS.

A systematic review has many pros .

  • They minimize research bias by considering all available evidence and evaluating each study for bias.
  • Their methods are transparent , so they can be scrutinized by others.
  • They’re thorough : they summarize all available evidence.
  • They can be replicated and updated by others.

Systematic reviews also have a few cons .

  • They’re time-consuming .
  • They’re narrow in scope : they only answer the precise research question.

The 7 steps for conducting a systematic review are explained with an example.

Step 1: Formulate a research question

Formulating the research question is probably the most important step of a systematic review. A clear research question will:

  • Allow you to more effectively communicate your research to other researchers and practitioners
  • Guide your decisions as you plan and conduct your systematic review

A good research question for a systematic review has four components, which you can remember with the acronym PICO :

  • Population(s) or problem(s)
  • Intervention(s)
  • Comparison(s)

You can rearrange these four components to write your research question:

  • What is the effectiveness of I versus C for O in P ?

Sometimes, you may want to include a fifth component, the type of study design . In this case, the acronym is PICOT .

  • Type of study design(s)
  • The population of patients with eczema
  • The intervention of probiotics
  • In comparison to no treatment, placebo , or non-probiotic treatment
  • The outcome of changes in participant-, parent-, and doctor-rated symptoms of eczema and quality of life
  • Randomized control trials, a type of study design

Their research question was:

  • What is the effectiveness of probiotics versus no treatment, a placebo, or a non-probiotic treatment for reducing eczema symptoms and improving quality of life in patients with eczema?

Step 2: Develop a protocol

A protocol is a document that contains your research plan for the systematic review. This is an important step because having a plan allows you to work more efficiently and reduces bias.

Your protocol should include the following components:

  • Background information : Provide the context of the research question, including why it’s important.
  • Research objective (s) : Rephrase your research question as an objective.
  • Selection criteria: State how you’ll decide which studies to include or exclude from your review.
  • Search strategy: Discuss your plan for finding studies.
  • Analysis: Explain what information you’ll collect from the studies and how you’ll synthesize the data.

If you’re a professional seeking to publish your review, it’s a good idea to bring together an advisory committee . This is a group of about six people who have experience in the topic you’re researching. They can help you make decisions about your protocol.

It’s highly recommended to register your protocol. Registering your protocol means submitting it to a database such as PROSPERO or ClinicalTrials.gov .

Step 3: Search for all relevant studies

Searching for relevant studies is the most time-consuming step of a systematic review.

To reduce bias, it’s important to search for relevant studies very thoroughly. Your strategy will depend on your field and your research question, but sources generally fall into these four categories:

  • Databases: Search multiple databases of peer-reviewed literature, such as PubMed or Scopus . Think carefully about how to phrase your search terms and include multiple synonyms of each word. Use Boolean operators if relevant.
  • Handsearching: In addition to searching the primary sources using databases, you’ll also need to search manually. One strategy is to scan relevant journals or conference proceedings. Another strategy is to scan the reference lists of relevant studies.
  • Gray literature: Gray literature includes documents produced by governments, universities, and other institutions that aren’t published by traditional publishers. Graduate student theses are an important type of gray literature, which you can search using the Networked Digital Library of Theses and Dissertations (NDLTD) . In medicine, clinical trial registries are another important type of gray literature.
  • Experts: Contact experts in the field to ask if they have unpublished studies that should be included in your review.

At this stage of your review, you won’t read the articles yet. Simply save any potentially relevant citations using bibliographic software, such as Scribbr’s APA or MLA Generator .

  • Databases: EMBASE, PsycINFO, AMED, LILACS, and ISI Web of Science
  • Handsearch: Conference proceedings and reference lists of articles
  • Gray literature: The Cochrane Library, the metaRegister of Controlled Trials, and the Ongoing Skin Trials Register
  • Experts: Authors of unpublished registered trials, pharmaceutical companies, and manufacturers of probiotics

Step 4: Apply the selection criteria

Applying the selection criteria is a three-person job. Two of you will independently read the studies and decide which to include in your review based on the selection criteria you established in your protocol . The third person’s job is to break any ties.

To increase inter-rater reliability , ensure that everyone thoroughly understands the selection criteria before you begin.

If you’re writing a systematic review as a student for an assignment, you might not have a team. In this case, you’ll have to apply the selection criteria on your own; you can mention this as a limitation in your paper’s discussion.

You should apply the selection criteria in two phases:

  • Based on the titles and abstracts : Decide whether each article potentially meets the selection criteria based on the information provided in the abstracts.
  • Based on the full texts: Download the articles that weren’t excluded during the first phase. If an article isn’t available online or through your library, you may need to contact the authors to ask for a copy. Read the articles and decide which articles meet the selection criteria.

It’s very important to keep a meticulous record of why you included or excluded each article. When the selection process is complete, you can summarize what you did using a PRISMA flow diagram .

Next, Boyle and colleagues found the full texts for each of the remaining studies. Boyle and Tang read through the articles to decide if any more studies needed to be excluded based on the selection criteria.

When Boyle and Tang disagreed about whether a study should be excluded, they discussed it with Varigos until the three researchers came to an agreement.

Step 5: Extract the data

Extracting the data means collecting information from the selected studies in a systematic way. There are two types of information you need to collect from each study:

  • Information about the study’s methods and results . The exact information will depend on your research question, but it might include the year, study design , sample size, context, research findings , and conclusions. If any data are missing, you’ll need to contact the study’s authors.
  • Your judgment of the quality of the evidence, including risk of bias .

You should collect this information using forms. You can find sample forms in The Registry of Methods and Tools for Evidence-Informed Decision Making and the Grading of Recommendations, Assessment, Development and Evaluations Working Group .

Extracting the data is also a three-person job. Two people should do this step independently, and the third person will resolve any disagreements.

They also collected data about possible sources of bias, such as how the study participants were randomized into the control and treatment groups.

Step 6: Synthesize the data

Synthesizing the data means bringing together the information you collected into a single, cohesive story. There are two main approaches to synthesizing the data:

  • Narrative ( qualitative ): Summarize the information in words. You’ll need to discuss the studies and assess their overall quality.
  • Quantitative : Use statistical methods to summarize and compare data from different studies. The most common quantitative approach is a meta-analysis , which allows you to combine results from multiple studies into a summary result.

Generally, you should use both approaches together whenever possible. If you don’t have enough data, or the data from different studies aren’t comparable, then you can take just a narrative approach. However, you should justify why a quantitative approach wasn’t possible.

Boyle and colleagues also divided the studies into subgroups, such as studies about babies, children, and adults, and analyzed the effect sizes within each group.

Step 7: Write and publish a report

The purpose of writing a systematic review article is to share the answer to your research question and explain how you arrived at this answer.

Your article should include the following sections:

  • Abstract : A summary of the review
  • Introduction : Including the rationale and objectives
  • Methods : Including the selection criteria, search method, data extraction method, and synthesis method
  • Results : Including results of the search and selection process, study characteristics, risk of bias in the studies, and synthesis results
  • Discussion : Including interpretation of the results and limitations of the review
  • Conclusion : The answer to your research question and implications for practice, policy, or research

To verify that your report includes everything it needs, you can use the PRISMA checklist .

Once your report is written, you can publish it in a systematic review database, such as the Cochrane Database of Systematic Reviews , and/or in a peer-reviewed journal.

In their report, Boyle and colleagues concluded that probiotics cannot be recommended for reducing eczema symptoms or improving quality of life in patients with eczema. Note Generative AI tools like ChatGPT can be useful at various stages of the writing and research process and can help you to write your systematic review. However, we strongly advise against trying to pass AI-generated text off as your own work.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval
  • Quartiles & Quantiles
  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Prospective cohort study

Research bias

  • Implicit bias
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hindsight bias
  • Affect heuristic
  • Social desirability bias

A literature review is a survey of scholarly sources (such as books, journal articles, and theses) related to a specific topic or research question .

It is often written as part of a thesis, dissertation , or research paper , in order to situate your work in relation to existing knowledge.

A literature review is a survey of credible sources on a topic, often used in dissertations , theses, and research papers . Literature reviews give an overview of knowledge on a subject, helping you identify relevant theories and methods, as well as gaps in existing research. Literature reviews are set up similarly to other  academic texts , with an introduction , a main body, and a conclusion .

An  annotated bibliography is a list of  source references that has a short description (called an annotation ) for each of the sources. It is often assigned as part of the research process for a  paper .  

A systematic review is secondary research because it uses existing research. You don’t collect new data yourself.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Turney, S. (2023, November 20). Systematic Review | Definition, Example & Guide. Scribbr. Retrieved September 3, 2024, from https://www.scribbr.com/methodology/systematic-review/

Is this article helpful?

Shaun Turney

Shaun Turney

Other students also liked, how to write a literature review | guide, examples, & templates, how to write a research proposal | examples & templates, what is critical thinking | definition & examples, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

University of Texas

  • University of Texas Libraries
  • UT Libraries

Systematic Reviews & Evidence Synthesis Methods

  • Formulate Question
  • Types of Reviews
  • Find Existing Reviews & Protocols
  • Register a Protocol
  • Searching Systematically
  • Supplementary Searching
  • Managing Results
  • Deduplication
  • Critical Appraisal
  • Glossary of terms
  • Librarian Support
  • Video tutorials This link opens in a new window
  • Systematic Review & Evidence Synthesis Boot Camp

Formulate your Research Question

Formulating a strong research question for a systematic review can be a lengthy process. While you may have an idea about the topic you want to explore, your specific research question is what will drive your review and requires some consideration. 

You will want to conduct preliminary or  exploratory searches  of the literature as you refine your question. In these searches you will want to:

  • Determine if a systematic review has already been conducted on your topic and if so, how yours might be different, or how you might shift or narrow your anticipated focus.
  • Scope the literature to determine if there is enough literature on your topic to conduct a systematic review.
  • Identify key concepts and terminology.
  • Identify seminal or landmark studies.
  • Identify key studies that you can test your search strategy against (more on that later).
  • Begin to identify databases that might be useful to your search question.

Types of Research Questions for Systematic Reviews

A narrow and specific research question is required in order to conduct a systematic review. The goal of a systematic review is to provide an evidence synthesis of ALL research performed on one particular topic. Your research question should be clearly answerable from the studies included in your review. 

Another consideration is whether the question has been answered enough to warrant a systematic review. If there have been very few studies, there won't be enough qualitative and/or quantitative data to synthesize. You then have to adjust your question... widen the population, broaden the topic, reconsider your inclusion and exclusion criteria, etc.

When developing your question, it can be helpful to consider the FINER criteria (Feasible, Interesting, Novel, Ethics, and Relevant). Read more about the FINER criteria on the Elsevier blog .

If you have a broader question or aren't certain that your question has been answered enough in the literature, you may be better served by pursuing a systematic map, also know as a scoping review . Scoping reviews are conducted to give a broad overview of a topic, to review the scope and themes of the prior research, and to identify the gaps and areas for future research.

What is the effectiveness of talk therapy in treating ADHD in children? Systematic Review
What treatments are available for treating children with ADHD? Systematic Map/Scoping Review

Are animal-assisted therapies as effective as traditional cognitive behavioral therapies in treating people with depressive disorders?

Systematic Review
  • CEE Example Questions Collaboration for Environmental Evidence Guidelines contains Table 2.2 outlining answers sought and example questions in environmental management. 

Learn More . . .

Cochrane Handbook Chapter 2  - Determining the scope of the review and the questions it will address

Frameworks for Developing your Research Question

PICO : P atient/ P opulation, I ntervention, C omparison, O utcome.

PEO: P opulation, E xposure, O utcomes

SPIDER : S ample, P henomenon of I nterest, D esign, E valuation, R esearch Type

For more frameworks and guidance on developing the research question, check out:

1. Advanced Literature Search and Systematic Reviews: Selecting a Framework. City University of London Library

2. Select the Appropriate Framework for your Question. Tab "1-1" from PIECES: A guide to developing, conducting, & reporting reviews [Excel workbook ]. Margaret J. Foster, Texas A&M University.  CC-BY-3.0 license .

3. Formulating a Research Question.  University College London Library. Systematic Reviews .

4. Question Guidance.  UC Merced Library. Systematic Reviews

 Sort Link Group  

 Add / Reorder  

Video - Formulating a Research Question (4:43 minutes)

  • Last Updated: Sep 6, 2024 12:39 PM
  • URL: https://guides.lib.utexas.edu/systematicreviews

Creative Commons License

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.15(12); 2023 Dec
  • PMC10828625

Logo of cureus

Ten Steps to Conduct a Systematic Review

Ernesto calderon martinez.

1 Digital Health, Universidad Nacional Autónoma de México, Ciudad de Mexico, MEX

Jose R Flores Valdés

2 General Medicine, Universidad Autonoma de Guadalajara, Guadalajara, MEX

Jaqueline L Castillo

Jennifer v castillo.

3 General Medicine, Universidad Autónoma de Guadalajara, Guadalajara, MEX

Ronald M Blanco Montecino

4 Research, University of Texas Southwestern Medical Center, Dallas, USA

Julio E Morin Jimenez

5 General Medicine, Universidad Autónoma del Estado de México, Ciudad de Mexico, MEX

David Arriaga Escamilla

6 Internal Medicine, Universidad Justo Sierra, Ciudad de Mexico, MEX

Edna Diarte

7 Medicine, Universidad Autonoma de Sinaloa, Culiacan, MEX

This article introduces a concise 10-step guide tailored for researchers engaged in systematic reviews within the field of medicine and health, aligning with the imperative for evidence-based healthcare. The guide underscores the importance of integrating research evidence, clinical proficiency, and patient preferences. It emphasizes the need for precision in formulating research questions, utilizing tools such as PICO(S)(Population Intervention Comparator Outcome), PEO (Population Exposure Outcome), SPICE (setting, perspective, intervention/exposure/interest, comparison, and evaluation), and SPIDER (expectation, client group, location, impact, professionals, service and evaluation), and advocates for the validation of research ideas through preliminary investigations. The guide prioritizes transparency by recommending the documentation and registration of protocols on various platforms. It highlights the significance of a well-organized literature search, encouraging the involvement of experts to ensure a high-quality search strategy. The critical stages of screening titles and abstracts are navigated using different tools, each characterized by its specific advantages. This diverse approach aims to enhance the effectiveness of the systematic review process. In conclusion, this 10-step guide provides a practical framework for the rigorous conduct of systematic reviews in the domain of medicine and health. It addresses the unique challenges inherent in this field, emphasizing the values of transparency, precision, and ongoing efforts to improve primary research practices. The guide aims to contribute to the establishment of a robust evidence base, facilitating informed decision-making in healthcare.

Introduction

The necessity of evidence-based healthcare, which prioritizes the integration of top-tier research evidence, clinical proficiency, and patient preferences, is increasingly recognized [ 1 , 2 ]. Due to the extensive amount and varied approaches of primary research, secondary research, particularly systematic reviews, is required to consolidate and interpret this information with minimal bias [ 3 , 4 ]. Systematic reviews, structured to reduce bias in the selection, examination, and consolidation of pertinent research studies, are highly regarded in the research evidence hierarchy. The aim is to enable objective, repeatable, and transparent healthcare decisions by reducing systematic errors.

To guarantee the quality and openness of systematic reviews, protocols are formulated, registered, and published prior to the commencement of the review process. Platforms such as PROSPERO (International Prospective Register of Systematic Reviews) aid in the registration of systematic review protocols, thereby enhancing transparency in the review process [ 5 ]. High-standard reviews comply with stringent peer review norms, ensuring that methodologies are revealed beforehand, thus reducing post hoc alterations for objective, repeatable, and transparent outcomes [ 6 ].

Nonetheless, the practical execution of systematic reviews, particularly in the field of medicine and health, poses difficulties for researchers. To address this, a succinct 10-step guide is offered to both seasoned and novice researchers, with the goal of improving the rigor and transparency of systematic reviews.

Technical report

Step 1: structure of your topic

When developing a research question for a systematic review or meta-analysis (SR/MA), it is essential to precisely outline the objectives of the study, taking into account potential effect modifiers. The research question should concentrate on and concisely explain the scientific elements and encapsulate the aim of the project.

Instruments such as PICO(S)(Population Intervention Comparator Outcome), PEO (Population Exposure Outcome), SPICE (setting, perspective, intervention/exposure/interest, comparison, and evaluation), and SPIDER (expectation, client group, location, impact, professionals, service and evaluation) assist in structuring research questions for evidence-based clinical practice, qualitative research, and mixed-methods research [ 7 - 9 ]. A joint strategy of employing SPIDER and PICO is suggested for exhaustive searches, subject to time and resource constraints. PICO and SPIDER are the frequently utilized tools. The selection between them is contingent on the research’s nature. The ability to frame and address research questions is crucial in evidence-based medicine. The "PICO format" extends to the "PICOTS" (Population Intervention Comparator Outcome Time Setting) (Table  1 ) design. Explicit delineation of these components is critical for systematic reviews, ensuring a balanced and pertinent research question with broad applicability.

This table gives a breakdown of the mnemonic for the elements required to formulate an adequate research question. Utilizing this mnemonic leads to a proper and non-biased search. Examples extracted from “The use and efficacy of oral phenylephrine versus placebo on adults treating nasal congestion over the years in a systematic review” [ 10 ].

RCT, randomized control trial; PICOTS, Population Intervention Comparator Outcome Time Setting

StructureMeaningExampleInclusion criteriaExclusion criteria
PP (Population and/or Patient and/or Problem): It refers to the people in/for whom the systematic review is expected to be applied.Adults’ population >18 years and <65 years  Adults between 18 and 65 years    Elderly, pediatrics, pregnant
II (Intervention): In the context of systematic reviews examining the effects of treatment. In other words, it encompasses medicines, procedures, health education, public health measures, or bundles/combinations. It also includes preventive measures like vaccination, prophylaxis, health education tools, and packages of such interventions. In some cases, intervention is not something that the investigators administer, and the investigators are merely observing the effects. Therefore, (I) can be better expressed as ‘Exposure’ abbreviated as (E). Diagnostic tests, prognostic markers, and condition prevalence can represent exposure.  Administration of oral phenylephrine  Oral administration of phenylephrine [ ]IV administration of phenylephrine, nasal phenylephrine  
CC (Comparison): It refers to the comparison of two groups; it can be people not receiving the intervention and those receiving an alternate intervention, placebo, or nothing. However, for some study designs and/or research questions, including a comparison may not be feasible.  Placebo, standard care, or no treatment   Phenylephrine vs. placebo    Phenylephrine in combination with another medication. Phenylephrine in comparison with other medication  
OO (Outcome): This refers to the effect intervention (I) has on the selected population (P) in comparison to the comparison (C). Most systematic reviews focus on efficacy, safety, and sometimes cost. When a systematic review focuses on diagnostic tests, the aim is to identify accuracy, reliability, and cost.  Symptoms like nasal congestion and nasal airway resistance  Nasal congestion management    Other allergy-related symptoms  
TT (Time Frame): The outcomes are only relevant when it is evaluated in a specific time frame.  Over the years  Taking medication over some time  One day, one week  
SS (Study Design): A study design is a specific protocol that allows the conduction of the study, allowing the investigator to translate the conceptual hypothesis research question into an operational one.  RCTs   RCT  Letters to the editor, case-control trials, observational  

While there are various formats like SPICE and ECLIPSE, PICO continues to be favored due to its adaptability across research designs. The research question should be stated in the introduction of a systematic review, laying the groundwork for impartial interpretations. The PICOTS template is applicable to systematic reviews that tackle a variety of research questions.

Validation of the Idea

To bolster the solidity of our research, we advocate for the execution of preliminary investigations and the validation of ideas. An initial exploration, especially in esteemed databases like PubMed, is vital. This process serves several functions, including the discovery of pertinent articles, the verification of the suggested concept, the prevention of revisiting previously explored queries, and the assurance of a sufficient collection of articles for review.

Moreover, it is crucial to concentrate on topics that tackle significant healthcare challenges, align with worldwide necessities and principles, mirror the present scientific comprehension, and comply with established review methodologies. Gaining a profound comprehension of the research field through pertinent videos and discussions is crucial for enhancing result retrieval. Overlooking this step could lead to the unfortunate unearthing of a similar study published earlier, potentially leading to the termination of our research, a scenario where precious time would be squandered on an issue already thoroughly investigated.

For example, during our initial exploration using the terms “Silymarin AND Liver Enzyme Levels” on PubMed, we discovered a systematic review and meta-analysis discussing the impact of Silymarin on liver enzyme levels in humans [ 11 ]. This discovery acts as a safety net because we will not pursue this identical idea/approach and face rejection; instead, we can rephrase a more sophisticated research question or objective, shifting the focus on evaluating different aspects of the same idea by just altering a part of the PICOTS structure. We can evaluate a different population, a different comparator, and a different outcome and arrive at a completely novel idea. This strategic method guarantees the relevance and uniqueness of our research within the scientific community.

Step 2: databases

This procedure is consistently executed concurrently. A well-orchestrated and orderly team is essential for primary tasks such as literature review, screening, and risk of bias evaluation by independent reviewers. During the study inclusion phase, if disagreements arise, the involvement of a third independent reviewer often becomes vital for resolution. The team’s composition should strive to include individuals with a variety of skills.

The intricacy of the research question and the expected number of references dictate the team’s size. The final team structure is decided after the definitive search, with the participation of independent reviewers dependent on the number of hits obtained. It is crucial to maintain a balance of expertise among team members to avoid undue influence from a specific group of experts. Importantly, a team requires a competent leader who may not necessarily be the most senior member or a professor. The leader plays a central role in coordinating the project, ensuring compliance with the study protocol, keeping all team members updated, and promoting their active involvement.

Establishing solid selection criteria is the foundational step in a systematic review. These criteria act as the guiding principles during the screening process, ensuring a focused approach that conserves time, reduces errors, and maintains transparency and reproducibility, being a primary component of all systematic review protocols. Carefully designed to align with the research question, as in Table  1 , the selection criteria cover a range of study characteristics, including design, publication date, and geographical location. Importantly, they incorporate details related to the study population, exposure and outcome measures, and methodological approaches. Concurrently, researchers must develop a comprehensive search strategy to retrieve eligible studies. A well-organized strategy using various terms and Boolean operators is typically required (Figure  1 ). It involves crafting specific search queries for different online databases, such as Embase, MEDLINE, Web of Science, and Google Scholar. In these searches, we can include singulars and plurals of the terms, misspellings of the terms, and related terms, among others. However, it is crucial to strike a balance, avoiding overly extensive searches that yield unnecessary results and insufficient searches that may miss relevant evidence. In this process, collaborating with a librarian or search specialist improves the quality and reproducibility of the search. For this, it is important to understand the basic characteristics of the main databases (Table  2 ). It is important for the team to include in their methodology how they will collect the data and the tools they will use for their entire protocol so that there is a consensus about this among all of them.

Principal databases where the main articles of the whole body of the research can be gathered. This is an example of specialities and it can be used for the researchers to have a variety of databases to work.

NLM, National Library of Medicine; ICTRP, International Clinical Trials Registry Platform; LILACS, Literatura Latino-Americana e do Caribe em Ciências da Saúde

DatabasePrincipal characteristics
PubMed [ , ]A free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. It is maintained by the United States NLM at the National Institutes of Health  
EMBASE [ ]A biomedical and pharmacological database containing bibliographic records with citations, abstracts, and indexing derived from biomedical articles in peer-reviewed journals. It is especially strong in its coverage of drug and pharmaceutical research.  
Cochrane [ ]A database of systematic reviews. It includes reliable evidence from Cochrane and other systematic reviews of clinical trials  
Google Scholar [ ]A freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines.  
Web of Science [ ]A research database used for citation analysis. It provides access to multiple databases including the Science Citation Index, the Social Sciences Citation Index, and the Arts and Humanities Citation Index.  
Science Direct  [ ]A full-text scientific database offering journal articles and book chapters from more than 2,500 peer-reviewed journals and more than 11,000 books.  
PsychINFO [ ]An electronic bibliographic database providing abstracts and citations to the scholarly literature in the psychological, social, behavioral, and health sciences.  
ICTRP [ ]ICTRP is a database of clinical trials being conducted around the world. It is maintained by the World Health Organization.  
Clinical Trials [ ]A database of privately and publicly funded clinical studies conducted around the world. It is provided by the United States NLM.  
LILACS [ ]The LILACS is an online bibliographic database of scientific and medical publications maintained by the Latin American and Caribbean Center on Health Sciences Information.  

An external file that holds a picture, illustration, etc.
Object name is cureus-0015-00000051422-i01.jpg

Boolean operators help break down and narrow down the search. "AND" will narrow your search so you get fewer results. It tells the database that your search results must include every one of your search terms. "OR" means MORE results. OR tells the database that you want results that mention one or both of your search terms. "NOT" means you are telling the database that you wish to have information related to the first term but not the second.

Image credits to authors of the articles (Created on  www.canva.com )

Documenting and registering the protocol early in the research process is crucial for transparency and avoiding duplication. The protocol serves as recorded guidance, encompassing elements like the research question, eligibility criteria, intervention details, quality assessment, and the analysis plan. Before uploading to registry sites, such as PROSPERO, it is advisable to have the protocol reviewed by the principal investigator. The comprehensive study protocol outlines research objectives, design, inclusion/exclusion criteria, electronic search strategy, and analysis plan, providing a framework for reviewers during the screening process. These are steps previously established in our process. Registration can be done on platforms like PROSPERO 5 for health and social care reviews or Cochrane 3 for interventions.

Step 3: search

In the process of conducting a systematic review, a well-organized literature search is a pivotal step. It is suggested to incorporate at least two to four online databases, such as Embase, MEDLINE, Web of Science, and Cochrane. As mentioned earlier, formulating search strategies for each database is crucial due to their distinct requirements. In line with AMSTAR (A Measurement Tool to Assess Systematic Reviews) guidelines, a minimum of two databases should be explored in systematic reviews/meta-analyses (SR/MA), but increasing this number improves the accuracy of the results [ 22 ]. We advise including databases from China as most studies exclude databases from this demographic [ 9 ]. The choice of databases, like Cochrane or ICTRP, is dependent on the review questions, especially in the case of clinical trials. These databases cater to various health-related aspects, and researchers should select based on the research subject. Additionally, it is important to consider unique search methods for each database, as some may not support the use of Boolean operators or quotations. Detailed search strategies for each database, including customization based on specific attributes, are provided for guidance. In general, systematic reviews involve searching through multiple databases and exploring additional sources, such as reference lists, clinical trial registries, and databases of non-indexed journals, to ensure a comprehensive review of both published and, in some instances, unpublished literature.

It is important to note that the extraction of information will also vary among databases. However, our goal is to obtain a RIS, BibText, CSV, bib, or txt file to import into any of the tools we will use in subsequent steps.

Step 4: tools

It is necessary to upload all our reference files into a predetermined tool like Rayyan, Covidence, EPPI, CADIMA, and DistillerSR for the collection and management of records (Table  3 ). The subsequent step entails the elimination of duplicates using a particular method. Duplicates are recognized if they have the same title and author published in the same year or if they have the same title and author published in the same journal. Tools such as Rayyan or Covidence assist in automatically identifying duplicates. The eradication of duplicate records is vital for lessening the workload during the screening of titles and abstracts.

The tools described above use artificial intelligence to help create keywords according to the inclusion and exclusion criteria defined previously by the researcher. This tool will help to reduce the amount of time to rule in or out efficiently.

ToolDescriptionKey FeaturesUsageCostDuplicate removalArticle screeningCritical appraisalAssist with reporting
Covidence [ ]Web-based software for managing systematic review projects.Streamlined screening and data extraction processes; collaboration features for team members; integration with reference management tools; real-time project tracking.Systematic reviews and evidence synthesis projects.Subscription-based, pricing varies.YesYesYesYes
Rayyan [ ]A web application designed for systematic review screening and study selection.User-friendly interface for importing, screening, and organizing studies; collaboration tools for multiple reviewers; supports a variety of file formats.Screening and study selection in systematic reviews.Free with limitations; Premium plans available.NoYesNoLimited
EPPI-Reviewer [ ]Software for managing the review process, with a focus on systematic reviews and other forms of evidence synthesis.Comprehensive data extraction and synthesis capabilities; customizable review processes; integration with reference management tools.Systematic reviews, evidence synthesis, and meta-analysis.Subscription-based, pricing varies.YesYesYesYes
CADIMA [ ]A web-based systematic review software platform.Customizable review workflow; collaboration tools for team members; integrated data extraction and synthesis features; real-time project tracking.Systematic reviews and evidence synthesis projects.Subscription-based, pricing varies.YesYesYesLimited
DistillerSR [ ]Online systematic review software for data extraction and synthesis.Streamlined data extraction and synthesis tools; collaboration features for team members; real-time progress tracking; integration with reference management tools.Systematic reviews and evidence synthesis projects.Subscription-based, pricing varies.YesYesYesYes

Step 5: title and abstract screening

The process of a systematic review encompasses several steps, which include screening titles and abstracts and applying selection criteria. During the phase of title and abstract screening, a minimum of two reviewers independently evaluate the pertinence of each reference. Tools like Rayyan, Covidence, and DistillerSR are suggested for this phase due to their effectiveness. The decisions to further assess retrieved articles are made based on the selection criteria. It is recommended to involve at least three reviewers to minimize the likelihood of errors and resolve disagreements.

In the following stages of the systematic review process, the focus is on acquiring full-text articles. Numerous search engines provide links for free access to full-text articles, and in situations where this is not feasible, alternative routes such as ResearchGate are pursued for direct requests from authors. Additionally, a manual search is carried out to decrease bias, using methods like searching references from included studies, reaching out to authors and experts, and exploring related articles in PubMed and Google Scholar. This manual search is vital for identifying reports that might have been initially overlooked. The approach involves independent reviewing by assigning specific methods to each team member, with the results gathered for comparison, discussion, and minimizing bias.

Step 6: full-text screening

The second phase in the screening process is full-text screening. This involves a thorough examination of the study reports that were selected after the title and abstract screening stage. To prevent bias, it is essential that three individuals participate in the full-text screening. Two individuals will scrutinize the entire text to ensure that the initial research question is being addressed and that none of the previously determined exclusion criteria are present in the articles. They have the option to "include" or "exclude" an article. If an article is "excluded," the reviewer must provide a justification for its exclusion. The third reviewer is responsible for resolving any disagreements, which could arise if one reviewer "excludes" an article that another reviewer "includes." The articles that are "included" will be used in the systematic review.

The process of seeking additional references following the full-text screening in a systematic review involves identifying other potentially relevant studies that were not found in the initial literature search. This can be achieved by reviewing the reference lists of the studies that were included after the full-text screening. This step is crucial as it can help uncover additional studies that are relevant to your research question but might have been overlooked in the initial database search due to variations in keywords, indexing terms, or other factors [ 15 ]. 

A PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) chart, also referred to as a PRISMA flow diagram, is a visual tool that illustrates the steps involved in an SR/MA. These steps encompass the identification, screening, evaluation of eligibility, and inclusion of studies.

The PRISMA diagram provides a detailed overview of the information flow during the various stages of an SR/MA. It displays the count of records that were identified, included, and excluded, along with the reasons for any exclusions.

The typical stages represented on a PRISMA chart are as follows: 1) identification: this is where records are discovered through database searches. 2) screening: this stage involves going through the records after removing any duplicates. 3) eligibility: at this stage, full-text articles are evaluated for their suitability. 4) included: this refers to the studies that are incorporated into the qualitative and quantitative synthesis. The PRISMA chart serves as a valuable tool for researchers and readers alike, aiding in understanding the process of study selection in the review and the reasons for the exclusion of certain studies. It is usually the initial figure presented in the results section of your systematic review [ 4 ].

Step 7: data extraction

As the systematic review advances, the subsequent crucial steps involve data extraction from the studies included. This process involves a structured data extraction from the full texts included, guided by a pilot-tested Excel sheet, which aids two independent reviewers in meticulously extracting detailed information from each article [ 28 ]. This thorough process offers an initial comprehension of the common characteristics within the evidence body and sets the foundation for the following analytical and interpretive synthesis. The participation of two to three independent reviewers ensures a holistic approach, including the extraction of both adjusted and non-adjusted data to account for potential confounding factors in future analyses. Moreover, numerical data extracted, such as dichotomous or continuous data in intervention reviews or information on true and false results in diagnostic test reviews, undergoes a thorough process. The extracted data might be suitable for pooled analysis, depending on sufficiency and compatibility. Difficulties in harmonizing data formats might occur, and systematic review authors might resort to communication with study authors to resolve these issues and enhance the robustness of the synthesis. This multi-dimensional data extraction process ensures a comprehensive and nuanced understanding of the included studies, paving the way for the subsequent analysis and synthesis phases.

Step 8: risk of bias assessment

To conduct a risk of bias in medical research, it is crucial to adhere to a specific sequence: choose tools that are specifically designed for systematic reviews. These tools should have proven acceptable validity and reliability, specifically address items related to methodological quality (internal validity), and ideally be based on empirical evidence of bias [ 29 ]. These tools should be chosen once the full text is obtained. For easy organization, it can be helpful to compile a list of the retrieved articles and view the type of study because it is necessary to understand how to select and organize each one. The most common tools to evaluate the risk of bias can be found in Table  4 .

The table summarizes some of the different tools to appraise the different types of studies and their main characteristics.

ROB, risk of bias; RRB, risk of reporting bias; AMSTAR; A Measurement Tool to Assess Systematic Reviews; GRADE, Grading of Recommendations Assessment, Development and Evaluations; ROBINS, risk of bias in non-randomized studies; RCT, randomized controlled trials

ToolDescription of the appraisal studies
Cochrane RoB2 Tool [ ]Widely used in both Cochrane and other systematic reviews. It replaces the notion of assessing study quality with that of assessing the risk of bias (RoB) 2 tool, considers biases arising at different stages of a trial (randomization process, deviation from intended intervention, missing outcome data, measurement of the outcome, and selection of the report result). It assesses RCT individually and in clusters. it also asses crossover RCT and cluster RCT
AHQR RRB [ ]Evaluates the risk of reporting bias and outcome reporting bias in a systematic review
AMSTAR 2 [ ]Assess the methodological quality of systematic reviews Including both randomized and non-randomized studies of healthcare interventions. Useful in the context of real-world observational evidence
Newcastle-Ottawa Quality Assessment Scale case-control studies [ ]Evaluate case-control studies. Assess the quality of non-randomized studies. Useful in the evaluation of the methodological quality of case-control studies. It provides a semi-quantitative measure of study quality that can be used to inform the interpretation of findings in a systematic review
GRADE [ ]It is used to assess the quality of evidence and the strength of recommendations in healthcare
ROBINS [ ]Tool used to assess the risk of bias in non-randomized studies. Two types of this tool (ROBINS-I and ROBINS-E). ROBINS-I assesses the risk of bias in the results of non-randomized studies that compare the health effects of two or more interventions; it evaluates the estimates of the effectiveness or safety (benefit or harm) of an intervention from studies that did not use randomization to allocate interventions. ROBINS-E provides a structured approach to assess the risk of bias in observational epidemiological studies, designed primarily for use in the context of a systematic review. Evaluates the effects of exposure (including environmental, occupational, and behavioral exposures) on human health. Both tools share many characteristics with the RoB2 tool. They are structured into a fixed set of domains of bias (signaling questions that inform the risk of bias judgments and overall risk of bias judgments). The seven domains of bias addressed are confounding, selection of participants, classification of intervention, deviations from intended interventions, missing data, measurement of outcomes, and selection of reported results. After completing all seven bias domains, an overall judgment is made for each three of the above-mentioned considerations.  

After choosing the suitable tool for the type of study, you should know that a good risk of bias should be transparent and easily replicable. This necessitates the review protocol to include clear definitions of the biases that will be evaluated [ 30 ].

The subsequent step in determining the risk of bias is to understand the different categories of risk of bias. This will explicitly assess the risk of selection, performance, attrition, detection, and selective outcome reporting biases. It allows for separate risk of bias ratings by the outcome to account for the outcome-specific variations in detection bias and specific outcome reporting bias.

Keep in mind that assessing the risk of bias based on study design and conduct rather than reporting is very important. Poorly reported studies may be judged as unclear risk of bias. Avoid presenting the risk of bias assessment as a composite score. Finally, classifying the risk of bias as "low," "medium," or "high" is a more practical way to proceed. Methods for determining an overall categorization for the study limitations should be established a priori and documented clearly.

As a concluding statement or as a way to summarize the risk of bias, the assessment is to evaluate the internal validity of the studies included in the systematic review. This process helps to ensure that the conclusions drawn from the review are based on high-quality, reliable evidence.

Step 9: synthesis

This step can be broken down to simplify the concept of conducting a descriptive synthesis of a systematic review. 1) inclusion of studies: the final count of primary studies included in the review is established based on the screening process. 2) flowchart: the systematic review process flow is summarized in a flowchart. This includes the number of references discovered, the number of abstracts and full texts screened, and the final count of primary studies included. 3) study description: the characteristics of the included studies are detailed in a table in the main body of the manuscript. This includes the populations studied, types of exposures, intervention details, and outcomes. 4) results: if a meta-analysis is not possible, the results of the included studies are described. This includes the direction and magnitude of the effect, consistency of the effect across studies, and the strength of evidence for the effect. 5) reporting bias check: reporting bias is a systematic error that can influence the results of a systematic review. It happens when the nature and direction of the results affect the dissemination of research findings. Checking for this bias is an important part of the review process. 6) result verification: the results of the included studies should be verified for accuracy and consistency [ 36 , 37 ]. The descriptive synthesis primarily relies on words and text to summarize and explain the findings, necessitating careful planning and meticulous execution. 

Step 10: manuscript

When working on a systematic review and meta-analysis for submission, it is essential to keep the bibliographic database search current if more than six to 12 months have passed since the initial search to capture newly published articles. Guidelines like PRISMA and MOOSE provide flowcharts that visually depict the reporting process for systematic reviews and meta-analyses, promoting transparency, reproducibility, and comparability across studies [ 4 , 38 ]. The submission process requires a comprehensive PRISMA or MOOSE report with these flowcharts. Moreover, consulting with subject matter experts can improve the manuscript, and their contributions should be recognized in the final publication. A last review of the results' interpretation is suggested to further enhance the quality of the publication.

The composition process is organized into four main scientific sections: introduction, methods, results, and discussion, typically ending with a concluding section. After the manuscript, characteristics table, and PRISMA flow diagram are finalized, the team should forward the work to the principal investigator (PI) for comprehensive review and feedback. Finally, choosing an appropriate journal for the manuscript is vital, taking into account factors like impact factor and relevance to the discipline. Adherence to the author guidelines of journals is crucial before submitting the manuscript for publication.

The report emphasizes the increasing recognition of evidence-based healthcare, underscoring the integration of research evidence. The acknowledgment of the necessity for systematic reviews to consolidate and interpret extensive primary research aligns with the current emphasis on minimizing bias in evidence synthesis. The report highlights the role of systematic reviews in reducing systematic errors and enabling objective and transparent healthcare decisions. The detailed 10-step guide for conducting systematic reviews provides valuable insights for both experienced and novice researchers. The report emphasizes the importance of formulating precise research questions and suggests the use of tools for structuring questions in evidence-based clinical practice.

The validation of ideas through preliminary investigations is underscored, demonstrating a thorough approach to prevent redundancy in research efforts. The report provides a practical example of how an initial exploration of PubMed helped identify an existing systematic review, highlighting the importance of avoiding duplication. The systematic and well-coordinated team approach in the establishment of selection criteria, development of search strategies, and an organized methodology is evident. The detailed discussion on each step, such as data extraction, risk of bias assessment, and the importance of a descriptive synthesis, reflects a commitment to methodological rigor.

Conclusions

The systematic review process is a rigorous and methodical approach to synthesizing and evaluating existing research on a specific topic. The 10 steps we followed, from defining the research question to interpreting the results, ensured a comprehensive and unbiased review of the available literature. This process allowed us to identify key findings, recognize gaps in the current knowledge, and suggest areas for future research. Our work contributes to the evidence base in our field and can guide clinical decision-making and policy development. However, it is important to remember that systematic reviews are dependent on the quality of the original studies. Therefore, continual efforts to improve the design, reporting, and transparency of primary research are crucial.

The authors have declared that no competing interests exist.

Author Contributions

Concept and design:   Ernesto Calderon Martinez, Jennifer V. Castillo, Julio E. Morin Jimenez, Jaqueline L. Castillo, Edna Diarte

Acquisition, analysis, or interpretation of data:   Ernesto Calderon Martinez, Ronald M. Blanco Montecino , Jose R. Flores Valdés, David Arriaga Escamilla, Edna Diarte

Drafting of the manuscript:   Ernesto Calderon Martinez, Julio E. Morin Jimenez, Ronald M. Blanco Montecino , Jaqueline L. Castillo, David Arriaga Escamilla

Critical review of the manuscript for important intellectual content:   Ernesto Calderon Martinez, Jennifer V. Castillo, Jose R. Flores Valdés, Edna Diarte

Supervision:   Ernesto Calderon Martinez

Human Ethics

Consent was obtained or waived by all participants in this study

Animal Ethics

Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

How to Do a Systematic Review: A Best Practice Guide for Conducting and Reporting Narrative Reviews, Meta-Analyses, and Meta-Syntheses

Affiliations.

  • 1 Behavioural Science Centre, Stirling Management School, University of Stirling, Stirling FK9 4LA, United Kingdom; email: [email protected].
  • 2 Department of Psychological and Behavioural Science, London School of Economics and Political Science, London WC2A 2AE, United Kingdom.
  • 3 Department of Statistics, Northwestern University, Evanston, Illinois 60208, USA; email: [email protected].
  • PMID: 30089228
  • DOI: 10.1146/annurev-psych-010418-102803

Systematic reviews are characterized by a methodical and replicable methodology and presentation. They involve a comprehensive search to locate all relevant published and unpublished work on a subject; a systematic integration of search results; and a critique of the extent, nature, and quality of evidence in relation to a particular research question. The best reviews synthesize studies to draw broad theoretical conclusions about what a literature means, linking theory to evidence and evidence to theory. This guide describes how to plan, conduct, organize, and present a systematic review of quantitative (meta-analysis) or qualitative (narrative review, meta-synthesis) information. We outline core standards and principles and describe commonly encountered problems. Although this guide targets psychological scientists, its high level of abstraction makes it potentially relevant to any subject area or discipline. We argue that systematic reviews are a key methodology for clarifying whether and how research findings replicate and for explaining possible inconsistencies, and we call for researchers to conduct systematic reviews to help elucidate whether there is a replication crisis.

Keywords: evidence; guide; meta-analysis; meta-synthesis; narrative; systematic review; theory.

PubMed Disclaimer

Similar articles

  • The future of Cochrane Neonatal. Soll RF, Ovelman C, McGuire W. Soll RF, et al. Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12. Early Hum Dev. 2020. PMID: 33036834
  • Summarizing systematic reviews: methodological development, conduct and reporting of an umbrella review approach. Aromataris E, Fernandez R, Godfrey CM, Holly C, Khalil H, Tungpunkom P. Aromataris E, et al. Int J Evid Based Healthc. 2015 Sep;13(3):132-40. doi: 10.1097/XEB.0000000000000055. Int J Evid Based Healthc. 2015. PMID: 26360830
  • RAMESES publication standards: meta-narrative reviews. Wong G, Greenhalgh T, Westhorp G, Buckingham J, Pawson R. Wong G, et al. BMC Med. 2013 Jan 29;11:20. doi: 10.1186/1741-7015-11-20. BMC Med. 2013. PMID: 23360661 Free PMC article.
  • A Primer on Systematic Reviews and Meta-Analyses. Nguyen NH, Singh S. Nguyen NH, et al. Semin Liver Dis. 2018 May;38(2):103-111. doi: 10.1055/s-0038-1655776. Epub 2018 Jun 5. Semin Liver Dis. 2018. PMID: 29871017 Review.
  • Publication Bias and Nonreporting Found in Majority of Systematic Reviews and Meta-analyses in Anesthesiology Journals. Hedin RJ, Umberham BA, Detweiler BN, Kollmorgen L, Vassar M. Hedin RJ, et al. Anesth Analg. 2016 Oct;123(4):1018-25. doi: 10.1213/ANE.0000000000001452. Anesth Analg. 2016. PMID: 27537925 Review.
  • The Association between Emotional Intelligence and Prosocial Behaviors in Children and Adolescents: A Systematic Review and Meta-Analysis. Cao X, Chen J. Cao X, et al. J Youth Adolesc. 2024 Aug 28. doi: 10.1007/s10964-024-02062-y. Online ahead of print. J Youth Adolesc. 2024. PMID: 39198344
  • The impact of chemical pollution across major life transitions: a meta-analysis on oxidative stress in amphibians. Martin C, Capilla-Lasheras P, Monaghan P, Burraco P. Martin C, et al. Proc Biol Sci. 2024 Aug;291(2029):20241536. doi: 10.1098/rspb.2024.1536. Epub 2024 Aug 28. Proc Biol Sci. 2024. PMID: 39191283 Free PMC article.
  • Target mechanisms of mindfulness-based programmes and practices: a scoping review. Maloney S, Kock M, Slaghekke Y, Radley L, Lopez-Montoyo A, Montero-Marin J, Kuyken W. Maloney S, et al. BMJ Ment Health. 2024 Aug 24;27(1):e300955. doi: 10.1136/bmjment-2023-300955. BMJ Ment Health. 2024. PMID: 39181568 Free PMC article. Review.
  • Bridging disciplines-key to success when implementing planetary health in medical training curricula. Malmqvist E, Oudin A. Malmqvist E, et al. Front Public Health. 2024 Aug 6;12:1454729. doi: 10.3389/fpubh.2024.1454729. eCollection 2024. Front Public Health. 2024. PMID: 39165783 Free PMC article. Review.
  • Strength of evidence for five happiness strategies. Puterman E, Zieff G, Stoner L. Puterman E, et al. Nat Hum Behav. 2024 Aug 12. doi: 10.1038/s41562-024-01954-0. Online ahead of print. Nat Hum Behav. 2024. PMID: 39134738 No abstract available.
  • Search in MeSH

LinkOut - more resources

Full text sources.

  • Ingenta plc
  • Ovid Technologies, Inc.

Other Literature Sources

  • scite Smart Citations

Miscellaneous

  • NCI CPTAC Assay Portal
  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Banner

Systematic Reviews & Meta-Analysis

Identifying your research question.

  • Developing Your Protocol
  • Conducting Your Search
  • Screening & Selection
  • Data Extraction & Appraisal
  • Meta-Analyses
  • Writing the Systematic Review
  • Suggested Readings

The first step in performing a Systematic Review is to develop your research question. The guidance provided on how to develop your research question for literature reviews will still apply here. The difference with a systematic review research question is that you must have a clearly defined question and consider what problem are you trying to address by conducting the review. The most important point is that you focus your question and design the question so that it is answerable by the research that you will be systematically examining.

Once you have developed your research question, it should not be changed once the review process has begun, as the review protocol needs to be formed around the question. 

Literature Review Question Systematic Review Question
Can be broad; highlight only particular pieces of literature, or support a particular viewpoint. Requires the question to be well-defined and focused so it is possible to answer.

To help develop and focus your research question you may use one of the question frameworks below.

Methods for Refining a Research Topic

PICO questions can be useful in the health or social sciences. PICO stands for:

  • Patient, Population, or Problem : What are the characteristics of the patient(s) or population, i.e. their ages, genders, or other demographics? What is the situation, disease, etc., that you are interested in?
  • Intervention or Exposure : What do you want to do with the patient, person, or population (i.e. observe, diagnose, treat)?
  • Comparison : What is the alternative to the intervention (i.e. a different drug, a different assignment in a classroom)?
  • Outcome : What are the relevant outcomes (i.e. complications, morbidity, grades)?

Additionally, the following are variations to the PICO framework:

  • PICO(T) : The 'T' stands for Timing, where you would define the duration of treatment and the follow-up schedule that matter to patients. Consider both long- and short-term outcomes.
  • PICO(S) : The 'S' stands for Study type (eg. randomized controlled trial), sometimes S can be used to stand for Setting or Sample Size

PPAARE is a useful question framework for patient care:

Problem -  Description of the problem related to the disease or condition

Patient - Description of the patient related to their demographics and risk factors

Action  - Description of the action related to the patient’s diagnosis, diagnostic test, etiology, prognosis, treatment or therapy, harm, prevention, patient ed.

Alternative - Description of the alternative to the action when there is one. (Not required)

Results   -   Identify the patient’s result of the action to produce, improve, or reduce the outcome for the patient

Evidence   -   Identify the level of evidence available after searching

SPIDER is a useful question framework for qualitative evidence synthesis:

Sample  - The group of participants, population, or patients being investigated. Qualitative research is not easy to generalize, which is why sample is preferred over patient.

Phenomenon of Interest - The reasons for behavior and decisions, rather than an intervention.

Design  - The research method and study design used for the research, such as interview or survey.

Evaluation  - The end result of the research or outcome measures.

Research type -   The research type; Qualitative, quantitative and/or mixed methods.

SPICE is a particularly useful method in the social sciences. It stands for

  • Setting (e.g. United States)
  • Perspective (e.g. adolescents)
  • Intervention (e.g. text message reminders)
  • Comparisons (e.g. telephone message reminders)
  • Evaluation (e.g. number of homework assignments turned in after text message reminder compared to the number of assignments turned in after a telephone reminder)

CIMO is useful method in the social sciences or organisational context. It stands for

  • Context - Which individuals, relationships, institutional settings, or wider systems are being studied?
  • Intervention - The effects of what event, action, or activity are being studied?
  • Mechanism - What are the mechanisms that explain the relationship between interventions and outcomes? Under what circumstances are these mechanisms activated or not activated?
  • Outcomes - What are the effects of the intervention? How will the outcomes be measured? What are the intended and unintended effects?

Has Your Systematic Review Already Been Done?

Once you have a reasonably well defined research question, it is important to check if your question has already been asked, or if there are other systematic reviews that are similar to that which you're preparing to do.

In the context of conducting a review, even if you do find one on your topic, it may be sufficiently out of date or you may find other defendable reasons to undertake a new or updated one. In addition, locating an existing systematic reviews may also provide a starting point for selecting a review topic, it may help you refocus your question, or redirect your research toward other gaps in the literature.

You may locate existing systematic reviews or protocols on the following resources:

  • Cochrane Library This link opens in a new window The Cochrane Library is a database collection containing high-quality, independent evidence, including systematic reviews and controlled trials, to inform healthcare decision-making. Terms of Use .
  • MEDLINE (EBSCO) This link opens in a new window Medline (EBSCO) produced by the U.S. National Library of Medicine is the premier database of biomedicine and health sciences, covering life sciences including biology, environmental science, marine biology, plant and animal science, biophysics and chemistry. Terms of Use . Coverage: 1950-present.

Open Access

  • PsycINFO This link opens in a new window Contains over 5 million citations and summaries of peer-reviewed journal articles, book chapters, and dissertations from the behavioral and social sciences in 29 languages from 50 countries. Terms of Use . Coverage: 1872-present.
  • << Previous: Systematic Reviews
  • Next: Developing Your Protocol >>
  • Last Updated: Jun 28, 2024 10:04 AM
  • URL: https://libguides.chapman.edu/systematic_reviews

Systematic Reviews: Formulating Your Research Question

  • What Type of Review is Right for You?
  • What is in a Systematic Review
  • Finding and Appraising Systematic Reviews
  • Formulating Your Research Question
  • Inclusion and Exclusion Criteria
  • Creating a Protocol
  • Results and PRISMA Flow Diagram
  • Searching the Published Literature
  • Searching the Gray Literature
  • Methodology and Documentation
  • Managing the Process
  • Scoping Reviews

Types of Questions

Research questions should be answerable and also fill important gaps in the knowledge. Developing a good question takes time and may not fit in the traditional framework.  Questions can be broad or narrow and there are advantages and disadvantages to each type. 

Questions can be about interventions, diagnosis, screening, measuring, patients/student/customer experiences, or even management strategies. They can also be about policies. As the field of systematic reviews grow, more and more people in humanities and social sciences are embracing systematic reviews and creating questions that fit within their fields of practice. 

More information can be found here:

Thomas J, Kneale D, McKenzie JE, Brennan SE, Bhaumik S. Chapter 2: Determining the scope of the review and the questions it will address. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors).  Cochrane Handbook for Systematic Reviews of Interventions  version 6.0 (updated July 2019). Cochrane, 2019. Available from  www.training.cochrane.org/handbook .

Frameworks are used to develop the question being asked. The type of framework doesn't matter as much as the question being selected.

Think of these frameworks as you would for a house or building. A framework is there to provide support and to be a scaffold for the rest of the structure. In the same way, a research question framework can also help structure your evidence synthesis question.  

Health Social Sciences Interdisciplinary

Organizing Your Question

  • Formulating non-PICO questions Although the PICO formulation should apply easily to the majority of effectiveness questions and a great number besides you may encounter questions that are not easily accommodated within this particular framework. Below you will find a number of acceptable alternatives:
  • Using The PICOS Model To Design And Conduct A Systematic Search: A Speech Pathology Case Study
  • 7 STEPS TO THE PERFECT PICO SEARCH Searching for high-quality clinical research evidence can be a daunting task, yet it is an integral part of the evidence-based practice process. One way to streamline and improve the research process for nurses and researchers of all backgrounds is to utilize the PICO search strategy. PICO is a format for developing a good clinical research question prior to starting one’s research. It is a mnemonic used to describe the four elements of a sound clinical foreground question (Yale University’s Cushing/Whitney Medical Library)

to search for quantitative review questions

P: Patient or Population

I: Intervention (or Exposure)

C: Comparison (or Control)

Variations Include:

S: Study Design

T: Timeframe

to search for qualitative evidence

S: Setting (where?)

P: Perspecitve (for whom?)

I: Intervention (what?)

C: Comparison (compared with what?)    

E: Evaluation (with what result?)

 to search for qualitative and mixed methods research studies

S: Sample

PI: Phenomenon of Interest    

E: Evaluation    

R: Research type

to search for health policy/management information

E: Expectation (improvement or information or innovation)

C: Client group (at whom the service is aimed)    

L: Location (where is the service located?)    

I: Impact (outcomes)

P: Professionals (who is involved in providing/improving the service)

Se: Service (for which service are you looking for information)

PICO Template Questions

Try words from your topic in these templates.  Your PICO should fit only one type of question in the list.

For an intervention/therapy:

In _______(P), what is the effect of _______(I) on ______(O) compared with _______(C) within ________ (T)?

For etiology:

Are ____ (P) who have _______ (I) at ___ (Increased/decreased) risk for/of_______ (O) compared with ______ (P) with/without ______ (C) over _____ (T)?

Diagnosis or diagnostic test:

Are (is) _________ (I) more accurate in diagnosing ________ (P) compared with ______ (C) for _______ (O)?

Prevention:

For ________ (P) does the use of ______ (I) reduce the future risk of ________ (O) compared with _________ (C)?

Prognosis/Predictions

In__________ (P) how does ________ (I) compared to _______(C) influence _______ (O) over ______ (T)?

How do ________ (P) diagnosed with _______ (I) perceive ______ (O) during _____ (T)?

Template taken from Southern Illinois University- Edwardsville

Example PICO Questions

Intervention/Therapy:

In school-age children (P), what is the effect of a school-based physical activity program (I) on a reduction in the incidence of childhood obesity (O) compared with no intervention (C) within a 1 year period (T)?

In high school children (P), what is the effect of a nurse-led presentation on bullying (I) on a reduction in reported incidences of bullying (O) compared with no intervention (C) within a 6 month time frame (T)?

Are males 50 years of age and older (P) who have a history of 1 year of smoking or less (I) at an increased risk of developing esophageal cancer (O) compared with males age 50 and older (P) who have no smoking history (C)?

Are women ages 25-40 (P) who take oral contraceptives (I) at greater risk for developing blood clots (O) compared with women ages 25-40 (P) who use IUDs for contraception (C) over a 5 year time frame (T)?

Diagnosis/Diagnostic Test:

Is a yearly mammogram (I) more effective in detecting breast cancer (O) compared with a mammogram every 3 years (C) in women under age 50 (P)?

Is a colonoscopy combined with fecal occult blood testing (I) more accurate in detecting colon cancer (O) compared with a colonoscopy alone (C) in adults over age 50 (P)?

For women under age 60 (P), does the daily use of 81mg low-dose Aspirin (I) reduce the future risk of stroke (O) compared with no usage of low-dose Aspirin (C)?

For adults over age 65 (P) does a daily 30 minute exercise regimen (I) reduce the future risk of heart attack (O) compared with no exercise regimen (C)?

Prognosis/Predictions:

Does daily home blood pressure monitoring (I) influence compliance with medication regimens for hypertension (O) in adults over age 60 who have hypertension (P) during the first year after being diagnosed with the condition (T)?

Does monitoring blood glucose 4 times a day (I) improve blood glucose control (O) in people with Type 1 diabetes (P) during the first six months after being diagnosed with the condition (T)?

How do teenagers (P) diagnosed with cancer (I) perceive chemotherapy and radiation treatments (O) during the first 6 months after diagnosis (T)?

How do first-time mothers (P) of premature babies in the NICU (I) perceive bonding with their infant (O) during the first month after birth (T)?

  • << Previous: Finding and Appraising Systematic Reviews
  • Next: Inclusion and Exclusion Criteria >>
  • Last Updated: Sep 6, 2024 1:05 PM
  • URL: https://guides.lib.lsu.edu/Systematic_Reviews

Provide Website Feedback Accessibility Statement

Duquesne University Logo

Systematic Reviews

  • Developing a Research Question
  • Developing a Protocol
  • Literature Searching
  • Screening References
  • Data Extraction
  • Quality Assessment
  • Reporting Results
  • Related Guides
  • Getting Help

Developing A Research Question

There are several different methods researchers might use in developing a research question. The best method to use depends on the discipline and nature of the research you hope to review. Consider the following example question templates.

Variations to PICO

Using PICO can help you define and narrow your research question so that it is specific.

  • P  - Patient, population, or problem
  • I   - Intervention
  • C - Comparison or Control
  • O - Outcome

Think about whether your question is relevant to practitioners, and whether the answer will help people (doctors, patients, nurses) make better informed health care decisions.

You can find out more about properly formulated questions by reviewing the YouTube video below.

The PICO method is used frequently, though there are some variations that exist to add other specifications to studies collected. Some variations include PICOSS, PICOT, and PICOC.

  • In addition to the fundamental components of PICO, additional criteria are made for  study design  (S) and  setting  (S).
  • (T), in this instance, represents  timeframe . This method could be used to narrow down length of treatment or intervention in health research.
  • In research where there may not be a comparison, Co instead denotes the  context  of the population and intervention being studied.

Using SPIDER can help you define and narrow your research question so that it is specific. This is typically used in qualitative research (Cooke, Smith, & Booth, 2012).

  • PI - Phenomenon of Interest 
  • E - Evaluation
  • R - Research type

Yet another search measure relating to Evidence-Based Practice (EBP) is SPICE. This framework builds on PICO by considering two additional axes: perspective and setting (Booth, 2006).

  • S - Setting
  • P - Perspective
  • I - Intervention
  • C - Comparison

Inclusion and Exclusion Criteria

Setting inclusion and exclusion criteria is a critical step in the systematic review process.

  • Inclusion criteria determine what characteristics are needed for a study to be included in a systematic review.
  • Exclusion criteria denote what attributes disqualify a study from consideration in a systematic review.
  • Knowing what to exclude or include helps speed up the review process.

These criteria will be used at different parts of the review process, including in search statements and the screening process.

Has this review been done?

After developing the research question, it is necessary to confirm that the review has not previously been conducted (or is currently in progress).

Make sure to check for both published reviews and registered protocols (to see if the review is in progress). Do a thorough search of appropriate databases; if additional help is needed,  consult a librarian  for suggestions.

  • << Previous: Planning a Review
  • Next: Developing a Protocol >>
  • Last Updated: Feb 9, 2024 4:57 PM
  • URL: https://guides.library.duq.edu/systematicreviews

Library Guides

Systematic Reviews

  • Introduction to Systematic Reviews
  • Systematic review
  • Systematic literature review
  • Scoping review
  • Rapid evidence assessment / review
  • Evidence and gap mapping exercise
  • Meta-analysis
  • Systematic Reviews in Science and Engineering
  • Timescales and processes
  • Question frameworks (e.g PICO)
  • Inclusion and exclusion criteria
  • Using grey literature
  • Search Strategy This link opens in a new window
  • Subject heading searching (e.g MeSH)
  • Database video & help guides This link opens in a new window
  • Documenting your search and results
  • Data management
  • How the library can help
  • Systematic reviews A to Z

the research question in a systematic review

Using a framework to structure your research question

Your systematic review or systematic literature review will be defined by your research question. A well formulated question will help:

  • Frame your entire research process
  • Determine the scope of your review
  • Provide a focus for your searches
  • Help you identify key concepts
  • Guide the selection of your papers

There are different models you can use to structure help structure a question, which will help with searching.

Selecting a framework

  • What if my topic doesn't fit a framework?

A model commonly used for clinical and healthcare related questions, often, although not exclusively, used for searching for quantitively designed studies. 

Example question: Does handwashing reduce hospital acquired infections in elderly people?

opulation Any characteristic that define your patient or population group.  Elderly people
ntervention What do you want to do with the patient or population? Handwashing
omparison (if relevant)  What are the alternatives to the main intervention? No handwashing
utcome Any specific outcomes or effects of your intervention. Reduced infection

Richardson, W.S., Wilson, M.C, Nishikawa, J. and Hayward, R.S.A. (1995) 'The well-built clinical question: a key to evidence-based decisions.' ACP Journal Club , 123(3) pp. A12

PEO is useful for qualitative research questions.

Example question:  How does substance dependence addiction play a role in homelessness?

Who are the users - patients, family, practitioners or community being affected? What are the symptoms, condition, health status, age, gender, ethnicity? What is the setting e.g. acute care, community, mental health? homeless persons
Exposure to a condition or illness, a risk factor (e.g. smoking), screening, rehabilitation, service etc. drug and alcohol addiction services
Experiences, attitudes, feelings, improvement in condition, mobility, responsiveness to treatment, care, quality of life or daily living. reduced homelessness

Moola S, Munn Z, Sears K, Sfetcu R, Currie M, Lisy K, Tufanaru C, Qureshi R, Mattis P & Mu P. (2015) 'Conducting systematic reviews of association (etiology): The Joanna Briggs Institute's approach'. International Journal of Evidence - Based Healthcare, 13(3), pp. 163-9. Available at: 10.1097/XEB.0000000000000064.

PCC is useful for both qualitative and quantitative (mixed methods) topics, and is commonly used in scoping reviews.

Example question:    “What patient-led models of care are used to manage chronic disease in high income countries?"

Population "Important characteristics of participants, including age and other qualifying criteria.  You may not need to include this element unless your question focuses on a specific condition or cohort." N/A.  As our example considers chronic diseases broadly, not a specific condition/population - such as women with chronic obstructive pulmonary disorder.
Concept

"The core concept examined by the scoping review should be clearly articulated to guide the scope and breadth of the inquiry. This may include details that pertain to elements that would be detailed in a standard systematic review, such as the "interventions" and/or "phenomena of interest" and/or "outcomes".

Chronic disease

Patient-led care models

Peters MDJ, Godfrey C, McInerney P, Munn Z, Tricco AC, Khalil, H. Chapter 11: Scoping Reviews (2020 version). In: Aromataris E, Munn Z (Editors). JBI Manual for Evidence Synthesis, JBI, 2020. Available from   https://synthesismanual.jbi.global  .    https://doi.org/10.46658/JBIMES-20-12

A model useful for qualitative and mixed method type research questions.

Example question: What are young parents’ experiences of attending antenatal education? (Cooke et al., 2012)

ample The group you are focusing on. Young parents
henomenon of nterest  The behaviour or experience your research is examining. Experience of antenatal classes
esign How the research will be carried out? Interviews, questionnaires
valuation What are the outcomes you are measuring? Experiences and views
esearch type What is the research type you are undertaking?  Qualitative

Cooke, A., Smith, D. and Booth, A. (2012) 'Beyond PICO: the SPIDER tool for qualitative evidence synthesis.' Qualitative Health Research , 22(10) pp. 1435-1443

A model useful for qualitative and mixed method type research questions. 

Example question: How effective is mindfulness used as a cognitive therapy in a counseling service in improving the attitudes of patients diagnosed with cancer?

etting The setting or the context Counseling service
opulation or perspective Which population or perspective will the research be conducted for/from Patients diagnosed with cancer
ntervention The intervention been studied Mindfulness based cognitive therapy
omparison  Is there a comparison to be made? No  comparison
valuation How well did the intervention work, what were the results? Assess patients attitudes to see if the intervention improved their quality of life

Example question taken from: Tate, KJ., Newbury-Birch, D., and McGeechan, GJ. (2018) ‘A systematic review of qualitative evidence of  cancer patients’ attitudes to mindfulness.’ European Journal of Cancer Care , 27(2) pp. 1 – 10.

A model useful for qualitative and mixed method type research questions, especially for question examining particular services or professions.

Example question: Cross service communication in supporting adults with learning difficulties

xpectation Purpose of the study - what are you trying to achieve? How communication can be improved between services to create better care
lient group Which group are you focusing on? Adult with learning difficulties
ocation Where is that group based? Community
mpact If your research is looking for service improvement, what is this and how is it being measured? Better support services for adults with learning difficulties through joined up, cross-service working
rofessionals What professional staff are involved? Community nurses, social workers, carers
ervice  Which service are you focusing on? Adult support services

You might find that your topic does not always fall into one of the models listed on this page. You can always modify a model to make it work for your topic, and either remove or incorporate additional elements.

The important thing is to ensure that you have a high quality question that can be separated into its component parts.

  • << Previous: Timescales and processes
  • Next: Inclusion and exclusion criteria >>
  • Last Updated: Jan 23, 2024 10:52 AM
  • URL: https://plymouth.libguides.com/systematicreviews
  • A-Z Publications

Annual Review of Psychology

Volume 70, 2019, review article, how to do a systematic review: a best practice guide for conducting and reporting narrative reviews, meta-analyses, and meta-syntheses.

  • Andy P. Siddaway 1 , Alex M. Wood 2 , and Larry V. Hedges 3
  • View Affiliations Hide Affiliations Affiliations: 1 Behavioural Science Centre, Stirling Management School, University of Stirling, Stirling FK9 4LA, United Kingdom; email: [email protected] 2 Department of Psychological and Behavioural Science, London School of Economics and Political Science, London WC2A 2AE, United Kingdom 3 Department of Statistics, Northwestern University, Evanston, Illinois 60208, USA; email: [email protected]
  • Vol. 70:747-770 (Volume publication date January 2019) https://doi.org/10.1146/annurev-psych-010418-102803
  • First published as a Review in Advance on August 08, 2018
  • Copyright © 2019 by Annual Reviews. All rights reserved

Systematic reviews are characterized by a methodical and replicable methodology and presentation. They involve a comprehensive search to locate all relevant published and unpublished work on a subject; a systematic integration of search results; and a critique of the extent, nature, and quality of evidence in relation to a particular research question. The best reviews synthesize studies to draw broad theoretical conclusions about what a literature means, linking theory to evidence and evidence to theory. This guide describes how to plan, conduct, organize, and present a systematic review of quantitative (meta-analysis) or qualitative (narrative review, meta-synthesis) information. We outline core standards and principles and describe commonly encountered problems. Although this guide targets psychological scientists, its high level of abstraction makes it potentially relevant to any subject area or discipline. We argue that systematic reviews are a key methodology for clarifying whether and how research findings replicate and for explaining possible inconsistencies, and we call for researchers to conduct systematic reviews to help elucidate whether there is a replication crisis.

Article metrics loading...

Full text loading...

Literature Cited

  • APA Publ. Commun. Board Work. Group J. Artic. Rep. Stand. 2008 . Reporting standards for research in psychology: Why do we need them? What might they be?. Am. Psychol . 63 : 848– 49 [Google Scholar]
  • Baumeister RF 2013 . Writing a literature review. The Portable Mentor: Expert Guide to a Successful Career in Psychology MJ Prinstein, MD Patterson 119– 32 New York: Springer, 2nd ed.. [Google Scholar]
  • Baumeister RF , Leary MR 1995 . The need to belong: desire for interpersonal attachments as a fundamental human motivation. Psychol. Bull. 117 : 497– 529 [Google Scholar]
  • Baumeister RF , Leary MR 1997 . Writing narrative literature reviews. Rev. Gen. Psychol. 3 : 311– 20 Presents a thorough and thoughtful guide to conducting narrative reviews. [Google Scholar]
  • Bem DJ 1995 . Writing a review article for Psychological Bulletin. Psychol . Bull 118 : 172– 77 [Google Scholar]
  • Borenstein M , Hedges LV , Higgins JPT , Rothstein HR 2009 . Introduction to Meta-Analysis New York: Wiley Presents a comprehensive introduction to meta-analysis. [Google Scholar]
  • Borenstein M , Higgins JPT , Hedges LV , Rothstein HR 2017 . Basics of meta-analysis: I 2 is not an absolute measure of heterogeneity. Res. Synth. Methods 8 : 5– 18 [Google Scholar]
  • Braver SL , Thoemmes FJ , Rosenthal R 2014 . Continuously cumulating meta-analysis and replicability. Perspect. Psychol. Sci. 9 : 333– 42 [Google Scholar]
  • Bushman BJ 1994 . Vote-counting procedures. The Handbook of Research Synthesis H Cooper, LV Hedges 193– 214 New York: Russell Sage Found. [Google Scholar]
  • Cesario J 2014 . Priming, replication, and the hardest science. Perspect. Psychol. Sci. 9 : 40– 48 [Google Scholar]
  • Chalmers I 2007 . The lethal consequences of failing to make use of all relevant evidence about the effects of medical treatments: the importance of systematic reviews. Treating Individuals: From Randomised Trials to Personalised Medicine PM Rothwell 37– 58 London: Lancet [Google Scholar]
  • Cochrane Collab. 2003 . Glossary Rep., Cochrane Collab. London: http://community.cochrane.org/glossary Presents a comprehensive glossary of terms relevant to systematic reviews. [Google Scholar]
  • Cohn LD , Becker BJ 2003 . How meta-analysis increases statistical power. Psychol. Methods 8 : 243– 53 [Google Scholar]
  • Cooper HM 2003 . Editorial. Psychol. Bull. 129 : 3– 9 [Google Scholar]
  • Cooper HM 2016 . Research Synthesis and Meta-Analysis: A Step-by-Step Approach Thousand Oaks, CA: Sage, 5th ed.. Presents a comprehensive introduction to research synthesis and meta-analysis. [Google Scholar]
  • Cooper HM , Hedges LV , Valentine JC 2009 . The Handbook of Research Synthesis and Meta-Analysis New York: Russell Sage Found, 2nd ed.. [Google Scholar]
  • Cumming G 2014 . The new statistics: why and how. Psychol. Sci. 25 : 7– 29 Discusses the limitations of null hypothesis significance testing and viable alternative approaches. [Google Scholar]
  • Earp BD , Trafimow D 2015 . Replication, falsification, and the crisis of confidence in social psychology. Front. Psychol. 6 : 621 [Google Scholar]
  • Etz A , Vandekerckhove J 2016 . A Bayesian perspective on the reproducibility project: psychology. PLOS ONE 11 : e0149794 [Google Scholar]
  • Ferguson CJ , Brannick MT 2012 . Publication bias in psychological science: prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. Psychol. Methods 17 : 120– 28 [Google Scholar]
  • Fleiss JL , Berlin JA 2009 . Effect sizes for dichotomous data. The Handbook of Research Synthesis and Meta-Analysis H Cooper, LV Hedges, JC Valentine 237– 53 New York: Russell Sage Found, 2nd ed.. [Google Scholar]
  • Garside R 2014 . Should we appraise the quality of qualitative research reports for systematic reviews, and if so, how. Innovation 27 : 67– 79 [Google Scholar]
  • Hedges LV , Olkin I 1980 . Vote count methods in research synthesis. Psychol. Bull. 88 : 359– 69 [Google Scholar]
  • Hedges LV , Pigott TD 2001 . The power of statistical tests in meta-analysis. Psychol. Methods 6 : 203– 17 [Google Scholar]
  • Higgins JPT , Green S 2011 . Cochrane Handbook for Systematic Reviews of Interventions, Version 5.1.0 London: Cochrane Collab. Presents comprehensive and regularly updated guidelines on systematic reviews. [Google Scholar]
  • John LK , Loewenstein G , Prelec D 2012 . Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23 : 524– 32 [Google Scholar]
  • Juni P , Witschi A , Bloch R , Egger M 1999 . The hazards of scoring the quality of clinical trials for meta-analysis. JAMA 282 : 1054– 60 [Google Scholar]
  • Klein O , Doyen S , Leys C , Magalhães de Saldanha da Gama PA , Miller S et al. 2012 . Low hopes, high expectations: expectancy effects and the replicability of behavioral experiments. Perspect. Psychol. Sci. 7 : 6 572– 84 [Google Scholar]
  • Lau J , Antman EM , Jimenez-Silva J , Kupelnick B , Mosteller F , Chalmers TC 1992 . Cumulative meta-analysis of therapeutic trials for myocardial infarction. N. Engl. J. Med. 327 : 248– 54 [Google Scholar]
  • Light RJ , Smith PV 1971 . Accumulating evidence: procedures for resolving contradictions among different research studies. Harvard Educ. Rev. 41 : 429– 71 [Google Scholar]
  • Lipsey MW , Wilson D 2001 . Practical Meta-Analysis London: Sage Comprehensive and clear explanation of meta-analysis. [Google Scholar]
  • Matt GE , Cook TD 1994 . Threats to the validity of research synthesis. The Handbook of Research Synthesis H Cooper, LV Hedges 503– 20 New York: Russell Sage Found. [Google Scholar]
  • Maxwell SE , Lau MY , Howard GS 2015 . Is psychology suffering from a replication crisis? What does “failure to replicate” really mean?. Am. Psychol. 70 : 487– 98 [Google Scholar]
  • Moher D , Hopewell S , Schulz KF , Montori V , Gøtzsche PC et al. 2010 . CONSORT explanation and elaboration: updated guidelines for reporting parallel group randomised trials. BMJ 340 : c869 [Google Scholar]
  • Moher D , Liberati A , Tetzlaff J , Altman DG PRISMA Group. 2009 . Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ 339 : 332– 36 Comprehensive reporting guidelines for systematic reviews. [Google Scholar]
  • Morrison A , Polisena J , Husereau D , Moulton K , Clark M et al. 2012 . The effect of English-language restriction on systematic review-based meta-analyses: a systematic review of empirical studies. Int. J. Technol. Assess. Health Care 28 : 138– 44 [Google Scholar]
  • Nelson LD , Simmons J , Simonsohn U 2018 . Psychology's renaissance. Annu. Rev. Psychol. 69 : 511– 34 [Google Scholar]
  • Noblit GW , Hare RD 1988 . Meta-Ethnography: Synthesizing Qualitative Studies Newbury Park, CA: Sage [Google Scholar]
  • Olivo SA , Macedo LG , Gadotti IC , Fuentes J , Stanton T , Magee DJ 2008 . Scales to assess the quality of randomized controlled trials: a systematic review. Phys. Ther. 88 : 156– 75 [Google Scholar]
  • Open Sci. Collab. 2015 . Estimating the reproducibility of psychological science. Science 349 : 943 [Google Scholar]
  • Paterson BL , Thorne SE , Canam C , Jillings C 2001 . Meta-Study of Qualitative Health Research: A Practical Guide to Meta-Analysis and Meta-Synthesis Thousand Oaks, CA: Sage [Google Scholar]
  • Patil P , Peng RD , Leek JT 2016 . What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspect. Psychol. Sci. 11 : 539– 44 [Google Scholar]
  • Rosenthal R 1979 . The “file drawer problem” and tolerance for null results. Psychol. Bull. 86 : 638– 41 [Google Scholar]
  • Rosnow RL , Rosenthal R 1989 . Statistical procedures and the justification of knowledge in psychological science. Am. Psychol. 44 : 1276– 84 [Google Scholar]
  • Sanderson S , Tatt ID , Higgins JP 2007 . Tools for assessing quality and susceptibility to bias in observational studies in epidemiology: a systematic review and annotated bibliography. Int. J. Epidemiol. 36 : 666– 76 [Google Scholar]
  • Schreiber R , Crooks D , Stern PN 1997 . Qualitative meta-analysis. Completing a Qualitative Project: Details and Dialogue JM Morse 311– 26 Thousand Oaks, CA: Sage [Google Scholar]
  • Shrout PE , Rodgers JL 2018 . Psychology, science, and knowledge construction: broadening perspectives from the replication crisis. Annu. Rev. Psychol. 69 : 487– 510 [Google Scholar]
  • Stroebe W , Strack F 2014 . The alleged crisis and the illusion of exact replication. Perspect. Psychol. Sci. 9 : 59– 71 [Google Scholar]
  • Stroup DF , Berlin JA , Morton SC , Olkin I , Williamson GD et al. 2000 . Meta-analysis of observational studies in epidemiology (MOOSE): a proposal for reporting. JAMA 283 : 2008– 12 [Google Scholar]
  • Thorne S , Jensen L , Kearney MH , Noblit G , Sandelowski M 2004 . Qualitative meta-synthesis: reflections on methodological orientation and ideological agenda. Qual. Health Res. 14 : 1342– 65 [Google Scholar]
  • Tong A , Flemming K , McInnes E , Oliver S , Craig J 2012 . Enhancing transparency in reporting the synthesis of qualitative research: ENTREQ. BMC Med. Res. Methodol. 12 : 181– 88 [Google Scholar]
  • Trickey D , Siddaway AP , Meiser-Stedman R , Serpell L , Field AP 2012 . A meta-analysis of risk factors for post-traumatic stress disorder in children and adolescents. Clin. Psychol. Rev. 32 : 122– 38 [Google Scholar]
  • Valentine JC , Biglan A , Boruch RF , Castro FG , Collins LM et al. 2011 . Replication in prevention science. Prev. Sci. 12 : 103– 17 [Google Scholar]
  • Article Type: Review Article

Most Read This Month

Most cited most cited rss feed, job burnout, executive functions, social cognitive theory: an agentic perspective, on happiness and human potentials: a review of research on hedonic and eudaimonic well-being, sources of method bias in social science research and recommendations on how to control it, mediation analysis, missing data analysis: making it work in the real world, grounded cognition, personality structure: emergence of the five-factor model, motivational beliefs, values, and goals.

Duke University Libraries

Systematic Reviews for Non-Health Sciences

  • Getting started
  • Types of reviews
  • 0. Planning the systematic review
  • 1. Formulating the research question

Formulating a research question

Purpose of a framework, selecting a framework.

  • 2. Developing the protocol
  • 3. Searching, screening, and selection of articles
  • 4. Critical appraisal
  • 5. Writing and publishing
  • Guidelines & standards
  • Software and tools
  • Software tutorials
  • Resources by discipline
  • Duke Med Center Library: Systematic reviews This link opens in a new window
  • Overwhelmed? General literature review guidance This link opens in a new window

Email a Librarian

the research question in a systematic review

Contact a Librarian

Ask a Librarian

Formulating a question.

Formulating a strong research question for a systematic review can be a lengthy process. While you may have an idea about the topic you want to explore, your specific research question is what will drive your review and requires some consideration. 

You will want to conduct preliminary  or  exploratory searches  of the literature as you refine your question. In these searches you will want to:

  • Determine if a systematic review has already been conducted on your topic and if so, how yours might be different, or how you might shift or narrow your anticipated focus
  • Scope the literature to determine if there is enough literature on your topic to conduct a systematic review
  • Identify key concepts and terminology
  • Identify seminal or landmark studies
  • Identify key studies that you can test your research strategy against (more on that later)
  • Begin to identify databases that might be useful to your search question

Systematic review vs. other reviews

Systematic reviews required a  narrow and specific research question. The goal of a systematic review is to provide an evidence synthesis of ALL research performed on one particular topic. So, your research question should be clearly answerable from the data you gather from the studies included in your review.

Ask yourself if your question even warrants a systematic review (has it been answered before?). If your question is more broad in scope or you aren't sure if it's been answered, you might look into performing a systematic map or scoping review instead.

Learn more about systematic reviews versus scoping reviews:

  • CEE. (2022). Section 2:Identifying the need for evidence, determining the evidence synthesis type, and establishing a Review Team. Collaboration for Environmental Evidence.  https://environmentalevidence.org/information-for-authors/2-need-for-evidence-synthesis-type-and-review-team-2/
  • DistillerSR. (2022). The difference between systematic reviews and scoping reviews. DistillerSR.  https://www.distillersr.com/resources/systematic-literature-reviews/the-difference-between-systematic-reviews-and-scoping-reviews
  • Nalen, CZ. (2022). What is a scoping review? AJE.  https://www.aje.com/arc/what-is-a-scoping-review/

Illustration of man holding check mark, woman holding cross, with large page in between them

  • Frame your entire research process
  • Determine the scope of your review
  • Provide a focus for your searches
  • Help you identify key concepts
  • Guide the selection of your papers

There are different frameworks you can use to help structure a question.

  • PICO / PECO
  • What if my topic doesn't fit a framework?

The PICO or PECO framework is typically used in clinical and health sciences-related research, but it can also be adapted for other quantitative research.

P — Patient / Problem / Population

I / E — Intervention / Indicator / phenomenon of Interest / Exposure / Event 

C  — Comparison / Context / Control

O — Outcome

Example topic : Health impact of hazardous waste exposure

Population E Comparators Outcomes
People living near hazardous waste sites Exposure to hazardous waste All comparators All diseases/health disorders

Fazzo, L., Minichilli, F., Santoro, M., Ceccarini, A., Della Seta, M., Bianchi, F., Comba, P., & Martuzzi, M. (2017). Hazardous waste and health impact: A systematic review of the scientific literature.  Environmental Health ,  16 (1), 107.  https://doi.org/10.1186/s12940-017-0311-8

The SPICE framework is useful for both qualitative and mixed-method research. It is often used in the social sciences.

S — Setting (where?)

P — Perspective (for whom?)

I — Intervention / Exposure (what?)

C — Comparison (compared with what?)

E — Evaluation (with what result?)

Learn more : Booth, A. (2006). Clear and present questions: Formulating questions for evidence based practice.  Library Hi Tech ,  24 (3), 355-368.  https://doi.org/10.1108/07378830610692127

The SPIDER framework is useful for both qualitative and mixed-method research. It is most often used in health sciences research.

S — Sample

PI — Phenomenon of Interest

D — Design

E — Evaluation

R — Study Type

Learn more : Cooke, A., Smith, D., & Booth, A. (2012). Beyond PICO: The SPIDER tool for qualitative evidence synthesis.  Qualitative Health Research, 22 (10), 1435-1443.  https://doi.org/10.1177/1049732312452938

The CIMO framework is used to understand complex social and organizational phenomena, most useful for management and business research.

C — Context (the social and organizational setting of the phenomenon)

I  — Intervention (the actions taken to address/influence the phenomenon)

M — Mechanisms (the underlying processes or mechanisms that drive change within the phenomenon)

O — Outcomes (the resulting changes that occur due to intervention/mechanisms)

Learn more : Denyer, D., Tranfield, D., & van Aken, J. E. (2008). Developing design propositions through research synthesis. Organization Studies, 29 (3), 393-413. https://doi.org/10.1177/0170840607088020

Click  here   for an exhaustive list of research question frameworks from the University of Maryland Libraries.

You might find that your topic does not always fall into one of the models listed on this page. You can always modify a model to make it work for your topic, and either remove or incorporate additional elements. Be sure to document in your review the established framework that yours is based off and how it has been modified.

  • << Previous: 0. Planning the systematic review
  • Next: 2. Developing the protocol >>
  • Last Updated: Jul 26, 2024 10:38 AM
  • URL: https://guides.library.duke.edu/systematicreviews

Duke University Libraries

Services for...

  • Faculty & Instructors
  • Graduate Students
  • Undergraduate Students
  • International Students
  • Patrons with Disabilities

Twitter

  • Harmful Language Statement
  • Re-use & Attribution / Privacy
  • Support the Libraries

Creative Commons License

1.2.2  What is a systematic review?

A systematic review attempts to collate all empirical evidence that fits pre-specified eligibility criteria in order to answer a specific research question.  It  uses explicit, systematic methods that are selected with a view to minimizing bias, thus providing more reliable findings from which conclusions can be drawn and decisions made (Antman 1992, Oxman 1993) . The key characteristics of a systematic review are:

a clearly stated set of objectives with pre-defined eligibility criteria for studies;

an explicit, reproducible methodology;

a systematic search that attempts to identify all studies that would meet the eligibility criteria;

an assessment of the validity of the findings of the included studies, for example through the assessment of risk of bias; and

a systematic presentation, and synthesis, of the characteristics and findings of the included studies.

Many systematic reviews contain meta-analyses. Meta-analysis is the use of statistical methods to summarize the results of independent studies (Glass 1976). By combining information from all relevant studies, meta-analyses can provide more precise estimates of the effects of health care than those derived from the individual studies included within a review (see Chapter 9, Section 9.1.3 ). They also facilitate investigations of the consistency of evidence across studies, and the exploration of differences across studies.

University of Maryland Libraries Logo

Systematic Review

  • Library Help
  • What is a Systematic Review (SR)?
  • Steps of a Systematic Review
  • Framing a Research Question
  • Developing a Search Strategy
  • Searching the Literature
  • Managing the Process
  • Meta-analysis
  • Publishing your Systematic Review

Developing a Research Question

Image:  

 

 

There are many ways of framing questions depending on the topic, discipline, or type of questions.

Try to generate a few options for your initial research topic and narrow it down to a specific population, geographical location, disease, etc. You may explore a similar tool,   to identify additional search terms.

Several frameworks are listed in the table below.

Source:

Foster, M. & Jewell, S. (Eds). (2017).  . Medical Library Association, Lanham: Rowman & Littlefield. p. 38, Table 3.

_______________________________________________________________________

Watch the 4 min. video on how to frame a research question with PICO.

___ ______ ______________________________________________________________

Frameworks for research questions

Be: behavior of interest
H: health contest (service/policy/intervention)
E: exclusions
MoTh: models or theories
Booth, A., & Carroll, C. (2015). (3), 220–235. https://doi.org/10.1111/hir.12108
 
Questions about theories
Context
How
Issues
Population
Shaw, R. (2010). . In M. A. Forester (Ed.),   (pp. 39-52). London, Sage.
 
Psychology, qualitative
Context
Intervention
Mechanisms
Outcomes
. In D. A. Buchanan & A. Bryman (Eds.),   (pp. 671-689). Thousand Oaks, CA: Sage Publications Ltd. Management, business, administration
Client group
Location of provided service
Improvement/Information/Innovation
Professionals (who provides the service?)
Wildridge, V., & Bell, L. (2002). (2), 113–115. https://doi.org/10.1046/j.1471-1842.2002.00378.x
 
Librarianship, management, policy
Client-Oriented
Practical
Evidence
Search
Gibbs, L. (2003).  Pacific Grove, CA: Brooks/Cole-Thomson Learning. Social work, health care, nursing
Expectation
Client
Location
Impact
Professionals
Service
Wildridge, V., & Bell, L. (2002).    (2), 113–115. https://doi.org/10.1046/j.1471-1842.2002.00378.x Management, services, policy, social care
Population
Exposure
Outcome
Khan, K. S., Kunz, R., Kleijnen, J., & Antes, G. (2003).  . London: Royal Society of Medicine Press. Qualitative
Patient/population/problem
Exposure
Comparison
Outcome
Duration
Results
Dawes, M., Pluye, P., Shea, L., Grad, R., Greenberg, A., & Nie, J.-Y. (2007). . (1), 9–16.
 
Medicine

Perspective
Setting
Phenomenon of interest/Problem
Environment
Comparison (optional)
Time/Timing
Findings

Booth, A., Noyes, J., Flemming, K., Moore, G., Tunçalp, Ö., & Shakibazadeh, E. (2019). . (Suppl 1). Qualitative research
Person
Environments
Stakeholders
Intervention
Comparison
Outcome
Schlosser, R. W., & O'Neil-Pirozzi, T. (2006). .  , 5-10. Augmentative and alternative communication
Patient
Intervention
Comparison
Outcome
Richardson, W. S., Wilson, M. C., Nishikawa, J., & Hayward, R. S. (1995). .  (3), A12-A12. Clinical medicine

Patient
Intervention
Comparison
Outcome

+context, patient values, and preferences

Bennett, S., & Bennett, J. W. (2000). .  (4), 171-180. Occupational therapy

Patient
Intervention
Comparison
Outcome

Context

Petticrew, M., & Roberts, H. (2006).   Malden, MA: Blackwell Publishers.  Social Sciences

Patient
Intervention
Comparison
Outcome

Study Type

Moher, D., Liberati, A., Tetzlaff, J., Altman, D. G., & Prisma Group. (2009).   (7), e1000097. Medicine

Patient
Intervention
Comparison
Outcome

Time

Richardson, W. S., Wilson, M. C., Nishikawa, J., & Hayward, R. S. (1995).  (3), A12-A12. Education, health care
Patient/participants/population
Index tests
Comparator/reference tests
Outcome
Kim, K. W., Lee, J., Choi, S. H., Huh, J., & Park, S. H. (2015).   (6), 1175-1187. Diagnostic questions
Population
Intervention
Professionals
Outcomes
Health care setting/context
ADAPTE Collaboration. (2009). . Version 2.0. Available from Screening
Problem
Phenomenon of interest
Time

Booth, A., Noyes, J., Flemming, K., Gerhardus, A., Wahlster, P., van der Wilt, G. J., ... & Rehfuess, E. (2016). [Technical Report]. https://doi.org/10.13140/RG.2.1.2318.0562

-----

Booth, A., Sutton, A., & Papaioannou, D. (2016).  (2. ed.). London: Sage.

Social sciences, qualitative, library science
Setting
Perspective
Interest
Comparison
Evaluation
Booth, A. (2006). .  (3), 355-368. Library and information sciences
Sample
Phenomenon of interest
Design
Evaluation
Research type
Cooke, A., Smith, D., & Booth, A. (2012).   (10), 1435-1443. Health, qualitative research
Who
What
How

What was done? (intervention, exposure, policy, phenomenon)

How does the what affect the who?

 

Further reading:

Methley, A. M., Campbell, S., Chew-Graham, C., McNally, R., & Cheraghi-Sohi, S. (2014). PICO, PICOS and SPIDER: A comparison study of specificity and sensitivity in three search tools for qualitative systematic reviews.   BMC Health Services Research, 14 (1), 579.

  • << Previous: Steps of a Systematic Review
  • Next: Developing a Search Strategy >>
  • Last Updated: Aug 26, 2024 12:37 PM
  • URL: https://lib.guides.umd.edu/SR
  • Search Menu
  • Sign in through your institution
  • Advance articles
  • Editor's Choice
  • 100 years of the AJE
  • Collections
  • Author Guidelines
  • Submission Site
  • Open Access Options
  • About American Journal of Epidemiology
  • About the Johns Hopkins Bloomberg School of Public Health
  • Journals Career Network
  • Editorial Board
  • Advertising and Corporate Services
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Society for Epidemiologic Research

Article Contents

Assessing the certainty of the evidence in systematic reviews: importance, process, and use.

  • Article contents
  • Figures & tables
  • Supplementary Data

Romina Brignardello-Petersen, Gordon H Guyatt, Assessing the Certainty of the Evidence in Systematic Reviews: Importance, Process, and Use, American Journal of Epidemiology , 2024;, kwae332, https://doi.org/10.1093/aje/kwae332

  • Permissions Icon Permissions

When interpreting results and drawing conclusions, authors of systematic reviews should consider the limitations of the evidence included in their review. The Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach provides a framework for the explicit consideration of the limitations of the evidence included in a systematic review, and for incorporating this assessment into the conclusions. Assessments of certainty of evidence are a methodological expectation of systematic reviews. The certainty of the evidence is specific to each outcome in a systematic review, and can be rated as high, moderate, low, or very low. Because it will have an important impact, before conducting certainty of evidence, reviewers must clarify the intent of their question: are they interested in causation or association. Serious concerns regarding limitations in the study design, inconsistency, imprecision, indirectness, and publication bias can decrease the certainty of the evidence. Using an example, this article describes and illustrates the importance and the steps for assessing the certainty of evidence and drawing accurate conclusions in a systematic review.

  • publication bias
  • grade approach
Month: Total Views:
September 2024 25

Email alerts

Citing articles via, looking for your next opportunity.

  • Recommend to your Library

Affiliations

  • Online ISSN 1476-6256
  • Print ISSN 0002-9262
  • Copyright © 2024 Johns Hopkins Bloomberg School of Public Health
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • Introduction
  • Conclusions
  • Article Information

MA indicates meta-analysis.

ROR indicates ratio of odds ratio.

eAppendix 1. Search Strategy

eAppendix 2. Data Extraction

eTable 1. The Composite Primary Outcome and Effect Estimates of Mega-Trials Identified by Our Search but Analyzed Only for a Subset of the Primary Outcome

eAppendix 3. Mega-Trials Not Included in Meta-Analyses

eTable 4. Characteristics of Mega-Trials Identified by Our Search but Had No Eligible Meta-Analysis

eTable 2. Characteristics of the Additional Identified Mega-Trials That Have Not Been Identified by Our Search

eAppendix 4. Meta-Analyses of Mega-Trials vs Smaller Trials for the Primary Outcome

eFigure 1. Agreement Between Mega-Trials and Smaller Trials for Primary Outcome: Random Effects (DerSimonian Laird)

eAppendix 5. Meta-Analyses of Mega-Trials vs Smaller Trials for All-Cause Mortality

eFigure 2. Agreement Between Mega-Trials and Smaller Trials for All-Cause Mortality: Random Effects (DerSimonian Laird)

eFigure 3. Agreement Between Smaller Trials Prior and After the Publication of the First Mega-Trial

eTable 3. Results of Uni- and Multivariable Meta-Regression

eFigure 4. Agreement Between Mega-Trials and Smaller Trials With 1/5 of the Least Weighted Mega-Trial

eFigure 5. Agreement Between Mega-Trials and Smaller Trials With 1/10 of the Least Weighted Megatrial

eFigure 6. Agreement Between Mega-Trials and Smaller Trials, Pooling the Results Using Fixed Effects

eFigure 7. Agreement Between Mega-Trials and Smaller Trials, Pooling the Results Using Random Effects – HKSJ Method

eFigure 8. Agreement Between Mega-Trials and Smaller Trials Stratified to Blinding

eFigure 9. Agreement Between Mega-Trials and Smaller Trials Stratified to Intervention Type

eFigure 10. Agreement Between Mega-Trials and Smaller Trials Stratified to Specialty

eFigure 11. Agreement Between Mega-Trials and Smaller Trials Stratified to Heterogeneity

eFigure 12. Agreement Between Trials With More Than 30,000 Participants and Smaller Trial for the Primary Outcome

eFigure 13. Agreement Between Mega-Trials When More Than One Was Present in a Meta-Analysis–Primary Outcome

eReferences.

Data Sharing Statement

See More About

Sign up for emails based on your interests, select your interests.

Customize your JAMA Network experience by selecting one or more topics from the list below.

  • Academic Medicine
  • Acid Base, Electrolytes, Fluids
  • Allergy and Clinical Immunology
  • American Indian or Alaska Natives
  • Anesthesiology
  • Anticoagulation
  • Art and Images in Psychiatry
  • Artificial Intelligence
  • Assisted Reproduction
  • Bleeding and Transfusion
  • Caring for the Critically Ill Patient
  • Challenges in Clinical Electrocardiography
  • Climate and Health
  • Climate Change
  • Clinical Challenge
  • Clinical Decision Support
  • Clinical Implications of Basic Neuroscience
  • Clinical Pharmacy and Pharmacology
  • Complementary and Alternative Medicine
  • Consensus Statements
  • Coronavirus (COVID-19)
  • Critical Care Medicine
  • Cultural Competency
  • Dental Medicine
  • Dermatology
  • Diabetes and Endocrinology
  • Diagnostic Test Interpretation
  • Drug Development
  • Electronic Health Records
  • Emergency Medicine
  • End of Life, Hospice, Palliative Care
  • Environmental Health
  • Equity, Diversity, and Inclusion
  • Facial Plastic Surgery
  • Gastroenterology and Hepatology
  • Genetics and Genomics
  • Genomics and Precision Health
  • Global Health
  • Guide to Statistics and Methods
  • Hair Disorders
  • Health Care Delivery Models
  • Health Care Economics, Insurance, Payment
  • Health Care Quality
  • Health Care Reform
  • Health Care Safety
  • Health Care Workforce
  • Health Disparities
  • Health Inequities
  • Health Policy
  • Health Systems Science
  • History of Medicine
  • Hypertension
  • Images in Neurology
  • Implementation Science
  • Infectious Diseases
  • Innovations in Health Care Delivery
  • JAMA Infographic
  • Law and Medicine
  • Leading Change
  • Less is More
  • LGBTQIA Medicine
  • Lifestyle Behaviors
  • Medical Coding
  • Medical Devices and Equipment
  • Medical Education
  • Medical Education and Training
  • Medical Journals and Publishing
  • Mobile Health and Telemedicine
  • Narrative Medicine
  • Neuroscience and Psychiatry
  • Notable Notes
  • Nutrition, Obesity, Exercise
  • Obstetrics and Gynecology
  • Occupational Health
  • Ophthalmology
  • Orthopedics
  • Otolaryngology
  • Pain Medicine
  • Palliative Care
  • Pathology and Laboratory Medicine
  • Patient Care
  • Patient Information
  • Performance Improvement
  • Performance Measures
  • Perioperative Care and Consultation
  • Pharmacoeconomics
  • Pharmacoepidemiology
  • Pharmacogenetics
  • Pharmacy and Clinical Pharmacology
  • Physical Medicine and Rehabilitation
  • Physical Therapy
  • Physician Leadership
  • Population Health
  • Primary Care
  • Professional Well-being
  • Professionalism
  • Psychiatry and Behavioral Health
  • Public Health
  • Pulmonary Medicine
  • Regulatory Agencies
  • Reproductive Health
  • Research, Methods, Statistics
  • Resuscitation
  • Rheumatology
  • Risk Management
  • Scientific Discovery and the Future of Medicine
  • Shared Decision Making and Communication
  • Sleep Medicine
  • Sports Medicine
  • Stem Cell Transplantation
  • Substance Use and Addiction Medicine
  • Surgical Innovation
  • Surgical Pearls
  • Teachable Moment
  • Technology and Finance
  • The Art of JAMA
  • The Arts and Medicine
  • The Rational Clinical Examination
  • Tobacco and e-Cigarettes
  • Translational Medicine
  • Trauma and Injury
  • Treatment Adherence
  • Ultrasonography
  • Users' Guide to the Medical Literature
  • Vaccination
  • Venous Thromboembolism
  • Veterans Health
  • Women's Health
  • Workflow and Process
  • Wound Care, Infection, Healing

Get the latest research based on your areas of interest.

Others also liked.

  • Download PDF
  • X Facebook More LinkedIn

Kastrati L , Raeisi-Dehkordi H , Llanaj E, et al. Agreement Between Mega-Trials and Smaller Trials : A Systematic Review and Meta-Research Analysis . JAMA Netw Open. 2024;7(9):e2432296. doi:10.1001/jamanetworkopen.2024.32296

Manage citations:

© 2024

  • Permissions

Agreement Between Mega-Trials and Smaller Trials : A Systematic Review and Meta-Research Analysis

  • 1 Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California
  • 2 Institute of Social and Preventive Medicine (ISPM), University of Bern, Bern, Switzerland
  • 3 Graduate School for Health Sciences, University of Bern, Bern, Switzerland
  • 4 Department of Diabetes, Endocrinology, Nutritional Medicine and Metabolism, Inselspital, Bern University Hospital, University of Bern, Switzerland
  • 5 Department of Global Public Health and Bioethics, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
  • 6 Epistudia, Bern, Switzerland
  • 7 Department of Molecular Epidemiology, German Institute of Human Nutrition Potsdam-Rehbrücke, Nuthetal, Germany
  • 8 German Centre for Diabetes Research (DZD), München-Neuherberg, Germany
  • 9 Department of Population Health Sciences, Duke University School of Medicine, Durham, North Carolina
  • 10 Community Medicine Department, Tehran University of Medical Sciences, Tehran, Iran
  • 11 Department of Internal Medicine, Lausanne University Hospital, University of Lausanne, Lausanne, Switzerland
  • 12 Division of Endocrinology and Diabetology, Department of Internal Medicine, Medical University of Graz, Graz, Austria
  • 13 Stanford Prevention Research Center, Department of Medicine, Stanford University School of Medicine, Stanford, California
  • 14 Department of Epidemiology and Population Health, Stanford University School of Medicine, Stanford, California
  • 15 Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, California
  • 16 Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California

Question   Are the results of mega-trials with 10 000 participants or more similar to meta-analysis of trials with smaller sample sizes for the primary outcome and/or all-cause mortality?

Findings   In this meta-research analysis of 82 mega-trials, meta-analyses of smaller studies showed overall comparable results with mega-trials, but smaller trials published before the mega-trials gave more favorable results than mega-trials. There were very low rates of significant results for the primary outcome and all-cause mortality for mega-trials.

Meaning   The findings of this study suggest that mega-trials need to be performed more often, given the relative low number of mega-trials found, their low significant rates, and the fact that smaller trials published prior to mega-trial reported more beneficial results than mega-trials and subsequent smaller trials.

Importance   Mega-trials can provide large-scale evidence on important questions.

Objective   To explore how the results of mega-trials compare with the meta-analysis results of trials with smaller sample sizes.

Data Sources   ClinicalTrials.gov was searched for mega-trials until January 2023. PubMed was searched until June 2023 for meta-analyses incorporating the results of the eligible mega-trials.

Study Selection   Mega-trials were eligible if they were noncluster nonvaccine randomized clinical trials, had a sample size over 10 000, and had a peer-reviewed meta-analysis publication presenting results for the primary outcome of the mega-trials and/or all-cause mortality.

Data Extraction and Synthesis   For each selected meta-analysis, we extracted results of smaller trials and mega-trials included in the summary effect estimate and combined them separately using random effects. These estimates were used to calculate the ratio of odds ratios (ROR) between mega-trials and smaller trials in each meta-analysis. Next, the RORs were combined using random effects. Risk of bias was extracted for each trial included in our analyses (or when not available, assessed only for mega-trials). Data analysis was conducted from January to June 2024.

Main Outcomes and Measures   The main outcomes were the summary ROR for the primary outcome and all-cause mortality between mega-trials and smaller trials. Sensitivity analyses were performed with respect to the year of publication, masking, weight, type of intervention, and specialty.

Results   Of 120 mega-trials identified, 41 showed a significant result for the primary outcome and 22 showed a significant result for all-cause mortality. In 35 comparisons of primary outcomes (including 85 point estimates from 69 unique mega-trials and 272 point estimates from smaller trials) and 26 comparisons of all-cause mortality (including 70 point estimates from 65 unique mega-trials and 267 point estimates from smaller trials), no difference existed between the outcomes of the mega-trials and smaller trials for primary outcome (ROR, 1.00; 95% CI, 0.97-1.04) nor for all-cause mortality (ROR, 1.00; 95% CI, 0.97-1.04). For the primary outcomes, smaller trials published before the mega-trials had more favorable results than the mega-trials (ROR, 1.05; 95% CI, 1.01-1.10) and subsequent smaller trials published after the mega-trials (ROR, 1.10; 95% CI, 1.04-1.18).

Conclusions and Relevance   In this meta-research analysis, meta-analyses of smaller studies showed overall comparable results with mega-trials, but smaller trials published before the mega-trials gave more favorable results than mega-trials. These findings suggest that mega-trials need to be performed more often given the relative low number of mega-trials found, their low significant rates, and the fact that smaller trials published prior to mega-trial report more beneficial results than mega-trials and subsequent smaller trials.

Most randomized comparisons of interventions in medicine use small to modest sample sizes. The call for more mega-trials (ie, large sample trials) with over 10 000 participants has been longstanding. 1 , 2 Mega-trials have been rare, but there has been a renewed interest recently. Several mega-trials have found that certain interventions, like vitamin D supplementation, may not be as effective as previously thought. 3 , 4 Conversely, other mega-trials, such as the Second International Study of Infarct Survival (ISIS-2) Collaborative Group trial on streptokinase and aspirin after myocardial infarction 5 found favorable results with major clinical impact. Conducting mega-trials may be facilitated by the growth of interest in pragmatic (ie, practical) research, 6 , 7 new platforms for recruitment of participants, 8 and wider recognition of the limitations of small trials. Therefore, it is important to understand and compare the results of mega-trials with those of smaller trials.

Meta-analyses rarely include large trials, and small trials have traditionally been considered more susceptible to biases, including more prominent selective reporting. 9 , 10 Previous literature comparing results of meta-analyses of small trials with subsequent large trials has shown heterogeneous results. 11 - 16 Furthermore, different methods have been proposed to analyze the agreement. 17 Different event rates in the control group of the considered trials (baseline risk), differences in trial quality, and variable susceptibility to bias of the health outcomes under investigation may also generate heterogeneity. 11 Moreover, mega-trials and smaller trials may have topic- and question-specific biases that are different in the 2 groups. In previous work, there was also no clear consensus on what constitutes a large trial. Some 18 have considered the amount of evidence in each trial (inverse of variance or sample size) as a continuum, while others tried to separate trials with sufficient power (eg, 80%) to detect plausible effects, 19 and yet others used arbitrary sample size thresholds, (eg, 1000 participants). 12 , 14 To our knowledge, no comprehensive empirical examination has systematically compared the results of mega-trials with sample sizes exceeding 10 000 participants versus smaller trials.

Here, we aimed to systematically identify such mega-trials, identify which ones have been included in meta-analyses for their primary outcomes and/or for mortality outcomes, compare the results of these mega-trials against the combined results of smaller trials, and identify potential factors associated with discrepancies.

This meta-analysis was a meta-research project; because this study is not a typical meta-analysis, we followed the Preferred Reporting Items for Systematic Reviews and Meta-analyses ( PRISMA ) reporting guideline where applicable. 18 The original protocol was registered in Open Science Framework Because the information we used consisted of publicly available results of RCTs, and not patient-specific data, there was no need for ethical review. We analyzed meta-analyses of clinical trials that have included mega-trial results in their analysis for calculations of a summary effect size for the primary end point of the mega-trial. Additionally, we considered data on all-cause mortality as a secondary outcome because it is the most severe and objective outcome.

Mega-trials were considered for analysis if they were noncluster, nonvaccine randomized clinical trials (RCTs) regardless of masking; had a sample size of more than 10 000 participants; had a peer-reviewed publication presenting the results of the primary end point; and were included in a meta-analysis for their primary outcome and/or all-cause mortality. We excluded cluster trials because the effective sample size is much smaller than the number of participants. We excluded vaccine trials because very large vaccine trials usually have different considerations and types of outcomes than mega-trials of other interventions.

For a meta-analysis to be included in the analysis, it had to have a systematic review design and include the results of the mega-trial along with any number of other trials in obtaining summary effect size estimates with the effect size and variance data available (or possible to calculate) for each trial from presented information.

We searched for mega-trials in ClinicalTrials.gov (last updated January 2023) and then performed PubMed searches (until June 2023) to identify the most recent meta-analyses that included the results of these mega-trials for the primary outcome of the mega-trial and for all-cause mortality. Details on the search process are in eAppendix 1 in Supplement 1 .

For each selected meta-analysis, we extracted the results of RCTs included in the summary effect size estimate that incorporated the effect size estimate of the mega-trial. We also extracted information, whenever available, on the risk of bias assessments for each included trial based on Cochrane Risk of Bias Tools (original, revised, and version 2). All data extractions (except mega-trial identification) were performed by 2 reviewers (L.K. and H.R.D.; L.K. and H.G.Q.P.; L.K. and E.L.L.; L.K. and N.S.A.; L.K. and F.K.; L.K. and R.M.; and L.K. and A.L.L.), and differences were settled by discussion. For any unsettled discrepancies, a third senior reviewer (T.M.) was invited to arbitrate. Details on data extraction appear in eAppendix 2 in Supplement 1 .

Some of the eligible meta-analyses contained results from other mega-trials that had not been detected by our search. Therefore, we described these extra identified trials and included them in our analyses. We extracted information for all mega-trials based on whether they found statistically significant or nonsignificant results and whether they were designed to show noninferiority. In several meta-analyses, some trials did not pass the 10 000-participant threshold but were substantially large enough to blur the effects. Therefore, in a sensitivity analysis, we compared the results of mega-trials vs only the smaller trials that weighted less than one-fifth of the least weighted mega-trial; in another sensitivity analysis, we compared the results of mega-trials vs smaller trials that weighted less than one-tenth of the least weighted mega-trial. We then further restricted these trials to those published only before or up to the first trial. We also explored the agreement on different thresholds, setting the threshold at a sample size of 30 000. In addition, we also compared the agreement between the mega-trials, when more than one was included in a meta-analysis. Finally, we also assessed the risk of bias for the mega-trials that had not been assessed (or had been assessed using various non-Cochrane tools [eg, Jadad scale]) using the Cochrane Risk-of-Bias Tool. 25

In each eligible meta-analysis, we combined the results from non–mega-trials using random effects (and fixed effects as sensitivity analysis) and compared them against the results of the mega-trial. In meta-analyses where several mega-trials were available, the results of the mega trials were combined using random effects first before being compared against the results of smaller trials. Any cluster trials were considered to be non–mega-trials. 20

The odds ratio (OR) was the metric of choice. All the analyzed outcomes were dichotomous. Between-trial heterogeneity assessments used τ 2 between-study variance estimator, Q test, and I 2 statistics. 21

We obtained the log ratio of ORs (ROR) and its variance (the sum of the variances of the logOR in the 2 groups) between the mega-trials and the smaller trials for each eligible outcome. Then, the logROR estimates were combined across each outcome using the DerSimonian-Laird random-effects calculations. 22 We also performed sensitivity analyses using the Hartung-Knapp-Sidik-Jonkman (HKSJ) method. 23 In all calculations, treatment effects in single trials and meta-analyses thereof were coined consistently so that an ROR less than 1 means a more favorable outcome for the intervention group over the control group.

A sensitivity analysis was performed to assess whether the results were different when non–mega-trials were included in the calculations only if they were published up until (and including) the year of publication of any mega-trials and comparing them with the results of the mega-trial. This analysis more specifically targets the research question of whether mega-trials corroborate the results of smaller trials that have been performed before them. A separate analysis also compared the results of non–mega-trials published up until the year of publication of the mega-trial vs non–mega-trials published subsequently.

Separate subgroup analyses were performed for the comparison of results in mega-trials vs other trials according to masking (open-label vs masked), intervention type, specialty (eg, cardiovascular), and per heterogeneity (low vs non-low) of the mega-trials. We also performed exploratory meta-regressions considering the same variables (masking, type of outcome, type of intervention, and specialty) and also risk of bias in the mega-trials (high vs other), risk of bias in the other trials (proportion at high risk), median number of participants in non–mega-trials, and total number of participants in non–mega-trials. We also performed exploratory tests for small study effect sizes (Egger test), 24 when there were more than 10 trials.

Analyses were conducted using Stata software version 17 (StataCorp). The threshold for significance was a 2-tailed P < .05. Data analysis occurred from January to June 2024.

A total of 180 registered completed phase 3 or 4 mega-trials that did not involve vaccines and that had 10 000 or more participants were identified through our search ( Figure 1 ). Among these, 91 were randomized, noncluster, nonvaccine mega-trials; but 35 of these 91 trials lacked an appropriate meta-analysis and 2 had no published results, leaving 51 mega-trials with an eligible meta-analysis for either primary outcome and/or all-cause mortality. Three trials registered with more than 10 000 participants and had eligible meta-analyses; however, they randomized less 10 000 participants and were excluded by our analyses. 26 - 28 Results were compared to smaller trials across 58 meta-analyses, including 35 for primary outcome 29 - 75 , 152 and 26 for all-cause mortality. 29 , 32 - 35 , 37 - 47 , 49 - 54 , 56 - 62 , 64 - 70 , 72 - 74 , 76 - 78 In 3 studies, 32 , 41 , 68 all-cause mortality was the mega-trial’s primary outcome ( Table 1 ). For 19 mega-trials that had a composite primary outcome , 30 , 32 , 33 , 39 , 42 , 45 , 46 , 48 , 53 , 55 , 56 , 59 , 61 , 62 , 66 , 68 , 69 , 71 , 152 no eligible meta-analysis was identified for the complete composite outcome, therefore the meta-analysis of one of the subsets of the composite outcome with the highest number of events was analyzed ( Table 1 and eTable 1, eAppendix 3, and eTable 4 in Supplement 1 ).

The eligible meta-analyses included estimates from another 30 mega-trials 79 - 108 that had randomized, noncluster design and more than 10 000 participants but had not been identified in our searches (eTable 2 in Supplement 1 ). Of these 30 studies, 26 were not registered in ClinicalTrials.gov, 79 - 84 , 86 - 94 , 96 - 101 , 103 - 107 while 2 85 , 108 had no listed location in ClinicalTrials.gov, 1 95 had listed no results in ClinicalTrials.gov, and for 1 study, 102 no reason for missingness was identified. These 30 trials with their estimates for primary outcomes (20 trials) and all-cause mortality (22 trials) were considered in the mega-trials group in all calculations. The meta-analyses included an additional 1 mega-trial that had initially been identified by our search but had no eligible meta-analysis for the primary outcome and/or all-cause mortality but was meta-analyzed for another outcome. 109 In total, 82 mega-trials were included across all meta-analyses for the primary outcome (69 mega-trials 29 - 75 , 79 , 80 , 84 - 87 , 89 - 94 , 97 - 100 , 102 - 104 , 108 , 109 , 152 ) and all-cause mortality (65 mega-trials 29 , 32 - 35 , 37 - 47 , 49 - 54 , 56 - 62 , 64 - 67 , 69 , 70 , 72 - 74 , 76 - 83 , 85 , 87 - 89 , 92 - 96 , 99 , 101 - 107 , 109 , 152 ).

Of the 82 mega-trials 29 - 109 , 152 included in our analyses, 64 30 , 31 , 33 - 40 , 42 - 74 , 76 - 86 , 89 - 94 , 96 - 98 , 100 , 102 - 106 , 108 , 109 investigated cardiovascular outcomes, 17 mega-trials 31 , 38 , 49 , 57 , 65 , 73 , 80 , 88 , 93 , 95 , 97 , 98 , 100 , 101 , 107 - 109 were centered around nutritional interventions, and 1 mega-trial 75 covered various other medical interventions intervention types, such as pharmacological treatment ( Table 1 and eTable 1 and eTable 2 in Supplement 1 ). Moreover, 15 of the mega-trials were open-label, 29 , 37 , 47 , 57 , 68 , 73 , 79 - 81 , 86 , 87 , 90 , 102 , 105 , 106 while the remaining 65 mega-trials were double-blinded, and 2 trials employed varying degrees of masking ( Table 1 ). Of all the mega-trials, 14 29 , 44 , 47 , 52 , 68 , 72 , 73 , 79 , 81 , 87 , 97 , 102 , 106 , 152 were judged at high risk of bias. A total of 32 mega-trials 29 , 30 , 35 , 37 , 39 , 40 , 43 , 45 , 51 , 54 , 55 , 58 , 60 , 64 , 69 , 71 , 73 , 76 , 78 - 80 , 82 , 85 , 87 , 90 , 92 , 96 , 101 , 102 , 105 , 106 had statistically significant results at P  < .05 for the primary outcome (30 favoring the intervention group) and only 17 29 , 33 , 43 , 47 , 48 , 50 , 58 , 61 , 69 , 76 , 79 , 80 , 82 , 86 , 99 , 101 , 106 had statistically significant results at P  < .05 for all-cause mortality (13 favoring the intervention group) ( Table 1 and eTable 1 and eTable 2 in Supplement 1 ).

A total of 35 comparisons of mega-trials vs other trials were available, 110 - 138 yielding a total of 85 point estimates coming from 69 unique mega-trials. 29 - 62 , 64 - 106 , 109 , 152 These 69 mega-trials yielded a median (IQR) of 15 715 (12 530-20 114) participants ( Table 2 ). The total number of smaller trials across these 35 mega-trials was 272 (median [range], 6 [1-45] smaller trials) ( Table 2 ). There was a median (IQR) of 1639 (297-4128) participants across the 35 studies from the smaller trials. Of the 272 smaller trials, 133 were published before or up to the year of the first mega-trial of the respective topic. In 7 meta-analyses, 110 , 114 , 117 , 121 , 124 , 132 , 137 the cumulative sample size of all the other smaller trials exceeded the cumulative sample size of the mega-trials ( Table 2 ).

Detailed information with forest plots on all of the 35 meta-analyses 110 - 138 appears in eAppendix 4 in Supplement 1 . In the summary analysis, there was no noteworthy discrepancy observed between the results of the mega-trials and those of smaller trials (summary ROR, 1.00; 95% CI, 0.97-1.04; I 2  = 0.0; P for heterogeneity = .48) (eFigure 1 in Supplement 1 ). There were 2 instances when disagreement between the mega-trials and the respective smaller trials was beyond chance; the first 112 was comparing ivabradine with placebo for major adverse cardiovascular event (ROR, 1.21; 95% CI, 1.00-1.47), and the second 126 was a comparison of new adenosine diphosphate receptor agonist with clopidogrel for myocardial infarction (ROR, 0.83; 95% CI, 0.73-0.95). ,

A total of 26 comparisons of mega-trials vs other trials were available. 112 - 115 , 118 , 119 , 122 , 127 , 128 , 130 , 133 , 134 , 136 , 138 - 145 and 70 estimates coming from 65 unique mega-trials 29 , 32 - 35 , 37 - 47 , 49 - 54 , 56 - 62 , 64 - 67 , 69 , 70 , 72 - 74 , 76 - 83 , 85 , 87 - 89 , 92 - 96 , 99 , 101 - 107 , 109 , 152 were considered in these comparisons ( Table 3 ). The median (IQR) total number of participants in all of the mega-trials was 15 919 (12 524-18 857).

The total number of smaller trials in these 26 meta-analyses was 268 (median [range] per meta-analysis, 6 [1-47] smaller trials). There was a median (IQR) of 1132 (250-4038) participants from smaller trials. Of the 268 smaller trials, 117 were published before or up to the year of the first mega-trial of the respective topic. In 5 meta-analyses, 132 , 139 - 141 , 144 the cumulative number of participants in the other smaller trials exceeded the total number of participants in the mega-trials ( Table 3 ). Comprehensive details and forest plots about the 26 meta-analyses appear in eAppendix 5 in Supplement 1 .

In the summary analysis, no difference existed between the outcomes of the mega-trials and those of the smaller trials (summary ROR, 1.00; 95% CI, 0.97-1.04; I 2  = 0.0%; P for heterogeneity = .60) (eFigure 2 in Supplement 1 ). In one instance testing effects of anti-inflammatory vs placebo in patients with coronary artery diseases, 128 the results differed beyond chance between mega-trials and the other smaller trials (ROR, 0.79; 95% CI, 0.65-0.97), with mega-trial showing no effect but meta-analysis of smaller trials showing an increased risk.

Smaller trials showed significantly larger effects for the primary outcome when compared with mega-trials when they were published before the first megatrial (ROR, 1.05; 95% CI, 1.01-1.10), and similar direction but nonsignificant effect for all-cause mortality (ROR, 1.03; 95% CI, 0.98-1.09) ( Figure 2 , A and B). Results of smaller trials published before the mega-trial showed significantly higher benefits as compared with smaller trials published subsequently for primary outcome (ROR, 1.10; 95% CI, 1.04-1.18) and similar outcomes for all-cause mortality (ROR, 1.06; 95% CI, 0.98-1.15) (eFigure 3 in Supplement 1 ).

No difference was seen when results were pooled using fixed effects, having a threshold of 30 000 participants using HKSJ random effects. Other subgroup analyses and meta-regressions were also nonrevealing (eTable 3 and eFigures 4-13 in Supplement 1 ). No small-study effects were found for the meta-analyses for the primary outcome and 1 meta-analysis 140 had a significant small-study effects result for all-cause mortality.

In total, we analyzed and/or described the results from 120 mega-trials. Of the 120, 41 showed a significant result for the primary outcome (33 of which favored intervention over control) and 22 showed a significant result for all-cause mortality ( and 18 of which favored intervention over control). For the 17 studies with noninferiority designs, 15 had reached noninferiority and 2 had significantly better results in the experimental group vs the control group for the primary outcome ( Table 1 and eTable 1 and eTable 2 in Supplement 1 ).

Overall, this meta-analysis of mega-trials found that outcomes from meta-analyses of other smaller clinical trials aligned on average with those of mega-trials in the clinical studies that we examined. This finding could be partly explained by the relatively large sample size of the smaller trials. However, mega-trials tended to have less favorable results than the smaller trials that preceded them timewise, and smaller trials published after the mega-trials tended to have less favorable results than the smaller trials published before the mega-trials and aligned with the mega-trials. Most mega-trials do not show statistically significant benefits for the primary outcome of interest, and statistically significant benefits for mortality are rare. Mega-trials are not available for most medical studies. Given that small trials and their meta-analyses may give unreliable, inflated estimates of benefit, mega-trials, or at least substantially large trials with sufficient power, may need to be considered and performed more frequently.

The diminished benefits in late smaller trials vs early small trials were also consistent with prior meta-research studies 146 that have shown that the reported effects of interventions change over time, with wider oscillations of results in early studies. It has been observed that it is more frequent for treatment effects to decrease rather than increase over time. 147 - 149 In our examined studies, the mega-trials may have corrected some inflated effects seen in the earlier trials that preceded them. Then, the subsequent trials might have been more aligned with what the mega-trials had shown because the mega-trials are likely to have been considered very influential.

Previous meta-research assessments have shown different levels of agreement between the results of meta-analyses of smaller trials and large clinical trials. For example, Cappelleri et al 11 reported compatible results of meta-analysis of smaller studies with the results of large trials, although discrepancies in their results were found in up to 10% of the cases. However, other meta-studies on this topic 13 showed larger differences with a discrepancy rate of up to 39%. These previous studies used a definition of a large trial having enrolled 1000 participants or more. In contrast, we used a sample size of 10 000 participants to define a mega-trial, and therefore had a larger power to detect effects.

This study has limitations. Several early mega-trials are not included in the ClinicalTrials.gov registry. Nevertheless, we were able to identify several of these trials because they were included in the meta-analyses of other mega-trials, and they were considered in our calculations.

Our comparative results vs smaller trials still did not include all mega-trials, because for some mega-trials retrieved in ClinicalTrials.gov, we found no relevant meta-analysis where they had been included. However, we did examine the main conclusions of these mega-trials and they also had low rates of statistically significant results. Therefore, we can conclude that mega-trials in general tend to give negative results for tested interventions.

Mega-trials may have, on average, more pragmatic designs than smaller trials. The different eligibility criteria and different populations of participants enrolled in mega-trials vs smaller trials may create differences in effect sizes. Addressing such differences in case-mix heterogeneity would require individual-level data.

Mega-trials are unlikely to be launched unless there is genuine equipoise. Nevertheless, the low rate of significant benefits, as opposed to the much higher rates of favorable results seen in typical phase 3 trials, is remarkable. 150 Previous research found more favorable results in industry-funded research. 150 , 151 Finally, our analyses depend on the accuracy and quality of data extracted from the included meta-analyses.

In this meta-research analysis, meta-analyses of smaller studies showed, in general, comparable results with mega-trials, but smaller trials published before the mega-trials gave more favorable results than the mega-trials. Mega-trials are done very sparingly to date, but it would be beneficial to add more of these trials to the clinical research armamentarium. 152 , 153

Accepted for Publication: July 12, 2024.

Published: September 6, 2024. doi:10.1001/jamanetworkopen.2024.32296

Open Access: This is an open access article distributed under the terms of the CC-BY License . © 2024 Kastrati L et al. JAMA Network Open .

Corresponding Author: John P. A. Ioannidis, MD, DSc, Meta-Research Innovation Center at Stanford (METRICS), Stanford University, 1265 Welch Rd, M/C 5411, Stanford, CA 94305 ( [email protected] ).

Author Contributions: Drs Kastrati and Ioannidis had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Kastrati, Muka, Ioannidis.

Acquisition, analysis, or interpretation of data: All authors.

Drafting of the manuscript: Kastrati, Quezada-Pinedo, Khatami, Ahanchi, Muka, Ioannidis.

Critical review of the manuscript for important intellectual content: Kastrati, Raeisi-Dehkordi, Llanaj, Quezada-Pinedo, Khatami, Llane, Meçani, Muka, Ioannidis.

Statistical analysis: Kastrati, Llanaj, Quezada-Pinedo, Khatami, Llane, Muka, Ioannidis.

Obtained funding: Ahanchi.

Administrative, technical, or material support: Quezada-Pinedo, Ahanchi.

Supervision: Ioannidis.

Conflict of Interest Disclosures: Dr Muka reported receiving grants from Merz Aesthetics; personal fees from Merz Aesthetics; and serving as cofounder and CEO at Epistudia GmbH outside the submitted work. No other disclosures were reported.

Funding/Support: This study was supported by an unrestricted gift from Sue and Bob O’Donnell to Stanford University (to Dr Ioannidis), the Swiss Government (scholarship for excellence to Dr Kastrati), University of Bern, and Insel Spital (funding to Dr Kastrati).

Role of the Funder/Sponsor: The sponsors had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Data Sharing Statement: See Supplement 2 .

  • Register for email alerts with links to free full-text articles
  • Access PDFs of free articles
  • Manage your interests
  • Save searches and receive search alerts

Software product line testing: a systematic literature review

  • Open access
  • Published: 02 September 2024
  • Volume 29 , article number  146 , ( 2024 )

Cite this article

You have full access to this open access article

the research question in a systematic review

  • Halimeh Agh   ORCID: orcid.org/0000-0003-0272-9092 1 ,
  • Aidin Azamnouri 1 &
  • Stefan Wagner 1 , 2  

A Software Product Line (SPL) is a software development paradigm in which a family of software products shares a set of core assets. Testing has a vital role in both single-system development and SPL development in identifying potential faults by examining the behavior of a product or products, but it is especially challenging in SPL. There have been many research contributions in the SPL testing field; therefore, assessing the current state of research and practice is necessary to understand the progress in testing practices and to identify the gap between required techniques and existing approaches. This paper aims to survey existing research on SPL testing to provide researchers and practitioners with up-to-date evidence and issues that enable further development of the field. To this end, we conducted a Systematic Literature Review (SLR) with seven research questions in which we identified and analyzed 118 studies dating from 2003 to 2022. The results indicate that the literature proposes many techniques for specific aspects (e.g., controlling cost/effort in SPL testing); however, other elements (e.g., regression testing and non-functional testing) still need to be covered by existing research. Furthermore, most approaches are evaluated by only one empirical method, most of which are academic evaluations. This may jeopardize the adoption of approaches in industry. The results of this study can help identify gaps in SPL testing since specific points of SPL Engineering still need to be addressed entirely.

Similar content being viewed by others

the research question in a systematic review

Systematic Review on Software Product Line Testing

the research question in a systematic review

Software Regression Testing in Industrial Settings: Preliminary Findings from a Literature Review

the research question in a systematic review

Software Test Management to Improve Software Product Quality

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

Software Product Line (SPL) engineering has proven to be an efficient and effective strategy to decrease implementation costs, reduce time to market, and improve the quality of derived products (Denger and Kolb 2006 ; Northrop et al. 2007 ). SPLs and Configurable Systems (Alves Pereira et al. 2020 ) are two approaches used in software engineering to manage and create software with varying levels of customization and flexibility. While both SPLs and configurable systems share the goal of offering flexibility and customization, they differ in their core approach. SPLs primarily emphasize the systematic reuse of components, architectures, and design patterns across a range of related software products. In contrast, configurable systems are single software products designed to be adaptable, enabling users to configure them to meet their unique requirements. We decided to limit the scope on SPL to keep the review focused.

Testing is an essential part of SPL Engineering (SPLE) to identify potential faults (Pohl and Metzger 2006 ). This activity examines core assets shared among many products, product-specific parts, and the interaction among them (McGregor 2001 ). Therefore, SPL testing includes activities from the validation of initial requirements to the acceptance testing of a specific product by customers (Da Mota Silveira Neto et al. 2011 ).

As the adoption of the SPL approach by companies has grown (Weiss 2008 ), many researchers have made contributions in the SPL testing field to provide efficient and effective approaches that can satisfy specific needs of the industry (e.g., controlling the cost/effort of SPL testing). This resulted in many publications on different aspects of SPL testing. Therefore, analyzing research conducted in this field using well-known empirical methods is required to provide an overview of state-of-the-art testing practices and assess the effectiveness of the proposed approaches. To this end, Systematic Literature Reviews (SLR) and Systematic Mapping Studies (SMS) were conducted on SPL testing, but the most recent one dates back to 2014 (do Carmo Machado et al. 2014 ). While some recent research has focused on reviewing specific aspects of SPL testing, such as model-based testing of SPLs (Petry et al. 2020 ), test case prioritization for SPL (Kumar 2016 ), and combinatorial interaction testing for software product lines (Lopez-Herrejon et al. 2015 ), there has not been an SLR or SMS since 2014 that provides a comprehensive overview of the current state of SPL testing in a general context. Therefore, there is a need to update existing literature reviews (Mendes et al. 2020 ) to identify up-to-date evidence and issues that enable further development of the SPL testing field.

This paper presents an SLR to analyze interesting aspects of SPL testing that are formalized as research questions. An SLR is a rigorous and systematic method to identify, evaluate, and interpret all available research relevant to a particular research question, topic area, or phenomenon of interest (Cruzes and Dybä 2011 ). The specific aspects based on which we analyzed relevant studies are:

Characteristics of the studies focused on SPL testing.

Test levels executed throughout the SPL lifecycle.

Creating test assets by considering commonalities and variabilities.

Dealing with configuration-aware software testing.

Preserving traceability between test assets and other artifacts.

Testing non-functional requirements in an SPL.

Controlling cost/effort of SPL testing.

The SLR process was conducted from June 2022 to the end of 2022. While some of the findings derived from this SLR align with the conclusions of previous SLRs, such as the identification of existing gaps in non-functional testing for SPLs and the necessity for more robust and user-friendly testing tools, our review uncovered specific insights and unaddressed gaps in this domain that were not fully explored in prior SLRs. These include:

Variability control, referring to the disciplined management and regulation of feature variations within SPLs, alongside modeling and tracing, presents persistent challenges that require attention throughout the testing process. Variability control involves implementing strategies, such as configuration and change management, to ensure consistency and predictability in the diverse configurations of products derived from the SPL.

Novel approaches are needed for regression test selection, prioritization, and minimization, along with architecture-based regression testing, to effectively manage regression testing in SPLs.

Promoting the adoption of SPL testing practices in industrial settings necessitates addressing practical challenges, such as offering guidance for industry-specific SPL testing, and conducting industrial evaluations.

Exploring the details of test levels across the SPL lifecycle and highlighting the consequences of neglecting a particular test level can offer valuable insights for practitioners.

Studies focusing on testing SPLs rarely address traceability explicitly. Considering feature variability and configuration management, more efficient methods for modeling and representing traceability relationships are required.

The remainder of this paper is organized as follows: Sect.  2 provides background information required to understand SPL and SPL testing concepts; Sect.  3 describes how the SLR methodology has been applied; the results of the SLR are reported in Sect.  4 ; potential threats to the validity of this study and the strategies employed to mitigate them are discussed in Sect. 5 ; Sect.  6 presents a summary of the research and examines the main findings; Sect.  7 provides a survey of the related research; Sect.  8 presents concluding remarks and further research.

2 Background

This section provides a concise background on the SPL development process, variability management, and testing approaches and levels as a basis for the remainder of this article.

2.1 SPL development process

SPL is a software development paradigm to achieve economies of scale and scope by analyzing product commonalities and variabilities. As this paradigm has specific benefits such as substantial cost savings, reduction of time to market, and high productivity, many organizations, including Philips, Nokia, Cummins, and Hewlett-Packard, have adopted it (Clements and Northrop 2002 ). In SPL, a set of core assets (e.g., reference architecture and reusable components) is first developed. Specific products are then built by configuring and composing the core assets in a prescribed way with product-specific features to satisfy particular market segments (Clements and Northrop 2002 ).

The SPL development process/lifecycle can be divided into two distinct phases: Domain Engineering and Application Engineering. According to Czarnecki and Eisenecker ( 2000 , p. 20), Domain Engineering is “the activity of collecting, organizing, and storing experience in building systems or parts of systems in a particular domain in the form of reusable assets, as well as providing an adequate means for reusing these assets when building new systems.” Application Engineering is focused on deriving specific products from the core assets created during Domain Engineering; in this phase, specifics of the products are added to common parts to satisfy the particular needs of a product (Clements and Northrop 2002 ). Of these two phases, Domain Engineering demands significant resources and time. If not managed effectively, it can lead to the failure of the entire SPL (Pohl et al. 2005 , p. 9–10). Three common approaches are employed for constructing an SPL, and each of these approaches directly influences the implementation of Domain Engineering (Apel et al. 2013 ):

Proactive approaches start with a comprehensive and thorough scoping of the domain to anticipate all requirements. Subsequently, all these requirements are implemented as assets, and SPL experts typically carry out this task.

Extractive approaches follow an automated process, utilizing a set of existing product variants as input. The SPL is constructed by extracting features from these variants. Features are identified and retrieved through feature location techniques (AL-Msie’deen et al. 2013 ; Rubin and Chechik 2013 ).

Reactive approaches follow an incremental process. They take as input an existing SPL version (SPL i ) and a set of new requirements about a new product. This process results in the creation of SPL i+1 , which can produce the new product.

2.2 Variability Management in SPL

In SPL engineering, variability mechanisms are fundamental for managing diversities across products. These mechanisms, as classified by Apel et al. ( 2013 ), include annotative mechanisms, transformative mechanisms (delta-oriented), and feature-oriented mechanisms. Annotative mechanisms involve marking or annotating code to denote variability points, while transformative mechanisms, such as delta-oriented programming, describe changes required to transform one product variant into another. Feature-oriented mechanisms organize variability around features and their interactions. These variability mechanisms can be applied across all stages of the software lifecycle.

A Feature Model is commonly used in Domain Engineering to present different combinations of features. A feature model is a formal representation and graphical notation that describes the variability and relationships among features in an SPL. Feature models typically consist of features (functionalities or characteristics), feature hierarchies (representing parent-child relationships between features), and constraints (rules governing the valid combinations of features) (Pohl et al. 2005 ). Due to the presence of numerous optional features, the configuration space in feature models may exponentially increase (reaching 2 n possible configurations, where n represents the number of optional features without further constraints) (Chen and Babar 2011 ). A specific product can be derived once a complete feature configuration is established.

Although proactive approaches emphasize systematic upfront planning, modeling variabilities with feature and configuration models, and high asset reusability, reactive methods can also use feature models to represent variabilities introduced by new requirements. Configuration files or mechanisms are often used in reactive approaches to specifying how variabilities are configured in reaction to new requirements (Ghanam et al. 2010 ). Furthermore, extractive approaches may employ feature models to represent and visualize variabilities discovered in existing products. Configuration scripts or files may be used to document and manage variabilities found in the codebase (Parra et al. 2012 ).

2.3 Testing approaches and levels

There exist diverse approaches to software testing, including (Luo 2001 ; Jorgensen 2013 ):

Manual testing : Testers create and execute test cases manually to evaluate the behavior of a software application or system without using automated testing tools or scripts.

Automated Testing : Specialized testing tools and scripts are used to automate the execution of test cases and the verification of software applications or systems.

Functional testing : Focuses on verifying software functions according to specified requirements. This approach includes different levels of testing, including:

Unit Testing is conducted at the lowest level, focusing on the fundamental unit of software, referred to interchangeably as “unit,” “module,” or “component.“

Integration Testing takes place when two or more tested units are integrated into a larger structure. This testing assesses the interactions between components and evaluates the quality of the overall structure when the properties cannot be determined solely from its individual components.

System Testing aims to validate the comprehensive quality of the entire system, covering end-to-end functionality. This type of testing typically aligns with the functional and requirement specifications of the system. Additionally, it assesses non-functional quality attributes like reliability, security, and maintainability.

Acceptance Testing occurs when the developers deliver the completed system to the customers or users. The primary goal of acceptance testing is to give confidence that the system functions correctly rather than to uncover errors.

Non-functional testing : Focuses on evaluating the attributes of a software system that are not directly related to its functional behavior. Instead, non-functional testing assesses the system’s performance, reliability, scalability, security, usability, and other qualities that impact the overall user experience and the system’s ability to meet non-functional requirements.

Regression testing : Focuses on verifying that recent changes or updates to a software application have not introduced new defects or negatively affected existing functionality.

Model-based testing : Test cases are derived from models representing the software’s expected behavior. Different models can be used to generate test cases systematically, including graphical representations, mathematical models, or formal notations.

SPL testing is an essential activity in SPLE to identify potential faults (Pohl and Metzger 2006 ). Exhaustive testing in SPL is usually infeasible due to a combinatorial explosion in the number of products. Following Tevanlinna et al. ( 2004 ), Reuys et al. ( 2005 ), Käköla and Dueñas ( 2006 ), there are specific differences between single-system testing and SPL testing:

Testing is a part of both phases: Domain Engineering and Application Engineering. Domain testing is focused on testing domain artifacts (e.g., requirements, features, and source code); however, as domain artifacts include variability, completely testing the domain artifacts in domain testing is impossible. Application testing aims to detect remaining faults in a derived product mainly caused by unexpected interactions.

Test assets created in Domain Engineering (e.g., test cases, test scenarios, test results, and test data) are reused in Application Engineering to test instantiated products. To this end, test assets should be created by considering variability, which we call variant-rich test assets.

3 Systematic literature review methodology

To carry out this SLR, we followed guidelines for performing SLRs in software engineering (Kitchenham and Charters 2007 ). The steps followed in conducting this SLR are developing a review protocol, conducting the review, analyzing the results, reporting the results, and discussing the findings. The review protocol used in this SLR is explained in the following subsections. The protocol includes the formulation of research questions to achieve the objective (Sect.  3.1 ), identification of sources to extract the research papers, the search criteria and principles for selecting the relevant studies (Sect.  3.2 ), specifying a set of criteria to assess the quality of each study remained for data extraction (Sect.  3.3 ), and developing the template used for extracting data (Sect.  3.4 ).

3.1 Research questions

As previously stated, this study aims to investigate how the existing approaches deal with testing in SPL. To formulate research questions, we examined topics addressed by previous research on SPL testing (Pérez et al. 2009 ; Engström and Runeson 2011 ; Da Mota Silveira Neto et al. 2011 ; do Carmo Machado et al. 2014 ). Some of the research questions were completely reused from previous research – i.e., RQ1, RQ2, RQ3, RQ6, and RQ7 – and some of them were formulated by analyzing specific aspects that have not been investigated in detail in previous research – i.e., RQ4 and RQ5.

We reuse RQs to contrast and compare the newer research contributions with the results of previous SLRs. Yet, we identified two unique, interesting aspects: Because testing every potential configuration of an SPL is often impractical, it becomes essential to employ specific approaches for identifying valid and invalid configurations. We have examined the techniques utilized or proposed in RQ4 to address this issue. Maintaining traceability between test assets and other SPL artifacts offers substantial advantages, including enhanced reusability, impact analysis, and change management. Consequently, we designed RQ5 to investigate the techniques employed for preserving traceability. Answering these questions led to a detailed investigation of the identified studies to specify practical and research issues regarding SPL testing; therefore, the results of this study can support both industrial and academic activities. The research questions are as follows:

RQ1. How is the research on SPL testing characterized? This question intends to discuss the bibliometrics of the primary studies and the evidence available to adopt the proposed approaches.

RQ2 . What levels of tests are usually executed throughout the SPL lifecycle (i.e., Domain Engineering and Application Engineering)? There are different levels of tests, and each level is associated with a specific development phase, including unit, integration, system, and acceptance tests (Ammann and Offutt 2008 ; Jaring et al. 2008 ). This question aims to specify different test levels usually executed throughout the SPL lifecycle.

RQ3 . How are test assets created by considering commonalities and variabilities? The large number of variation points and variants in an SPL increases the number of possible testing combinations. Creating test assets for all combinations of functionality is almost impossible in practice; therefore, test assets must be created by considering commonality and variability so that they can be reused as much as possible. Furthermore, an undetected error in common core assets of an SPL can be spread to all instances depending on those assets (Pohl and Metzger 2006 ); therefore, creating test assets by considering commonalities and variabilities and testing common aspects as early as possible is essential. Answering this question led to investigating how testing approaches handle commonality and variability throughout creating/executing test assets.

RQ4 . How do SPL approaches deal with configuration-aware software testing? Testing all functionality combinations in an SPL is impossible and unnecessary since some combinations are invalid based on the constraints defined between configuration parameters. This question is intended to specify ways/techniques to detect valid and invalid combinations of configuration parameters.

RQ5 . How is the traceability between test assets and other artifacts of SPL preserved throughout the SPL lifecycle? The reusability of test assets is essential to manage the complexity of SPL testing; preserving traceability between test assets and requirements/implementation can enhance the reusability of test assets. In this sense, this question is intended to identify specific ways/techniques to achieve traceability between test assets and other artifacts throughout the SPL lifecycle.

RQ6 . How are Non-Functional Requirements (NFRs) tested in SPL? NFRs such as security, reliability, and performance are very important for SPLs, and ignoring these requirements can lead to negative results (e.g., economic loss) (Nguyen 2009 ). Therefore, systematically testing NFRs by considering commonalities and variabilities is an important aspect of SPLE. This question is intended to investigate how tests of NFRs are performed in an SPL.

RQ7 . What mechanisms have been used for controlling cost/effort of SPL testing? As SPL testing is more expensive than single-system testing, identifying specific techniques to reduce effort can provide the reader with an initial list of techniques identified by analyzing the selected studies. The specified list can be enriched regarding new publications about SPL testing.

3.2 Identification of relevant literature

The process of gathering and selecting primary studies has been performed in three stages: in the first stage, we investigated previously published literature reviews on SPL testing (Pérez et al. 2009 ; Engström and Runeson 2011 ; Da Mota Silveira Neto et al. 2011 ; do Carmo Machado et al. 2014 ) to identify the initial set of papers that have been published up to 2013. In the second stage, we updated the list of papers by searching for new papers published between 2013 and 2022; in this stage, we performed forward and backward snowballing (Webster and Watson 2002 ) to identify missing relevant papers. In the third stage, we applied inclusion and exclusion criteria to each potential primary study identified through stages one and two. Each of the three stages is explained in detail in the following subsections. We must note that we chose studies that could address at least one of the RQs while selecting primary studies. For instance, certain studies focusing on SPL verification were included because they could provide insights relevant to questions such as RQ4. An Excel file was created to be shared among the authors to document the various steps of the SLR process. This file Footnote 1 contains all the details about how we gathered and selected primary studies and how we extracted data from the chosen studies.

3.2.1 Analysis of existing reviews

By searching for existing SLRs or Systematic Mapping Studies (SMSs) on SPL testing, we found four SLRs (Engström and Runeson 2011 ; Da Mota Silveira Neto et al. 2011 , Pérez et al. 2009 ; do Carmo Machado et al. 2014 ). Engström and Runeson ( 2011 ) conducted an SMS to identify useful approaches and needs for future research; in this study, 64 papers published up to 2008 were surveyed. Da Mota Silveira Neto et al. ( 2011 ) performed an SMS to investigate state-of-the-art testing practices in SPL testing; this study analyzed a set of 45 publications from 1993 to 2009. Pérez et al. ( 2009 ) conducted an SLR to identify experience reports and initiatives carried out in the SPL testing area; in this study, 23 primary studies published up to 2009 were analyzed. do Carmo Machado et al. ( 2014 ) conducted an SLR by analyzing 49 studies published up to 2013. As the four studies followed a systematic process to gather and select the primary studies, we are confident that they covered all the primary studies in the SPL testing field published up to 2013. Using the list of primary studies in the four SLR/SMS, a set of 181 potentially relevant papers was identified, shown as stage 1.1 in Fig.  1 . By reading the titles and abstracts of the publications, papers that addressed none of the research questions were excluded. Furthermore, duplicated papers were removed, i.e., those included in more than one literature review. At the end of this stage, 97 studies were finally selected, shown as stage 1.2 in Fig.  1 .

figure 1

The process of gathering and selecting primary studies

3.2.2 Gathering recent publications

In the second stage of the search process, we updated the list of primary studies by analyzing papers published between 2013 and 2022 using the following databases: IEEE Xplore, Scopus, ACM DL, Springer, and Wiley online library. To answer the stated research questions, we identified the keywords that had to be used in the search process. Variants of the terms “ software product line ”, “ software product family ”, and “ software testing ” were applied to compose the search query, as follows:

(Software Product Line OR Software Product Lines OR Software Product Family OR Software Product Families) AND (Test OR Testing) .

To evaluate the search string, we first performed a limited manual search to see whether the results of that search were among the results obtained by running the search string. The search string was adapted based on the syntax requirements of each data source used. Table  13 in Appendix A shows the forms of search strings applied to different engines and the number of papers extracted from each data source.

We obtained a set of 2,608 papers by running the search string on the search engines, shown as stage 2.1 in Fig.  1 . We excluded 161 papers as duplicates since they were retrieved from multiple search engines. Furthermore, by reading the titles and abstracts of the remaining papers, a set of 2,125 papers was identified as irrelevant since they considered testing from a single-system development perspective, not an SPL point of view. At the end of this step, we had 322 papers, shown as stage 2.2 in Fig.  1 .

In the next step, we conducted both backward and forward snowballing by examining the reference lists of all the identified papers and exploring the papers that have cited these identified papers, respectively. Following this step, 70 additional papers (20 via backward snowballing and 50 via forward snowballing) were added to the previously identified set of papers, shown as stage 2.3 in Fig.  1 . At the end of stage 2, we had a set of 392 new publications, shown in Fig.  1 as stage 2.4.

3.2.3 Primary study selection strategy

By merging the results of the two previous stages, a set of 477 papers was composed, shown as stage 3.1 in Fig.  1 . Throughout the merging process, we identified 12 papers as duplicates because the year 2013 was considered in both the SLR conducted by do Carmo Machado et al. ( 2014 ) and in the automated search stage. We defined a set of inclusion and exclusion criteria to assess each potential primary study; the criteria are presented in Table  1 . These criteria were applied to the titles and abstracts of the identified papers. The first author performed this stage. However, to reduce the researcher bias, the results of this stage were validated by the second and third authors of this paper.

At this stage, we initially applied inclusion criteria to select papers meeting all of the specified criteria for inclusion. Following this, we applied exclusion criteria to exclude papers that met one or more of the specified exclusion criteria. We included only papers evaluated via at least one empirical method, including Case study, Survey, Experiment, and Observational study (Wohlin et al. 2003 ; Sjoberg et al. 2007 ; Zhang et al. 2018 ). At the end of this stage, a set of 161 papers were selected to be subject to full-text reading, depicted in Fig.  1 as stage 3.2. The analysis results of the papers, conducted based on the inclusion and exclusion criteria, are accessible within the replication package.

3.3 Quality assessment

Quality assessment of candidate studies is recommended to be performed to ensure that studies are impartially assessed for quality (Kitchenham et al. 2016 ). To this end, we used a set of quality criteria to examine the studies, shown in Table  14 in Appendix B. These criteria were reused from the criteria proposed by Dybå and Dingsøyr ( 2008 ) and cover four main aspects related to quality, including:

Reporting : Reporting of the study’s rationale, aims, and context.

Rigor : Has a thorough and appropriate approach been applied to key research methods in the study?

Credibility : Are the findings well-presented and meaningful?

Relevance : How useful are the findings to the software industry and the research community?

We used a weighting approach to examine the candidate studies in which two optional answers with their respective score were given for each question: “Yes” = 1, and “No” = 0. Then, we assigned a quality assessment score to each study by summing up the scores given to all the questions; the total quality score for each study ranged from 0 (very poor) to 11 (very good). The two authors assessed the papers, and any discrepancies were resolved by holding sessions with all the authors.

The first three criteria shown in Table  14 in Appendix B were used as the minimum quality threshold of the review to exclude non-empirical research papers. To this end, if question 1, or both of questions 2 and 3, received a “0” response, we did not continue the quality assessment process, and the paper was excluded. The results of the quality assessment for each paper are available in the replication package. Consequently, 43 papers were excluded, and 118 were selected as primary studies, shown in Fig.  1 as stage 3.3. The list of primary studies is presented in Table  15 in Appendix C.

The analysis of the studies based on quality assessment criteria is explained in more detail in Appendix E. In summary, concerning Reporting, most of the studies performed well. While the context description could be better in some studies, approximately 82% have clear research objectives, and all studies are based on research. On average, the studies performed reasonably well in terms of Rigor. Researchers have justified the research design in almost 62% of studies to accomplish the research’s goals. A base approach has been compared with the proposed approach in around 60% of studies, with the researchers attempting to prove that the selected controls reflect a defined population. Despite these promising findings, 32% of the studies fail in rigor. According to the credibility issue, around 95% of the studies discuss the results in relation to the research questions and highlight the study’s limitations. Most studies, however, need to establish relationships between the researcher and participants and the data collection that addresses the research problem. Regarding Relevance, about 97% of studies explicitly discuss SPL testing and how it contributes to existing knowledge, identifies new areas for research, and explains how the results can be used. Nevertheless, practitioner-based guidelines are present in about 15% of cases, indicating that more practical guidance is needed to strengthen industry adoption of SPL testing.

3.4 Data extraction and analysis

Data was extracted from each of the 118 primary studies during this stage. To this end, we used a predefined extraction form that enabled us to record the full details of the studies and be specific in answering research questions. The extraction form is shown in Table  2 . The first two authors conducted the process of reading and completing the extraction form; the data were extracted and stored in a spreadsheet after reading each paper and shared with all the authors. We followed the content structuring / theme analysis approach of Mayring ( 2014 ) to analyze the data. The types of extracted data from the extraction form already provided us with a list of themes and the corresponding extracted data for these themes. This step was deductive. In the next step, we inductively created categories in the themes to summarize them. All the authors held multiple sessions to discuss the intermediate results and resolve any potential discrepancies.

In the following sections, the data extracted from the primary studies is used to answer the research questions. An overview of the primary studies is first provided in Sect.  4.1 . Then, we answer each RQ via the extracted data.

4.1 Characteristics of the studies (RQ1)

This section discusses the bibliometrics of the primary studies, the evidence available to adopt the proposed approaches, and the results of the evaluations conducted based on the quality assessment criteria.

4.1.1 Bibliometrics

In this section, we analyze annual trends and distribution per venue type of the studies selected.

Annual trend:

The distribution of the primary studies according to publication year is shown in Fig.  2 . No publication prior to 2003 focuses on SPL testing. However, after 2003, there was at least one paper per year, except for 2004. As seen in Fig.  2 , the number of published papers in this field has generally increased over time (2003–2019). This indicates that the SPL Testing field has attracted the attention of many researchers in the last few years. Furthermore, it shows increasing attention to the use of empirical methods to assess the value of proposed approaches since we only included empirically evaluated studies in our review. As we excluded some of the papers based on the quality assessment criteria, there is no primary study published in 2004 that satisfies the minimum quality threshold of the review. Furthermore, the number of papers published in some years (e.g., 2013) was actually higher than the ones presented in Fig.  2 ; however, some of those papers were excluded throughout the assessment of quality criteria. It is worth mentioning that many studies might not be made available by search engines until the time the search was performed (August 2022), and thus, we did not consider these studies in this study. We have specified these studies in the replication package. The overall trend that the number of publications increases is quantified by all entries in DBLP for each year, as shown in Fig.  2 for comparison. As we see in this figure, the trend in SPL testing is well above in several years (2014, 2016, 2017, and 2019). However, this trend has been decreasing in recent years.

figure 2

Distribution of primary studies by year

Distribution per venue:

Most of the primary studies were published in conferences; of 65 conference papers, 17 papers (∼ 26%) were published in SPLC Footnote 2 , which is the most representative conference for the SPL engineering area. This indicates that SPLC is an important venue for SPL research, and most primary studies in this field are presented in SPLC. Also, 31% of studies were published in journals, 7% in symposiums, and 5% in workshops.

4.1.2 Analyzing the evidence available to adopt the proposed approaches

As reported in the title or the text of the studies, case studies, experiments, and expert surveys are the specific methods that have been used for evaluating primary studies. Most of the primary studies were evaluated by conducting an experiment (∼ 58%). It is worth mentioning that five studies applied more than one evaluation method, including case study and expert survey (Bucaioni et al. 2022 ), case study and experiment (Akbari et al. 2017 ; Fragal et al. 2019 ), experiment and expert survey (Hervieu et al. 2016 ), and case study, experiment, and expert survey (Wang et al. 2017 ). Table  3 shows the primary studies that have used each type of evaluation method.

Although the studies reported that their proposed approaches were evaluated by using the mentioned empirical methods, we need to analyze the strength of the evidence available to adopt the proposed approaches. The results of this analysis can help researchers to find new topics for empirical studies, and practitioners to assess the maturity of a proposed approach. Kitchenham and Charters ( 2007 ) classified the study design into five levels, based on the evidence presented in medical research.

Alves et al. ( 2010 ) revised the classification to be applicable in their study; the revised classification is fully applicable in our review. The following hierarchy is used in our study (from weakest to strongest):

No evidence.

Evidence obtained from demonstration or working out toy examples.

Evidence obtained from expert opinions or observations.

Evidence obtained from academic studies, e.g., controlled lab experiments.

Evidence obtained from industrial studies, e.g., causal case studies.

Evidence obtained from industrial practice.

Based on the evidence evaluation scheme explained, the results of the evaluation on how much evidence is available to adopt the proposed approaches are presented in Table  16 in Appendix D. All the studies have been evaluated by one kind of evaluation method. Academic studies (Lev4) are the most used evaluation method (60%), where open-source repositories are usually utilized to assess the proposed approaches. Following is Demonstration (Lev2) (∼ 17%). Only a small number of studies have been evaluated by using industrial systems or real data sets (∼ 16%) (Industrial studies, Lev5), or by applying the proposed methods to industrial settings and by involving industrial professionals (∼ 13%) (Industrial practice, Lev6). This analysis shows an overall low level of evidence in the SPL testing field that is in line with the results of the SLR conducted by do Carmo Machado et al. ( 2014 ).

4.2 Test levels executed throughout the SPL lifecycle (RQ2)

We divided SPL testing according to the two common phases of SPLE: Domain Engineering and Application Engineering. Based on the analysis of the studies, there are two types of testing activities that are performed during Domain Engineering: (1) developing test assets so they can be instantiated in Application Engineering, (2) applying tests to assets produced during Domain Engineering to detect faults in common core assets as soon as possible. By analyzing studies that are focused on the second activity, we identified two levels of tests usually performed in Domain Engineering; distribution of studies based on the test levels is shown in Table  4 :

Unit testing : Out of 118 studies, three studies are only focused on this level of testing (Jaring et al. 2008 ; Kim et al. 2011 , 2012 ). Jaring et al. ( 2008 ) classified test levels based on the binding time of variabilities. Based on this study, unit tests are performed before variant binding; therefore, we included this study in this classification since Application Engineering is the phase in which variabilities are bounded to derive a specific product. Kim et al. ( 2011 ) and Kim et al. ( 2012 ) proposed specific methods in which analysis on the code level is performed to generate test suits for testing common parts of an SPL in Domain Engineering.

Integration testing : Execution of integration tests in Domain Engineering are examined in the studies by Reis et al. ( 2007 ), Neto et al. ( 2010 ) and Akbari et al. ( 2017 ). Reis et al. ( 2007 ) proposed a model-based, automated technique for integration testing in Domain Engineering. In the proposed technique, integration test case scenarios are generated to support the test of interactions between the components of an integrated sub-system; placeholders are also created for necessary variable parts and all components that are not part of the integrated sub-system. Neto et al. ( 2010 ) presented a regression testing approach for SPL architectures to maintain the correctness and reliability of the architecture after modifications; as the main purpose of the approach is to verify the integration among modules and components that compose the SPL architecture, we included this study in this classification. Akbari et al. ( 2017 ) proposed a method for prioritized selection and execution of integration test cases in both Domain Engineering and Application Engineering.

Specific testing activities that are conducted in Application Engineering are: Creating specific product test assets by selecting and instantiating domain test assets, designing additional product-specific tests, and executing tests (Da Mota Silveira Neto et al. 2011 ). It is worth mentioning that some of the studies are focused on reducing the number of products that need to be tested by using specific techniques like pairwise testing (e.g., Matnei et al. 2016 ). In addition, some studies are focused on product prioritization to enhance the efficiency of SPL testing (e.g., Parejo et al. 2016 ). Once a set of configurations/products are selected/prioritized for testing, their behavior needs to be tested using a specific mechanism, e.g. executable unit tests (Parejo et al. 2016 ). Studies that are focused only on the first step (selecting/prioritizing configurations) do not usually consider a specific level of test. The testing levels usually performed in Application Engineering, as shown in Table  4 , are as follows:

Unit testing : Some of the studies considered executing unit tests in Application Engineering (Bürdek et al. 2015 ; Li et al. 2018 ; Souto and d’Amorim 2018 ; Jung et al. 2019 , 2020 ; Lochau et al. 2014 ). Bürdek et al. ( 2015 ) proposed a white-box test-suit derivation mechanism for SPLs, specifically for unit testing, in which test specifications are extended with a presence condition. A presence condition constrains the set of configurations for which a specific test case is valid; this information is used for testing configurations in Application Engineering. Li et al. ( 2018 ) investigated test cases generated for one product that are reused for another product of the SPL by applying two categories of structure-based criteria, control-flow and data-flow. Souto and d’Amorim ( 2018 ), Jung et al. ( 2019 ) and Jung et al. ( 2020 ) identify unit test cases to be selected for regression testing.

Integration testing : As shown in Table  4 , this level of testing has been considered in a greater number of studies (27 studies). Some studies have not explicitly mentioned this level of testing; however, they mentioned that the untested parts of the framework are tested during Application Engineering (Scheidemann 2006 ; Al-Dallal and Sorenson 2008 ; Jaring et al. 2008 ). Some of the studies consider the selection of integration test cases during Application Engineering (e.g., Jung et al. 2019 ).

System /Acceptance testing : This level of testing has also been considered in a greater number of studies (28 studies), as shown in Table  4 . In most studies, test models designed throughout Domain Engineering are instantiated to derive specific system test cases (e.g., Olimpiew and Gomaa 2009 ). Arrieta et al. ( 2015 ) split the lifecycle of cyber-physical systems product lines into three phases: Domain Engineering, Application Engineering, and Simulation phases. Execution of system test cases are performed in the simulation phase; however, as we classified SPL lifecycle into Domain Engineering and Application Engineering, we included this study in this category.

4.3 Creating test assets by considering commonalities and variabilities (RQ3)

Creating test assets by considering commonality and variability to enhance their reusability and to reduce the probability of undetected errors in common core assets by testing them as early as possible is essential in SPL testing. Out of 118 papers, 25 primary studies (∼ 21%) provide contributions to handle variability in a range of different manners. We conducted an exploratory analysis to identify shared characteristics among the approaches and subsequently categorized them. We identified three categories of approaches, including model-, specification-, and requirements-based approaches. The distribution of studies based on these categories is shown in Table  5 .

Model-based approaches : In model-based approaches, a set of techniques is used to design and execute tests for SPLs by leveraging formal or semi-formal models of the SPL’s variability. In the examined studies, the subsequent methods are employed to incorporate variability into test models:

Adaptation of UML models or integrating them with the feature model to produce test models including variability : In studies (Reuys et al. 2005 , 2006 ; Reis et al. 2007 ; Olimpiew and Gomaa 2009 ), activity diagrams are extended using specific mechanisms (e.g., stereotyping specific elements) to contain variabilities and then used as test models to create domain test case scenarios. Ebert et al. ( 2019 ) developed a common platform in Domain Engineering that contains all elements required for producing products. This study uses the SMArDT methodology (Drave et al. 2019 ) to elaborate each functionality defined in the platform via an extended version of activity diagram; generic test cases are then created for each functionality based on the SMArDT methodology. Reis et al. ( 2006 ) propose the ScenTED-PT technique in which the requirements and the architecture of the system are specified by UML models supplemented with performance requirements; then, they create a test model from which performance test case scenarios are derived.

Lochau et al. ( 2012a ) and Lackner et al. ( 2014 ) proposed to use the statechart modeling approach as a basis for capturing commonalities and variabilities of product implementations in an SPL; a 150% statechart model and the feature model is integrated to produce a reusable test model. The 150% statechart model is a model that contains the behavioral specification fragments of every feature without considering constraints between features, and the 100% statechart model is a specific instantiation of the 150% model by considering the dependencies and constraints (Lochau et al. 2012a ).

Using/defining different modeling notations to capture variabilities and using them to produce test assets : In this category of model-based approaches, specific modeling notations have been used or defined to create variant-rich test models. Tuglular et al. ( 2019 ) introduced Featured Event Sequence Graphs (FESGs) to explicitly capture behavioral variability in SPLs. Gebizli and Sözer ( 2016 ) used hierarchical Markov chains to model system usage; as this model captures all possible usage scenarios for a family of systems, it is considered as a reference test model. Bucaioni et al. ( 2022 ) define specific metamodels and languages to capture test variabilities, including SPL metamodel (SPLmm), Products metamodel (Pmm), Weaving metamodel (Wmm) to link features and signals in Pmm to those in SPLmm, Test case DSL (TcDSL), and Test Script generation Transformation (TsT). Fragal et al. ( 2019 ) use Featured Finite State Machines (FFSMs) to represent the abstract behavior of an SPL; in this study, the HSI method (Luo et al. 1995 ) has been extended to generate a single configurable test suite for an SPL. Luthmann et al. ( 2019a ) extended the concept of Timed Automata (TA) by feature constraints and configurable parameters to facilitate efficient verification of real-time properties for SPLs. Lochau et al. 2012b ), Lachmann et al. ( 2016 ), and Lity et al. ( 2019 ) apply the principles of delta modeling (Schaefer et al. 2010 ) to state machine test models to explicitly capture behavioral commonality and variability between product variants and then their test assets. In delta-oriented testing techniques, a product is considered as a base product and delta modules specify changes that should be applied to the base product to produce new ones (Schaefer et al. 2010 ). Beohar and Mousavi ( 2016 ) introduce the concept of Input-Output Featured Transition Systems (IOFTSs); IOFTSs are labeled transition systems with logical constraints on the presence or absence of features and are used as test models. In the work by Lochau et al. ( 2014 ), they introduced delta-oriented architecture test modeling as a means to systematically reuse common component and integration test elements across various system variants. They employed delta-oriented test artifact reuse and regression test planning to facilitate the systematic evolution of variable test elements among incrementally tested versions and/or variants of a software system.

Specification-based approaches : In these approaches, specific links are defined between different configurations of an SPL and, therefore, between test cases designed for both shared and variable components of the products. Mishra ( 2006 ) uses the process algebraic specification language CSP-CASL (Roggenbach 2006 ) to formally specify the system; then, enhancement relationships are established between the specifications of products. In this way, test cases generated for the common parts are reused between products, and new test cases are generated for the differences in the specification. Uzuncaova et al. ( 2010 ) describe properties of features as first-order logic formulas in Alloy (Jackson 2012 ); by considering a product as a base, test cases are generated for the base product using Alloy Analyzer. For each new product, the test cases from previous products are reused/refined based on the differences in the specifications.

Requirement-based approaches : In these approaches, variability is considered as early as possible so that it can be used to design test cases. In several primary studies, use case modeling is the approach used for representing requirements (Nebut et al. 2006 ; Araújo et al. 2017 ; Hajri et al. 2020 ). Nebut et al. ( 2006 ) enhance use cases with parameters and contracts used for presenting variability at the level of requirements; test-related artifacts (e.g., test objectives, test scenarios, and behavioral test patterns) are produced based on the enhanced use cases. Araújo et al. ( 2017 ) express use case specifications in a controlled natural language by considering variabilities; the specifications are then used for generating test procedures and their input and output. Hajri et al. ( 2020 ) propose to use the Product line Use case modeling Method (PUM) that supports variability modeling in use case diagrams; by using the requirement traceability mechanism, test cases for a new product are generated by reusing/adapting existing test cases or by defining new test cases.

Kang et al. ( 2015 ) propose a method called Systematic Software Product Line Test - Data (SSPLT-D) in which a set of platform test requirements are first defined throughout Domain Engineering and then platform test scenarios, platform test cases, and platform test data are created based on test requirements. Nebut et al. ( 2003 ) propose to derive a set of behavioral test patterns from the requirement model and then use them to produce product-specific test cases.

4.4 Dealing with configuration-aware software testing (RQ4)

Dealing with configuration-aware software testing, i.e., detecting valid and invalid combinations of configuration parameters, is paramount in SPL approaches because testing all combinations of SPL functionalities would be impossible and unnecessary. In our investigation, 41 out of 118 papers (∼ 35%) have addressed this. These papers have employed three methods to distinguish between valid and invalid configurations; distribution of studies based on these methods is shown in Table  6 :

Using/proposing specific approaches/algorithms/tools to produce valid configurations : Some studies utilize constraint programming, which is used for solving and modeling constraint satisfaction problems, to generate configurations that satisfy all cross-tree constraints imposed by the feature model (Hervieu et al. 2011 ; Marijan et al. 2013 ). In the same way, Kim et al. ( 2013 ) and Akbari et al. ( 2017 ) propose a constraint handling approach to produce valid configurations; as an example, an algorithm called SPLat is proposed in study (Kim et al. 2013 ) that dynamically prunes irrelevant configurations by handling constraints.

Using formal methods to check cross-tree constraints defined in feature models to check the relations between features is another way to find and produce valid configurations (Lackner et al. 2014 ; Lopez-Herrejon et al. 2014 ; Beohar and Mousavi 2016 ; Parejo et al. 2016 ; Ferrer et al. 2017 , 2021 ; Akimoto et al. 2019 ; Arrieta et al. 2019 ; Jakubovski Filho et al. 2019 ; Luthmann et al. 2019b ; Ibias et al. 2022 ). For example, Lackner et al. ( 2014 ) transform a feature model into propositional formulas so that any variable assignment that satisfies the formula is a valid configuration for the product line.

Several studies suggest the utilization of sampling algorithms and techniques to generate valid configurations (Oster et al. 2010 ; Lochau et al. 2012a ; Patel et al. 2013 ; Yu et al. 2014 ; Al-Hajjaji et al. 2016 , 2019 ; Lee and Hwang 2019 ). Combinatorial Interaction Testing (CIT) is among the commonly used sampling algorithms to exclude invalid interactions between features; in CIT, design-time decisions for variability are considered to exclude invalid interactions between features. For example, Oster et al. ( 2010 ) and Lochau et al. ( 2012a ) propose a pairwise algorithm in which dependencies and constraints between each pair of features are considered to generate all possible products that cover all valid pairs of features and their potential interactions. In a study conducted by Saini et al. ( 2022 ), they introduced a distance-based method for recognizing invalid configurations. This approach involves an initial phase where specific CIT algorithms are employed to generate real configurations. Following that, desired configurations are created, considering the availability of features in the configurations. The approach distinguishes valid from invalid configurations by applying a comparison technique to assess the differences between the actual and desired configurations.

Additionally, several studies proposed tool support for their specific approaches. They used SAT solvers to generate configurations to satisfy the feature model constraints which, in turn, reduces the configuration space to be tested (Henard et al. 2013 , 2014a , b ; Galindo et al. 2016 ; Hervieu et al. 2016 ; Souto and d’Amorim 2016; Fragal et al. 2019 ; Luthmann et al. 2019a ; Krieter et al. 2020 ; Xiang et al. 2022 ). Using or implementing a tool or toolkit to produce valid configurations has been proposed by Ensan et al. ( 2012 ), Al-Hajjaji et al. ( 2016 ), Arrieta et al. ( 2016 ), Al-Hajjaji et al. ( 2019 ) and Arrieta et al. ( 2019 ). For example, FeatureIDE has been used in studies by Al-Hajjaji et al. ( 2016 ), Arrieta et al. ( 2016 ), Al-Hajjaji et al. ( 2019 ), and Arrieta et al. ( 2019 ); this tool can generate valid configurations manually and automatically.

Runtime analysis : An alternative category of methods employs runtime analysis to differentiate intended from unintended interactions. In these methodologies, rather than relying on pre-established specifications to detect interactions, they examine runtime data to distinguish valid and invalid interactions (Reuys et al. 2006 ; Lochau et al. 2014 ; Rocha et al. 2020 ; Vidal Silva et al. 2020 ). As an example, in a study by Rocha et al. ( 2020 ), they introduced an iterative technique called VarXplorer to inspect interactions as they emerge. When provided with a test case consisting of system inputs, VarXplorer generates a Feature Interaction Graph (FIG), which is a concise representation of all pairwise interactions among features. This FIG offers a visual depiction of the features that interact, the contextual data, and the relationships between features, including cases where one feature suppresses another. By employing an iterative approach to interaction detection, developers and testers can thoroughly analyze the FIG derived from all the test cases within a test suite.

It is worth mentioning that some studies only stated that the feature model is manually analyzed to consider feature dependencies and feature grouping constraints (Olimpiew and Gomaa 2009 ; Cabral et al. 2010 ).

4.5 Preserving traceability between test assets and other artifacts (RQ5)

One of the essential factors in SPL testing is the preservation of the traceability between test assets and other artifacts throughout the SPL lifecycle. This is due to enhancing the reusability of test assets for managing the SPL testing complexity. However, in this regard, a few papers take preserving traceability into account, only 14 out of 118 (∼ 12%). We categorized these papers according to the type of the artifacts linked to test assets; distribution of studies based on this classification is shown in Table  7 :

Preserving traceability between requirements and test assets : In the majority of the studies, traceability is established between requirements, often represented using UML models (primarily use cases), and various test assets. These papers have utilized various methods, encompassing the gradually refinement of UML models into test models, direct mapping of requirements to test assets, annotation-based traceability, and the application of specific tools for automated tracing.

Reuys et al. ( 2005 ), Nebut et al. ( 2006 ), Reis et al. ( 2007 ) and Olimpiew and Gomaa ( 2009 ) use UML models to preserve the traceability between requirements and test case scenarios. In the same way, Reuys et al. ( 2006 ) enabled the traceability between different artifacts (use cases, use case scenarios, architecture scenarios, and test case scenarios) by refining use case scenarios into test case scenarios.

The manual definition of links between use cases and system test cases was mentioned by Hajri et al. ( 2020 ). Lackner et al. ( 2014 ), Gebizli and Sözer ( 2016 ) and Wang et al. ( 2017 ) created mapping relationships between variabilities modeled via the feature model and the test model to preserve traceability between requirements and test assets. Bucaioni et al. ( 2022 ) employed a metamodel to create a link between the product models and the SPL model. In this approach, the shared functionalities of the SPL are represented through a class diagram, and test cases are generated explicitly for these shared functionalities.

Adding annotations to test assets to specify their relationship with other artifacts is the approach proposed by Marijan et al. ( 2017 ); in this approach, test cases were manually annotated using tags and related to one or more test requirements; this traceability information is then used to assess the quality of test cases with respect to the requirements coverage.

In some studies, specific tools are used for automated tracing (Reis et al. 2006 ; Lochau et al. 2012a ). Reis et al. ( 2006 ) use a tool named Mercury TestDirector to preserve the traceability between requirements specification, domain performance test case scenarios, and application performance test case scenarios. Lochau et al. ( 2012a ) employed Rhapsody ATG to enable traceability between requirement models and test artifacts in an automated manner.

Preserving traceability between configurations and test assets : The solution proposed by Mishra ( 2006 ) is the definition of enhancement relationships between specifications of systems (different configurations of the SPL) and, therefore, their test cases.

It is also worth mentioning that some studies have emphasized the importance of preserving traceability between test assets and other artifacts, but they provide no mechanism in this regard (Kang et al. 2015 ; Aduni Sulaiman et al. 2019 ).

4.6 Testing non-functional requirements in SPL (RQ6)

In addition to functional requirements, there are non-functional requirements which should be tested in SPL, but only 3 out of 118 studies consider them. Various categories of NFRs have been addressed in these studies, including load testing and performance profiling (Reis et al. 2006 ), NFRs at the hardware-in-the-loop level (Arrieta et al. 2016 ), and real-time properties (Luthmann et al. 2019a ).

Reis et al. ( 2006 ) propose a technique which concentrates on load testing and performance profiling. They employ the Object Management Group’s UML Profile (Fomel 2002 ) to model performance aspects. Testing NFRs as a critical aspect of cyber-physical systems is investigated at the hardware-in-the-loop level by Arrieta et al. ( 2016 ); these requirements (e.g., the usage of memory and CPU) are modeled via the feature model and their coverage is considered by using selected test cases and the simulation process. Luthmann et al. ( 2019a ) present configurable parametric timed automata to extend the expressiveness of featured timed automata to enable efficient family-based verification of real-time properties (e.g., synchronization and execution time behaviors); the proposed modeling formalism aims to represent the behavioral variability of time-critical product lines and consider the minimum/maximum delay coverage.

4.7 Controlling cost/effort of SPL testing (RQ7)

As the cost/effort of SPL testing remains a significant concern within SPLE, numerous studies have proposed various techniques to address this issue. However, the lack of a standardized classification for these techniques has made it challenging to analyze them effectively. One notable exception is the extensive research conducted on product sampling techniques, which has been categorized into specific sub-techniques, including automatic selection, semi-automatic selection, and coverage (Varshosaz et al. 2018 ). In our analysis, we utilized these established categories to organize the diverse range of techniques proposed in the literature.

While reviewing the papers, we identified other approaches that offer potential solutions to managing the cost and effort associated with SPL testing. These approaches were categorized based on their primary contributions and grouped into distinct categories. Some of the identified approaches focus on the reuse of test assets, either from a core asset base or from previously tested products. Others provide varying degrees of automation, ranging from the implementation or utilization of specialized tools to the automation of specific testing processes, such as specification-based approaches.

Additionally, a subset of studies explored strategies for prioritizing the execution order of SPL configurations or products and the associated test cases. Another category of research aimed to minimize the size of the test suite required for testing a particular product, thereby reducing overall testing effort.

It is important to note that these techniques can often be combined. For example, test prioritization and minimization techniques can be used with sampling techniques to further optimize the cost and effort associated with SPL testing. Furthermore, the list of techniques can be enriched concerning new publications regarding SPL testing. In the rest of this section, the details of these five techniques are provided:

Reusing test assets : Based on the analysis of studies, test assets (e.g., test cases and test results) are reused in two ways, including:

Reusing test assets from a core asset base : In some studies, domain test scenarios containing variabilities are created in Domain Engineering; some of these scenarios are reused, and some of them are adapted based on the application requirements (Nebut et al. 2003 ; Reuys et al. 2005 , 2006 ; Reis et al. 2006 ). Some other studies are focused on reusing test cases by selecting them from a repository based on the application requirements (Arrieta et al. 2016 ; Wang et al. 2017 ; Lima et al. 2020 ) or by binding variabilities defined in abstract test cases based on specific criteria (e.g., coverage criteria) (Al-Dallal and Sorenson 2008 ; Olimpiew and Gomaa 2009 ; Lackner et al. 2014 ; Bürdek et al. 2015 ; Kang et al. 2015 ; Ebert et al. 2019 ; Fragal et al. 2019 ; Luthmann et al. 2019a ).

Reusing test assets between products : In some studies, test assets are reused between products by analyzing differences between the current product and previously tested products (Mishra 2006 ; Uzuncaova et al. 2010 ; Neto et al. 2010 ; Lochau et al. 2012b , 2014 ; Xu et al. 2013 ; Lachmann et al. 2015 , 2016 ; Beohar and Mousavi 2016 ; Fragal et al. 2017 ; Li et al. 2018 ; Ebert et al. 2019 ; Lity et al. 2019 ; Luthmann et al. 2019a ; Tuglular et al. 2019 ; Hajri et al. 2020 ). The technique usually used in these studies is the delta-oriented testing technique, based on regression testing principles and delta modeling concepts. By considering delta modules, test cases and test results from previously tested products can be reused and adapted for the new product.

Providing a specific level of automation : We found two ways by which the studies provide a particular level of automation:

Implementing/using a specific tool(s) : In 49 studies, authors claimed that their proposed approach is automatically performed using specific tools. However, the majority of these studies fail to provide any details regarding the specific tools employed for this purpose (e.g., Reis et al. 2006 ; Olimpiew and Gomaa 2009 ; Calvagna et al. 2013 ; Li et al. 2018 ; Safdar et al. 2021 ). Table  8 shows that only 19 of these studies have provided online access to their tools. It is worth noting that most of these tools are in the form of research prototypes. Instead of developing a novel tool tailored to their proposed approach, some studies utilize a set of pre-existing tools at various stages of their approach. For instance, in the case of Parejo et al. ( 2016 ), the Combinatorial tool and Feature Model Testing System (FMTS), as introduced by Ferreira et al. ( 2013 ), were employed to derive pairs and calculate solution fitness, respectively.

Using specific techniques that help automate the testing process. Specification-based testing was used in some studies (e.g., Mishra 2006 ) as an appropriate step in automating the testing process because of its precise nature in describing the desired properties of the system under test by using a formal language. Model-based testing is another approach that helps automate the testing process. For example, Bucaioni et al. ( 2022 ) introduced a model-based approach in which test scripts are generated from shared SPL features by model transformation.

Handling the selection of products to test : Testing all possible combinations of features is almost impossible in terms of resources and execution time (Cohen et al. 2006 ). Specific approaches have been proposed to determine a minimal set of configurations so that the correctness of the entire family can be inferred by successful verification of this set. Through our examination of the studies, we have identified diverse techniques for choosing a subset of products. These techniques have been categorized according to the provided categories for product sampling in study (Varshosaz et al. 2018 ). Distribution of studies based on these techniques are shown in Table  9 :

Automatic selection : There are two general types of automatic selection techniques, including Greedy and Meta-heuristic search:

Greedy : Greedy algorithms (Vazirani 2001 ) are focused on finding an optimal solution by an iterative approach. In the context of SPLs, the optimal solution is the configuration most close to the optimum. Specific measures are used to determine a configuration as an optimum solution in each iteration (e.g., requirements/feature coverage).

Meta-heuristic search : In this category, the problem of identifying a subset of products is considered as an optimization problem. Meta-heuristic algorithms are designed to target this problem by employing computational search within the configuration space to find an optimal subset of products (Varshosaz et al. 2018 ). Some studies have applied Evolutionary Algorithm, Random Search, and Genetic Algorithm by using an aggregation function of different objectives such as cost, number of products, number of revealed faults, pairwise coverage, and mutation score (e.g., Ensan et al. 2012 ). Some other studies propose to use multi-objective algorithms (e.g., Matnei et al. 2016 ). Hyper-heuristics are another category of approaches that have been explored in some studies to solve the problem of product sampling (e.g., Strickler et al. 2016 ). A hyper-heuristic is a methodology that can help automate configuration of heuristic algorithms and determine low-level heuristics (Jakubovski Filho et al. 2018 ). To consider user preferences throughout the selection of products as well as to make use of benefits of hyper-heuristic approaches, a preference-based hyper-heuristic approach has been proposed by Jakubovski Filho et al. ( 2018 ); this approach is an example of algorithms proposed in the field called Preference and Search-Based Software Engineering (PSBSE) (Ferreira et al. 2017b ).

Semi-automatic selection : In semi-automated selection, various factors are considered, including the desired number of generated products, the allocated sampling time, and the level of coverage, such as coverage of feature interactions. Moreover, the complete sample set or an initial set produced by other sampling techniques may serve as a starting point for the sampling process (Varshosaz et al. 2018 ). As an example, Reuling et al. ( 2015 ) propose a framework for fault-based (re-)generation of configuration samples based on feature-diagram mutation. The underlying rationale for this approach is rooted in the recognition that subsets of products generated by CIT approaches can often contain numerous redundant or less significant feature combinations. Furthermore, these approaches may overlook crucial or error-prone combinations beyond t-wise, primarily due to their black-box nature, which typically lacks consideration of domain-specific knowledge, including the fault history associated with feature combinations. The authors argue that the integration of their proposed approach with pairwise CIT sampling can potentially enhance the efficiency and effectiveness of SPL testing.

Coverage : Coverage criteria are frequently employed to ensure the quality of product sampling. One commonly utilized criterion is the coverage of feature interactions (Varshosaz et al. 2018 ). CIT techniques are focused on the interactions between different features or configuration options, as these interactions often lead to defects in software systems. These techniques are classified as greedy by Cohen et al. ( 2007 ) since they are focused on selecting a subset of configurations where each configuration covers as many uncovered combinations as possible. However, it is categorized separately in some other studies (e.g., Cmyrev and Reissing 2014 ). We also prefer to separate this category of techniques from greedy algorithms since they are specially focused on covering feature interactions. The studies that provide details of either a process or an algorithm for CIT are shown in Table  9 .

The most popular kind of CIT is pairwise testing (2-wise), a specialized notion of t-wise coverage; in t-wise testing, configurations are selected in a way that guarantees that all combinations of t features are tested. Kuhn et al. ( 2004 ) showed that 80% of bugs can be revealed by investigating interaction between two variables. Furthermore, for solving problems of large complexity, pairwise has proven to be most effective since finding inconsistencies in a model including only two features might be easier than investigating all combinations of features at once (do Carmo Machado et al. 2014 ). However, Steffens et al. ( 2012 ) revealed that the interaction of three or more features usually occurs in the SPL testing field; therefore, considering the combination of high-strength can have an important role in revealing faults. To this end, some studies claimed that their proposed approach for t-wise coverage can work with any value of t (e.g., Krieter et al. 2020 ). However, high-strength (t > 3) feature interaction can lead to a large number of valid configurations and therefore complicate the problem of t-wise coverage (Qian et al. 2018 ). Therefore, selecting a specific value for t is usually a trade-off between cost and efficiency to reveal faults.

Prioritizing configurations/test cases : Test case prioritization is focused on defining the execution order of test cases that attempts to increase their effectiveness at meeting some performance goals (Li et al. 2007 ; Catal and Mishra 2012 ). By investigating studies, we found two categories of studies in this regard:

Several studies propose approaches for prioritizing SPL configurations/products to be tested; these approaches are usually used as a complement for product selection/sampling techniques. In some of these studies, one or more objectives are defined for configuration prioritization (e.g., high failure rate and high overall requirement coverage) (Scheidemann 2006 ; Sánchez et al. 2014 ; Wang et al. 2014 ; Galindo et al. 2016 ; Parejo et al. 2016 ; Akimoto et al. 2019 ; Hierons et al. 2020 ; Pett et al. 2020 ; Ferrer et al. 2021 ); results of the evaluations conducted by Parejo et al. ( 2016 ) indicate that multi-objective prioritization typically leads to faster fault detection than mono-objective prioritization. In another category of studies, similarity between configurations with respect to feature selections is considered as a criterion for product prioritization (similarity-based prioritization) (Arrieta et al. 2015 ; Al-Hajjaji et al. 2017a , 2019 ). In these approaches, configurations are prioritized based on the dissimilarity between them so that the configuration that has the lowest value of similarity compared to previously selected configurations in terms of feature selections is chosen. Al-Hajjaji et al. ( 2017b ) propose a delta-oriented product prioritization method as similarity-based prioritization techniques do not consider all actual differences between products; in this approach, instead of comparing products to select features, delta-modeling artifacts (Clarke et al. 2010 ) are used to prioritize products.

Some studies are focused on prioritizing test cases for products. Lima et al. ( 2020 ) propose a learning-based approach is proposed to prioritize test cases in the Continuous Integration (CI) cycles of Highly Configurable Systems (HCI). Arrieta et al. ( 2015 ), Marijan et al. ( 2017 ), Markiegi et al. ( 2017 ), Arrieta et al. ( 2019 ) and Hajri et al. ( 2020 ) use specific criteria to prioritize the test cases (e.g., Fault detection capability, Test execution time, or Test case appearance frequency). In another category of studies, similarity-based approaches are proposed to prioritize test cases (e.g., Devroey et al. 2017 ; Lachmann et al. 2015 ; Lachmann et al. 2016 ). As an example, Devroey et al. ( 2017 ) propose an algorithm to generate and sort dissimilar tests to achieve good fault finding; to this end, a distance function is calculated based on the actions executed by the test case. Furthermore, to provide a good coverage of a large number of products, prioritizing test cases is also performed based on the products that may execute a test case.

Minimizing test suite : This technique is focused on minimizing the test suite size for testing a product, while preserving fault detection capability and testing coverage of the original test suite. Al-Dallal and Sorenson ( 2008 ), Stricker et al. ( 2010 ), Kim et al. ( 2012 ) and Beohar and Mousavi ( 2016 ) discuss approaches in which test cases already covered during Domain Engineering or test cases related to common parts that have already been executed in previous products are ignored. Other studies propose specific approaches to reduce redundant test executions for SPL regression testing by pruning tests that are not impacted by changes (Lachmann et al. 2016 ; Jung et al. 2019 , 2020 , 2022 ; Souto and d’Amorim 2018 ).

There are studies focused on improving test generation process to produce minimal set of test cases while achieving specific objectives (e.g., coverage and cost/time) (Patel et al. 2013 ; Wang et al. 2015 ; Gebizli and Sözer 2016 ; Akbari et al. 2017 ; Marijan et al. 2017 ; Aduni Sulaiman et al. 2019 ; Markiegi et al. 2019 ; Rocha et al. 2020 ). As an example, Akbari et al. ( 2017 ) propose a method in which features in feature model are prioritized based on the domain engineer’s decisions and the constraints that exist between features; integration test cases are then produced by considering specified priorities. Furthermore, there are approaches that are not directly focused on test suit minimization; however, they help reduce redundant execution of tests for unnecessary configurations (Kim et al. 2013 ; Souto and d’Amorim 2018 ). These approaches are focused on removing the valid configurations that are unnecessary for the execution of each test.

The distribution of studies based on the identified techniques is presented in Table  10 . As observed, the majority of the studies (∼ 62%) are focused on proposing a specific level of automation. However, many of these studies do not offer details regarding the specific tools utilized for this purpose. The second most researched category of approaches pertains to handling the selection of products to test (∼ 39%). Following this are techniques involving reusing test assets (∼ 25%), prioritizing configurations/test cases (∼ 18%), and minimizing test suite size (∼ 15%).

5 Threats to validity

In this section, we discuss the main threats associated with the validation of this study, classified according to the categorization proposed by Ampatzoglou et al. ( 2019 ). These particular threats are categorized into three categories: study selection validity, data validity, and research validity.

5.1 Study selection validity

One of the main threats to any secondary study is its inability to guarantee the inclusion of all relevant articles in the field. To mitigate this threat, a meeting involving all researchers was conducted to discuss and refine the search scope and keywords. Then, we evaluated the validity of the search string by conducting a limited manual search to see whether the results of that manual search show up in the results obtained by running the search string.

To ensure the comprehensive identification of all relevant studies in our search process, we rigorously followed the guidelines provided by Kitchenham and Charters ( 2007 ). We conducted a bibliographic search of published literature reviews in the SPL testing field. We updated the list of studies by applying a search string to multiple digital libraries and performed the backward and forward snowballing process. Therefore, we are confident that we have provided good coverage of studies in the SPL testing field.

During the primary study selection process, to minimize potential bias in applying inclusion/exclusion criteria, these criteria were clearly defined and regularly updated in our protocol. The first author applied inclusion and exclusion criteria. However, to reduce the researcher bias, the results of this stage were validated by the second and third authors of this paper.

Regarding quality assessment, we used a set of quality criteria to examine the studies. These criteria were reused from those proposed by Dybå and Dingsøyr ( 2008 ). Two researchers participated in the application of quality assessment criteria. We also conducted regular meetings to address and resolve any conflicts that arose during the process effectively.

5.2 Data validity

One of the main threats regarding data validity is data extraction bias. Subjective bias during the data extraction process has the potential to lead to an inconsistent interpretation of the extracted data by researchers. To mitigate this risk, two researchers collaborate during the data extraction phase, conducting resolution sessions to address any emerging ambiguities. Nevertheless, due to certain studies needing more explicit details on specific aspects of SPL testing, such as test levels, we had to make subjective interpretations based on information scattered throughout these studies.

Subjective bias may also lead to the misclassification of data in response to RQ3–RQ7. Since no predefined categories were available, we adopted an exploratory approach, scrutinizing the extracted data and identifying pertinent categories. To mitigate this potential issue, we introduced a structured data extraction form, conducted quality assessments on the chosen studies, and maintained ongoing discussions to ensure consistency in the data extraction process and category definitions. However, it is essential to acknowledge the potential influence of researcher bias on data extraction and presentation within this study.

5.3 Research validity

Research validity encompasses threats identified at all stages of our SLR.

We extensively searched secondary studies, as detailed in Sect. 3.2. This approach enabled us to identify research gaps, consider the scope and definition of RQs, and gain insights into the current state-of-the-art within the domain of SPL testing.

In our exploration of potential threats to the repeatability of this SLR, we acknowledge the complexity inherent in replicating research. Specifically, we highlight the concern that other researchers may not repeat the SLR with precisely the same results. To mitigate this threat, we provided the details of the SLR methodology so that other researchers can replicate the study; furthermore, we have made all the data collected during the SLR process available online. However, as subjectivity in the studies analysis is one major issue in conducting a literature review, we cannot guarantee that researchers can achieve exactly the same results.

One serious threat to the validity of the SLR is the inability to generalize the study’s results to other scenarios and application domains. We included only the studies empirically evaluated in our analysis to handle this threat. However, as most evaluations do not refer to real-world practice, the results and classifications presented in this study may not fully apply to practical settings. Moreover, our SLR intentionally focused exclusively on SPLs. This deliberate choice was made to answer specific questions tailored for SPL testing. While this focus enhances the depth of our insights into SPL testing practices, it inevitably limits the applicability of our findings to the broader context of configurable systems. The decision not to include configurable systems was strategic, considering the extensive body of literature on configurable system testing, which would have required substantial additional time and effort for comprehensive analysis.

6 Discussion

In this study, we presented a systematic review of testing approaches proposed in the SPLE field. We have investigated seven RQs:

RQ1: How is the research on SPL testing characterized?

The analysis indicates that the SPL testing field has attracted significant attention from researchers in recent years, with an increase in empirically evaluated studies. Although the overall number of publications has grown, recent years have seen a decline. Most primary studies are published in conferences, with case studies, experiments, and expert surveys being the common evaluation methods. However, the strength of evidence supporting the proposed approaches varies, with academic studies (60%) being the most common, followed by demonstrations (17%). Only a small number of studies involve industrial systems or real data sets (16%) or industrial practice (13%), indicating an overall low level of evidence in the field.

RQ2 . What levels of tests are usually executed throughout the SPL lifecycle (i.e., Domain Engineering and Application Engineering)?

In Domain Engineering, testing activities include developing test assets for later use and testing assets to detect faults early. In Application Engineering, activities involve creating specific product test assets, designing additional product-specific tests, and executing tests. Some studies focus on reducing the number of products tested or prioritizing products to enhance testing efficiency. The distribution of studies based on test levels shows that in Application Engineering, integration testing and system/acceptance testing are the most commonly reported levels. In contrast, unit testing is less frequently reported in both phases. This indicates a strong focus on higher levels of testing in the SPL testing field, particularly in the Application Engineering phase.

RQ3 . How are test assets created by considering commonalities and variabilities?

Creating test assets to address commonality and variability in SPL testing is crucial for enhancing reusability and minimizing faults in core assets. Our analysis categorized these approaches into three groups: model-based, specification-based, and requirement-based.

Model-based approaches utilize formal or semi-formal models of SPL variability to design and execute tests. Specification-based approaches define specific links between different SPL configurations and test cases. Requirement-based approaches prioritize considering variability early in test case design. The distribution of studies across these categories indicates that model-based techniques are the most commonly used in the examined studies.

RQ4 . How do SPL approaches deal with configuration-aware software testing?

Dealing with configuration-aware software testing, particularly in distinguishing valid and invalid combinations of configuration parameters, is crucial in SPL approaches. Testing all possible combinations of SPL functionalities is not only impractical but also unnecessary. The studies have employed three main methods to distinguish between valid and invalid configurations: Using/proposing specific approaches, algorithms, or tools, runtime analysis, and manual analysis. The distribution of studies across these methods indicates that the majority of the studies have either proposed specific methods or algorithms or have utilized already available tools.

RQ5 . How is the traceability between test assets and other artifacts of SPL preserved throughout the SPL lifecycle?

Preservation of traceability between test assets and other artifacts is a crucial factor in SPL testing as it enhances the reusability of test assets and manages the complexity of SPL testing. However, only a few papers consider preserving traceability throughout the SPL lifecycle. The papers are categorized based on the types of artifacts associated with test assets, focusing on preserving traceability between requirements and test assets as well as between configurations and test assets. The distribution of primary studies addressing this aspect highlights that most of the studies focus on preserving traceability between requirements and test assets.

RQ6 . How are Non-Functional Requirements (NFRs) tested in SPL?

Testing NFRs in SPLs has been rarely examined by researchers, with only three studies addressing this aspect. These studies cover various categories of NFRs, such as load testing, performance profiling, NFRs at the hardware-in-the-loop level, and real-time properties.

RQ7 . What mechanisms have been used for controlling cost/effort of SPL testing?

Various techniques have been proposed to manage the cost and effort associated with SPL testing. However, the lack of a standardized classification for these techniques has made their analysis challenging. Notably, research on product sampling techniques has been extensively categorized into sub-techniques such as automatic selection, semi-automatic selection, and coverage. Beyond sampling techniques, other approaches have emerged, categorized based on their primary contributions, including reusing test assets, providing different levels of automation, handling product selection for testing, prioritizing configurations/test cases, and minimizing the test suite size.

These techniques are often combinable, as seen in the use of test prioritization and minimization techniques alongside sampling techniques to optimize testing cost and effort further. Moreover, the list of techniques continues to evolve with new publications on SPL testing. The distribution of studies reveals that the majority focus on proposing a specific level of automation (∼ 62%). However, many studies lack details on the specific tools used for this purpose. The second most researched category involves handling the selection of products to test (∼ 39%). Additionally, techniques related to reusing test assets (∼ 25%), prioritizing configurations/test cases (∼ 18%), and minimizing test suite size (∼ 15%) are also explored.

We only included studies empirically evaluated in our analysis. In this discussion, we emphasize the maturity of evaluations conducted in these studies, highlight the contributions of the studies in addressing the research questions, present the main findings, and propose research directions to address identified gaps. It is important to note that our SLR intentionally focused exclusively on SPLs. We deliberately excluded the broader context of configurable systems from our analysis to have a clear focus for our article. Therefore, all the findings and research gaps reported in this section are based on our analysis within the SPL testing area. We acknowledge that this might lead to missing synergies with contributions from the broader field of configurable systems. Still, we hope this SLR can be the basis for exploring these aspects in future work.

6.1 Overview of evaluation maturity and studies’ contributions

Proposed approaches have been evaluated using three types of evaluation methods, including case studies, experiments, and expert surveys. However, there is variation in the scope and type of SPLs employed in these evaluations. Different types of SPLs have been employed in the evaluations, representing diverse application domains, such as embedded systems (e.g., automotive and medical systems), web-based systems, banking systems, and smartphone and vendor machine SPLs. We categorized the scope of applications employed in the evaluations into three main groups: Industrial systems with real data sets, SPLs sourced from online repositories (e.g., SPLOT repository) or extracted from existing sources, and the development of a demonstrator. It is important to note that some studies utilized more than one category of applications, for instance, both industrial SPLs and SPLs available online. Approximately 60% of the studies (71 studies) conducted evaluations using SPLs available online or derived from prior research. Around 17% (20 studies) involved the development of a demonstrator for assessing the proposed approach. Only 29% (34 studies) utilized an industrial-scale SPL (Industrial study or Industrial practice) for evaluating their approach. This issue may jeopardize the adoption of the proposed approaches in industry; therefore, proposed approaches for SPL testing need to improve from their evaluation perspective.

Discussing threats to validity is crucial in research since it helps researchers and readers understand the limitations and potential challenges associated with the study. However, an analysis of the included studies reveals that only 32 primary studies (∼ 27%) extensively discussed threats to validity. In approximately 42 studies (∼ 36%), the examination of threats to validity was brief. Notably, 44 studies (∼ 37%) entirely neglected to address this crucial aspect.

Another aspect that is worth analyzing is the distribution of the studies based on their contribution to the research questions. Figure  3 represents the frequencies of studies according to the research questions addressed by them. It should be mentioned that some studies covered more than one topic; therefore, the total amount shown in Fig.  3 exceeds the total number of studies selected for final analysis. As seen in Fig.  3 , most studies address the questions RQ7 (Controlling cost/effort of SPL testing) and RQ2 (Test levels in SPL testing). Moreover, there is notable research interest in the area of configuration-aware testing (RQ4), followed by a substantial focus on variability-aware creation of test assets (RQ3). However, some aspects of SPL testing have rarely been considered and, therefore, need new solutions, including RQ5 (Traceability between test assets and other artifacts) and RQ6 (Non-functional testing).

figure 3

Distribution of studies by the contribution to the research questions

6.2 Main findings

We analyzed the data based on the content structuring/theme analysis approach of Mayring ( 2014 ). Initially, the data extracted from the extraction form provided us with a list of key challenges and sub-themes. In the next step, we inductively created categories within the themes to summarize them (analytical themes). The results of this analysis are shown in Table  11 . In the rest of this section, we present various gaps and concerns that necessitate further exploration and attention from both researchers and practitioners:

Variability management : Effective variability management in SPLs is crucial, yet it introduces complexities that can pose challenges to testing (Sect.  4.3 ). One facet that needs further exploration is the challenges associated with variability control. It demands a more in-depth investigation to identify and analyze challenges arising from the diverse features and configurations inherent in SPLs. These challenges encompass the complexities introduced by numerous potential combinations and the possibility of unforeseen interactions among variable elements. While this aspect has been previously examined, the key concern lies in the applicability of the proposed solutions and approaches in real-world scenarios. For example, one of the most investigated solutions involves selecting a subset of products for testing. However, the potential for unseen interactions between features in new products to result in faults raises doubts. Furthermore, many of the proposed approaches have only been evaluated at a proof-of-concept level, necessitating a more in-depth investigation into their suitability for industrial SPL applications.

Another crucial aspect involves examining variability modeling. This includes an analysis of the current state of variability modeling in SPL testing and an exploration of opportunities to enhance modeling techniques to address testing challenges. While model-based approaches, commonly used to create variant-rich test assets, have shown promise in SPL testing, there is still room for improvement in automating the generation of test cases and ensuring comprehensive coverage based on variability models. Utilizing model-based approaches can automate the process of transforming high-level test assets (e.g., test scenarios) and generating low-level test assets (e.g., test cases and test data).

Non-functional testing : Despite the fact that functional testing of SPLs has been extensively investigated, non-functional testing aspects need greater focus and specific methodologies (Sect.  4.6 ). This particular gap has already been acknowledged in previous literature reviews. Non-functional requirements encompass diverse dimensions, including but not limited to performance, security, usability, and scalability. While some studies have explored aspects such as real-time behaviors and performance, there remains a need for further research to comprehensively address diverse facets within this domain. Moreover, the inherent nature of non-functional requirements significantly shapes testing strategies. Considering their distinct characteristics and evaluation criteria, it is crucial to investigate how distinct testing approaches are essential for various aspects like performance testing, security testing, and usability testing.

Non-functional testing, particularly in critical areas such as performance and security, poses challenges due to its resource-intensive nature. Investigating the challenges associated with acquiring and allocating resources for thorough non-functional testing throughout the SPL lifecycle is crucial for effective quality assurance.

The complexities of seamlessly integrating non-functional testing with functional testing necessitate further exploration. Examining how the interplay between these two testing dimensions influences the overall quality assurance process will contribute valuable insights to the field.

Tool support : Given the substantial testing effort required for SPLs, the availability of tools specifically designed for SPL testing is crucial (Sect.  4.7 ). The analysis of the studies with respect to automation provided by the tools indicates that most of the tool implementations are proof-of-concept prototypes developed for validating the proposed approach. Therefore, developing more robust and user-friendly tools can significantly help practitioners in their testing efforts. This particular challenge has previously been discussed in prior literature reviews.

Some specific areas need further exploration. Evaluating the effectiveness and efficiency of existing SPL testing tools explores capabilities, limitations, and areas for improvement in tools designed for various testing activities within the SPL lifecycle. Analyzing how well testing tools adapt to changes in SPL configurations includes investigating their ability to accommodate evolving feature sets, configurations, and architectural variations, ensuring continued effectiveness. Assessing the user experience and usability of SPL testing tools explores how user-friendly and accessible tools are for practitioners involved in SPL testing, considering factors such as ease of use, learning curve, and user satisfaction.

Regression testing : Effectively handling regression testing in SPLs, where modifications to one product can affect others, presents an intricate challenge (Sect.  4.7 ). Regression test selection/prioritization/minimization and architecture-based regression testing are potential points for future research. Test case selection is focused on choosing a set of relevant test cases to test the modified version of the system, and the aim of test minimization is to remove the redundant/irrelevant test cases from the existing test suit. Test case prioritization aims at ordering and ranking test cases based on specific criteria such as importance and likelihood of failure. All these techniques aim to reduce the cost/effort of SPL testing after applying any change to products or the SPL architecture.

An important aspect is analyzing how changes and evolutions in the SPL architecture impact regression testing strategies. This investigation includes understanding the challenges of maintaining test suites across evolving SPL configurations and the need for adaptive regression testing approaches.

Additionally, exploring the benefits and challenges of implementing automated regression testing within the SPL context is crucial. This requires an analysis of efficiency gains, potential pitfalls, and strategies to optimize the effectiveness of automated regression testing in SPL scenarios.

Moreover, investigating challenges related to maintaining traceability between evolving codebase versions and regression test suites is critical. This requires exploring strategies to preserve traceability links, ensuring that regression testing aligns with the dynamic nature of SPL development.

Industrial evaluations : Encouraging the adoption of SPL testing practices in industrial settings requires addressing practical challenges (Sect.  3.3 and 4.1 ). This includes offering guidance tailored for industry-specific SPL testing and conducting industrial evaluations.

To enhance the industry adoption of SPL approaches, offering practical insights and recommendations is essential. This involves providing tailored guidance to help organizations navigate the unique challenges and requirements of adopting SPL testing methods in their specific industry domains. Additionally, there is a need to move beyond proof-of-concept evaluations and conduct practical assessments to verify the feasibility, scalability, and effectiveness of proposed SPL testing methods in diverse industrial contexts.

Test levels throughout the SPL lifecycle : Exploring the details of a test level throughout the SPL lifecycle and illustrating the challenges associated with neglecting a particular test level would provide valuable insights for practitioners (Sect.  4.2 ). Two levels of tests are commonly executed throughout Domain Engineering: Unit testing and Integration testing. Although testing common core assets of an SPL is vital to detect faults as soon as possible, a few studies have considered the execution of tests in domain engineering. Therefore, it would be useful to conduct further investigations regarding how to execute a specific level of test in Domain Engineering and the consequences of not performing it. In Application Engineering, three levels of tests are usually executed: Unit testing, Integration testing, and System/acceptance testing. The two last levels have been investigated in most of the studies. It is worth mentioning that Unit testing has been investigated as a level of test in Application Engineering in a few studies published in recent years. In contrast, previous literature reviews have not reported this level of test in Application Engineering (e.g., Pérez et al. 2009 ). This indicates no consensus on the test levels executed during Domain Engineering and Application Engineering.

Another aspect that needs further exploration involves examining the influence of variabilities inherent in SPLs on different test levels. This requires understanding how the presence of variable features across products affects test activities, including planning, design, and execution at each testing level. Additionally, there is a need to investigate how test levels adapt to requirements and feature set changes throughout the SPL lifecycle. This requires exploring the challenges and opportunities associated with maintaining effective testing strategies in response to the dynamic nature of evolving product configurations.

Preserving the traceability between test assets and development artifacts : Preserving traceability between test assets and development artifacts in SPLs is particularly challenging due to the complex relationships between product variants and the shared assets (Sect.  4.5 ). Studies that target testing SPLs (very) rarely consider traceability explicitly. Examining the challenges associated with preserving traceability is crucial, especially when dealing with evolving product configurations within the SPL testing environment. While researchers have proposed certain methods, such as Reis et al. ( 2007 ) which preserved the traceability between requirements and test case scenarios using UML models and by refining use case scenarios into test case scenarios, Reuys et al. ( 2006 ) enabled traceability between artifacts, there remains a necessity to investigate more efficient approaches for modeling and representing traceability relationships, considering feature variability and configuration management. Furthermore, exploring the creation of automated tools and techniques for establishing and consistently updating traceability links in response to the evolving nature of SPLs presents an engaging area for future research.

To compare findings with previous SLRs, Table  12 presents a summary of the findings from both the current study and prior literature reviews (Pérez et al. 2009 ; Engström and Runeson 2011 ; Da Mota Silveira Neto et al. 2011 ; do Carmo Machado et al. 2014 ).

7 Related work

This research aims to provide researchers and practitioners with an overview of state-of-the-art testing practices applied to SPL and identify the gaps between required techniques and existing approaches. Accordingly, we conducted an SLR to analyze existing approaches to SPL testing. Therefore, SLRs and SMSs on SPL testing can be considered as works related to this research. To the best of our knowledge, four papers have systematically analyzed approaches focused on SPL testing (Pérez et al. 2009 ; Engström and Runeson 2011 ; Da Mota Silveira Neto et al. 2011 ; do Carmo Machado et al. 2014 ).

Pérez et al. ( 2009 ) conducted an SLR to identify experience reports and initiatives carried out in the SPL testing area. In this work, primary studies were classified into seven categories: Unit testing, Integration testing, functional testing, SPL Architecture testing, Embedded system testing, testing process and testing effort in SPL. Then, they presented a summary of each area. The similarity of this SLR to our work is testing levels investigated in both works; however, our work is broader in scope than this SLR since we investigated more aspects of SPL testing.

Engström and Runeson ( 2011 ) conducted an SMS by analyzing papers published up to 2008. The authors mapped studies into seven categories based on their research focus: Test organization and process, Test management, Testability, System and acceptance testing, Integration testing, Unit testing, and Test automation. They also identified challenges in SPL testing and needs for future research. This SMS has similarities with our work regarding specific SPL aspects investigated, including testing levels and test automation. However, the research questions designed by Engström and Runeson ( 2011 ) are more general, focusing on specifying challenges and topics investigated in SPL testing.

Da Mota Silveira Neto et al. ( 2011 ) conducted an SMS to investigate state-of-the-art testing practices by analyzing a set of 45 publications dated from 1993 to 2009. Primary studies are mapped into nine categories: Testing strategy, Static and dynamic analysis, Testing levels, Regression testing, Non-functional testing, Commonality and variability testing, Variant binding time, Effort reduction, and Test measurement. Some of the research questions designed by Da Mota Silveira Neto et al. ( 2011 ) are similar to the ones investigated in our work (e.g., testing SPLs while considering commonalities and variabilities). However, our work is broader in scope since we analyzed 110 papers published up to 2022. Furthermore, we only included empirically evaluated studies in our review.

do Carmo Machado et al. ( 2014 ) conducted an SLR by analyzing 49 studies published up to 2013; this SLR aimed to identify testing strategies that could achieve higher defect detection rates and reduced quality assurance effort. Identifying strategies to handle the selection of products to test has been investigated in both (do Carmo Machado et al. 2014 ) and our work. Furthermore, similar to our work, the initial set of primary studies in study (do Carmo Machado et al. 2014 ) has been identified by investigating previously conducted SLRs or SMSs, published up to the year 2009; also, the authors of this SLR only included empirically evaluated studies. However, our work investigates more aspects of SPL testing (e.g., preserving traceability between test assets and other artifacts) and analyzes more studies (110 papers).

Literature reviews also specifically focused on analyzing one aspect of SPL testing. As an example, Lopez-Herrejon et al. ( 2015 ) conducted an SMS to identify techniques that have been applied for combinatorial interaction testing of SPLs. However, our work is broader in scope since we did not limit the studies to a specific technique.

In general, the previous literature reviews and our work complement each other regarding the research questions addressed. Some aspects of SPL testing have not been considered in detail in previous reviews: techniques used for preserving traceability between test artifacts and other artifacts, techniques employed for identifying valid and invalid configurations, and different ways to control cost/effort of SPL testing were not covered in an extent that makes it possible to identify the current status of research and practice from the perspective of those aspects.

8 Conclusions and future work

The goal of SPLE is to improve the effectiveness and efficiency of software development by managing commonalities and variabilities among products. Testing is an essential part of SPLE to achieve the benefits of an SPL. It is focused on detecting potential faults in core assets created during Domain Engineering and products created during Application Engineering by reusing core assets. This paper presents the results of a systematic literature review of testing in SPLE. The SLR aimed to investigate specific aspects of SPL testing that were formulated as seven research questions, identify gaps, and address specific points of SPLE that still need to be fully addressed.

The analysis that we conducted based on 118 studies from 2003 to 2022 has uncovered a range of issues and considerations that researchers and practitioners can work on. It is shown that managing variability in SPL testing is vital but can complicate the testing process. Model-based methods show promise in generating test assets, but there is room for improvement in automating test case creation and ensuring comprehensive coverage. Non-functional testing aspects like performance, security, and usability require more attention and specific methodologies. Having the right tools is important, but most tool implementations are still in the proof-of-concept stage. Regression testing poses a complex challenge, and future research should concentrate on areas like regression test selection, prioritization, minimization, and architecture-based regression testing. Establishing benchmark datasets and standard evaluation criteria for SPL testing methods would simplify comparing and adopting various techniques.

Exploring test levels throughout the SPL lifecycle and illustrating the challenges of neglecting a particular test level would offer valuable insights. Additionally, studies focusing on testing SPLs need to address traceability explicitly. Maintaining traceability between test assets and development artifacts is especially difficult due to the intricate relationships between product variants and shared assets, which requires effective approaches. It is also worth mentioning that, throughout selecting studies for final analysis, we included only the studies empirically evaluated. By analyzing the evaluation conducted in the studies, we noticed that most of the studies were assessed by applying only one empirical method. Furthermore, most of the assessments undertaken do not refer to real-world practice. This indicates the need to evaluate SPL testing approaches not in academia but in industry.

Based on the findings of this SLR, further research in the SPL testing field can be expended on specific areas we identified throughout this research as the potential points for future research (e.g., SPL regression testing). Furthermore, empirical assessment of existing techniques for the investigated aspects (e.g., selection of products to test or creating reusable test assets) to compare those techniques would be helpful for both researchers and practitioners, mainly if those techniques are applied to real-world and large-scale scenarios. Furthermore, this research can be strengthened by examining studies published in the field of testing configurable systems. Such analysis can investigate how techniques from this broader domain might be applied to SPL testing to address existing deficiencies in this area.

Data availability

All data generated during this study are available in the “Zenodo” repository: https://zenodo.org/doi/10.5281/zenodo.10018266 .

Replication package available on https://zenodo.org/doi/ https://doi.org/10.5281/zenodo.10018266 .

SPLC stands for Software Product Line Conference.

Aduni Sulaiman R, Jawawi DN, Halim SA (2019) Derivation of test cases for model-based testing of software product line with hybrid heuristic approach. In: IRICT’19, pp 199–208. https://doi.org/10.1007/978-3-030-33582-3_19

Akbari Z, Khoshnevis S, Mohsenzadeh M (2017) A method for prioritizing integration testing in software product lines based on feature model. Int J Softw Eng Knowl Eng 27(04):575–600. https://doi.org/10.1142/S0218194017500218

Article   Google Scholar  

Akimoto H, Isogami Y, Kitamura T, Noda N, Kishi T (2019) A prioritization method for SPL pairwise testing based on user profiles. In: APSEC’19, pp 118–125. https://doi.org/10.1109/APSEC48747.2019.00025

Al-Dallal J, Sorenson PG (2008) Testing software assets of framework-based product families during application engineering stage. J Softw 3(5):11–25

Google Scholar  

Al-Hajjaji M, Krieter S, Thüm T, Lochau M, Saake G (2016) IncLing: efficient product-line testing using incremental pairwise sampling. ACM SIGPLAN Not 52(3):144–155. https://doi.org/10.1145/3093335.2993253

Al-Hajjaji M, Krüger J, Schulze S, Leich T, Saake G (2017a) Efficient product-line testing using cluster-based product prioritization. In: AST’17, pp 16–22. https://doi.org/10.1109/AST.2017.7

Al-Hajjaji M, Lity S, Lachmann R, Thüm T, Schaefer I, Saake G (2017b) Delta-oriented product prioritization for similarity-based product-line testing. In: VACE’17, pp 34–40. https://doi.org/10.1109/VACE.2017.8

Al-Hajjaji M, Thüm T, Lochau M, Meinicke J, Saake G (2019) Effective product-line testing using similarity-based product prioritization. Softw Syst Model 18(1):499–521. https://doi.org/10.1007/s10270-016-0569-2

AL-Msie’deen RF, Seriai A, Huchard M, Urtado C, Vauttier S, Salman HE (2013) Feature location in a collection of software product variants using formal concept analysis. In: ICSR’13, pp 302–307. https://doi.org/10.1007/978-3-642-38977-1_22

Alves V, Niu N, Alves C, Valença G (2010) Requirements engineering for software product lines: a systematic literature review. Inf Softw Technol 52(8):806–820

Alves Pereira J, Acher M, Martin H, Jézéquel JM (2020) Sampling effect on performance prediction of configurable systems: A case study. In: ICPE’20, pp 277–288. https://doi.org/10.1145/3358960.3379137

Ammann P, Offutt J (2008) Introduction to software testing. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511809163

Book   Google Scholar  

Ampatzoglou A, Bibi S, Avgeriou P, Verbeek M, Chatzigeorgiou A (2019) Identifying, categorizing and mitigating threats to validity in software engineering secondary studies. Inf Softw Technol 106:201–230

Aoyama Y, Kuroiwa T, Kushiro N (2021) Executable test case generation from specifications written in natural language and test execution environment. In: CCNC’21, pp 1–6. https://doi.org/10.1109/CCNC49032.2021.9369549

Apel S, Batory D, Kästner C, Saake G (2013) Feature-oriented software product lines: concepts and implementation. Springer, Berlin

Araújo IL, Santos IS, Filho JB, Andrade RM, Neto PS (2017) Generating test cases and procedures from use cases in dynamic software product lines. In: SAC’17, pp 1296–1301. https://doi.org/10.1145/3019612.3019790

Arrieta A, Sagardui G, Etxeberria L (2015) Test control algorithms for the validation of cyber-physical systems product lines. In: SPLC’15, pp 273–282. https://doi.org/10.1145/2791060.2791095

Arrieta A, Wang S, Sagardui G, Etxeberria L (2016) Search-based test case selection of cyber-physical system product lines for simulation-based validation. In: SPLC’16, pp 297–306. https://doi.org/10.1145/2934466.2946046

Arrieta A, Wang S, Sagardui G, Etxeberria L (2019) Search-based test case prioritization for simulation-based testing of cyber-physical system product lines. J Syst Softw 149:1–34. https://doi.org/10.1016/j.jss.2018.09.055

Baller H, Lity S, Lochau M, Schaefer I (2014) Multi-objective test suite optimization for incremental product family testing. In: ICST’14, pp 303–312. https://doi.org/10.1109/ICST.2014.43

Belli F, Tuglular T, Ufuktepe E (2021) Heterogeneous modeling and testing of software product lines. In: QRS-C’21, pp 1079–1088. https://doi.org/10.1109/QRS-C55045.2021.00162

Beohar H, Mousavi MR (2016) Input–output conformance testing for software product lines. J Log Algebr Methods Program 85(6):1131–1153. https://doi.org/10.1016/j.jlamp.2016.09.007

Article   MathSciNet   Google Scholar  

Bharathi M, Sangeetha V (2018) Weighted rank ant colony metaheuristics optimization-based test suite reduction in combinatorial testing for improving software quality. In: ICICCS’18, pp 525–534. https://doi.org/10.1109/ICCONS.2018.8663102

Bucaioni A, Di Silvestro F, Singh I, Saadatmand M, Muccini H (2022) Model-based generation of test scripts across product variants: an experience report from the railway industry. J Softw Evol Process 34(11):e2498. https://doi.org/10.1002/smr.2498

Bürdek J, Lochau M, Bauregger S, Holzer A, Rhein AV, Apel S, Beyer D (2015) Facilitating reuse in multi-goal test-suite generation for software product lines. In: FASE’15, pp 84–99. https://doi.org/10.1007/978-3-662-46675-9_6

Cabral I, Cohen MB, Rothermel G (2010) Improving the testing and testability of software product lines. In: SPLC’10, pp 241–255. https://doi.org/10.1007/978-3-642-15579-6_17

Calvagna A, Gargantini A, Vavassori P (2013) Combinatorial testing for feature models using CitLab. In: ICSTW’13, pp 338–347. https://doi.org/10.1109/ICSTW.2013.45

Carlsson M, Gotlieb A, Marijan D (2016) Software product line test suite reduction with constraint optimization. In: ICSOFT’16, pp 68–87. https://doi.org/10.1007/978-3-319-62569-0_4

Catal C, Mishra D (2012) Test case prioritization: a systematic mapping study. Softw Qual J 21(3):445–478. https://doi.org/10.1007/s11219-012-9181-z

Chen L, Babar MA (2011) A systematic review of evaluation of variability management approaches in software product lines. Inf Softw Technol 53(4):344–362. https://doi.org/10.1016/j.infsof.2010.12.006

Clarke D, Helvensteijn M, Schaefer I (2010) Abstract delta modeling. ACM SIGPLAN Not 46(2):13–22. https://doi.org/10.1145/1942788.1868298

Clements P, Northrop L (2002) Software product lines: practices and patterns. Addison-Wesley, Boston

Cmyrev A, Reissing R (2014) Efficient and effective testing of automotive software product lines. Appl Sci Eng Prog 7(2):53–57. https://doi.org/10.14416/j.ijast.2014.05.001

Cohen MB, Dwyer MB, Shi J (2006) Coverage and adequacy in software product line testing. In: ROSATEA’06, pp 53–63. https://doi.org/10.1145/1147249.1147257

Cohen MB, Dwyer MB, Shi J (2007) Interaction testing of highly-configurable systems in the presence of constraints. In: ISSTA’07, pp 129–139. https://doi.org/10.1145/1273463.1273482

Cruzes DS, Dybä T (2011) Research synthesis in software engineering: a tertiary study. Inf Softw Technol 53(5):440–455. https://doi.org/10.1016/j.infsof.2011.01.004

Czarnecki K, Eisenecker UW (2000) Generative programming: methods, tools and applications. Addison-Wesley, New York

Da Mota Silveira Neto PA, do, Carmo Machado I, McGregor JD, De Almeida ES, de Lemos Meira SR (2011) A systematic mapping study of software product lines testing. Inf Softw Technol 53(5):407–423. https://doi.org/10.1016/j.infsof.2010.12.003

Denger C, Kolb R (2006) Testing and inspecting reusable product line components: First empirical results. In: ISESE’06, pp 184–193. https://doi.org/10.1145/1159733.1159762

Devroey X, Perrouin G, Legay A, Schobbens PY, Heymans P (2017) Dissimilar test case selection for behavioural software product line testing, In: SPLC’17, pp 1–9

do Carmo Machado I, McGregor JD, Cavalcanti YC, De Almeida ES (2014) On strategies for testing software product lines: a systematic literature review. Inf Softw Technol 56(10):1183–1199. https://doi.org/10.1016/j.infsof.2014.04.002

do Nascimento Ferreira T, Kuk JN, Pozo A, Vergilio SR (2016) Product selection based on upper confidence bound MOEA/D-DRA for testing software product lines. In: CEC’16, pp 4135–4142. https://doi.org/10.1109/CEC.2016.7744315

Dominka S, Mandl M, Dübner M, Ertl D (2018) Using combinatorial testing for distributed automotive features: Applying combinatorial testing for automated feature-interaction-testing. In: CCWC’18, pp 490–495. https://doi.org/10.1109/CCWC.2018.8301632

Drave I, Hillemacher S, Greifenberg T, Kriebel S, Kusmenko E, Markthaler M, Orth P, Salman KS, Richenhagen J, Rumpe B, Schulze C (2019) SMArDT modeling for automotive software testing. Softw Pract Exp 49(2):301–328. https://doi.org/10.1002/spe.2650

Dybå T, Dingsøyr T (2008) Empirical studies of agile software development: a systematic review. Inf Softw Technol 50(9–10):833–859. https://doi.org/10.1016/j.infsof.2008.01.006

Ebert R, Jolianis J, Kriebel S, Markthaler M, Pruenster B, Rumpe B, Salman KS (2019) Applying product line testing for the electric drive system. In: SPLC’19, pp 14–24. https://doi.org/10.1145/3336294.3336318

Engström E, Runeson P (2011) Software product line testing–A systematic mapping study. Inf Softw Technol 53(1):2–13. https://doi.org/10.1016/j.infsof.2010.05.011

Ensan F, Bagheri E, Gašević D (2012) Evolutionary search-based test generation for software product line feature models. In: CAiSE’12, pp 613–628. https://doi.org/10.1007/978-3-642-31095-9_40

Ergun B, Gebizli CŞ, Sözer H (2017) FORMAT: A tool for adapting test models based on feature models. In: COMPSAC’17, pp 66–71. https://doi.org/10.1109/COMPSAC.2017.134

Ferreira JM, Vergilio SR, Quináia MA (2013) A mutation approach to feature testing of software product lines. In: SEKE’13, pp 231–237

Ferreira JM, Vergilio SR, Quinaia MA (2017a) Software product line testing based on feature model mutation. Int J Softw Eng Knowl Eng 27(05):817–839. https://doi.org/10.1142/S0218194017500309

Ferreira TN, Vergilio SR, de Souza JT (2017b) Incorporating user preferences in search-based software engineering: a systematic mapping study. Inf Softw Technol 90:55–69. https://doi.org/10.1016/j.infsof.2017.05.003

Ferrer J, Chicano F, Alba E (2017) Hybrid algorithms based on integer programming for the search of prioritized test data in software product lines. In: EvoCOP’17, pp 3–19. https://doi.org/10.1007/978-3-319-55792-2_1

Ferrer J, Chicano F, Ortega-Toro JA (2021) CMSA algorithm for solving the prioritized pairwise test data generation problem in software product lines. J Heuristics 27(1):229–249. https://doi.org/10.1007/s10732-020-09462-w

Fischer S, Linsbauer L, Egyed A, Lopez-Herrejon RE (2018) Predicting higher order structural feature interactions in variable systems. In: ICSME’18, pp 252–263. https://doi.org/10.1109/ICSME.2018.00035

Fomel S (2002) Object management group: UML profile for schedulability, performance and time specification. OMG Doc 2(03):1–101

Fragal VH, Simao A, Endo AT, Mousavi MR (2017) Reducing the concretization effort in FSM-based testing of software product lines. In: ICSTW’17, pp 329–336. https://doi.org/10.1109/ICSTW.2017.61

Fragal VH, Simao A, Mousavi MR, Turker UC (2019) Extending HSI test generation method for software product lines. Comput J 62(1):109–129. https://doi.org/10.1093/comjnl/bxy046

Galindo JA, Alférez M, Acher M, Baudry B, Benavides D (2014) A variability-based testing approach for synthesizing video sequences. In: ISSTA’14, pp 293–303. https://doi.org/10.1145/2610384.2610411

Galindo JA, Turner H, Benavides D, White J (2016) Testing variability-intensive systems using automated analysis: an application to Android. Softw Qual J 24(2):365–405. https://doi.org/10.1007/s11219-014-9258-y

Gebizli CS, Sözer H (2016) Model-based software product line testing by coupling feature models with hierarchical markov chain usage models. In: QRS-C’16, pp 278–283. https://doi.org/10.1109/QRS-C.2016.42

Ghanam Y, Andreychuk D, Maurer F (2010) Reactive variability management in agile software development. In: 2010 Agile Conference, pp 27–34. https://doi.org/10.1109/AGILE.2010.6

Hajri I, Goknil A, Pastore F, Briand LC (2020) Automating system test case classification and prioritization for use case-driven testing in product lines. Empir Softw Eng 25(5):3711–3769. https://doi.org/10.1007/s10664-020-09853-4

Henard C, Papadakis M, Perrouin G, Klein J, Traon YL (2013) Multi-objective test generation for software product lines. In: SPLC’13, pp 62–71. https://doi.org/10.1145/2491627.2491635

Henard C, Papadakis M, Perrouin G, Klein J, Heymans P, Le Traon Y (2014a) Bypassing the combinatorial explosion: using similarity to generate and prioritize t-wise test configurations for software product lines. IEEE Trans Softw Eng 40(7):650–670. https://doi.org/10.1109/TSE.2014.2327020

Henard C, Papadakis M, Traon YL (2014b) Mutation-based generation of software product line test configurations. In: SSBSE’14, pp 92–106. https://doi.org/10.1007/978-3-319-09940-8_7

Hentze M, Pett T, Sundermann C, Krieter S, Thüm T, Schaefer I (2022) Generic Solution-Space Sampling for Multi-domain Product Lines. In: GPCE’22, pp 135–147. https://doi.org/10.1145/3564719.3568695

Hervieu A, Baudry B, Gotlieb A (2011) PACOGEN: Automatic generation of pairwise test configurations from feature models. In: ISSRE’11, pp 120–129. https://doi.org/10.1109/ISSRE.2011.31

Hervieu A, Marijan D, Gotlieb A, Baudry B (2016) Practical minimization of pairwise-covering test configurations using constraint programming. Inf Softw Technol 71:129–146. https://doi.org/10.1016/j.infsof.2015.11.007

Hierons RM, Li M, Liu X, Parejo JA, Segura S, Yao X (2020) Many-objective test suite generation for software product lines. ACM Trans Softw Eng Methodol 29(1):1–46. https://doi.org/10.1145/3361146

Ibias A, Llana L, Núñez M (2022) Using ant colony optimisation to select features having associated costs. In: ICTSS’22, pp 106–122. https://doi.org/10.1007/978-3-031-04673-5_8

Jackson D (2012) Software abstractions: logic, language, and analysis. MIT Press

Jakubovski Filho HL, Ferreira TN, Vergilio SR (2018) Incorporating user preferences in a software product line testing hyper-heuristic approach. In: CEC’18, pp 1–8. https://doi.org/10.1109/CEC.2018.8477803

Jakubovski Filho HL, Ferreira TN, Vergilio SR (2019) Preference based multi-objective algorithms applied to the variability testing of software product lines. J Syst Softw 151:194–209. https://doi.org/10.1016/j.jss.2019.02.028

Jaring M, Krikhaar RL, Bosch J (2008) Modeling variability and testability interaction in software product line engineering. In: ICCBSS’08, pp 120–129. https://doi.org/10.1109/ICCBSS.2008.9

Johansen MF, Haugen Ø, Fleurey F (2011) Properties of realistic feature models make combinatorial testing of product lines feasible. In: MODELS’11, pp 638–652. https://doi.org/10.1007/978-3-642-24485-8_47

Jorgensen PC (2013) Software testing: a craftsman’s approach. Auerbach Publications

Jung P, Kang S, Lee J (2019) Automated code-based test selection for software product line regression testing. J Syst Softw 158:110419. https://doi.org/10.1016/j.jss.2019.110419

Jung P, Kang S, Lee J (2020) Efficient regression testing of software product lines by reducing redundant test executions. Appl Sci 10(23):8686. https://doi.org/10.3390/app10238686

Jung P, Kang S, Lee J (2022) Reducing redundant test executions in software product line testing—A case study. Electronics 11(7):1165. https://doi.org/10.3390/electronics11071165

Käkölä T, Dueñas JC (2006) Research issues in software product lines—Engineering and management. Springer, Heidelberg

Kang S, Baek H, Kim J, Lee J (2015) Systematic software product line test case derivation for test data reuse. In: COMPSAC’15, pp 433–440. https://doi.org/10.1109/COMPSAC.2015.174

Kim CH, Batory DS, Khurshid S (2011) Reducing combinatorics in testing product lines. In: AOSD’11, pp 57–68. https://doi.org/10.1145/1960275.1960284

Kim CH, Khurshid S, Batory D (2012) Shared execution for efficiently testing product lines. In: ISSRE’12, pp 221–230. https://doi.org/10.1109/ISSRE.2012.23

Kim CH, Marinov D, Khurshid S, Batory D, Souto S, Barros P, d’Amorim M (2013) SPLat: Lightweight dynamic analysis for reducing combinatorics in testing configurable systems. In: ESEC/FSE’13, pp 257–267. https://doi.org/10.1145/2491411.2491459

Kitchenham B, Charters S (2007) Guidelines for performing systematic literature reviews in software engineering. Technical Report, Keele University and Durham University

Kitchenham B, Budgen D, Brereton P (2016) Evidence-based software engineering and systematic reviews. CRC Press

Krieter S, Thüm T, Schulze S, Saake G, Leich T (2020) YASA: Yet another sampling algorithm. In: VaMoS’20, pp 1–10. https://doi.org/10.1145/3377024.3377042

Kuhn DR, Wallace DR, Gallo AM (2004) Software fault interactions and implications for software testing. IEEE Trans Softw Eng 30(6):418–421. https://doi.org/10.1109/TSE.2004.24

Kumar S (2016) Test case prioritization techniques for software product line: A survey. In: ICCCA, pp 884–889. https://doi.org/10.1109/CCAA.2016.7813841

Lachmann R, Lity S, Lischke S, Beddig S, Schulze S, Schaefer I (2015) Delta-oriented test case prioritization for integration testing of software product lines. In: SPLC’15, pp 81–90. https://doi.org/10.1145/2791060.2791073

Lachmann R, Lity S, Al-Hajjaji M, Fürchtegott F, Schaefer I (2016) Fine-grained test case prioritization for integration testing of delta-oriented software product lines. In: FOSD’16, pp 1–10. https://doi.org/10.1145/3001867.3001868

Lachmann R, Beddig S, Lity S, Schulze S, Schaefer I (2017) Risk-based integration testing of software product lines. In: VaMoS’17, pp 52–59. https://doi.org/10.1145/3023956

Lackner H, Thomas M, Wartenberg F, Weißleder S (2014) Model-based test design of product lines: Raising test design to the product line level. In: ICST’14, pp 51–60. https://doi.org/10.1109/ICST.2014.16

Lamancha BP, Polo M, Piattini M (2015) PROW: a pairwise algorithm with constraints, order and weight. J Syst Softw 99:1–19. https://doi.org/10.1016/j.jss.2014.08.005

Lee J, Hwang S (2019) Combinatorial test design using design-time decisions for variability. Int J Softw Eng Knowl Eng 29(08):1141–1158. https://doi.org/10.1142/S0218194019400138

Li Z, Harman M, Hierons RM (2007) Search algorithms for regression test case prioritization. IEEE Trans Softw Eng 33(4):225–237. https://doi.org/10.1109/TSE.2007.38

Li X, Wong WE, Gao R, Hu L, Hosono S (2018) Genetic algorithm-based test generation for software product line with the integration of fault localization techniques. Empir Softw Eng 23(1):1–51. https://doi.org/10.1007/s10664-016-9494-9

Lima JA, Mendonça WD, Vergilio SR, Assunção WK (2020) Learning-based prioritization of test cases in continuous integration of highly-configurable software. In: SPLC’20, pp 1–11. https://doi.org/10.1145/3382025.3414967

Lity S, Nieke M, Thüm T, Schaefer I (2019) Retest test selection for product-line regression testing of variants and versions of variants. J Syst Softw 147:46–63. https://doi.org/10.1016/j.jss.2018.09.090

Lochau M, Oster S, Goltz U, Schürr A (2012a) Model-based pairwise testing for feature interaction coverage in software product line engineering. Softw Qual J 20(3):567–604. https://doi.org/10.1007/s11219-011-9165-4

Lochau M, Schaefer I, Kamischke J, Lity S (2012b) Incremental model-based testing of delta-oriented software product lines. In: TAP’12, pp 67–82. https://doi.org/10.1007/978-3-642-30473-6_7

Lochau M, Lity S, Lachmann R, Schaefer I, Goltz U (2014) Delta-oriented model-based integration testing of large-scale systems. J Syst Softw 91:63–84. https://doi.org/10.1016/j.jss.2013.11.1096

Lopez-Herrejon RE, Javier Ferrer J, Chicano F, Haslinger EN, Egyed A, Alba E (2014) A parallel evolutionary algorithm for prioritized pairwise testing of software product lines. In: GECCO’14, pp 1255–1262. https://doi.org/10.1145/2576768.2598305

Lopez-Herrejon RE, Fischer S, Ramler R, Egyed A (2015) A first systematic mapping study on combinatorial interaction testing for software product lines. In: ICSTW’15, pp 1–10. https://doi.org/10.1109/ICSTW.2015.7107435

Luo L (2001) Software testing techniques. Institute for Software Research International Carnegie Mellon University. Pittsburgh PA 15232(19):1–19

Luo G, Petrenko A, Bochmann GV (1995) Selecting test sequences for partially-specified nondeterministic finite state machines. In: IFIP WG, pp 95–110. https://doi.org/10.1007/978-0-387-34883-4_6

Luthmann L, Gerecht T, Stephan A, Bürdek J, Lochau M (2019a) Minimum/maximum delay testing of product lines with unbounded parametric real-time constraints. J Syst Softw 149:535–553. https://doi.org/10.1016/j.jss.2018.12.028

Luthmann L, Gerecht T, Lochau M (2019b) Sampling strategies for product lines with unbounded parametric real-time constraints. Int J Softw Tools Technol Transf 21(6):613–633. https://doi.org/10.1007/s10009-019-00532-4

Marijan D, Gotlieb A, Sen S, Hervieu A (2013) Practical pairwise testing for software product lines. In: SPLC’13, pp 227–235. https://doi.org/10.1145/2491627.2491646

Marijan D, Liaaen M, Gotlieb A, Sen S, Ieva C (2017) Titan: Test suite optimization for highly configurable software. In: ICST’17, pp 524–531. https://doi.org/10.1109/ICST.2017.60

Markiegi U, Arrieta A, Sagardui G, Etxeberria L (2017) Search-based product line fault detection allocating test cases iteratively. In: SPLC’17, pp 123–132. https://doi.org/10.1145/3106195.3106210

Markiegi U, Arrieta A, Etxeberria L, Sagardui G (2019) Test case selection using structural coverage in software product lines for time-budget constrained scenarios. In: SAC’19, pp 2362–2371. https://doi.org/10.1145/3297280.3297512

Matnei Filho RA, Vergilio SR (2016) A multi-objective test data generation approach for mutation testing of feature models. J Softw Eng Res Dev 4(1):1–29. https://doi.org/10.1186/s40411-016-0030-9

Mayring P (2014) Qualitative Content Analysis: Theoretical Foundation, Basic Procedures and Software Solution. Klagenfurt. Available at Social Science Open Access Repository (SSOAR) https://nbn-resolving.de/urn:nbn:de:0168-ssoar-395173 (accessed 04 June 2024)

McGregor JD (2001) Testing a software product line. Technical Report, Carnegie Mellon University

Mendes E, Wohlin C, Felizardo K, Kalinowski M (2020) When to update systematic literature reviews in software engineering. J Syst Softw 167:110607. https://doi.org/10.1016/j.jss.2020.110607

Mishra S (2006) Specification based software product line testing: A case study. In: CS&P’06, pp 243–254

Nebut C, Pickin S, Le Traon Y, Jézéquel JM (2003) Automated requirements-based generation of test cases for product families. In: ASE’03, pp 263–266. https://doi.org/10.1109/ASE.2003.1240317

Nebut C, Traon YL, Jézéquel JM (2006) System testing of product lines: from requirements to test cases. In: Käköla T, Duenas JC (eds) Software Product lines. Springer, Berlin, Heidelberg, pp 447–477. https://doi.org/10.1007/978-3-540-33253-4_12

Chapter   Google Scholar  

Neto PA, do Carmo Machado I, Cavalcanti YC, De Almeida ES, Garcia VC, de Lemos Meira SR (2010) A regression testing approach for software product lines architectures. In: SBCARS’10, pp 41–50. https://doi.org/10.1109/SBCARS.2010.14

Nguyen QL (2009) Non-functional requirements analysis modeling for software product lines. In: ICSE’09, pp 56–61. https://doi.org/10.1109/MISE.2009.5069898

Northrop L, Clements P, Bachmann F, Bergey J, Chastek G, Cohen S, Donohoe P, Jones L, Krut R, Little R (2007) A framework for software product line practice, version 5.0. Technical report, Carnegie Mellon University

Olimpiew EM, Gomaa H (2009) Reusable model-based testing. In: ICSR’09, pp 76–85. https://doi.org/10.1007/978-3-642-04211-9_8

Oster S, Markert F, Ritter P (2010) Automated incremental pairwise testing of software product lines. In: SPLC’10, pp 196–210. https://doi.org/10.1007/978-3-642-15579-6_14

Parejo JA, Sánchez AB, Segura S, Ruiz-Cortés A, Lopez-Herrejon RE, Egyed A (2016) Multi-objective test case prioritization in highly configurable systems: a case study. J Syst Softw 122:287–310. https://doi.org/10.1016/j.jss.2016.09.045

Parra C, Giral L, Infante A, Cortés C (2012) Extractive SPL adoption using multi-level variability modeling. In: SPLC’12, pp 99–106. https://doi.org/10.1145/2364412.2364429

Patel S, Gupta P, Shah V (2013) Combinatorial interaction testing with multi-perspective feature models. In: ICSTW’13, pp 321–330. https://doi.org/10.1109/ICSTW.2013.43

Pérez B, Polo M, Piatini M (2009) Software product line testing-A systematic review. In: ICSOFT’09, pp 1–8

Perrouin G, Sen S, Klein J, Baudry B, Le Traon Y (2010) Automated and scalable t-wise test case generation strategies for software product lines. In: ICST’10, pp 459–468. https://doi.org/10.1109/ICST.2010.43

Petry KL, OliveiraJr E, Zorzo AF (2020) Model-based testing of software product lines: mapping study and research roadmap. J Syst Softw 167:110608. https://doi.org/10.1016/j.jss.2020.110608

Pett T, Eichhorn D, Schaefer I (2020) Risk-based compatibility analysis in automotive systems engineering. In: MODELS’20, pp 1–10. https://doi.org/10.1145/3417990.3421263

Pohl K, Metzger A (2006) Software product line testing. Commun ACM 49(12):78–81. https://doi.org/10.1145/1183236.1183271

Pohl K, Böckle G, Van Der Linden F (2005) Software product line engineering: foundations, principles, and techniques. Springer Berlin, Heidelberg. https://doi.org/10.1007/3-540-28901-1

Qian Y, Zhang C, Wang F (2018) Selecting products for high-strength t-wise testing of software product line by multi-objective method. In: PIC’18, pp 370–378. https://doi.org/10.1109/PIC.2018.8706270

Reis S, Metzger A, Pohl K (2006) A reuse technique for performance testing of software product lines. In: SPLiT’06, pp 5–10

Reis S, Metzger A, Pohl K (2007) Integration testing in software product line engineering: a model-based technique. In: FASE’07, pp 321–335. https://doi.org/10.1007/978-3-540-71289-3_25

Reuling D, Bürdek J, Rotärmel S, Lochau M, Kelter U (2015) Fault-based product-line testing: Effective sample generation based on feature-diagram mutation. In: SPLC’15, pp 131–140. https://doi.org/10.1145/2791060.2791074

Reuys A, Kamsties E, Pohl K, Reis S (2005) Model-based system testing of software product families. In: CAiSE’05, pp 519–534. https://doi.org/10.1007/11431855_36

Reuys A, Reis S, Kamsties E, Pohl K (2006) The scented method for testing software product lines. In: SPLC’06, pp 479–520. https://doi.org/10.1007/978-3-540-33253-4_13

Rocha L, Machado I, Almeida E, Kästner C, Nadi S (2020) A semi-automated iterative process for detecting feature interactions. In: SBES’20, pp 778–787. https://doi.org/10.1145/3422392.3422418

Roggenbach M (2006) CSP-CASL—A new integration of process algebra and algebraic specification. Theor Comput Sci 354(1):42–71. https://doi.org/10.1016/j.tcs.2005.11.007

Rubin J, Chechik M (2013) A survey of feature location techniques. In: Reinhartz-Berger I, Sturm A, Clark T, Cohen S, Bettin J (eds) Domain Engineering. Springer, Berlin, Heidelberg, pp 29–58. https://doi.org/10.1007/978-3-642-36654-3_2

Safdar SA, Yue T, Ali S (2021) Recommending faulty configurations for interacting systems under test using multi-objective search. ACM Trans Softw Eng Methodol 30(4):1–36. https://doi.org/10.1145/3464939

Saini A, Rajkumar, Kumar S (2022) Software Product Line Testing—A Proposal of Distance-Based Approach. In: AISE’20, pp 187–198. https://doi.org/10.1007/978-981-16-8542-2_15

Sánchez AB, Segura S, Ruiz-Cortés A (2014) A comparison of test case prioritization criteria for software product lines. In: ICST’14, pp 41–50. https://doi.org/10.1109/ICST.2014.15

Schaefer I, Bettini L, Bono V, Damiani F, Tanzarella N (2010) Delta-oriented programming of software product lines. In: SPLC’10, pp 77–91. https://doi.org/10.1007/978-3-642-15579-6_6

Scheidemann KD (2006) Optimizing the selection of representative configurations in verification of evolving product lines of distributed embedded systems. In: SPLC’06, pp 75–84. https://doi.org/10.1109/SPLINE.2006.1691579

Shi J, Cohen MB, Dwyer MB (2012) Integration testing of software product lines using compositional symbolic execution. In: FASE’12, pp 270–284. https://doi.org/10.1007/978-3-642-28872-2_19

Sjoberg DI, Dyba T, Jorgensen M (2007) The future of empirical methods in software engineering research. In: FOSE’07, pp 358–378. https://doi.org/10.1109/FOSE.2007.30

Soe NT, Wild N, Tanachutiwat S, Lichter H (2022) Design and implementation of a test automation framework for configurable devices. In: APIT’22, pp 200–207. https://doi.org/10.1145/3512353.3512383

Souto S, d’Amorim M (2018) Time-space efficient regression testing for configurable systems. J Syst Softw 137:733–746. https://doi.org/10.1016/j.jss.2017.08.010

Souto S, d’Amorim M, Gheyi R (2017) Balancing soundness and efficiency for practical testing of configurable systems. In: ICSE’17, pp 632–642. https://doi.org/10.1109/ICSE.2017.64

Steffens M, Oster S, Lochau M, Fogdal T (2012) Industrial evaluation of pairwise SPL testing with MoSo-PoLiTe. In: VaMoS’12, pp 55–62. https://doi.org/10.1145/2110147.2110154

Stricker V, Metzger A, Pohl K (2010) Avoiding redundant testing in application engineering. In: SPLC’10, pp 226–240. https://doi.org/10.1007/978-3-642-15579-6_16

Strickler A, Lima JA, Vergilio SR, Pozo AT (2016) Deriving products for variability test of feature models with a hyper-heuristic approach. Appl Soft Comput 49:1232–1242. https://doi.org/10.1016/j.asoc.2016.07.059

Tevanlinna A, Taina J, Kauppinen R (2004) Product family testing: a survey. ACM SIGSOFT Softw Eng Notes 29(2):12–12. https://doi.org/10.1145/979743.979766

Tuglular T, Coşkun DE (2021) Behavior-driven development of software product lines. In: DSA’21, pp 230–239. https://doi.org/10.1109/DSA52907.2021.00035

Tuglular T, Beyazıt M, Öztürk D (2019) Featured event sequence graphs for model-based incremental testing of software product lines. In: COMPSAC’19, pp 197–202. https://doi.org/10.1109/COMPSAC.2019.00035

Uzuncaova E, Khurshid S, Batory D (2010) Incremental test generation for software product lines. IEEE Trans Softw Eng 36(3):309–322. https://doi.org/10.1109/TSE.2010.30

Varshosaz M, Al-Hajjaji M, Thüm T, Runge T, Mousavi MR, Schaefer I (2018) A classification of product sampling for software product lines. In: SPLC’18, pp 1–13. https://doi.org/10.1145/3233027.3233035

Vazirani VV (2001) Approximation algorithms. Springer, Berlin. https://doi.org/10.1007/978-3-662-04565-7

Vidács L, Horváth F, Mihalicza J, Vancsics B, Beszédes Á (2015) Supporting software product line testing by optimizing code configuration coverage. In: ICSTW’15, pp 1–7. https://doi.org/10.1109/ICSTW.2015.7107478

Vidal Silva C, Galindo Duarte JÁ, Benavides Cuevas DF (2020) Functional testing of conflict detection and diagnosis tools in feature model configuration: a test suite design. In: ConfWS’20, pp 17–24

Wang S, Buchmann D, Ali S, Gotlieb A, Pradhan D, Liaaen M (2014) Multi-objective test prioritization in software product line testing: an industrial case study. In: SPLC’14, pp 32–41. https://doi.org/10.1145/2648511.2648515

Wang S, Ali S, Gotlieb A (2015) Cost-effective test suite minimization in product lines using search techniques. J Syst Softw 103:370–391. https://doi.org/10.1016/j.jss.2014.08.024

Wang S, Ali S, Gotlieb A, Liaaen M (2017) Automated product line test case selection: industrial case study and controlled experiment. Softw Syst Model 16(2):417–441. https://doi.org/10.1007/s10270-015-0462-4

Webster J, Watson RT (2002) Analyzing the past to prepare for the future: writing a literature review. MIS Q 26(2):xiii–xxiii

Weiss DM (2008) The product line hall of fame. In: SPLC’08, pp 39. https://doi.org/10.1109/SPLC.2008.56

Wohlin C, Höst M, Henningsson K (2003) Empirical research methods in software engineering. Empirical methods and studies in Software Engineering-experiences. Springer, Berlin, Heidelberg, pp 7–23. https://doi.org/10.1007/978-3-540-45143-3_2

Xiang Y, Huang H, Zhou Y, Li S, Luo C, Lin Q, Yang X (2022) Search-based diverse sampling from real-world software product lines. In: ICSE’22, pp 1945–1957. https://doi.org/10.1145/3510003.3510053

Xu Z, Cohen MB, Motycka W, Rothermel G (2013) Continuous test suite augmentation in software product lines. In: SPLC’13, pp 52–61. https://doi.org/10.1145/2491627.2491650

Yan L, Hu W, Han L (2019) Optimize SPL test cases with adaptive simulated annealing genetic algorithm. In: ACM TURC’19, pp 1–7. https://doi.org/10.1145/3321408.3326676

Yu L, Duan F, Lei Y, Kacker RN, Kuhn DR (2014) Combinatorial test generation for software product lines using minimum invalid tuples. In: HASE’14, pp 65–72. https://doi.org/10.1109/HASE.2014.18

Zhang L, Tian JH, Jiang J, Liu YJ, Pu MY, Yue T (2018) Empirical research in software engineering-A literature survey. JCST 33(5):876–899. https://doi.org/10.1007/s11390-018-1864-x

Download references

This work was funded by the Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg in the Innovation Campus Mobility of the Future, projects SWUpCar and TESSOF.

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and affiliations.

Institute of Software Engineering, University of Stuttgart, Stuttgart, Germany

Halimeh Agh, Aidin Azamnouri & Stefan Wagner

TUM School of Communication, Information and Technology, Technical University of Munich, Heilbronn, Germany

Stefan Wagner

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Halimeh Agh , Aidin Azamnouri or Stefan Wagner .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflict of interest.

Additional information

Communicated by Sven Apel.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Table  17 shows the results of the evaluation based on the quality assessment criteria, described in Table  14 in Appendix B. Regarding the issue Reporting (QA1-QA3 in Table  14 ), most of the studies performed well; all the studies are based on research and almost 82% of them have a clear statement of the aims of the research. However, the description of the context is bad in some of the studies (∼ 30%); this compromises the validity of these studies since, without enough information about the subjects of the study, it is usually difficult to specify whether the selected case is suitable to evaluate different aspects of the proposed approach.

In terms of rigor (QA4-QA7), the studies performed, on average, fairly well. In 77 studies (∼ 62%), the researchers have justified the research design so that it can address the aims of the research. In 71 studies (∼ 60%), the proposed approach has been compared with a base approach; the researcher(s) has tried to justify that the selected controls are representative of a defined population. The way data collected is satisfactory in 85 studies (∼ 72%) since the researchers have clearly defined the measure(s) selected and justified their selection. Furthermore, the data has been analyzed rigorously in 80 studies (68%) by providing sufficient data to support the findings. Although these findings are promising, 32% of the studies, overall, fail in rigor; this compromises the validity and usefulness of these studies since failing in rigor, as a key issue in Evidence-Based Software Engineering, indicates that the empirical methods have been applied in an informal way.

Regarding the issue Credibility, 95% of the studies provide a clear statement of the findings (QA9) by discussing the findings in relation to the research questions and also presenting the limitations of the study. However, most studies perform poorly in establishing relationships between the researcher(s) and participants and the data collected to address the research issue (QA8); this quality attribute is considered in only 12 studies (∼ 10%). This can threaten the quality of the research due to not considering potential bias and influence of the researcher(s) during the formulation of research questions, data collection, and analysis and selection of data for presentation.

In terms of Relevance, 114 studies (∼ 97%) explicitly deal with SPL testing and discuss the contributions the study makes to existing knowledge, identify new areas in which research is necessary, and discuss the ways in which the research can be used (QA10). This result is in line with the nature of the research goals, described as inclusion and exclusion criteria in Sect. 3.2. However, only 18 studies (∼ 15%) present practitioner-based guidelines (QA11). This indicates that the SPL testing field needs more practical guidance to strengthen the adoption of industry.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Agh, H., Azamnouri, A. & Wagner, S. Software product line testing: a systematic literature review. Empir Software Eng 29 , 146 (2024). https://doi.org/10.1007/s10664-024-10516-x

Download citation

Accepted : 19 June 2024

Published : 02 September 2024

DOI : https://doi.org/10.1007/s10664-024-10516-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Software product lines
  • Software testing
  • Software quality
  • Systematic literature review
  • Find a journal
  • Publish with us
  • Track your research

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • For authors
  • Browse by collection
  • BMJ Journals

You are here

  • Volume 14, Issue 5
  • Use of social network analysis in health research: a scoping review protocol
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Eshleen Grewal 1 ,
  • Jenny Godley 2 , 3 , 4 ,
  • Justine Wheeler 5 ,
  • http://orcid.org/0000-0001-9008-2289 Karen L Tang 1 , 3 , 4
  • 1 Department of Medicine , University of Calgary , Calgary , Alberta , Canada
  • 2 Department of Sociology , University of Calgary , Calgary , Alberta , Canada
  • 3 Department of Community Health Sciences , University of Calgary , Calgary , Alberta , Canada
  • 4 O’Brien Institute for Public Health , University of Calgary , Calgary , Alberta , Canada
  • 5 Libraries and Cultural Resources , University of Calgary , Calgary , Alberta , Canada
  • Correspondence to Dr Karen L Tang; klktang{at}ucalgary.ca

Introduction Social networks can affect health beliefs, behaviours and outcomes through various mechanisms, including social support, social influence and information diffusion. Social network analysis (SNA), an approach which emerged from the relational perspective in social theory, has been increasingly used in health research. This paper outlines the protocol for a scoping review of literature that uses social network analytical tools to examine the effects of social connections on individual non-communicable disease and health outcomes.

Methods and analysis This scoping review will be guided by Arksey and O’Malley’s framework for conducting scoping reviews. A search of the electronic databases, Ovid Medline, PsycINFO, EMBASE and CINAHL, will be conducted in April 2024 using terms related to SNA. Two reviewers will independently assess the titles and abstracts, then the full text, of identified studies to determine whether they meet inclusion criteria. Studies that use SNA as a tool to examine the effects of social networks on individual physical health, mental health, well-being, health behaviours, healthcare utilisation, or health-related engagement, knowledge, or trust will be included. Studies examining communicable disease prevention, transmission or outcomes will be excluded. Two reviewers will extract data from the included studies. Data will be presented in tables and figures, along with a narrative synthesis.

Ethics and dissemination This scoping review will synthesise data from articles published in peer-reviewed journals. The results of this review will map the ways in which SNA has been used in non-communicable disease health research. It will identify areas of health research where SNA has been heavily used and where future systematic reviews may be needed, as well as areas of opportunity where SNA remains a lesser-used method in exploring the relationship between social connections and health outcomes.

  • Protocols & guidelines
  • Social Interaction
  • Social Support

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:  http://creativecommons.org/licenses/by-nc/4.0/ .

https://doi.org/10.1136/bmjopen-2023-078872

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

STRENGTHS AND LIMITATIONS OF THIS STUDY

This is a novel scoping review that fills an important gap—how and where social network analysis (SNA) (as a data collection and analytical tool) has been used in health research has not been systematically documented despite its increasing use in the discipline.

The breadth of the scoping review allows for a comprehensive mapping of the use of SNA to examine social connections and non-communicable disease and health outcomes, without limiting to any one population group or setting.

The use of the Arksey and O’Malley framework as well as the Levac et al recommendations to guide our scoping review will ensure that a rigorous and transparent process is undertaken.

Due to the scope of the review and the large volume of anticipated studies, only published articles in the English language will be included.

Introduction

Social connections are known to influence health. 1 People with many supportive social connections tend to be healthier and live longer than people who have fewer supportive social connections, while social isolation, or the absence of supportive social connections, is associated with the deterioration of physical and psychological health, and even death. 2–5 These associations hold even when accounting for socioeconomic status and health practices. 6 Additionally, having a low quantity of supportive social connections is associated with the development or worsening of medical conditions, such as atherosclerosis, hypertension, cardiovascular disease and cancer, potentially through chronic inflammation and changes to autonomic regulation and immune responses. 7–13 Unsupportive social connections can also have adverse effects on health due to emotional stress, which can then lead to poor health habits, psychological distress and negative physiological responses (eg, increased heart rate and blood pressure), all of which are detrimental to health over time. 14 The health of individuals is therefore connected to the people around them. 15

Social networks can influence health via five pathways. 15 16 First, networks can provide social support, to meet the needs of the individual. Dyadic relationships can provide informational, instrumental (ie, aid and assistance with tangible needs), appraisal (ie, help with decision-making) and/or emotional support; this support can be enhanced or hindered by the overall network structure. 17 In addition to the tangible aid and resources that are provided, social support—either perceived or actual—also has direct effects on mental health, well-being and feelings of self-efficacy. 18–20 Social support may also act as a buffer to stress. 16 19 The second pathway by which social networks influence health, and in particular health behaviours such as alcohol and cigarette use, physical activity, food intake patterns and healthcare utilisation, is through social influence. 16 21 That is, the attitudes and behaviour of individuals are guided and altered in response to other network members. 22 23 Social influence is difficult to disentangle from social selection from an empirical standpoint. That is, similarities in behaviour may be due to influences within a network, or alternatively, they may reflect the known phenomenon where individuals tend to form close connections with others who are like them. 22 24 The third pathway is through the promotion of social engagement and participation. Individuals derive a sense of identity, value and meaning through the roles they play (eg, parental roles, community roles, professional roles, etc) in their networks, and the opportunities for participation in social contexts. 16 The fourth pathway by which networks affect health is through transmission of communicable diseases through person-to-person contact. Finally, social networks overlap, resulting in differential access to resources and opportunities (eg, finances, information and jobs). 15 16 An individual’s structural position can result in differential health outcomes, similar to the inequities that stem from differences in social status. 16

There has been an explosion of literature in the area of social networks and health. In their bibliometric analysis, Chapman et al found that the number of studies that examine social networks and health has sextupled since 2000. 25 Similarly, the value of grants and contracts in this topic area, as awarded by the National Science Foundation and the National Institutes of Health, has increased 10-fold. 25 A turning point in the field was the HIV epidemic, where there was an urgent need to better understand its spread. 25 The exponential rise in the number of studies since then that examine social networks and health appears to reflect a widespread understanding that an individual’s health cannot be isolated from his or her social networks and context. There is, however, significant heterogeneity in what aspect of, and how, social networks are being studied. For example, many health research studies use proxies for social connectedness such as marital status or living alone status (as these variables tend to be commonly included in health surveys), without considering the quality of those social connections, and without further exploring the broader social network and their characteristics. 16 26 These proxy measures do little to describe the structure, quantity, quality or characteristics of social connections within which individuals are embedded. Another common approach in health research is to focus on social support measures and their effects on health. Individuals are asked about perceived, or received, social support (for example, through questions that ask about the availability of people who provide emotional support, informational support and/or assistance with daily tasks, with either binary or a Likert scale of responses). 27 28 While important, social support measures do not assess the structure of social networks and represent only one of many different mechanisms by which social networks influence health. 17 23

Social network analysis

Social network analysis (SNA) is a methodological tool, developed in the 1930s by social psychologists, used to study the structure and characteristics of the social networks within which individuals are embedded. 16 29 It has evolved over the past 100 years and has been used by researchers in many social science disciplines to analyse how structures of relationships impact social life. 29 30 SNA has the following key properties 3 30 31 : (1) it relies on empirical relational data (ie, data on actors (nodes) and the connections (ties) between them); (2) it uses mathematical models and graph theory to examine the structure of relationships within which individual actors are embedded; and (3) it models social action at both the group and the individual level arising from the opportunities and constraints determined by the system of relationships. The premise of SNA is that social ties are both drivers and consequences of human behaviour, and are therefore the object of study. 15 16 23 32 Social networks are comprised of nodes, representing the members within a network, connected by ties, representing relations among those individuals. 33 There are two types of SNA: egocentric network analysis and whole network analysis. Egocentric network analysis describes the characteristics of an individual’s (ie, the ‘egos’) personal network, while whole network analysis examines the structure of relationships among all the individuals in a bounded group, such as a school or classroom. 3

In egocentric network analysis, a list of ‘alters’ (ie, nodes) to whom the ego is connected, is obtained through a name generator. Name generator questions ask for a list of alters based on role relations (eg, friends or family), affect (eg, people to whom the ego feels close), interaction (eg, people with whom the ego has been in contact) or exchange (eg, people who provide social and/or financial support). 34 These are followed by name interpreters, where the ego is asked questions about the characteristics of each named alter. 35 Analyses of these data involve constructing measures that describe these egocentric networks. Such measures include network size, network density (ie, how tightly knit the network is), the strength of relationships (ie, the intensity and duration of relationships between ego and alter), network function (ie, the resources and/or support provided through the network) and the diversity of relations within the network (‘heterogeneity’). 23 36 In whole network analysis, the network boundary is determined a priori and network members are known, for example, through membership lists or rosters. 37 Each network member is surveyed, to identify the other network members with whom they are connected and/or affiliated; attributes of each member are obtained through surveying the network members themselves. Variables are constructed at the individual and network levels. Individual-level measures include the number of ties to other network members (‘degree’), types of relationships, and the strength and diversity of relationships. Network-level measures include but are not limited to: density (representing how tightly knit or ‘glued’ together the network is), reciprocity (ie, the proportion of network ties that are reciprocated), isolates (ie, nodes with no ties to other network members), centralisation (or the extent to which the network ties are focused on one node or a set of nodes), cliques and equivalence (ie, sets of nodes that have the same pattern of ties and therefore occupy the same position in the network). 33 38 The constructed measures can then be included in statistical models to explore associations between individual and/or network-level measures, and outcomes. 33 39

Study rationale

In medicine and health research, there has traditionally been a dichotomy between the individual and the context in which the individual is situated—such as in their relationships with others. 40 As such, epidemiology of diseases has historically focused on individual-level traditional risk and protective factors—such as biological markers, genetics, lifestyle and health behaviours, and psychological conditions. 41 While criticisms of this individualistic focus abound, attempts to develop and use different approaches in medicine and research have lagged behind. 42 The use and adoption of methods, like SNA, that frame issues of health and wellness differently, has the potential to offer new insights and solutions to clinical and healthcare delivery problems, 42 by more holistically considering ‘different levels of change’ beyond the individual. 41 We seek to examine the extent to which SNA has transcended the boundaries of its disciplines of origin in the social sciences, into health research. For example, while Chapman et al have clearly shown an explosion of publications at this intersection, 25 it remains unclear whether these studies use SNA tools (which were developed specifically to interrogate the nature and characteristics of social networks), or whether they suffer from the known problem of conflation of constructs like social support, social capital and social integration. 15 43 Many studies that report the impact of ‘social networks’ on health outcomes do not use SNA methods but rather use self-reported network size (without probing the network and its structure), 44 45 social support, 46 marital status 47 48 and/or household members 47 as proxies.

We will therefore undertake a scoping review to map the use of SNA as a data collection and analytical method in health research. More specifically, the scoping review will examine how SNA has been used to study associations across social networks and individual health and well-being (including both physical and psychological health), health knowledge, health engagement, health service use and health behaviours. Scoping reviews are a knowledge synthesis approach that aims to uncover the volume, range, reach and coverage of a body of literature on a specific topic. 49 They differ from systematic reviews, another type of knowledge synthesis, in their objectives. Systematic reviews seek to answer clinical or epidemiological questions and are conducted to fill gaps in knowledge. 50 Systematic reviews are used to establish the effectiveness of an intervention or associations between specific exposures and outcomes. On the other hand, scoping reviews do not seek to provide an answer to a question, but rather, aim to create a map of the existing literature. 49 They are used to provide clarity to the concepts and definitions used in literature, examine the way in which research is conducted in a specific field or on a specific topic, and uncover knowledge gaps. 49 A scoping review, therefore, is well suited as a research method to address our research question, of mapping the ways in which SNA has been used in health research. This scoping review can identify areas (eg, specific populations and specific health outcomes) where there has been a plethora of SNA research warranting future systematic reviews. It can also identify areas within health research where the use of SNA is scarce, highlighting topics, populations or outcomes for future study.

This scoping review will be limited to studies that use SNA in exploring network components and their associations with non-communicable diseases and health and well-being outcomes, for three reasons. The first is feasibility, given the large volume of studies anticipated, based on Chapman et al ’s bibliometric study on this topic. 25 Second, the use of SNA in understanding disease transmission of communicable diseases (such as sexually transmitted infections) is well established; its application to HIV was in fact one of the catalysts, as previously mentioned, to its broader uptake in health research. 25 Third, SNA in health research has shifted from focusing on communicable diseases to focusing on non-communicable diseases and their risk factors; SNA is now being applied much more frequently to the latter conditions than the former ones. 51

Methods and analysis

The scoping review will be informed by the framework developed by Arksey and O’Malley 52 for conducting scoping reviews, as well as the additional recommendations made by Levac et al . 53 Arksey and O’Malley’s framework recommends that the review process be organised into the following five steps: identifying the research question; identifying relevant studies; study selection; charting the data; and collating, summarising and reporting the results. 52 The reporting of this review will adhere to the Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews. 54

Patient and public involvement

No patients will be involved.

Step 1: identifying the research question

A preliminary search of the literature identified a gap related to SNA and how it has been used to study the relationship between social networks and individual well-being and health outcomes. This led to the development of the research question that will guide this scoping review: how have social network analytical tools been used to study the associations between social networks and individual patient health? In this case, SNA is defined as a data analysis technique that uses either an egocentric or whole network analysis approach. For egocentric network analysis, we will include studies that involve peer nomination (ie, use of a name generator) and the collection of one or more characteristics of alters (ie, use of name interpreter(s)).

Step 2: identifying relevant studies

A search strategy will be constructed through consultation with an academic librarian (JW). The main concepts from the research question will be used for a preliminary search in Google Scholar. Additionally, the lead authors will provide the librarian with key studies that will be text-mined for relevant terms. These key studies will include a variety of populations (across different countries and age groups) and health outcomes. 55–58 Key studies will be searched in Ovid MEDLINE for appropriate subject headings. In consultation with team members, the librarian (JW) will construct a pilot search strategy. A title/abstract/keyword search will be conducted in Ovid MEDLINE against the known seed/key studies. Table 1 lists example keywords and terms relating to social networks that will be used, with the full search strategy detailed in online supplemental appendix A .

Supplemental material

  • View inline

Search terms relating to social network analysis

Due to a significant number of irrelevant articles surrounding communicable diseases using this search strategy, we will exclude records with these terms in either the title or keyword fields. Table 2 lists the terms related to communicable diseases.

Search terms relating to communicable diseases

Of note, the search strategy will not include terms that relate to health-related outcomes of interest (outside of excluding communicable diseases). Prior literature has shown that the inclusion of outcome concepts in a search strategy reduces the recall and sensitivity of a search strategy. 59 60 This problem is further exacerbated when only generic health terms (for example, ‘morbidity’ or ‘health status’) or specific health terms (eg, specific diseases or conditions such as ‘diabetes mellitus’) are used. 61 Because the objective of this scoping review is to examine and map the use of SNA in health research, the outcomes of interest are very broad, including: physical health and well-being, psychological health and well-being, healthcare engagement, health knowledge, health behaviours, healthcare access and use, disease prevalence and outcomes (spanning every organ system), and mortality. It will be impossible for a search strategy to be sufficiently comprehensive, to capture all possible generic and specific terms relating to this broad range of outcomes. In keeping with recommendations to minimise the number of elements in a search strategy 62 —and in particular outcome elements 63 —our search strategy will entail searching for SNA terms in health databases without specifying health outcomes.

The search strategy will first be created in Medline (Ovid), then translated and adapted for the databases: (1) EMBASE (Ovid), (2) APA PsycInfo (Ovid) and (3) CINAHL (EBSCO). A search will be completed in April 2024. No date filters will be applied to the search. However, animal-only studies will be excluded. The current version of the search strategy including limits and filters, for all databases, is included in online supplemental appendix A .

Step 3: study selection

The criteria that will be used to determine which studies to include are as follows:

Studies that employ SNA as a data collection and/or analysis technique, as defined above. Of note, studies that elicit only the number of friends or other social contacts, without collection of any information about these social contacts, are not considered to be SNA and are therefore not included in the scoping review.

Studies that explore the social networks of individuals in whom the health outcome is measured.

Studies must include the exploration of non-communicable health outcomes. Examples include self-rated health or other global measures of health (including measures of physical health, mental health and well-being), health practices (eg, physical activity, dietary patterns, smoking, alcohol use, substance use), sexual and reproductive health, healthcare-seeking behaviours (eg, medication adherence, acute care use, attachment to a primary care provider), health knowledge, health beliefs, healthcare engagement, non-communicable disease prevalence and mortality.

The criteria that will be used to exclude studies are as follows:

Studies that explore the social networks of organisations or healthcare providers, rather than the social networks of the individual about whom the health outcome is measured or reported.

Studies that describe or use data analysis techniques other than SNA (eg, using proxies for social networks/social support that do not include peer nomination (such as marital status or living alone status), or studies where study participants report the number of social contacts but where no other information about each social contact is collected).

Studies that focus exclusively on online social networks (eg, social media, online forums, online support groups).

Studies related to prevention, transmission or outcomes of communicable diseases.

Non-English studies, for feasibility purposes.

We will not limit studies based on the study population or country in which the SNA is conducted. Studies in paediatric and adult populations will be included. The reasons for excluding SNA studies that focus solely on social media and online networks are twofold. First, we anticipate a very large number of articles, given the broad populations and outcomes of interest, and for feasibility purposes, we have needed to narrow the research objective to in-person and/or offline social networks only. Second, there are likely inherent differences in online and offline social networks. Individuals use health-related social networking sites and online networks primarily for information seeking, connection with others who share a similar lived experience while being able to maintain some emotional distance and interacting with health professionals 64 ; this differs from in-person networks, which individuals go to more for emotional and tangible or instrumental support. Friends met on online networks vary from friends met in person in other important ways. They tend to have less similarity in terms of age, gender and place of residence, 65 and the network ties more commonly arise spontaneously—that is, without common acquaintances or affiliations. 66 The social patterns and interactions among individuals and their online network contacts are also different—with entire relationships built on text-based interactions. 66 Therefore, while online social networks are an important area of study, they appear to be inherently different from the study of offline social networks, and are therefore excluded from this scoping review.

For the first step of the screening process, after removing duplicate articles, two reviewers will independently assess the titles and abstracts of the studies to determine whether they meet the inclusion criteria. Any studies that do not meet the inclusion criteria will be excluded from the review. Studies that either one of the two reviewers feels are potentially relevant will be included in the full-text review, to ensure that no article is prematurely excluded at this stage. During the second step of the screening process, two reviewers will independently review the full texts of the studies to ensure they meet the inclusion criteria. Conflicts will be resolved by third and fourth reviewers with expertise in SNA (JG) and health outcomes (KLT). The number of studies included in each step of the screening process will be reported using the Preferred Reporting Items for Systematic reviews and Meta-Analyses diagram. 67

Step 4: charting the data

A data charting document ( online supplemental appendix B ) will be created to extract data from the studies in the review. This document will include information about the authors, year of publication, study location, study population characteristics, outcomes of interest to this scoping review, and the scales and measures used for each outcome. Data about the social network analytical method will also be extracted, including whether studies used egocentric versus whole networks, the name generator used (in egocentric network studies) or the relationship being explored, the maximum number of peer nominations allowed, the lookback period used, whether (and which) alter attributes were collected, and whether alter-to-alter tie data were collected. Data extraction will be performed by at least one reviewer, with a second reviewer separately checking and confirming the inputted data. Disagreements in data extraction will be resolved through a consensus, and through the input of reviewers with content and methods expertise (KLT, JG).

Step 5: collating, summarising and reporting results

The results of the review will be presented in the form of figures and tables and will include descriptive numerical summaries. The numerical summary will include information about the number of studies included in the review, where the studies were conducted, when they were published and characteristics of the populations, such as the sample sizes and mean age. It will also include characteristics of the SNA conducted in these studies, including the number that are whole network studies versus egocentric network studies, the data sources used and the attributes of the social connections that are collected and analysed. Results will be synthesised in text, as well as through tables and figures.

Ethics and dissemination

This review does not require ethics approval. Data will be extracted from published material. Once the scoping review is complete, an article will be written to convey the findings of this review, and it will be submitted for publication in a peer-reviewed journal. We anticipate the results of this review will map out the ways in which SNA has been used in health research. Specifically, this scoping review will identify areas of potential saturation where SNA has been heavily used, opportunities for future systematic reviews (where there is a large body of primary research studies requiring synthesis) and health research gaps (eg, the health outcomes where SNA has been minimally used). The scoping review will also shed light on characteristics of SNA that have been used (eg, whether egocentric networks vs whole networks are used and in what settings, and whether a broad range of social network characteristics are captured and analysed), which will serve to inform the conduct of future SNA studies in health research.

Ethics statements

Patient consent for publication.

Not applicable.

  • Schaefer DR
  • Christiansen J ,
  • Qualter P ,
  • Friis K , et al
  • Leong D , et al
  • Umberson D ,
  • Glymour MM ,
  • Everson-Rose SA ,
  • Robles TF ,
  • Kiecolt-Glaser JK
  • Kiecolt-Glaser JK ,
  • McGuire L ,
  • Robles TF , et al
  • O’Brien E , et al
  • Shattuck EC
  • Christakis NA
  • Berkman LF ,
  • Brissette I , et al
  • McFarlane AH ,
  • Bellissimo A ,
  • Rueger SY ,
  • Malecki CK ,
  • Pyun Y , et al
  • Siciliano MD
  • de la Haye K ,
  • Barnett LM , et al
  • Murray JM ,
  • Sánchez-Franco SC ,
  • Sarmiento OL , et al
  • Pescosolido BA ,
  • Borgatti SP
  • Chapman A ,
  • Verdery AM ,
  • Holt-Lunstad J ,
  • Gjesfjeld CD ,
  • Greeno CG ,
  • Poston WS , et al
  • Fredericks KA ,
  • Carrington P
  • Crossley N ,
  • Bellotti E ,
  • Edwards G , et al
  • Wasserman S ,
  • Burgette JM ,
  • Rankine J ,
  • Culyba AJ , et al
  • Kirkengen AL ,
  • Ekeland T-J ,
  • Getz L , et al
  • Pescosolido BA
  • Lucivero F , et al
  • Vettore MV ,
  • Ahmad SFH ,
  • Machuca C , et al
  • De Gagne JC
  • Palmer Kelly E ,
  • García EL ,
  • Banegas JR ,
  • Pérez-Regadera AG , et al
  • Hempler NF ,
  • Joensen LE ,
  • Peters MDJ ,
  • Stern C , et al
  • Higgins JPT ,
  • Chandler J , et al.
  • Valente TW ,
  • Colquhoun H ,
  • Tricco AC ,
  • Zarin W , et al
  • Christakis NA ,
  • Mohr P , et al
  • O’Malley AJ ,
  • Arbesman S ,
  • Steiger DM , et al
  • Watkins SC ,
  • Jato MN , et al
  • Frandsen TF ,
  • Nielsen MFB ,
  • Bruun Nielsen MF ,
  • Lindhardt CL , et al
  • Maclean A ,
  • Sweeting H , et al
  • Bramer WM ,
  • de Jonge GB ,
  • Rethlefsen ML , et al
  • Lefebvre C ,
  • Glanville J ,
  • Briscoe S , et al
  • Colineau N ,
  • Doerfel ML ,
  • Shamseer L ,
  • Clarke M , et al

Supplementary materials

Supplementary data.

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

  • Data supplement 1
  • Data supplement 2

Contributors KLT and JG conceived of the study protocol. KLT, JG, EG and JW developed and revised the study protocol, the search strategy and the inclusion/exclusion criteria. EG and KLT drafted the protocol manuscript, and all authors provided critical revisions.

Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Competing interests None declared.

Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

Provenance and peer review Not commissioned; externally peer reviewed.

Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Read the full text or download the PDF:

COMMENTS

  1. 1. Formulate the Research Question

    Step 1. Formulate the Research Question. A systematic review is based on a pre-defined specific research question (Cochrane Handbook, 1.1).The first step in a systematic review is to determine its focus - you should clearly frame the question(s) the review seeks to answer (Cochrane Handbook, 2.1).It may take you a while to develop a good review question - it is an important step in your review.

  2. Formulating a research question

    Formulating a research question - Systematic reviews

  3. Research question

    Develop your research question. A systematic review is an in-depth attempt to answer a specific, focused question in a methodical way. Start with a clearly defined, researchable question, that should accurately and succinctly sum up the review's line of inquiry. A well formulated review question will help determine your inclusion and exclusion ...

  4. Systematic reviews: Structure, form and content

    Topic selection and planning. In recent years, there has been an explosion in the number of systematic reviews conducted and published (Chalmers & Fox 2016, Fontelo & Liu 2018, Page et al 2015) - although a systematic review may be an inappropriate or unnecessary research methodology for answering many research questions.Systematic reviews can be inadvisable for a variety of reasons.

  5. LibGuides: Systematic Reviews: 2. Develop a Research Question

    Systematic Reviews. 2. Develop a Research Question. A well-developed and answerable question is the foundation for any systematic review. This process involves: Using the PICO framework can help team members clarify and refine the scope of their question. For example, if the population is breast cancer patients, is it all breast cancer patients ...

  6. Systematic Review

    Systematic Review | Definition, Example & Guide

  7. Formulate Question

    A narrow and specific research question is required in order to conduct a systematic review. The goal of a systematic review is to provide an evidence synthesis of ALL research performed on one particular topic. Your research question should be clearly answerable from the studies included in your review. Another consideration is whether the ...

  8. Ten Steps to Conduct a Systematic Review

    The systematic review process is a rigorous and methodical approach to synthesizing and evaluating existing research on a specific topic. The 10 steps we followed, from defining the research question to interpreting the results, ensured a comprehensive and unbiased review of the available literature.

  9. Introduction to Systematic Reviews

    A systematic review identifies and synthesizes all relevant studies that fit prespecified criteria to answer a research question (Lasserson et al. 2019; IOM 2011).What sets a systematic review apart from a narrative review is that it follows consistent, rigorous, and transparent methods established in a protocol in order to minimize bias and errors.

  10. How to Do a Systematic Review: A Best Practice Guide for Conducting and

    Systematic reviews are characterized by a methodical and replicable methodology and presentation. They involve a comprehensive search to locate all relevant published and unpublished work on a subject; a systematic integration of search results; and a critique of the extent, nature, and quality of evidence in relation to a particular research question.

  11. Identifying Your Research Question

    Systematic Reviews & Meta-Analysis - LibGuides

  12. How to do a systematic review

    A systematic review aims to bring evidence together to answer a pre-defined research question. This involves the identification of all primary research relevant to the defined review question, the critical appraisal of this research, and the synthesis of the findings. 13 Systematic reviews may combine data from different research studies in order to produce a new integrated result or ...

  13. Systematic Reviews: Formulating Your Research Question

    evidence-based practice process. One way to streamline and improve the research process for nurses and researchers of all backgrounds is to utilize the PICO search strategy. PICO is a format for developing a good clinical research question prior to starting one's research. It is a mnemonic used to describe the four elements

  14. How to do a systematic review

    A systematic review aims to bring evidence together to answer a pre-defined research question. This involves the identification of all primary research relevant to the defined review question, the critical appraisal of this research, and the synthesis of the findings.13 Systematic reviews may combine data from different.

  15. LibGuides: Systematic Reviews: Developing a Research Question

    After developing the research question, it is necessary to confirm that the review has not previously been conducted (or is currently in progress). Make sure to check for both published reviews and registered protocols (to see if the review is in progress). Do a thorough search of appropriate databases; if additional help is needed, consult a ...

  16. Library Guides: Systematic Reviews: Question frameworks (e.g PICO)

    Your systematic review or systematic literature review will be defined by your research question. A well formulated question will help: Frame your entire research process. Determine the scope of your review. Provide a focus for your searches. Help you identify key concepts. Guide the selection of your papers.

  17. How to Do a Systematic Review: A Best Practice Guide ...

    Systematic reviews are characterized by a methodical and replicable methodology and presentation. They involve a comprehensive search to locate all relevant published and unpublished work on a subject; a systematic integration of search results; and a critique of the extent, nature, and quality of evidence in relation to a particular research question. The best reviews synthesize studies to ...

  18. Guidelines for writing a systematic review

    Guidelines for writing a systematic review. 1. Introduction. A key feature of any academic activity is to have a sufficient understanding of the subject area under investigation and thus an awareness of previous research. Undertaking a literature review with an analysis of the results on a specific issue is required to demonstrate sufficient ...

  19. 1. Formulating the research question

    Systematic review vs. other reviews. Systematic reviews required a narrow and specific research question. The goal of a systematic review is to provide an evidence synthesis of ALL research performed on one particular topic. So, your research question should be clearly answerable from the data you gather from the studies included in your review.

  20. Steps of a Systematic Review

    Tools: Steps: PICO template. 1. Id entify your research question. Formulate a clear, well-defined research question of appropriate scope. Define your terminology. Find existing reviews on your topic to inform the development of your research question, identify gaps, and confirm that you are not duplicating the efforts of previous reviews.

  21. 1.2.2 What is a systematic review?

    A systematic review attempts to collate all empirical evidence that fits pre-specified eligibility criteria in order to answer a specific research question. It uses explicit, systematic methods that are selected with a view to minimizing bias, thus providing more reliable findings from which conclusions can be drawn and decisions made (Antman ...

  22. Framing a Research Question

    The process for developing a research question There are many ways of framing questions depending on the topic, discipline, or type of questions. Try Elicit to generate a few options for your initial research topic and narrow it down to a specific population, geographical location, disease, etc.

  23. Assessing the Certainty of the Evidence in Systematic Reviews

    The certainty of the evidence is specific to each outcome in a systematic review, and can be rated as high, moderate, low, or very low. Because it will have an important impact, before conducting certainty of evidence, reviewers must clarify the intent of their question: are they interested in causation or association.

  24. Using Walking Interviews in Migration Research: A Systematic Review of

    This systematic review referenced 24 articles published between 2016 and 2024. Table 2 describes these publications in terms of research locations, participant types and demographic details, research methods, rational for using walking interviews, main research questions, and key methodological insights. Geographically, one study was conducted ...

  25. The effect of Orem's Self‐Care Deficit Theory-based care during

    2 AIMS AND RESEARCH QUESTIONS. This systematic review aimed to determine the effect of care based on Orem's Self-Care Deficit Theory on the physical and mental health outcomes in women, when provided during pregnancy and the postpartum period. The questions to be answered in the study were as follows:

  26. Agreement Between Mega-Trials and Smaller Trials: A Systematic Review

    Key Points. Question Are the results of mega-trials with 10 000 participants or more similar to meta-analysis of trials with smaller sample sizes for the primary outcome and/or all-cause mortality?. Findings In this meta-research analysis of 82 mega-trials, meta-analyses of smaller studies showed overall comparable results with mega-trials, but smaller trials published before the mega-trials ...

  27. An evaluation of the role of social identity processes for enhancing

    2.2 The current systematic review. Given the lack of theoretical underpinning in extant systematic reviews of Social Prescribing (e.g., Bickerdike et al., 2017; Reinhardt, Vidovic, & Hammerton, 2014; Steffens et al., 2021), and the aforementioned potential of the SIAH to allow for active mechanisms within Social Prescribing to be established (e.g., Kellezi et al., 2019; Wakefield et al., 2020 ...

  28. Comparisons of communication in medical face-to-face and

    The COVID-19 pandemic has brought telemedicine into mainstream medical practice (although questions remain over its role in a post-pandemic world). Research suggests that most patients and providers are satisfied with the flexibility and convenience of teleconsultations. However, there is continuing uncertainty about whether this shift has a clinically relevant impact on the quality of doctor ...

  29. Software product line testing: a systematic literature review

    This paper aims to survey existing research on SPL testing to provide researchers and practitioners with up-to-date evidence and issues that enable further development of the field. To this end, we conducted a Systematic Literature Review (SLR) with seven research questions in which we identified and analyzed 118 studies dating from 2003 to 2022.

  30. Use of social network analysis in health research: a scoping review

    Ethics and dissemination This scoping review will synthesise data from articles published in peer-reviewed journals. The results of this review will map the ways in which SNA has been used in non-communicable disease health research. It will identify areas of health research where SNA has been heavily used and where future systematic reviews may be needed, as well as areas of opportunity where ...