• - Google Chrome

Intended for healthcare professionals

  • My email alerts
  • BMA member login
  • Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution

Home

Search form

  • Advanced search
  • Search responses
  • Search blogs
  • Value of composite...

Value of composite reference standards in diagnostic research

  • Related content
  • Peer review
  • Christiana A Naaktgeboren , PhD fellow ,
  • Loes C M Bertens , PhD fellow ,
  • Maarten van Smeden , PhD fellow ,
  • Joris A H de Groot , assistant professor ,
  • Karel G M Moons , professor ,
  • Johannes B Reitsma , associate professor
  • 1 Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Universiteitsweg 100, 3584 CG Utrecht, Netherlands
  • Correspondence to: C A Naaktgeboren c.naaktgeboren{at}umcutrecht.nl
  • Accepted 11 September 2013

Combining several tests is a common way to improve the final classification of disease status in diagnostic accuracy studies but is often used ambiguously. This article gives advice on proper use and reporting of composite reference standards

A common challenge in diagnostic studies is to obtain a correct final diagnosis in all participants. Ideally, a single error-free reference test, known as a gold standard, is used to determine the final diagnosis 1 and estimate the accuracy of the test or diagnostic model under evaluation. If the reference standard does not perfectly correspond to true target disease status, estimates of the accuracy of the test or model under study (index test), such as sensitivity, specificity, predictive values, or area under the curve, can be biased. 2 This is known as imperfect reference standard bias. One method to reduce this bias is to use a fixed rule to combine results of several imperfect tests into a composite reference standard. 3 When the combination of several component tests provides a better perspective on disease than any of the individual tests alone, accuracy estimates of the test under evaluation (the index test) will be less biased than if only one imperfect test is used as the reference standard. Comparing the index test against each component test separately and then averaging the accuracy estimates is not recommended; it is better to insightfully combine component tests together into a composite reference standard.

The hallmark of composite reference standards is that each combination of test results leads to a particular final diagnosis; in its simplest form, disease present or absent. For example, in a study on the accuracy of a rapid antigen test for detecting trichomoniasis, researchers decided against using the traditional gold standard of culture because it probably misses some cases. 4 As they believed that microscopy picks up additional true cases, they instead considered patients as diseased if either microscopy or culture results were abnormal. Table 1 ⇓ gives further examples.

 Examples of composite reference standards

  • View inline

Although the choice of component tests and the rules used to combine them affects the estimates of accuracy of the test under study, 7 little guidance exists on how to develop and define a composite reference standard. Additionally, there is a lack of consensus in the way the term composite reference standard is used and reporting of results is generally poor. To address these problems, we provide an explanation of the methods for composite reference standards and make recommendations for development and reporting.

What is a composite reference standard?

A composite reference standard is a fixed rule used to make a final diagnosis based on the results of two or more tests, referred to as component tests. For each possible pattern of component test results (test profiles), a decision is made about whether it reflects presence or absence of the target disease.

Composite reference standards are appealing because of their similarity to clinical practice; they strongly resemble diagnostic rules that exist for several conditions, such as rheumatic fever and depression. Their main advantage is reproducibility of results, which is made possible by the transparency and consistency in the way that the final diagnosis is reached across participants. However, they also have disadvantages, the most glaring being the subjectivity introduced in the development of the rule.

The term “composite reference standard” is often loosely used as a catch-all term to describe any situation in which multiple reference tests are used to evaluate the accuracy of the index test. It is sometimes mistakenly used to describe differential verification, when different reference standards are used for different groups of participants (table 2 ⇓ ). 8 9 It has also been used to describe discrepant analysis, a method in which the reference standard is re-run or re-evaluated, or a different reference standard is used, when the first one does not agree with the index test. 13 Both these approaches can lead to seriously biased estimates of accuracy and should be avoided whenever possible.

 Examples of misuse of the term composite reference standard

In the example in table 2 ⇑ of a study on deep venous thrombosis differential verification was mislabelled as a composite reference standard. The reference standard for participants with a negative index test result was clinical follow-up while those with a positive result received the preferred reference standard, computed tomography. 11 If minor thromboembolisms that would have been picked up by computed tomography were missed during follow-up, the number of false negatives will be underestimated and the number of true negatives overestimated, thus biasing the accuracy estimates. Ethical or practical difficulties sometimes make it impossible to implement the same reference standard in all participants, but it is important that the term differential verification is used to describe such situations.

Table 2 ⇑ also gives an example of discrepant analysis from an imaging study for coronary artery stenosis in which the reference standard results were re-evaluated when they did not agree with the index test results. 12 Such re-evaluation can only lead to increased agreement between index test and the reference standard, which in turn can only lead to overestimates of accuracy. Although discrepant analysis his highly discouraged, situations in which the reference standard is repeated or a different reference standard is applied in those patients where the index test and first reference standard disagree, should be termed discrepant analysis.

To avoid confusion we recommend using the term composite reference standard exclusively for situations in which, by design, all patients are intended to receive the same component tests and these component tests are interpreted and combined in a fixed way for all patients.

Developing a composite reference standard

As the choice of component tests and the rule for combining them strongly influences the accuracy of composite reference standards, 14 careful attention is required when developing the decision rule. Ideally, the combination of test results and the corresponding final diagnosis should be specified before the study to prevent data driven decisions. However, if there is uncertainty about the best composite reference standard, a sensitivity analysis could be planned to see how sensitive the results are to the particular choice of tests or combination rule. It is also important that the composite reference standard is clinically relevant. In other words, it should detect cases that will benefit from clinical intervention rather than simply the presence of disease. 15 For clinical situations when the true disease status cannot be defined the composite reference standard should reflect the provisional working definition. Keeping diagnostic guidelines in mind and seeking advice from experts in the field will help ensure that the chosen standard is clinically relevant and interpretable.

Defining rules to combine component tests

Two rules exist for combining component tests into a composite reference standard. In the simplest scenario of two dichotomous component tests, participants could be considered to have the disease if either test is indicative of disease (any positive rule, also known as the “or” rule). The alternative is that participants are considered to have the disease only if both tests detect disease (all positive or “and” rule). If there are more than two component tests a combination of these two rules can be used.

Increasing the number of component tests will increase the number of participants categorised as diseased. If the any positive rule is used, this will increase the sensitivity of the composite reference standard (more diseased subjects will be classified as diseased) but decrease its specificity (more non-diseased subjects will be classified as having the disease). The reverse is true for the “all positive” rule; sensitivity of the composite reference standard decreases while specificity increases. Table 3 ⇓ gives an example of how the choice of combination rule affects the accuracy of the composite reference standard, which in turn affects the accuracy estimates of the test under study. 2

Effect of using different rules to produce composite reference standard on estimates of accuracy using example inspired by a study on the accuracy of rapid antigen detection test for trichomoniasis 4

There is almost always a trade-off between sensitivity and specificity when considering alternative ways to combine component tests. 14 The exception is when a component test in an “any positive” rule has perfect sensitivity, which makes a composite reference standard with perfect sensitivity, or when a component test in an “all positive” rule has perfect specificity, which makes a composite standard with perfect specificity. 3 Near perfect sensitivity or specificity of a component test is often the reasoning provided for the rule chosen.

Selection of component tests

Although it may be tempting to include numerous component tests, the gain in sensitivity or specificity of the resulting composite reference standard decreases (and the clinical interpretability may diminish) as more tests are added. This is because additional tests may fail to provide new information. In the trichomoniasis example, if another test such as polymerase chain reaction amplification is added, new true cases may be detected. 4 However, if yet another test is added, fewer additional true cases will be detected because fewer remained undetected. Eventually, all true cases are detected and additional tests will only result in false positive results, thus decreasing the specificity of the composite reference standard.

Multiple tests will be useful only if the component tests catch each other’s mistakes. For example, in a group of patients who truly have trichomoniasis, if microscopy identifies disease in the same participants as culture does, microscopy does not add any information and therefore the sensitivity of the composite reference standard will not be higher than that of culture alone. 2 When component tests make the same classifications in truly diseased or non-diseased patients more or less often than is expected by chance alone, this is referred to as conditional dependence.

In some cases, conditional dependence can be avoided or reduced by choosing component tests that look at different biological aspects of the disease. 16 To avoid causing the tests to make the same mistakes, you should consider blinding the observer of each component test to the results of the other component tests if knowledge of these other test results can influence interpretation.

Extensions to the basic composite reference standard

The basic composite reference standard categorises patients simply as diseased or non-diseased. However, multiple disease categories can also be defined, such as subtypes, stages, or degree of certainty of disease. An example is a study on tuberculosis in which people were categorised into one of four levels of disease certainty (table 4 ⇓ ). 17

Use of a composite reference standard to determine different categories of diagnosis for turberculosis 17

The basic composite reference standard gives equal weight to all tests, but in clinical practice tests carry different weights. The relative importance of the component tests can be incorporated by assigning weights. For example, in the assessment of adherence to isoniazid treatment for latent tuberculosis in table 1 ⇑ , the most reliable test was given twice the weight of the other tests. 6

Missing values on component tests

As with all diagnostic accuracy studies, results may be biased when not all participants receive the intended reference standard. 8 Careful attention needs to be paid to missing values in component tests. For example, if the “any positive” rule is used and the result of component test 1 is positive, we can conclude that a patient is diseased without knowing the result of component test 2. For efficiency, researchers might consider skipping the second test in participants whose first test result is positive. 3 18 However, if component test 1 is negative, component test 2 becomes necessary for determining the diagnosis.

When a result is missing from a component test that must be present under the combination rules, the composite reference standard is also missing. This may affect the accuracy estimates of the index test and mathematical methods should be used to tentatively correct for this bias. 19

Reporting guidelines

Complete and accurate reporting of the reference standard procedure is critical to allow readers to judge the potential risk of bias in accuracy estimates. This is especially important for systematic reviews of diagnostic tests. The validity of comparing accuracy estimates between studies and pooling of estimates across studies is challenged when studies use different reference standards or when reference standards are poorly defined or reported. 20 21 We therefore recommend that in addition to using current reporting guidelines, 22 authors of diagnostic accuracy studies should include the following details about studies with composite reference standards:

The rationale behind the selection of component tests and the combination rule

The corresponding final diagnosis for each combination of test results

Whether component test results were missing and and whether this resulted in a missing composite reference standard

The number of participants with each combination of test results. For continuous tests, this information should at least be provided for the optimal or most common cut-off point.

Table 5 ⇓ gives a template for reporting. The availability of all of the above information will allow studies using composite reference standards to be compared with those using only one of the component tests as the reference standard.

Template for reporting results when using a composite reference standard

Conclusions and recommendations

Combining multiple tests to define a target disease status rather than using a single imperfect test is a transparent and reproducible method for dealing with the common problem of imperfect reference standard bias. Although composite reference standards may reduce the amount of such bias, they cannot completely eliminate it because it is unlikely that a combination of imperfect tests will produce a composite standard with perfect sensitivity and specificity.

Other methods for dealing with bias resulting from imperfect reference standards are panel diagnosis and latent class analysis. 1 3 In panel diagnosis, multiple experts review relevant patient characteristics, test results, and sometimes follow-up information before coming to a consensus about the final diagnosis in each patient. Latent class analysis estimates accuracy by assuming that true disease status is unobservable and relating the results of multiple tests to it in a statistical model. 3 23 The choice of method to deal with imperfect reference standard bias will probably depend on the type, number, and accuracy of the pieces of diagnostic information available in a particular study. Results from all three methods could be presented to strengthen their face validity. Researchers who use a composite reference standard can improve the transparency and reproducibility of their results by following our recommendations on reporting.

Summary points

A composite reference standard is a predefined rule that combines the results of multiple imperfect (component) tests in order to improve the classification of disease status in a diagnostic study

The term is often misused to describe differential verification, a situation in which different reference standards are used for different groups of participants

Different sets of component tests or different rules to combine the same component tests will lead to different estimates of accuracy for the test(s) under study

When using composite reference standards, it is important to prespecify and explain the rationale for the rule, report index test results for each combination of component tests, and explain how missing component test results are dealt with

Cite this as: BMJ 2013;347:f5605

Contributors: All authors participated in the conception and design of the article, worked on the drafting of the article and revising it critically for important intellectual content, and have approved the final version to be published. CAN had the idea for the article, performed the literature search, and wrote the article. JBR is the guarantor.

Competing interests: All authors read and understood the BMJ policy on declaration of interests and declare financial support from Netherlands Organization for Scientific Research (project 918.10.615).

Provenance and peer review: Not commissioned; externally peer reviewed.

  • ↵ Rutjes AW, Reitsma JB, Coomarasamy A, Khan KS, Bossuyt PM. Evaluation of diagnostic tests when there is no gold standard. A review of methods. Health Technol Assess 2007 ; 11 : iii , ix-51. OpenUrl PubMed
  • ↵ Walter SD, Macaskill P, Lord SJ, Irwig L. Effect of dependent errors in the assessment of diagnostic or screening test accuracy when the reference standard is imperfect. Stat Med 2012 ; 31 : 1129 -38. OpenUrl CrossRef PubMed
  • ↵ Alonzo TA, Pepe MS. Using a combination of reference tests to assess the accuracy of a new diagnostic test. Stat Med 1999 ; 18 : 2987 -3003. OpenUrl CrossRef PubMed Web of Science
  • ↵ Hegazy MM, El-Tantawy NL, Soliman MM, El-Sadeek ES, El-Nagar HS. Performance of rapid immunochromatographic assay in the diagnosis of Trichomoniasis vaginalis. Diagn Microbiol Infect Dis 2012 ; 74 : 49 -53. OpenUrl CrossRef PubMed
  • Siba V, Horwood PF, Vanuga K, Wapling J, Sehuko R, Siba PM, et al. Evaluation of serological diagnostic tests for typhoid fever in Papua New Guinea using a composite reference standard. Clin Vaccine Immunol 2012 ; 19 : 1833 -7. OpenUrl Abstract / FREE Full Text
  • ↵ Nicolau I, Tian L, Menzies D, Ostiguy G, Pai M. Point-of-care urine tests for smoking status and isoniazid treatment monitoring in adult patients. PLoS One 2012 ; 7 : e45913 . OpenUrl CrossRef PubMed
  • ↵ Hadgu A, Dendukuri N, Wang L. Evaluation of screening tests for detecting Chlamydia trachomatis: bias associated with the patient-infected-status algorithm. Epidemiology 2012 ; 23 : 72 -82. OpenUrl CrossRef PubMed Web of Science
  • ↵ De Groot JA, Bossuyt PM, Reitsma JB, Rutjes AW, Dendukuri N, Janssen KJ, et al. Verification problems in diagnostic accuracy studies: consequences and solutions. BMJ 2011 ; 343 : d4770 . OpenUrl FREE Full Text
  • ↵ Naaktgeboren CA, de Groot JAH, van Smeeden M, Moons KGM, Reitsma JB. Evaluating diagnostic accuracy in the face of multiple reference standards. Ann Intern Med 2013 ; 159 : 195 -202. OpenUrl CrossRef PubMed Web of Science
  • Ewer AK, Furmston AT, Middleton LJ, Deeks JJ, Daniels JP, Pattison HM, et al. Pulse oximetry as a screening test for congenital heart defects in newborn infants: a test accuracy study with evaluation of acceptability and cost-effectiveness. Health Technol Assess 2012 ; 16 : v -184. OpenUrl PubMed Web of Science
  • ↵ Geersing GJ, Erkens PM, Lucassen WA, Buller HR, Cate HT, Hoes AW, et al. Safe exclusion of pulmonary embolism using the Wells rule and qualitative D-dimer testing in primary care: prospective cohort study. BMJ 2012 ; 345 : e6564 . OpenUrl Abstract / FREE Full Text
  • ↵ Kerl JM, Schoepf UJ, Zwerner PL, Bauer RW, Abro JA, Thilo C, et al. Accuracy of coronary artery stenosis detection with CT versus conventional coronary angiography compared with composite findings from both tests as an enhanced reference standard. Eur Radiol 2011 ; 21 : 1895 -903. OpenUrl CrossRef PubMed
  • ↵ Hadgu A. The discrepancy in discrepant analysis. Lancet 1996 ; 348 : 592 -3. OpenUrl CrossRef PubMed Web of Science
  • ↵ Macaskill P, Walter SD, Irwig L, Franco EL. Assessing the gain in diagnostic performance when combining two diagnostic tests. Stat Med 2002 ; 21 : 2527 -46. OpenUrl CrossRef PubMed Web of Science
  • ↵ Lord SJ, Staub LP, Bossuyt PM, Irwig LM. Target practice: choosing target conditions for test accuracy studies that are relevant to clinical practice. BMJ 2011 ; 343 : d4684 . OpenUrl FREE Full Text
  • ↵ Gardner IA, Stryhn H, Lind P, Collins MT. Conditional dependence between tests affects the diagnosis and surveillance of animal diseases. Prev Vet Med 2000 ; 45 : 107 -22. OpenUrl CrossRef PubMed Web of Science
  • ↵ Vadwai V, Boehme C, Nabeta P, Shetty A, Alland D, Rodrigues C. Xpert MTB/RIF: a new pillar in diagnosis of extrapulmonary tuberculosis? J Clin Microbiol 2011 ; 49 : 2540 -5. OpenUrl Abstract / FREE Full Text
  • ↵ Hilden J. Boolean algebra, Boolean nodes. In: Kattan M, Cowen ME, eds. Encyclopedia of medical decision making. 1st ed. Sage, 2009:94-8.
  • ↵ De Groot JA, Janssen KJ, Zwinderman AH, Bossuyt PM, Reitsma JB, Moons KG. Correcting for partial verification bias: a comparison of methods. Ann Epidemiol 2011 ; 21 : 139 -48. OpenUrl CrossRef PubMed Web of Science
  • ↵ Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011 ; 155 : 529 -36. OpenUrl CrossRef PubMed Web of Science
  • ↵ Lijmer JG, Bossuyt PM, Heisterkamp SH. Exploring sources of heterogeneity in systematic reviews of diagnostic tests. Stat Med 2002 ; 21 : 1525 -37. OpenUrl CrossRef PubMed Web of Science
  • ↵ Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Fam Pract 2004 ; 21 : 4 -10. OpenUrl Abstract / FREE Full Text
  • ↵ Pepe MS, Janes H. Insights into latent class analysis of diagnostic test performance. Biostatistics 2007 ; 8 : 474 -84. OpenUrl Abstract / FREE Full Text

what is reference standard in research

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • For authors
  • Browse by collection
  • BMJ Journals

You are here

  • Volume 6, Issue 11
  • STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Jérémie F Cohen 1 , 2 ,
  • Daniël A Korevaar 1 ,
  • Douglas G Altman 3 ,
  • David E Bruns 4 ,
  • Constantine A Gatsonis 5 ,
  • Lotty Hooft 6 ,
  • Les Irwig 7 ,
  • Deborah Levine 8 , 9 ,
  • Johannes B Reitsma 10 ,
  • Henrica C W de Vet 11 ,
  • Patrick M M Bossuyt 1
  • 1 Department of Clinical Epidemiology, Biostatistics and Bioinformatics , Academic Medical Centre, University of Amsterdam , Amsterdam , The Netherlands
  • 2 Department of Pediatrics , INSERM UMR 1153, Necker Hospital, AP-HP, Paris Descartes University , Paris , France
  • 3 Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences , Centre for Statistics in Medicine, University of Oxford , Oxford , UK
  • 4 Department of Pathology , University of Virginia School of Medicine , Charlottesville, Virginia , USA
  • 5 Department of Biostatistics , Brown University School of Public Health , Providence, Rhode Island , USA
  • 6 Cochrane Netherlands, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, University of Utrecht , Utrecht , The Netherlands
  • 7 Screening and Diagnostic Test Evaluation Program, School of Public Health, University of Sydney , Sydney, New South Wales , Australia
  • 8 Department of Radiology , Beth Israel Deaconess Medical Center , Boston, Massachusetts , USA
  • 9 Radiology Editorial Office , Boston, Massachusetts , USA
  • 10 Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, University of Utrecht , Utrecht , The Netherlands
  • 11 Department of Epidemiology and Biostatistics , EMGO Institute for Health and Care Research, VU University Medical Center , Amsterdam , The Netherlands
  • Correspondence to Professor Patrick M M Bossuyt; p.m.bossuyt{at}amc.uva.nl

Diagnostic accuracy studies are, like other clinical studies, at risk of bias due to shortcomings in design and conduct, and the results of a diagnostic accuracy study may not apply to other patient groups and settings. Readers of study reports need to be informed about study design and conduct, in sufficient detail to judge the trustworthiness and applicability of the study findings. The STARD statement (Standards for Reporting of Diagnostic Accuracy Studies) was developed to improve the completeness and transparency of reports of diagnostic accuracy studies. STARD contains a list of essential items that can be used as a checklist, by authors, reviewers and other readers, to ensure that a report of a diagnostic accuracy study contains the necessary information. STARD was recently updated. All updated STARD materials, including the checklist, are available at http://www.equator-network.org/reporting-guidelines/stard . Here, we present the STARD 2015 explanation and elaboration document. Through commented examples of appropriate reporting, we clarify the rationale for each of the 30 items on the STARD 2015 checklist, and describe what is expected from authors in developing sufficiently informative study reports.

  • Reporting quality
  • Sensitivity and specificity
  • Diagnostic accuracy
  • Research waste
  • Peer review
  • Medical publishing

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/

https://doi.org/10.1136/bmjopen-2016-012799

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Diagnostic accuracy studies are at risk of bias, not unlike other clinical studies. Major sources of bias originate in methodological deficiencies, in participant recruitment, data collection, executing or interpreting the test or in data analysis. 1 , 2 As a result, the estimates of sensitivity and specificity of the test that is compared against the reference standard can be flawed, deviating systematically from what would be obtained in ideal circumstances (see key terminology in table 1 ). Biased results can lead to improper recommendations about testing, negatively affecting patient outcomes or healthcare policy.

  • View inline

Key STARD terminology

Diagnostic accuracy is not a fixed property of a test. A test's accuracy in identifying patients with the target condition typically varies between settings, patient groups and depending on prior testing. 2 These sources of variation in diagnostic accuracy are relevant for those who want to apply the findings of a diagnostic accuracy study to answer a specific question about adopting the test in his or her environment. Risk of bias and concerns about the applicability are the two key components of QUADAS-2, a quality assessment tool for diagnostic accuracy studies. 3

Readers can only judge the risk of bias and applicability of a diagnostic accuracy study if they find the necessary information to do so in the study report. The published study report has to contain all the essential information to judge the trustworthiness and relevance of the study findings, in addition to a complete and informative disclose about the study results.

Unfortunately, several surveys have shown that diagnostic accuracy study reports often fail to transparently describe core elements. 4–6 Essential information about included patients, study design and the actual results is frequently missing, and recommendations about the test under evaluation are often generous and too optimistic.

To facilitate more complete and transparent reporting of diagnostic accuracy studies, the STARD statement was developed: Standards for Reporting of Diagnostic Accuracy Studies. 7 Inspired by the Consolidated Standards for the Reporting of Trials or CONSORT statement for reporting randomised controlled trials, 8 , 9 STARD contains a checklist of items that should be reported in any diagnostic accuracy study.

The STARD statement was initially released in 2003 and updated in 2015. 10 The objectives of this update were to include recent evidence about sources of bias and variability and other issues in complete reporting, and make the STARD list easier to use. The updated STARD 2015 list now has 30 essential items ( table 2 ).

The STARD 2015 list 10

Below, we present an explanation and elaboration of STARD 2015. This is an extensive revision and update of a similar document that was prepared for the STARD 2003 version. 11 Through commented examples of appropriate reporting, we clarify the rationale for each item and describe what is expected from authors.

We are confident that these descriptions can further assist scientists in writing fully informative study reports, and help peer reviewers, editors and other readers in verifying that submitted and published manuscripts of diagnostic accuracy studies are sufficiently detailed.

STARD 2015 items: explanation and elaboration

Title or abstract.

Item 1. Identification as a study of diagnostic accuracy using at least one measure of accuracy (such as sensitivity, specificity, predictive values or AUC)

Example . ‘Main outcome measures: Sensitivity and specificity of CT colonography in detecting individuals with advanced neoplasia (i.e., advanced adenoma or colorectal cancer) 6 mm or larger’. 12

Explanation . When searching for relevant biomedical studies on a certain topic, electronic databases such as MEDLINE and Embase are indispensable. To facilitate retrieval of their article, authors can explicitly identify it as a report of a diagnostic accuracy study. This can be performed by using terms in the title and/or abstract that refer to measures of diagnostic accuracy, such as ‘sensitivity’, ‘specificity’, ‘positive predictive value’, ‘negative predictive value’, ‘area under the ROC curve (AUC)’ or ‘likelihood ratio’.

In 1991, MEDLINE introduced a specific keyword (MeSH heading) for indexing diagnostic studies: ‘Sensitivity and Specificity.’ Unfortunately, the sensitivity of using this particular MeSH heading to identify diagnostic accuracy studies can be as low as 51%. 13 As of May 2015, Embase's thesaurus (Emtree) has 38 check tags for study types; ‘diagnostic test accuracy study’ is one of them, but was only introduced in 2011.

In the example , the authors mentioned the terms ‘sensitivity’ and ‘specificity’ in the abstract. The article will now be retrieved when using one of these terms in a search strategy, and will be easily identifiable as one describing a diagnostic accuracy study.

Item 2. Structured summary of study design, methods, results and conclusions (for specific guidance, see STARD for Abstracts)

Example . See STARD for Abstracts ( manuscript in preparation; checklist will be available at http://www.equator-network.org/reporting-guidelines/stard/ ).

Explanation . Readers use abstracts to decide whether they should retrieve the full study report and invest time in reading it. In cases where access to the full study report cannot be obtained or where time is limited, it is conceivable that clinical decisions are based on the information provided in abstracts only.

In two recent literature surveys, abstracts of diagnostic accuracy studies published in high-impact journals or presented at an international scientific conference were found insufficiently informative, because key information about the research question, study methods, study results and the implications of findings were frequently missing. 14 , 15

Informative abstracts help readers to quickly appraise critical elements of study validity (risk of bias) and applicability of study findings to their clinical setting (generalisability). Structured abstracts, with separate headings for objectives, methods, results and interpretation, allow readers to find essential information more easily. 16

Building on STARD 2015, the newly developed STARD for Abstracts provides a list of essential items that should be included in journal and conference abstracts of diagnostic accuracy studies (list finalised; manuscript under development) .

Item 3. Scientific and clinical background, including the intended use and clinical role of the index test

Example . ‘The need for improved efficiency in the use of emergency department radiography has long been documented. This need for selectivity has been identified clearly for patients with acute ankle injury, who generally are all referred for radiography, despite a yield for fracture of less than 15%. The referral patterns and yield of radiography for patients with knee injuries have been less well described but may be more inefficient than for patients with ankle injuries. […] The sheer volume of low-cost tests such as plain radiography may contribute as much to rising health care costs as do high-technology, low-volume procedures. […] If validated in subsequent studies, a decision rule for knee-injury patients could lead to a large reduction in the use of knee radiography and significant health care savings without compromising patient care’. 17

Explanation . In the introduction of scientific study reports, authors should describe the rationale for their study. In doing so, they can refer to previous work on the topic, remaining uncertainty and the clinical implications of this knowledge gap. To help readers in evaluating the implications of the study, authors can clarify the intended use and the clinical role of the test under evaluation, which is referred to as the index test.

The intended use of a test can be diagnosis, screening, staging, monitoring, surveillance, prognosis, treatment selection or other purposes. 18 The clinical role of the test under evaluation refers to its anticipated position relative to other tests in the clinical pathway. 19 A triage test, for example, will be used before an existing test because it is less costly or burdensome, but often less accurate as well. An add-on test will be used after existing tests, to improve the accuracy of the total test strategy by identifying false positives or false negatives of the initial test. In other cases, a new test may be used to replace an existing test.

Defining the intended use and clinical role of the test will guide the design of the study and the targeted level of sensitivity and specificity; from these definitions follow the eligibility criteria, how and where to identify eligible participants, how to perform tests and how to interpret test results. 19

Specifying the clinical role is helpful in assessing the relative importance of potential errors (false positives and false negatives) made by the index test. A triage test to rule out disease, for example, will need very high sensitivity, whereas the one that mainly aims to rule in disease will need very high specificity.

In the example , the intended use is diagnosis of knee fractures in patients with acute knee injuries, and the potential clinical role is triage test; radiography, the existing test, would only be performed in those with a positive outcome of the newly developed decision rule. The authors outline the current scientific and clinical background of the health problem studied, and their reason for aiming to develop a triage test: this would reduce the number of radiographs and, consequently, healthcare costs.

Item 4. Study objectives and hypotheses

Example (1) . ‘The objective of this study was to evaluate the sensitivity and specificity of 3 different diagnostic strategies: a single rapid antigen test, a rapid antigen test with a follow-up rapid antigen test if negative (rapid-rapid diagnostic strategy), and a rapid antigen test with follow-up culture if negative (rapid-culture)—the AAP diagnostic strategy—all compared with a 2-plate culture gold standard. In addition, […] we also compared the ability of these strategies to achieve an absolute diagnostic test sensitivity of >95%’. 20

Example (2) . ‘Our 2 main hypotheses were that rapid antigen detection tests performed in physician office laboratories are more sensitive than blood agar plate cultures performed and interpreted in physician office laboratories, when each test is compared with a simultaneous blood agar plate culture processed and interpreted in a hospital laboratory, and rapid antigen detection test sensitivity is subject to spectrum bias’. 21

Explanation . Clinical studies may have a general aim (a long-term goal, such as ‘to improve the staging of oesophageal cancer’), specific objectives (well-defined goals for this particular study) and testable hypotheses (statements than can be falsified by the study results).

In diagnostic accuracy studies, statistical hypotheses are typically defined in terms of acceptability criteria for single tests (minimum levels of sensitivity, specificity or other measures). In those cases, hypotheses generally include a quantitative expression of the expected value of the diagnostic parameter. In other cases, statistical hypotheses are defined in terms of equality or non-inferiority in accuracy when comparing two or more index tests.

A priori specification of the study hypotheses limits the chances of post hoc data-dredging with spurious findings, premature conclusions about the performance of tests or subjective judgement about the accuracy of the test. Objectives and hypotheses also guide sample size calculations. An evaluation of 126 reports of diagnostic test accuracy studies published in high-impact journals in 2010 revealed that 88% did not state a clear hypothesis. 22

In the first example , the authors' objective was to evaluate the accuracy of three diagnostic strategies; their specific hypothesis was that the sensitivity of any of these would exceed the prespecified value of 95%. In the second example , the authors explicitly describe the hypotheses they want to explore in their study. The first hypothesis is about the comparative sensitivity of two index tests (rapid antigen detection test vs culture performed in physician office laboratories); the second is about variability of rapid test performance according to patient characteristics (spectrum bias).

Item 5. Whether data collection was planned before the index test and reference standard were performed (prospective study) or after (retrospective study)

Example . ‘We reviewed our database of patients who underwent needle localization and surgical excision with digital breast tomosynthesis guidance from April 2011 through January 2013. […] The patients’ medical records and images of the 36 identified lesions were then reviewed retrospectively by an author with more than 5 years of breast imaging experience after a breast imaging fellowship’. 23

Explanation . There is great variability in the way the terms ‘prospective’ and ‘retrospective’ are defined and used in the literature. We believe it is therefore necessary to describe clearly whether data collection was planned before the index test and reference standard were performed, or afterwards. If authors define the study question before index test and reference standards are performed, they can take appropriate actions for optimising procedures according to the study protocol and for dedicated data collection. 24

Sometimes, the idea for a study originates when patients have already undergone the index test and the reference standard. If so, data collection relies on reviewing patient charts or extracting data from registries. Though such retrospective studies can sometimes reflect routine clinical practice better than prospective studies, they may fail to identify all eligible patients, and often result in data of lower quality, with more missing data points. 24 A reason for this could be, for example, that in daily clinical practice, not all patients undergoing the index test may proceed to have the reference standard.

In the example , the data were clearly collected retrospectively: participants were identified through database screening, clinical data were abstracted from patients' medical records, though images were reinterpreted.

Item 6. Eligibility criteria

Example (1). ‘Patients eligible for inclusion were consecutive adults (≥18 years) with suspected pulmonary embolism, based on the presence of at least one of the following symptoms: unexplained (sudden) dyspnoea, deterioration of existing dyspnoea, pain on inspiration, or unexplained cough. We excluded patients if they received anticoagulant treatment (vitamin K antagonists or heparin) at presentation, they were pregnant, follow-up was not possible, or they were unwilling or unable to provide written informed consent’. 25

Example (2) . ‘Eligible cases had symptoms of diarrhoea and both a positive result for toxin by enzyme immunoassay and a toxigenic C difficile strain detected by culture (in a sample taken less than seven days before the detection round). We defined diarrhoea as three or more loose or watery stool passages a day. We excluded children and adults on intensive care units or haematology wards. Patients with a first relapse after completing treatment for a previous C difficile infection were eligible but not those with subsequent relapses. […] For each case we approached nine control patients. These patients were on the same ward as and in close proximity to the index patient. Control patients did not have diarrhoea, or had diarrhoea but a negative result for C difficile toxin by enzyme immunoassay and culture (in a sample taken less than seven days previously)’. 26

Explanation . Since a diagnostic accuracy study describes the behaviour of a test under particular circumstances, a report of the study must include a complete description of the criteria that were used to identify eligible participants. Eligibility criteria are usually related to the nature and stage of the target condition and the intended future use of the index test; they often include the signs, symptoms or previous test results that generate the suspicion about the target condition. Additional criteria can be used to exclude participants for reasons of safety, feasibility and ethical arguments.

Excluding patients with a specific condition or receiving a specific treatment known to adversely affect the way the test works can lead to inflated diagnostic accuracy estimates. 27 An example is the exclusion of patients using β blockers in studies evaluating the diagnostic accuracy of exercise ECG.

Some studies have one set of eligibility criteria for all study participants; these are sometimes referred to as single-gate or cohort studies. Other studies have one set of eligibility criteria for participants with the target condition, and (an)other set(s) of eligibility criteria for those without the target condition; these are called multiple-gate or case–control studies. 28

In the first example , the eligibility criteria list presenting signs and symptoms, an age limit and exclusion based on specific conditions and treatments. Since the same set of eligibility criteria applies to all study participants, this is an example of a single-gate study.

In the second example , the authors used different eligibility criteria for participants with and without the target condition: one group consisted of patients with a confirmed diagnosis of Clostridium difficile , and one group consisted of healthy controls. This is an example of a multiple-gate study. Extreme contrasts between severe cases and healthy controls can lead to inflated estimates of accuracy. 6 , 29

Item 7. On what basis potentially eligible participants were identified (such as symptoms, results from previous tests, inclusion in registry)

Example . ‘We reviewed our database of patients who underwent needle localization and surgical excision with digital breast tomosynthesis guidance from April 2011 through January 2013’. 23

Explanation . The eligibility criteria specify who can participate in the study, but they do not describe how the study authors identified eligible participants. This can be performed in various ways. 30 A general practitioner may evaluate every patient for eligibility that he sees during office hours. Researchers can go through registries in an emergency department, to identify potentially eligible patients. In other studies, patients are only identified after having been subjected to the index test. Still other studies start with patients in whom the reference standard was performed. Many retrospective studies include participants based on searching hospital databases for patients that underwent the index test and the reference standard. 31

Differences in methods for identifying eligible patients can affect the spectrum and prevalence of the target condition in the study group, as well as the range and relative frequency of alternative conditions in patients without the target condition. 32 These differences can influence the estimates of diagnostic accuracy.

In the example , participants were identified through searching a patient database and were included if they underwent the index test and the reference standard.

Item 8. Where and when potentially eligible participants were identified (setting, location and dates)

Example . ‘The study was conducted at the Emergency Department of a university-affiliated children's hospital between January 21, 1996, and April 30, 1996’. 33

Explanation . The results of a diagnostic accuracy study reflect the performance of a test in a particular clinical context and setting. A medical test may perform differently in a primary, secondary or tertiary care setting, for example. Authors should therefore report the actual setting in which the study was performed, as well as the exact locations: names of the participating centres, city and country. The spectrum of the target condition as well as the range of other conditions that occur in patients suspected of the target condition can vary across settings, depending on which referral mechanisms are in play. 34–36

Since test procedures, referral mechanisms and the prevalence and severity of diseases can evolve over time, authors should also report the start and end dates of participant recruitment.

This information is essential for readers who want to evaluate the generalisability of the study findings, and their applicability to specific questions, for those who would like to use the evidence generated by the study to make informed healthcare decisions.

In the example , study setting and study dates were clearly defined.

Item 9. Whether participants formed a consecutive, random or convenience series

Example . ‘All subjects were evaluated and screened for study eligibility by the first author (E.N.E.) prior to study entry. This was a convenience sample of children with pharyngitis; the subjects were enrolled when the first author was present in the emergency department’. 37

Explanation . The included study participants may be either a consecutive series of all patients evaluated for eligibility at the study location and satisfying the inclusion criteria, or a subselection of these. A subselection can be purely random, produced by using a random numbers table, or less random, if patients are only enrolled on specific days or during specific office hours. In that case, included participants may not be considered a representative sample of the targeted population, and the generalisability of the study results may be jeopardised. 2 , 29

In the example, the authors explicitly described a convenience series where participants were enrolled based on their accessibility to the clinical investigator.

Item 10a. Index test, in sufficient detail to allow replication

Item 10b. Reference standard, in sufficient detail to allow replication

Example . ‘An intravenous line was inserted in an antecubital vein and blood samples were collected into serum tubes before (baseline), immediately after, and 1.5 and 4.5 h after stress testing. Blood samples were put on ice, processed within 1 h of collection, and later stored at −80°C before analysis. The samples had been through 1 thaw–freeze cycle before cardiac troponin I (cTnI) analysis. We measured cTnI by a prototype hs assay (ARCHITECT STAT high-sensitivity troponin, Abbott Diagnostics) with the capture antibody detecting epitopes 24–40 and the detection antibody epitopes 41–49 of cTnI. The limit of detection (LoD) for the high sensitivity (hs) cTnI assay was recently reported by other groups to be 1.2 ng/L, the 99th percentile 16 ng/L, and the assay 10% coefficient of variation (CV) 3.0 ng/L. […] Samples with concentrations below the range of the assays were assigned values of 1.2 […] for cTnI. […]’. 38

Explanation . Differences in the execution of the index test or reference standard are a potential source of variation in diagnostic accuracy. 39 , 40 Authors should therefore describe the methods for executing the index test and reference standard, in sufficient detail to allow other researchers to replicate the study, and to allow readers to assess (1) the feasibility of using the index test in their own setting, (2) the adequacy of the reference standard and (3) the applicability of the results to their clinical question.

The description should cover key elements of the test protocol, including details of:

the preanalytical phase, for example, patient preparation such as fasting/feeding status prior to blood sampling, the handling of the sample prior to testing and its limitations (such as sample instability), or the anatomic site of measurement;

the analytical phase, including materials and instruments and analytical procedures;

the postanalytical phase, such as calculations of risk scores using analytical results and other variables.

Between-study variability in measures of test accuracy due to differences in test protocol has been documented for a number of tests, including the use of hyperventilation prior to exercise ECG and the use of tomography for exercise thallium scintigraphy. 27 , 40

The number, training and expertise of the persons executing and reading the index test and the reference standard may also be critical. Many studies have shown between-reader variability, especially in the field of imaging. 41 , 42 The quality of reading has also been shown to be affected in cytology and microbiology by professional background, expertise and prior training to improve interpretation and to reduce interobserver variation. 43–45 Information about the amount of training of the persons in the study who read the index test can help readers to judge whether similar results are achievable in their own settings.

In some cases, a study depends on multiple reference standards. Patients with lesions on an imaging test under evaluation may, for example, undergo biopsy with a final diagnosis based on histology, whereas patients without lesions on the index test undergo clinical follow-up as reference standard. This could be a potential source of bias, so authors should specify which patient groups received which reference standard. 2 , 3

More specific guidance for specialised fields of testing, or certain types of tests, will be developed in future STARD extensions. Whenever available, these extensions will be made available on the STARD pages at the EQUATOR (Enhancing the QUAlity and Transparency Of health Research) website ( http://www.equator-network.org/ ).

In the example , the authors described how blood samples were collected and processed in the laboratory. They also report analytical performance characteristics of the index test device, as obtained in previous studies.

Item 11. Rationale for choosing the reference standard (if alternatives exist)

Example . ‘The MINI [Mini International Neuropsychiatric Inventory] was developed as a short and efficient diagnostic interview to be used in both research and clinical settings (reference supporting this statement provided by the authors) . It has good reliability and validity rates compared with other gold standard diagnostic interviews, such as the Structured Clinical Interview for DSM [Diagnostic and Statistical Manual of Mental Disorders] Disorders (SCID) and the Composite International Diagnostic Interview (references supporting this statement provided by the authors) ’. 46

Explanation . In diagnostic accuracy studies, the reference standard is used for establishing the presence or absence of the target condition in study participants. Several reference standards may be available to define the same target condition. In such cases, authors are invited to provide their rationale for selecting the specific reference standard from the available alternatives. This may depend on the intended use of the index test, the clinical relevance or practical and/or ethical reasons.

Alternative reference standards are not always in perfect agreement. Some reference standards are less accurate than others. In other cases, different reference standards reflect related but different manifestations or stages of the disease, as in confirmation by imaging (first reference standard) versus clinical events (second reference standard).

In the example , the authors selected the MINI, a structured diagnostic interview commonly used for psychiatric evaluations, as the reference standard for identifying depression and suicide risk in adults with epilepsy. As a rationale for their choice, they claimed that the MINI test was short to administer, efficient for clinical and research purposes, reliable and valid when compared with alternative diagnostic interviews.

Item 12a. Definition of and rationale for test positivity cut-offs or result categories of the index test, distinguishing prespecified from exploratory

Item 12b. Definition of and rationale for test positivity cut-offs or result categories of the reference standard, distinguishing prespecified from exploratory

Example . ‘We also compared the sensitivity of the risk-model at the specificity that would correspond to using a fixed FIT [fecal immunochemical test] positivity threshold of 50 ng/ml. We used a threshold of 50 ng/ml because this was the anticipated cut-off for the Dutch screening programme at the time of the study’. 47

Explanation . Test results in their original form can be dichotomous (positive vs negative), have multiple categories (as in high, intermediate or low risk) or be continuous (interval or ratio scale).

For tests with multiple categories, or continuous results, the outcomes from testing are often reclassified into positive (disease confirmed) and negative (disease excluded). This is performed by defining a threshold: the test positivity cut-off. Results that exceed the threshold would then be called positive index test results. In other studies, an ROC curve is derived, by calculating the sensitivity–specificity pairs for all possible cut-offs.

To evaluate the validity and applicability of these classifications, readers would like to know these positivity cut-offs or result categories, how they were determined and whether they were defined prior to the study or after collecting the data. Prespecified thresholds can be based on (1) previous studies, (2) cut-offs used in clinical practice, (3) thresholds recommended by clinical practice guidelines or (4) thresholds recommended by the manufacturer. If no such thresholds exist, the authors may be tempted to explore the accuracy for various thresholds after the data have been collected.

If the authors selected the positivity cut-off after performing the test, choosing the one that maximised test performance, there is an increased risk that the resulting accuracy estimates are overly optimistic, especially in small studies. 48 , 49 Subsequent studies may fail to replicate the findings. 50 , 51

In the example , the authors stated the rationale for their selection of cut-offs.

Item 13a. Whether clinical information and reference standard results were available to the performers or readers of the index test

Item 13b. Whether clinical information and index test results were available to the assessors of the reference standard

Example . ‘Images for each patient were reviewed by two fellowship-trained genitourinary radiologists with 12 and 8 years of experience, respectively, who were blinded to all patient information, including the final histopathologic diagnosis’. 52

Explanation . Some medical tests, such as most forms of imaging, require human handling, interpretation and judgement. These actions may be influenced by the information that is available to the reader. 1 , 53 , 54 This can lead to artificially high agreement between tests, or between the index test and the reference standard.

If the reader of a test has access to information about signs, symptoms and previous test results, the reading may be influenced by this additional information, but this may still represent how the test is used in clinical practice. 2 The reverse may also apply, if the reader does not have enough information for a proper interpretation of the index test outcome. In that case, test performance may be affected downwards, and the study findings may have limited applicability. Either way, readers of the study report should know to which extent, such additional information was available to test readers and may have influenced their final judgement.

In other situations, the assessors of the reference standard may have had access to the index test results. In those cases, the final classification may be guided by the index test result, and the reported accuracy estimates for the index test will be too high. 1 , 2 , 27 Tests that require subjective interpretation are particularly susceptible to this bias.

Withholding information from the readers of the test is commonly referred to as ‘blinding’ or ‘masking’. The point of this reporting item is not that blinding is desirable or undesirable, but, rather, that readers of the study report need information about blinding for the index test and the reference standard to be able to interpret the study findings.

In the example , the readers of unenhanced CT for differentiating between renal angiomyolipoma and renal cell carcinoma did not have access to clinical information, nor to the results of histopathology, the reference standard in this study.

Item 14. Methods for estimating or comparing measures of diagnostic accuracy

Example . ‘Statistical tests of sensitivity and specificity were conducted by using the McNemar test for correlated proportions. All tests were two sided, testing the hypothesis that stereoscopic digital mammography performance differed from that of digital mammography. A p-value of 0.05 was considered as the threshold for significance’. 55

Explanation . Multiple measures of diagnostic accuracy exist to describe the performance of a medical test, and their calculation from the collected data is not always straightforward. 56 Authors should report the methods used for calculating the measures that they considered appropriate for their study objectives.

Statistical techniques can be used to test specific hypotheses, following from the study's objectives. In single-test evaluations, authors may want to evaluate if the diagnostic accuracy of the tests exceeds a prespecified level (eg, sensitivity of at least 95%, see Item 4).

Diagnostic accuracy studies can also compare two or more index tests. In such comparisons, statistical hypothesis testing usually involves assessing the superiority of one test over another, or the non-inferiority. 57 For such comparisons, authors should indicate what measure they specified to make the comparison; these should match their study objectives, and the purpose and role of the index test relative to the clinical pathway. Examples are the relative sensitivity, the absolute gain in sensitivity and the relative diagnostic OR. 58

In the example , the authors used McNemar's test statistic to evaluate whether the sensitivity and specificity of stereoscopic digital mammography differed from that of digital mammography in patients with elevated risk for breast cancer. In itself, the resulting p value is not a quantitative expression of the relative accuracy of the two investigated tests. Like any p value, it is influenced by the magnitude of the difference in effect and the sample size. In the example, the authors could have calculated the relative or absolute difference in sensitivity and specificity, including a 95% CI that takes into account the paired nature of the data.

Item 15. How indeterminate index test or reference standard results were handled

Example . ‘Indeterminate results were considered false-positive or false-negative and incorporated into the final analysis. For example, an indeterminate result in a patient found to have appendicitis was considered to have had a negative test result’. 59

Explanation . Indeterminate results refer to those that are neither positive or negative. 60 Such results can occur on the index test and the reference standard, and are a challenge when evaluating the performance of a diagnostic test. 60–63 The occurrence of indeterminate test results varies from test to test, but frequencies up to 40% have been reported. 62

There are many underlying causes for indeterminate test results. 62 , 63 A test may fail because of technical reasons or an insufficient sample, for example, in the absence of cells in a needle biopsy from a tumour. 43 , 64 , 65 Sometimes test results are not reported as just positive or negative, as in the case of ventilation–perfusion scanning in suspected pulmonary embolism, where the findings are classified in three categories: normal, high probability or inconclusive. 66

In itself, the frequency of indeterminate test results is an important indicator of the feasibility of the test, and typically limits the overall clinical usefulness; therefore, authors are encouraged to always report the respective frequencies with reasons, as well as failures to complete the testing procedure. This applies to the index test and the reference standard.

Ignoring indeterminate test results can produce biased estimates of accuracy, if these results do not occur at random. Clinical practice may guide the decision on how to handle indeterminate results.

There are multiple ways for handling indeterminate test results in the analysis when estimating accuracy and expressing test performance. 63 They can be ignored altogether, be reported but not accounted for or handled as a separate test result category. Handling these results as a separate category may be useful when indeterminate results occur more often, for example, in those without the target condition than in those with the target condition. It is also possible to reclassify all such results: as false positives or false negatives, depending on the reference standard result (‘worst-case scenario’), or as true positives and true negatives (‘best-case scenario’).

In the example , the authors explicitly chose a conservative approach by considering all indeterminate results from the index test as being false-negative (in those with the target condition) or false-positive (in all others), a strategy sometimes referred to as the ‘worst-case scenario’.

Item 16. How missing data on the index test and reference standard were handled

Example . ‘One vessel had missing FFR CT and 2 had missing CT data. Missing data were handled by exclusion of these vessels as well as by the worst-case imputation’. 67

Explanation . Missing data are common in any type of biomedical research. In diagnostic accuracy studies, they can occur for the index test and reference standard. There are several ways to deal with them when analysing the data. 68 Many researchers exclude participants without an observed test result. This is known as ‘complete case’ or ‘available case’ analysis. This may lead to a loss in precision and can introduce bias, especially if having a missing index test or reference standard result is related to having the target condition.

Participants with missing test results can be included in the analysis if missing results are imputed. 68–70 Another option is to assess the impact of missing test results on estimates of accuracy by considering different scenarios. For the index test, for example, in the ‘worst-case scenario’, all missing index test results are considered false-positive or false-negative depending on the reference standard result; in the ‘best-case scenario’, all missing index test results are considered true-positive or true-negative.

In the example , the authors explicitly reported how many cases with missing index test data they encountered and how they handled these data: they excluded them, but also applied a ‘worst-case scenario’.

Item 17. Any analyses of variability in diagnostic accuracy, distinguishing prespecified from exploratory

Example . ‘To assess the performance of urinary indices or their changes over the first 24 hours in distinguishing transient AKI [acute kidney injury] from persistent AKI, we plotted the receiver-operating characteristic curves for the proportion of true positives against the proportion of false positives, depending on the prediction rule used to classify patients as having persistent AKI. The same strategy was used to assess the performance of indices and their changes over time in two predefined patient subgroups; namely, patients who did not receive diuretic therapy and patients without sepsis’. 71

Explanation . The relative proportion of false-positive or false-negative results of a diagnostic test may vary depending on patient characteristics, experience of readers, the setting and previous test results. 2 , 3 Researchers may therefore want to explore possible sources of variability in test accuracy within their study. In such analyses, investigators typically assess differences in accuracy across subgroups of participants, readers or centres.

Post hoc analyses, performed after looking at the data, carry a high risk for spurious findings. The results are especially likely not to be confirmed by subsequent studies. Analyses that were prespecified in the protocol, before data were collected, have greater credibility. 72

In the example , the authors reported that the accuracy of the urinary indices was evaluated in two subgroups that were explicitly prespecified.

Item 18. Intended sample size and how it was determined

Example . ‘Study recruitment was guided by an expected 12% prevalence of adenomas 6 mm or larger in a screening cohort and a point estimate of 80% sensitivity for these target lesions. We planned to recruit approximately 600 participants to achieve margins of sampling error of approximately 8 percentage points for sensitivity. This sample would also allow 90% power to detect differences in sensitivity between computed tomographic colonography and optical colonoscopy of 18 percentage points or more’. 73

Explanation . Performing sample size calculations when developing a diagnostic accuracy study may ensure that a sufficient amount of precision is reached. Sample size calculations also take into account the specific objectives and hypotheses of the study.

Readers may want to know how the sample size was determined, and whether the assumptions made in this calculation are in line with the scientific and clinical background, and the study objectives. Readers will also want to learn whether the study authors were successful in recruiting the targeted number of participants. Methods for performing sample size calculations in diagnostic research are widely available, 74–76 but such calculations are not always performed or provided in reports of diagnostic accuracy studies. 77 , 78

Many diagnostic accuracy studies are small; a systematic survey of studies published in 8 leading journals in 2002 found a median sample size of 118 participants (IQR 71–350). 77 Estimates of diagnostic accuracy from small studies tend to be imprecise, with wide CIs around them.

In the example , the authors reported in detail to achieve a desired level of precision for an expected sensitivity of 80%.

Item 19. Flow of participants, using a diagram

Example . ‘Between 1 June 2008 and 30 June 2011, 360 patients were assessed for initial eligibility and invited to participate. The figure shows the flow of patients through the study, along with the primary outcome of advanced colorectal neoplasia. Patients who were excluded (and reasons for this) or who withdrew from the study are noted. In total, 229 patients completed the study, a completion rate of 64%’. 79 (See figure 1 .)

  • Download figure
  • Open in new tab
  • Download powerpoint

Example of flow diagram from a study evaluating the accuracy of faecal immunochemical testing for diagnosis of advanced colorectal neoplasia (adapted from Collins et al , 79 with permission).

Explanation . Estimates of diagnostic accuracy may be biased if not all eligible participants undergo the index test and the desired reference standard. 80–86 This includes studies in which not all study participants undergo the reference standard, as well as studies where some of the participants receive a different reference standard. 70 Incomplete verification by the reference standard occurs in up to 26% of diagnostic studies; it is especially common when the reference standard is an invasive procedure. 84

To allow the readers to appreciate the potential for bias, authors are invited to build a diagram to illustrate the flow of participants through the study. Such a diagram also illustrates the basic structure of the study. An example of a prototypical STARD flow diagram is presented in figure 2 .

STARD 2015 flow diagram.

By providing the exact number of participants at each stage of the study, including the number of true-positive, false-positive, true-negative and false-negative index test results, the diagram also helps identifying the correct denominator for calculating proportions such as sensitivity and specificity. The diagram should also specify the number of participants that were assessed for eligibility, the number of participants who did not receive either the index test and/or the reference standard and the reasons for that. This helps readers to judge the risk of bias, but also the feasibility of the evaluated testing strategy, and the applicability of the study findings.

In the example , the authors very briefly described the flow of participants, and referred to a flow diagram in which the number of participants and corresponding test results at each stage of the study were provided, as well as detailed reasons for excluding participants ( figure 1 ).

Item 20. Baseline demographic and clinical characteristics of participants

Example . ‘The median age of participants was 60 years (range 18–91), and 209 participants (54.7%) were female. The predominant presenting symptom was abdominal pain, followed by rectal bleeding and diarrhea, whereas fever and weight loss were less frequent. At physical examination, palpation elicited abdominal pain in almost half the patients, but palpable abdominal or rectal mass was found in only 13 individuals (Table X)’. 87 (See table 3 .)

Example of baseline demographic and clinical characteristics of participants in a study evaluating the accuracy of point-of-care fecal tests for diagnosis of organic bowel disease (adapted from Kok et al , 87 with permission)

Explanation . The diagnostic accuracy of a test can depend on the demographic and clinical characteristics of the population in which it is applied. 2 , 3 , 88–92 These differences may reflect variability in the extent or severity of disease, which affects sensitivity, or in the alternative conditions that are able to generate false-positive findings, affecting specificity. 85

An adequate description of the demographic and clinical characteristics of study participants allows the reader to judge whether the study can adequately address the study question, and whether the study findings apply to the reader's clinical question.

In the example , the authors presented the demographic and clinical characteristics of the study participants in a separate table, a commonly used, informative way of presenting key participant characteristics ( table 3 ).

Item 21a. Distribution of severity of disease in those with the target condition

Item 21b. Distribution of alternative diagnoses in those without the target condition

Example . ‘Of the 170 patients with coronary disease, one had left main disease, 53 had three vessel disease, 64 two vessel disease, and 52 single vessel disease. The mean ejection fraction of the patients with coronary disease was 64% (range 37–83). The other 52 men with symptoms had normal coronary arteries or no significant lesions at angiography’. 93

Explanation . Most target conditions are not fixed states, either present or absent; many diseases cover a continuum, ranging from minute pathological changes to advanced clinical disease. Test sensitivity is often higher in studies in which more patients have advanced stages of the target condition, as these cases are often easier to identify by the index test. 28 , 85 The type, spectrum and frequency of alternative diagnoses in those without the target condition may also influence test accuracy; typically, the healthier the patients without the target condition, the less frequently one would find false-positive results of the index test. 28

An adequate description of the severity of disease in those with the target condition and of the alternative conditions in those without it allows the reader to judge both the validity of the study, relative to the study question and the applicability of the study findings to the reader's clinical question.

In the example , the authors investigated the accuracy of exercise tests for diagnosing coronary artery disease. They reported the distribution of severity of disease in terms of the number of vessels involved; the more vessels, the more severe the coronary artery disease would be. Sensitivity of test exercises was higher in those with more diseased vessels (39% for single vessel disease, 58% for two and 77% for three vessels). 91

Item 22. Time interval and any clinical interventions between index test and reference standard

Example . ‘The mean time between arthrometric examination and MR imaging was 38.2 days (range, 0–107 days)’. 94

Explanation . Studies of diagnostic accuracy are essentially cross-sectional investigations. In most cases, one wants to know how well the index test classified patients in the same way as the reference standard, when both tests are performed in the same patients, at the same time. 30 When a delay occurs between the index test and the reference standard, the target condition and alternative conditions can change; conditions may worsen, or improve in the meanwhile, due to the natural course of the disease, or due to clinical interventions applied between the two tests. Such changes influence the agreement between the index test and the reference standard, which could lead to biased estimates of test performance.

The bias can be more severe if the delay differs systematically between test positives and test negatives, or between those with a high prior suspicion of having the target condition and those with a low suspicion. 1 , 2

When follow-up is used as the reference standard, readers will want to know how long the follow-up period was.

In the example , the authors reported the mean number of days, and a range, between the index test and the reference standard.

Item 23. Cross tabulation of the index test results (or their distribution) by the results of the reference standard

Example . ‘Table X shows pain over speed bumps in relation to diagnosis of appendicitis’. 95 (See table 4 .)

Example of contingency table from a study evaluating the accuracy of pain over speed bumps for diagnosis of appendicitis (adapted from Ashdown et al , 95 with permission)

Explanation . Research findings should be reproducible and verifiable by other scientists; this applies both to the testing procedures, to the conduct of the study and to the statistical analyses.

A cross tabulation of index test results against reference standard results facilitates recalculating measures of diagnostic accuracy. It also facilitates recalculating the proportion of study group participants with the target condition, which is useful as the sensitivity and specificity of a test may vary with disease prevalence. 32 , 96 It also allows for performing alternative or additional analyses, such as meta-analysis.

Preferably, such tables should include actual numbers, not just percentages, because mistakes made by study authors in calculating estimates for sensitivity and specificity are not rare.

In the example , the authors provided a contingency table from which the number of true positives, false positives, false negatives and true negatives can be easily identified ( table 4 ).

Item 24. Estimates of diagnostic accuracy and their precision (such as 95% CIs)

Example . ‘Forty-six patients had pulmonary fibrosis at CT, and sensitivity and specificity of MR imaging in the identification of pulmonary fibrosis were 89% (95% CI 77%, 96%) and 91% (95% CI 76%, 98%), respectively, with positive and negative predictive values of 93% (95% CI 82%, 99%) and 86% (95% CI 70%, 95%), respectively’. 97

Explanation . Diagnostic accuracy studies never determine a test's ‘true’ sensitivity and specificity; at best, the data collected in the study can be used to calculate valid estimates of sensitivity and specificity. The smaller the number of study participants, the less precise these estimates will be. 98

The most frequently used expression of imprecision is to report not just the estimates—sometimes referred to as point estimates—but also 95% CIs around the estimates. Results from studies with imprecise estimates of accuracy should be interpreted with caution, as overoptimism lurks. 22

In the example , where MRI is the index test and CT the reference standard, the authors reported point estimates and 95% CIs around them, for sensitivity, specificity and positive and negative predictive value.

Item 25. Any adverse events from performing the index test or the reference standard

Example . ‘No significant adverse events occurred as a result of colonoscopy. Four (2%) patients had minor bleeding in association with polypectomy that was controlled endoscopically. Other minor adverse events are noted in the appendix’. 79

Explanation . Not all medical tests are equally safe, and in this, they do not differ from many other medical interventions. 99 , 100 The testing procedure can lead to complications, such as perforations with endoscopy, contrast allergic reactions in CT imaging or claustrophobia with MRI scanning.

Measuring and reporting of adverse events in studies of diagnostic accuracy will provide additional information to clinicians, who may be reluctant to use them if they produce severe or frequent adverse events. Actual application of a test in clinical practice will not just be guided by the test's accuracy, but by several other dimensions as well, including feasibility and safety. This also applies to the reference standard.

In the example , the authors distinguished between ‘significant’ and ‘minor’ adverse events, and explicitly reported how often these were observed.

Item 26. Study limitations, including sources of potential bias, statistical uncertainty and generalisability

Example. ‘This study had limitations. First, not all patients who underwent CT colonography (CTC) were assessed by the reference standard methods. […] However, considering that the 41 patients who were eligible but did not undergo the reference standard procedures had negative or only mildly positive CTC findings, excluding them from the analysis of CTC diagnostic performance may have slightly overestimated the sensitivity of CTC (ie, partial verification bias). Second, there was a long time interval between CTC and the reference methods in some patients, predominately those with negative CTC findings. […] If anything, the prolonged interval would presumably slightly underestimate the sensitivity and NPV of CTC for non-cancerous lesions, since some “missed” lesions could have conceivably developed or increased in size since the time of CTC’. 101

Explanation . Like other clinical trials and studies, diagnostic accuracy studies are at risk of bias; they can generate estimates of the test's accuracy that do not reflect the true performance of the test, due to flaws or deficiencies in study design and analysis. 1 , 2 In addition, imprecise accuracy estimates, with wide CIs, should be interpreted with caution. Because of differences in design, participants and procedures, the findings generated by one particular diagnostic accuracy study may not be obtained in other conditions; their generalisability may be limited. 102

In the Discussion section, authors should critically reflect on the validity of their findings, address potential limitations and elaborate on why study findings may or may not be generalisable. As bias can come down to overestimation or underestimation of the accuracy of the index test under investigation, authors should discuss the direction of potential bias, along with its likely magnitude. Readers are then informed of the likelihood that the limitations jeopardise the study's results and conclusions (see also Item 27). 103

Some journals explicitly encourage authors to report on study limitations, but many are not specific about which elements should be addressed. 104 For diagnostic accuracy studies, we highly recommend that at least potential sources of bias are discussed, as well as imprecision, and concerns related to the selection of patients and the setting in which the study was performed.

In the example , the authors identified two potential sources of bias that are common in diagnostic accuracy studies: not all test results were verified by the reference standard, and there was a time interval between index test and reference standard, allowing the target condition to change. They also discussed the magnitude of this potential bias, and the direction: whether this may have led to overestimations or underestimations of test accuracy.

Item 27. Implications for practice, including the intended use and clinical role of the index test

Example . ‘A Wells score of ≤4 combined with a negative point of care D-dimer test result ruled out pulmonary embolism in 4–5 of 10 patients, with a failure rate of less than 2%, which is considered safe by most published consensus statements. Such a rule-out strategy makes it possible for primary care doctors to safely exclude pulmonary embolism in a large proportion of patients suspected of having the condition, thereby reducing the costs and burden to the patient (for example, reducing the risk of contrast nephropathy associated with spiral computed tomography) associated with an unnecessary referral to secondary care’. 25

Explanation . To make the study findings relevant for practice, authors of diagnostic accuracy studies should elaborate on the consequences of their findings, taking into account the intended use (the purpose of testing) and clinical role of the test (how will the test be positioned in the existing clinical pathway).

A test can be proposed for diagnostic purposes, for susceptibility, screening, risk stratification, staging, prediction, prognosis, treatment selection, monitoring, surveillance or other purposes. The clinical role of the test reflects its positioning relative to existing tests for the same purpose, within the same clinical setting: triage, add-on or replacement. 19 , 105 The intended use and the clinical role of the index test should have been described in the introduction of the paper (Item 3).

The intended use and the proposed role will guide the desired magnitude of the measures of diagnostic accuracy. For ruling-out disease with an inexpensive triage test, for example, high sensitivity is required, and less-than-perfect specificity may be acceptable. If the test is supposed to rule-in disease, specificity may become much more important. 106

In the Discussion section, authors should elaborate on whether or not the accuracy estimates are sufficient for considering the test to be ‘fit for purpose’.

In the example , the authors concluded that the combination of a Wells score ≤4 and a negative point-of-care D-dimer result could reliably rule-out pulmonary embolism in a large proportion of patients seen in primary care.

Other information

Item 28. Registration number and name of registry

Example . ‘The study was registered at http://www.clinicaltrials.org ( NCT00916864 )’. 107

Explanation . Registering study protocols before their initiation in a clinical trial registry, such as ClinicalTrials.gov or one of the WHO Primary Registries, ensures that existence of the studies can be identified. 108–112 This has many advantages, including avoiding overlapping or redundant studies, and allowing colleagues and potential participants to contact the study coordinators.

Additional benefits of study registration are the prospective definition of study objectives, outcome measures, eligibility criteria and data to be collected, allowing editors, reviewers and readers to identify deviations in the final study report. Trial registration also allows reviewers to identify studies that have been completed but were not yet reported.

Many journals require registration of clinical trials. A low but increasing number of diagnostic accuracy studies are also being registered. In a recent evaluation of 351 test accuracy studies published in high-impact journals in 2012, 15% had been registered. 113

Including a registration number in the study report facilitates identification of the trial in the corresponding registry. It can also be regarded as a sign of quality, if the trial was registered before its initiation.

In the example , the authors reported that the study was registered at ClinicalTrials.gov. The registration number was also provided, so that the registered record could be easily retrieved.

Item 29. Where the full study protocol can be accessed

Example . ‘The design and rationale of the OPTIMAP study have been previously published in more detail [with reference to study protocol]’. 114

Explanation . Full study protocols typically contain additional methodological information that is not provided in the final study report, because of word limits, or because it has been reported elsewhere. This additional information can be helpful for those who want to thoroughly appraise the validity of the study, for researchers who want to replicate the study and for practitioners who want to implement the testing procedures.

An increasing number of researchers share their original study protocol, often before enrolment of the first participant in the study. They may do so by publishing the protocol in a scientific journal, at an institutional or sponsor website, or as supplementary material on the journal website, to accompany the study report.

If the protocol has been published or posted online, authors should provide a reference or a link. If the study protocol has not been published, authors should state from whom it can be obtained. 115

In the example , the authors provided a reference to the full protocol, which had been published previously.

Item 30. Sources of funding and other support; role of funders

Example . ‘Funding, in the form of the extra diagnostic reagents and equipment needed for the study, was provided by Gen-Probe. The funders had no role in the initiation or design of the study, collection of samples, analysis, interpretation of data, writing of the paper, or the submission for publication. The study and researchers are independent of the funders, Gen-Probe’. 116

Explanation . Sponsorship of a study by a pharmaceutical company has been shown to be associated with results favouring the interests of that sponsor. 117 Unfortunately, sponsorship is often not disclosed in scientific articles, making it difficult to assess this potential bias. Sponsorship can consist of direct funding of the study, or of the provision of essential study materials, such as test devices.

The role of the sponsor, including the degree to which that sponsor was involved in the study, varies. A sponsor could, for example, be involved in the design of the study, but also in the conduct, analysis, reporting and decision to publish. Authors are encouraged to be explicit about sources of funding as well as the sponsors role(s) in the study, as this transparency helps readers to appreciate the level of independency of the researchers.

In the example , the authors were explicit about the contribution from the sponsor, and their independence in each phase of the study.

Acknowledgments

The authors thank the STARD Group for helping us in identifying essential items for reporting diagnostic accuracy studies.

  • Whiting P ,
  • Rutjes AW ,
  • Reitsma JB , et al
  • Whiting PF ,
  • Westwood ME , et al
  • Korevaar DA ,
  • van Enst WA ,
  • Spijker R , et al
  • van Enst WA , et al
  • Lijmer JG ,
  • Heisterkamp S , et al
  • Bossuyt PM ,
  • Reitsma JB ,
  • Bruns DE , et al
  • Eastwood S , et al
  • Schulz KF ,
  • Altman DG ,
  • Galatola G , et al
  • Deville WL ,
  • Bezemer PD ,
  • Hooft L , et al
  • de Ronde MW , et al
  • ↵ A proposal for more informative abstracts of clinical articles. Ad Hoc Working Group for Critical Appraisal of the Medical Literature . Ann Intern Med 1987 ; 106 : 598 – 604 . OpenUrl CrossRef PubMed Web of Science
  • Stiell IG ,
  • Greenberg GH ,
  • Wells GA , et al
  • Horvath AR ,
  • StJohn A , et al
  • Craig J , et al
  • Gieseker KE ,
  • MacKenzie T , et al
  • Gerber MA ,
  • Kabat W , et al
  • Ochodo EA ,
  • de Haan MC ,
  • Rafferty EA
  • Sorensen HT ,
  • Geersing GJ ,
  • Erkens PM ,
  • Lucassen WA , et al
  • Bomers MK ,
  • van Agtmael MA ,
  • Luik H , et al
  • Philbrick JT ,
  • Horwitz RI ,
  • Feinstein AR
  • Vandenbroucke JP , et al
  • Di Nisio M , et al
  • Knottnerus JA ,
  • van der Schouw YT ,
  • Van Dijk R ,
  • Leeflang MM ,
  • Zaoutis T ,
  • Eppes S , et al
  • Knipschild PG ,
  • Rongkavilit C ,
  • Fairfax MR , et al
  • Kravdal G ,
  • Hoiseth AD , et al
  • Bossuyt P ,
  • Glasziou P , et al
  • Detrano R ,
  • Gianrossi R ,
  • Froelicher V
  • Brealey S ,
  • Elmore JG ,
  • Lee CH , et al
  • Montanari G ,
  • Aimone V , et al
  • Rodgers RP ,
  • Hales MS , et al
  • Marcon MJ , et al
  • Perry KN , et al
  • Stegeman I ,
  • de Wijkerslooth TR ,
  • Stoop EM , et al
  • Justice AC ,
  • Covinsky KE ,
  • Harrell FE Jr . ,
  • Hodgdon T ,
  • McInnes MD ,
  • Schieda N , et al
  • Doubilet P ,
  • D'Orsi CJ ,
  • Pickett RM , et al
  • Macaskill P ,
  • Irwig L , et al
  • Garcia Pena BM ,
  • Kraus SJ , et al
  • Feussner JR ,
  • DeLong ER , et al
  • Feinstein AR , et al
  • Greenes RA ,
  • Shinkins B ,
  • Thompson M ,
  • Mallett S , et al
  • Pisano ED ,
  • Fajardo LL ,
  • Tsimikas J , et al
  • Investigators P
  • Leipsic J ,
  • Pencina MJ , et al
  • Naaktgeboren CA ,
  • de Groot JA ,
  • Rutjes AW , et al
  • van der Heijden GJ ,
  • Donders AR ,
  • Stijnen T , et al
  • Lautrette A ,
  • Oziel J , et al
  • Ioannidis JP ,
  • Agoritsas T , et al
  • Cai W , et al
  • Flahault A ,
  • Cadilhac M ,
  • Hoilund-Carlsen PF
  • Bachmann LM ,
  • ter Riet G , et al
  • Bochmann F ,
  • Johnson Z ,
  • Azuara-Blanco A
  • Collins MG ,
  • Cole SR , et al
  • Kosinski AS ,
  • Jones MT , et al
  • Diamond GA ,
  • Rozanski A ,
  • Forrester JS , et al
  • Ransohoff DF ,
  • Witteman BJ , et al
  • Harris JM Jr .
  • Hlatky MA ,
  • Harrell FE Jr . , et al
  • Nachamkin I ,
  • Edelstein PH , et al
  • van Es GA ,
  • Deckers JW , et al
  • O'Connor PW ,
  • Tansay CM ,
  • Detsky AS , et al
  • Deckers JW ,
  • Rensing BJ ,
  • Tijssen JG , et al
  • Naraghi AM ,
  • Jacks LM , et al
  • Ashdown HF ,
  • D'Souza N ,
  • Karim D , et al
  • Rajaram S ,
  • Capener D , et al
  • Gotzsche PC , et al
  • Lee SS , et al
  • Glasziou PP , et al
  • Ter Riet G ,
  • Chesley P ,
  • Gross AG , et al
  • Ioannidis JP
  • Pewsner D ,
  • Battaglia M ,
  • Minder C , et al
  • Niessner M ,
  • Back T , et al
  • Bossuyt PM , et al
  • Ioannidis JP , et al
  • Leeuwenburgh MM ,
  • Wiarda BM ,
  • Wiezer MJ , et al
  • Vickers A , et al
  • Stewart CM ,
  • Schoeman SA ,
  • Booth RA , et al

JFC and DAK contributed equally to this manuscript and share first authorship.

Contributors JFC, DAK and PMMB are responsible for drafting of manuscript. DGA, DEB, CAG, LH, LI, DL, JBR and HCWdV are responsible for critical revision of manuscript.

Competing interests None declared.

Provenance and peer review Not commissioned; externally peer reviewed.

Data sharing statement No additional data are available.

Read the full text or download the PDF:

USP

USP Reference Standards

Recognized globally, USP Standards accelerate the pharmaceutical drug development process and increase confidence in the accuracy of analytical results. USP Standards are built on deep science, provide a high degree of analytic rigor and are accepted by regulators around the world . USP Standards support every stage of drug development and manufacturing, saving time and resources which contribute to the acceleration of the development of quality medicines.

The use of USP Standards (Pharmacopeial Reference Standards and Pharmacopeial Documentary Standards Method) enables companies to operate with a high level of certainty and confidence, reducing the risk of incorrect results that could lead to unnecessary batch failures, product delays, and market withdrawals.

Quick links

  • Access the USP Store
  • Get Customer Support
  • FAQs: Reference Standards
  • Use and Storage of USP Reference Standards
  • Safety Data Sheets

what is reference standard in research

Reference Standards Catalog

USP currently offers more than 3,500 Reference Standards—highly characterized specimens of drug substances, excipients, food ingredients, impurities, degradation products, dietary supplements, compendial reagents and performance calibrators. Click below to download the full list.

  • Download as PDF
  • Download as Excel
  • Download as CSV

new reference standards

New Releases

Access the USP Store to see a list of new product releases.

Upcoming reference standards releases

Upcoming Releases

See a list of upcoming product releases.

what is reference standard in research

Quality Solutions Sheets

A compilation of documentary standards and physical materials related to a drug substance or product – all in one place.

USP–NF online

USP–NF Updates

See USP–NF and Pharmacopeial Forum updates.

Featured Products & Solutions

Kinase inhibitors.

USP’s standards and solutions – well accepted by regulators worldwide – play a critical role in accelerating drug development and minimizing risks for manufacturers seeking to release and gain approval of kinase inhibitors. USP provides official Reference Standards for active pharmaceutical ingredients and impurities, Pharmaceutical Analytical Impurities (PAIs), nitrosamine impurities, excipients, and validated methods in USP−NF monographs for our portfolio of kinase inhibitor solutions.

Click the links below to expand.

Please note: some links below require a USP-NF account to access.

Erlotinib Hydrochloride - COMING SOON

  • Erlotinib Hydrochloride - NEW
  • Erlotinib Related Compound A - NEW

Erlotinib Tablets - COMING SOON

Included General Chapters:

  • <11> USP Reference Standards
  • <191> Identification Tests - General
  • <197> Spectroscopic Identification Tests
  • <281> Residue on Ignition
  • USP Dissolution Performance Verification Standard-Prednisone
  • <905> Uniformity of Dosage Units
  • Sodium Tartrate Dihydrate

Included Excipients:

  • Titanium Dioxide
  • Hypromellose
  • Lactose Monohydrate
  • Anhydrous Lactose
  • Crospovidone
  • Sodium Lauryl Sulfate
  • Butylated Hydroxytoluene
  • Methyl Acetate
  • Methyl Alcohol
  • Polyvinyl Alcohol
  • Croscarmellose Sodium
  • Aleuritic Acid  
  • Refined Bleached Shellac
  • Regular Bleached Shellac
  • Microcrystalline Cellulose

Everolimus 

  • Everolimus System Suitability Mixture
  • <61> Microbiological Examination of Nonsterile Products: Microbial Enumeration Tests
  • <62> Microbiological Examination of Nonsterile Products: Tests for Specified Microorganisms
  • USP 2-Methyl-1-Propanol
  • Residual Solvent Class 1—1,1,1-Trichloroethane
  • Residual Solvent Class 1—Carbon Tetrachloride
  • Residual Solvent Class 2—N,N-Dimethylacetamide
  • Residual Solvent Class 2—N,N-Dimethylformamide
  • Residual Solvent Class 2—N-Methylpyrrolidone
  • Triethylamine
  • 3-Methyl-1-butanol
  • Alcohol Determination - Alcohol
  • Butyl Acetate
  • Dimethyl Sulfoxide
  • Ethyl Acetate
  • Ethyl Formate
  • Formic Acid
  • Glacial Acetic Acid
  • Isobutyl Acetate
  • Isopropyl Acetate
  • Methyl Ethyl Ketone
  • Propyl Acetate
  • Residual Solvent Class 1—1,1-Dichloroethene
  • Residual Solvent Class 1—1,2-Dichloroethane
  • Residual Solvent Class 2—1,2-Dichloroethene
  • Residual Solvent Class 2—1,2-Dimethoxyethane
  • Residual Solvent Class 2—2-Ethoxyethanol
  • Residual Solvent Class 2—2-Methoxyethanol
  • Residual Solvent Class 2—Acetonitrile
  • Residual Solvent Class 2—Chlorobenzene
  • Residual Solvent Class 2—Cyclohexane
  • Residual Solvent Class 2—Ethylene Glycol
  • Residual Solvent Class 2—Methylbutylketone
  • Residual Solvent Class 2—Methylcyclohexane
  • Residual Solvent Class 2—Methylene Chloride
  • Residual Solvent Class 2—Methylisobutylketone
  • Residual Solvent Class 2—Nitromethane
  • Residual Solvent Class 2—Tetrahydrofuran
  • Residual Solvent Class 2—Trichloroethylene
  • Residual Solvents Class 2—Mixture C
  • Residual Solvent Class 1—Benzene
  • Residual Solvent Class 2—1,4-Dioxane
  • Residual Solvent Class 2—Chloroform
  • Residual Solvent Class 2—Cumene
  • Residual Solvent Class 2—Hexane
  • Residual Solvent Class 2—Methanol
  • Residual Solvent Class 2—Pyridine
  • Residual Solvent Class 2—Tetralin
  • Residual Solvent Class 2—Toluene
  • Residual Solvent Class 2—Xylenes
  • Residual Solvents Mixture—Class 1
  • Residual Solvents Class 2—Mixture A
  • Residual Solvents Class 2—Mixture B
  • tert-Butylmethyl Ether RS
  • <621> Chromatography
  • <781> Optical Rotation
  • ​​​​​​​Sodium Tartrate Dihydrate
  • Gefitinib Related Compound A
  • Gefitinib Related Compound B
  • Dichloroaniline

Gefitinib Tablets

  • <857> Ultraviolet-Visible Spectroscopy

Nilotinib  

  • Nilotinib Related Compound A
  • Nilotinib Related Compound B
  • Nilotinib Related Compound C
  • Nilotinib System Suitability Mixture
  • Regorafenib
  • Regorafenib Related Compound A
  • Sorafenib Tosylate - NEW

Included General Chapters

Excipients included with this molecule

  • Aleuritic Acid
  • Ethylene Glycol
  • Diethylene Glycol
  • Propylene Glycol

Sirolimus  - COMING SOON

  • Sirolimus System Suitability Mixture - NEW

Sirolimus Tablets  - COMING SOON

  • <61> Microbiological Examination of Nonsterile Products: Microbial Enumeration Tests
  • <62> Microbiological Examination of Nonsterile Products: Tests for Specified Microorganisms
  • <281> Residue On Ignition
  • Sorafenib Tosylate
  • Sorafenib Related Compound H

Sorafenib Tablets

  • Dissolution Performance Verification Standard-Prednisone 
  • <905>Uniformity of Dosage Units

Excipients included with this molecule:

Pharmaceutical Analytical Impurities:

  • Sorafenib Diarylurea Analog
  • Sorafenib Formamide
  • Sorafenib Isopropyl Carbamate
  • 4-Chloro-3-(trifluoromethyl)aniline
  • Deschlorosorafenib
  • Ethyl [4-chloro-3-(trifluoromethyl)phenyl]carbamate

Sunitinib Malate - COMING SOON

  • Sunitinib Malate - NEW
  • Desimidazoline Sunitinib - NEW
  • Sunitinib N-Oxide - NEW
  • Desethyl Sunitinib - NEW
  • Sunitinib Amide - COMING SOON 
  • <731> Loss on Drying

Pharmaceutical Analytical Impurities

  • Desdiethylamino Sunitinib

Performance Verification Testing

Performance Verification Testing

A new standard for Performance Verification Testing (PVT) is now available.

anticoagulants

Direct Oral Anticoagulants

Standards and solutions for anticoagulant drug development.

Impurities

Nitrosamines

USP offers a growing catalog of Nitrosamine impurities.

Warning: The NCBI web site requires JavaScript to function. more...

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Aslam RW, Bates V, Dundar Y, et al. Automated tests for diagnosing and monitoring cognitive impairment: a diagnostic accuracy review. Southampton (UK): NIHR Journals Library; 2016 Oct. (Health Technology Assessment, No. 20.77.)

Cover of Automated tests for diagnosing and monitoring cognitive impairment: a diagnostic accuracy review

Automated tests for diagnosing and monitoring cognitive impairment: a diagnostic accuracy review.

Appendix 1 measures for assessing an index test against a reference standard.

The following section outlines the different methods of assessing diagnostic outcomes. It is adapted from a previous piece of work conducted by the Liverpool Review and Implementation Group (LRiG) and is reproduced here with permission. 106

The classic presentation of the results of a clinical validity study is the so-called 2 × 2 table, as shown in Table 17 .

TABLE 17

Example of a 2 × 2 table

The number entered into cell ‘a’ is the number of patients for whom the new test correctly diagnoses MCI (as determined by the reference standard, in this case a clinical diagnosis of MCI). For these people, the new test is positive as is the reference standard; these are TPs.

The number entered into cell ‘b’ is the number of patients for whom the new test is positive (i.e. indicates the presence of MCI) but who do not, according to the reference standard (clinical diagnosis), have MCI. The new test has incorrectly diagnosed MCI; these are FPs.

The number entered into cell ‘c’ is the number of patients who are identified through the reference standard (clinical diagnosis) as having MCI but for whom the new test gave negative results. The new test has incorrectly labelled the patient as having MCI; these are FNs.

The number in cell ‘d’ is the number of patients who do not, according to the reference standard (clinical diagnosis), have MCI and who are also shown by the new test to be free from disease; these are TNs.

The numbers displayed in a 2 × 2 table are used to generate other summary measures. These are set out in Table 18 .

TABLE 18

Summary measures derived from numbers in a 2 × 2 table

In an ideal world, a test would be 100% sensitive and 100% specific. However, in reality there is often a trade-off between the two, with tests that have high sensitivity having low specificity and vice versa.

The use of a 2 × 2 tables requires that the test results are dichotomous, that is, they can be divided into two groups: test positive and test negative.

  • Receiver operating characteristic curve

When an intervention test has a range of possible thresholds that could be used to divide results into test positive and test negative, the relationship between the threshold used and the performance of the test can be examined in a receiver operating characteristic curve. This is a graphical plot of the sensitivity (TP rate) against 1 – specificity or the FP rate for each threshold; examples of a receiver operating characteristic curve are shown in Figure 5 , with the associated distribution of the index tests in diseased and non-diseased populations. An ideal test would have a point in the top-left corner with 100% specificity and 100% sensitivity.

Examples of a receiver operating characteristic curve. (Image reproduced from Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy with permission.)

  • Area under a receiver operating characteristic curve

The receiver operating characteristic curve can be used to assess the degree to which sensitivity changes at different levels of specificity or vice versa. Some studies report AUC as a proportion of the total area of the graph. This is a measure of the predictive accuracy or discrimination of the diagnostic test, that is, the ability of the test to discriminate between those who have (or will develop) MCI from those who do not have (or will not develop) MCI.

The AUC can also be expressed as the probability that someone with the disease will have a higher test result than someone without the disease. It is also referred to as the c -statistic. An AUC of 1.0 indicates a perfect test, and an AUC of 0.5 (the diagonal line) indicates that the test is no better than chance (i.e. 50% probability) in predicting whether or not the disease is present. An AUC of 0.5 to 0.7 is considered as poor discrimination, 0.7 to 0.8 acceptable discrimination, 0.8 to 0.9 excellent discrimination and > 0.9 exceptional discrimination.

  • Positive predictive value and negative predictive value

The PPV is the probability that subjects with a positive screening test truly have the disease.

The NPV is the probability that subjects with a negative screening test truly do not have the disease.

The PPV and NPV are clinically significant, as they give probabilities that an individual is truly MCI/early dementia positive given that they tested positive or truly MCI/early dementia negative given that they tested negative.

  • Likelihood ratio

The LR gives another measure of performance for the disease, and is described in chapter 10 of the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy , 107 as follows:

The LR+ describes how many times more likely positive index test results were in the diseased group than in the non-diseased group. The LR+, which should be > 1 if the test informative, is defined as:

(where T+ is index test positive, T– is index text negative, D+ is diseased, D– is non-diseased, P means probability and | means ‘given that’ or ‘on condition that’) and is estimated as:

The LR– describes how many times less likely negative index test results were in the diseased group than in the non-diseased group. The LR–, which should be < 1 if the test is informative, is defined as:

and is estimated as:

Included under terms of UK Non-commercial Government License .

  • Cite this Page Aslam RW, Bates V, Dundar Y, et al. Automated tests for diagnosing and monitoring cognitive impairment: a diagnostic accuracy review. Southampton (UK): NIHR Journals Library; 2016 Oct. (Health Technology Assessment, No. 20.77.) Appendix 1, Measures for assessing an index test against a reference standard.
  • PDF version of this title (792K)

In this Page

Other titles in this collection.

  • Health Technology Assessment

Recent Activity

  • Measures for assessing an index test against a reference standard - Automated te... Measures for assessing an index test against a reference standard - Automated tests for diagnosing and monitoring cognitive impairment: a diagnostic accuracy review

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

[Reference standards in diagnostic research: problems and solutions]

Affiliation.

  • 1 Universitair Medisch Centrum Utrecht, Julius Centrum voor Gezondheidswetenschappen en Eerstelijns Geneeskunde, afd. Klinische Epidemiologie, Utrecht.
  • PMID: 25589276

The accuracy of diagnostic tests is of utmost importance as biased test results may lead to wrong decisions in clinical practice. In diagnostic accuracy research the results of a diagnostic test, model or strategy are compared to those of the reference standard, i.e. the best available method to determine whether a certain condition or disease is present or absent. Problems with the reference standard lead to biased test results. The umbrella term for this is 'verification bias'. Verification bias arises if the reference standard cannot be applied to all patients, if investigators use different reference standards or simply because there is no reference standard. Correction of these problems is often possible, and, if it is applied in a transparent and reproducible fashion it will deliver useful diagnostic information. Clinicians who use a diagnostic test should take possible verification bias into account.

PubMed Disclaimer

Similar articles

  • Evaluating diagnostic accuracy in the face of multiple reference standards. Naaktgeboren CA, de Groot JA, van Smeden M, Moons KG, Reitsma JB. Naaktgeboren CA, et al. Ann Intern Med. 2013 Aug 6;159(3):195-202. doi: 10.7326/0003-4819-159-3-201308060-00009. Ann Intern Med. 2013. PMID: 23922065
  • Estimation of diagnostic test accuracy without full verification: a review of latent class methods. Collins J, Huynh M. Collins J, et al. Stat Med. 2014 Oct 30;33(24):4141-69. doi: 10.1002/sim.6218. Epub 2014 Jun 9. Stat Med. 2014. PMID: 24910172 Free PMC article. Review.
  • Fuzzy gold standards: Approaches to handling an imperfect reference standard. Walsh T. Walsh T. J Dent. 2018 Jul;74 Suppl 1:S47-S49. doi: 10.1016/j.jdent.2018.04.022. J Dent. 2018. PMID: 29929589
  • A Bayesian approach to simultaneously adjusting for verification and reference standard bias in diagnostic test studies. Lu Y, Dendukuri N, Schiller I, Joseph L. Lu Y, et al. Stat Med. 2010 Oct 30;29(24):2532-43. doi: 10.1002/sim.4018. Stat Med. 2010. PMID: 20799249
  • Systematic reviews and meta-analyses of diagnostic test accuracy. Leeflang MM. Leeflang MM. Clin Microbiol Infect. 2014 Feb;20(2):105-13. doi: 10.1111/1469-0691.12474. Clin Microbiol Infect. 2014. PMID: 24274632 Review.
  • Protocol for conducting a systematic review on diagnostic accuracy in clinical research. Sguanci M, Mancin S, Piredda M, De Marinis MG. Sguanci M, et al. MethodsX. 2024 Jan 20;12:102569. doi: 10.1016/j.mex.2024.102569. eCollection 2024 Jun. MethodsX. 2024. PMID: 38304392 Free PMC article.

Publication types

  • Search in MeSH

LinkOut - more resources

Full text sources.

  • Vereniging Nederlands Tijdschrift voor Geneeskunde
  • MedlinePlus Health Information

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

what is reference standard in research

  • Get new issue alerts Get alerts
  • IARS MEMBER LOGIN

Secondary Logo

Journal logo.

Colleague's E-mail is Invalid

Your message has been successfully sent to your colleague.

Save my selection

What Is a Reference Standard?

Morey, Timothy E. MD; Rice, Mark J. MD; Gravenstein, Nikolaus MD

From the Department of Anesthesiology, University of Florida, Gainesville, Florida.

Accepted for publication July 15, 2014.

Funding: Funded by the Department of Anesthesiology, University of Florida, Gainesville, FL.

The authors declare no conflicts of interest.

Reprints will not be available from the authors.

Address correspondence to Timothy E. Morey, MD, Department of Anesthesiology, University of Florida, PO 100254, Gainesville, FL 32610. Address e-mail to [email protected] .

This month in the journal, Carabini and colleagues observed in “A Comparison of Hemoglobin Measured by Co-Oximetry and Central Laboratory During Major Spine Fusion Surgery” that significant differences existed in measured hemoglobin concentration from 1,832 contemporaneous blood specimens depending on whether the practitioner sent the blood sample aliquots to the central laboratory auto-analyzers, complete blood count (CBC), or the arterial blood gas (ABG) “stat” laboratory. 1 More specifically, the ABG hemoglobin overestimated the CBC hemoglobin by 0.4 g/dL, although 7% of measurements had a >1.0 g/dL difference. Moreover, there was only fair to moderate agreement in the range of hemoglobin where a “transfusion trigger” might be pulled. These results led the authors to conclude that CBC and ABG techniques “… cannot be used interchangeably … when managing a patient with critical blood loss.” 1 It also means that the two devices should not be used interchangeably as reference standards for scientific investigations.

Lord William Kelvin (1824–1907), a founder of thermodynamics, cogently noted, “In physical science the first essential step in the direction of learning any subject is to find principles of numerical reckoning and practicable methods for measuring some quality connected with it. I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre [sic] and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of Science, whatever the matter may be.” 2 This nineteenth-century observation was illustrated by Henry Marks and still serves us well today ( Fig. 1 ). Based on the work of Carabini and colleagues in the twenty-first century, one wonders whether the method of measurement may affect the conclusions for a variety of medical investigations that may or may not use CBC or ABG hemoglobin concentrations to test a certain hypothesis. That is, do different “numerical reckonings” affect the conclusions? For example, Brown and colleagues published in this journal earlier this year an analysis of 20,930 patients to test their hypothesis that older patients have a greater risk of packed red blood cell transfusion than younger ones. 3 They accepted their hypothesis partly based on, “When patients were stratified by lowest in-hospital hemoglobin (7.00–7.99, 8.00–8.99, 9.00–9.99, and ≥10.00 g/dL), the odds of transfusion generally increased with each additional decade of age in every stratum, except for that containing patients in whom the lowest in-hospital hemoglobin did not decrease below <10 g/dL.” 3 The study is informative to the journal’s readers to discuss hemoglobin concentration triggers in the context of age, but what if the values change depending on method of measurement? How many of those 20,930 patients had CBC or ABG hemoglobin measurements or a mixture of both, as with Carabini and colleagues’ hospital? It is likely that a mean difference of CBC and ABG hemoglobin concentrations of 0.4–0.5 g/dL reported by Carabini and colleagues affected the interpretation of this study given patient stratifications of only 1.0 g/dL. Similarly, in another study, it is difficult to understand the impact of an investigation exploring intensive treatment to maintain the hematocrit in <45% patients with polycythemia vera where the hemoglobin assay is not detailed. 4 The methods of hemoglobin measurement are not mentioned in the studies perhaps because the values were perceived as accurate with negligible differences between CBC and ABG labs. Now, Carabini and colleagues have shown us this belief may not be true. Moreover, the method of hemoglobin determination becomes relevant when developing new, point-of-care hemoglobin measurement technologies based on spectrophotometry, occlusion spectroscopy, pulse co-oximetry, or other future techniques as yet undescribed. That is, the referent standard for comparison to a developing technology may play an important role in observed bias and precision, something previously termed “inappropriate reference standard” that causes “errors of imperfect reference standard or standards bias the measurement of diagnostic accuracy of the index text.” 5 In part to address inappropriate reference standard bias, it would be ideal to remove this confounding variable of a wavering referent value depending on whether CBC or ABG hemoglobin measurements were determined.

F1-4

Recently, the U.S. Food and Drug Administration strengthened its accuracy requirement for hospital point-of-care glucose meters from ±20% to ±10% of reference concentrations for values >75 mg/dL and from ±15 to ±7 mg/dL for those <75 mg/dL. 6 Reference central laboratory glucose devices have coefficients of variation of approximately 2.5%, which translates into a 95% confidence interval (CI) of about ±5%. At a glucose concentration of 70 mg/dL, this means that the reference is accurate to ±3.5 mg/dL (95% CI). When considering an error budget for a point-of-care glucose meter, fully one-half of the error budget can be immediately consumed by the innate error in the reference device. Stated in slightly different terms, the point-of-care device must be twice as accurate as the U.S. Food and Drug Administration desires to meet the new, stricter accuracy requirements because of the unknown error in the reference devices. The natural inaccuracies of our reference instruments create problems for several groups of stakeholders. First, when reading the literature, it becomes nearly impossible for the clinician to figure out the “true” accuracy of a device. The reported accuracy in a study is measured against some “gold standard,” which has a natural bias and precision (that is not and cannot be zero). Use of different reference instruments can greatly influence reported accuracy. Second, clinicians and scientists have a real dilemma when investigating the accuracy of diagnostic devices. The reference chosen can greatly influence the reported accuracy. Finally, those who develop new techniques for glucose and hemoglobin may struggle when testing new devices and reporting accuracy to regulatory bodies to identify appropriate reference standards.

What is the answer to the question of what is the true reference? There is no perfect standard instrument for measuring a biological substance. However, as clinicians and scientists, there are a few things we can do to improve how we conduct diagnostic device accuracy research. First, investigators should choose the most accurate standard instrument available. This device may not be the most convenient or the least expensive for our limited science budgets. Moreover, it is not likely one that is used every day in the “stat lab.” Second, authors should clearly note the known bias and precision of the instrument used as the standard in the methods of manuscripts. Third, researchers should explain that reference measurements do have variation, and these differences affect the reported accuracy of new instruments in the discussion of manuscripts. The data reported by Carabini and colleagues go a long way toward making these points.

DISCLOSURES

Name: Timothy E. Morey, MD.

Contribution: This author helped write the manuscript.

Attestation: Timothy E. Morey approved the final manuscript.

Name: Mark J. Rice, MD.

Attestation: Mark J. Rice approved the final manuscript.

Name: Nikolaus Gravenstein, MD.

Attestation: Nikolaus Gravenstein approved the final manuscript.

This manuscript was handled by: Maxime Cannesson, MD, PhD.

  • Cited Here |
  • View Full Text | PubMed | CrossRef
  • + Favorites
  • View in Gallery

Readers Of this Article Also Read

Postoperative pain after inguinal herniorrhaphy with different types of..., fatty acid lessens halothane's inhibition of energy metabolism in isolated..., pharmacodynamics and pharmacokinetics of epidural ropivacaine in humans, differences in magnitude and duration of opioid-induced respiratory depression..., kinetics and potency of desflurane (i-653) in volunteers.

IMAGES

  1. How to choose the right reference standard for your pharmaceutical analysis

    what is reference standard in research

  2. References in Research

    what is reference standard in research

  3. PPT

    what is reference standard in research

  4. General rules: Reference list

    what is reference standard in research

  5. APA Reference Page Examples and Format Guide

    what is reference standard in research

  6. How to Properly Cite Sources in a Written Assignment

    what is reference standard in research

VIDEO

  1. Drug Standard research laboratory@NRIUMSD

  2. European Standards: Your innovation bridge

  3. Rss feeds, Ep 17, Research Methodology, keyword, reference, citation, D Creations Resources

  4. Our Process

  5. DIFFERENCE BETWEEN REFERENCE STANDARD AND WORKING STANDARD

  6. Technical Support for Reference Standard Users

COMMENTS

  1. Value of composite reference standards in diagnostic research

    A composite reference standard is a fixed rule used to make a final diagnosis based on the results of two or more tests, referred to as component tests. For each possible pattern of component test results (test profiles), a decision is made about whether it reflects presence or absence of the target disease.

  2. Recommendations and Best Practices for Reference Standards ...

    This commentary will focus on reference standards and key reagents, such as metabolites and internal standards used in the support of regulated bioanalysis for new chemical entities (NCEs) and new biological entities (NBEs).

  3. Reference Standards, Judges, and Comparison Subjects

    Experts generate a reference standard and serve as comparison subjects. Experts generate responses, which are combined into a reference standard (middle column).

  4. Table 10: Reference Standards: Common Practices and ... - CASSS

    According to ICHQ6A, a reference standard, or reference material, is a substance prepared for use as the standard in an assay, identification, or purity test. It should have appropriate quality for its intended use.

  5. STARD 2015 items: explanation and elaboration - BMJ Open

    In diagnostic accuracy studies, the reference standard is used for establishing the presence or absence of the target condition in study participants. Several reference standards may be available to define the same target condition.

  6. USP Reference Standards

    USP currently offers more than 3,500 Reference Standards—highly characterized specimens of drug substances, excipients, food ingredients, impurities, degradation products, dietary supplements, compendial reagents and performance calibrators.

  7. Measures for assessing an index test against a reference standard

    The number entered into cell ‘a’ is the number of patients for whom the new test correctly diagnoses MCI (as determined by the reference standard, in this case a clinical diagnosis of MCI). For these people, the new test is positive as is the reference standard; these are TPs.

  8. [Reference standards in diagnostic research: problems and ...

    In diagnostic accuracy research the results of a diagnostic test, model or strategy are compared to those of the reference standard, i.e. the best available method to determine whether a certain condition or disease is present or absent.

  9. Reference standards - ScienceDirect

    Reference Standard: A reference standard is broadly defined as certified material or substance, supplied by a certifying body, which exhibits one or more properties that are sufficiently well established (and assigned) that it may be used for calibration of an apparatus, assessment of a measurement method, and assigning values to materials.

  10. What Is a Reference Standard? : Anesthesia & Analgesia - LWW

    What is the answer to the question of what is the true reference? There is no perfect standard instrument for measuring a biological substance. However, as clinicians and scientists, there are a few things we can do to improve how we conduct diagnostic device accuracy research.