Evidencebased laboratory medicine

There are various types of evidence we accept for laboratory tests and biomarkers: evidence about the analytical performance of an assay; evidence about quality control in the laboratory and quality assurance from external schemes; and evidence about issues like sensitivity and specificity in particular clinical circumstances. What we rarely have, though, is evidence that the use of a laboratory test can, for a given patient or group of patients, make a clinically relevant difference to the diagnosis or treatment. Evidence-based laboratory medicine (EBLM) has to encompass all of these types of evidence, of course, but the judgement will increasingly be made on clinical outcomes.

Whether systematic reviews will be helpful for EBLM, as they have been for treatments, is questionable, however. One description of levels of evidence commonly used for studies of diagnostic tests is shown in Table 1.1. The keys to good quality have been said to be independence, masked comparison with a reference standard and consecutive patients from an appropriate population. Lower quality comes from inappropriate populations and comparisons that are not masked or with different reference standards. Until recently, we lacked any empirical or theoretical evidence about the levels of bias that any of these study architectures can impart.

A new contribution from Holland [5] provides the missing link. The authors searched for and found 26 systematic reviews of diagnostic tests with at least five included studies. Only 11 could be used in their analysis, because 15 were either not

Table 1.1. Levels of evidence for studies of diagnostic methods

Level Criteria

1 An independent, masked comparison with reference standard among an appropriate population of consecutive patients

2 An independent, masked comparison with reference standard among nonconsecutive patients or confined to a narrow population of study patients

3 An independent, masked comparison with an appropriate population of patients, but reference standard not applied to all study patients

4 Reference standard not applied independently or masked

5 Expert opinion with no explicit critical appraisal, based on physiology, bench research or first principles systematic in their searching or did not report any sensitivity or specificity. Data from the remainder were subjected to mathematical analysis, to investigate whether the presence or absence of some item of proposed study quality made a difference to the perceived value of the test.

There were 218 individual studies, only 15 of which satisfied all eight criteria of quality that this analysis concerned. Thirty per cent fulfilled at least six of eight criteria. To evaluate bias, the authors calculated the relative diagnostic odds ratio by comparing the diagnostic performance of a test in those studies that failed to satisfy the methodological criterion with the performance of the test in studies that did meet this criterion. Overestimation of effectiveness (positive bias) of a diagnostic test was shown by a lower confidence interval for the relative diagnostic odds ratio of more than 1.

The results are shown in Table 1.2. Use of different reference tests, lack of blinding and lack of a description of either the test or the population in which the test was studied led to positive bias. However, the largest factor leading to positive bias was evaluation of a test in a group of patients already known to have the disease and a separate group of normal patients - called a case-control study in the paper [5].

There are also pointers to good practice in the publication of articles on diagnostic tests. The authors of a most important paper [6] set out seven methodological standards (Table 1.3). They then looked at papers published in the Lancet, British Medical Journal, New England Journal of Medicine and Journal of the American Medical Association from 1978 through 1993 to see how many reports of diagnostic tests meet these standards. Between 1978 and 1993, they found 112 articles, predominantly on radiological tests and immunoassays. Few of the standards were met consistently - ranging from 51% avoiding workup bias down to 9% reporting accuracy in subgroups (Table 1.3). While there was an overall improvement over

Table 1.2. Empirical evidence of bias in diagnostic test studies of different architecture

Relative diagnostic

Study characteristic

odds ratio (95% CI)



3.0 (2.0-4.5)

A group of patients already known to have the disease compared with a separate group of normal subjects

Different reference tests

2.2 (1.5-3.3)

Different reference tests used for patients with and without the disease

Not blinded

1.3 (1.0-1.9)

Interpretation of test and reference is not blinded to outcomes

No description of test

1.7 (1.1-1.7)

Test not properly described

No description of population

1.4 (1.1-1.7)

Population under investigation not properly described

No description of reference

0.7 (0.6-0.9)

Reference standard not properly described


The relative diagnostic odds ratio indicates the diagnostic performance of a test in studies failing to satisfy the methodological criterion relative to its performance in studies with the corresponding feature [5].


The relative diagnostic odds ratio indicates the diagnostic performance of a test in studies failing to satisfy the methodological criterion relative to its performance in studies with the corresponding feature [5].

time for reports to score on more standards, even in the most recent period studied only 24% met up to four standards, and only 6% up to six.

Most diagnostic test evaluations are structured to examine patients with a disease compared with those without the disease - a case-control design. Astonishingly, few studies are performed according to the highest standard in Table 1.1. The studies which have been published are seriously flawed, as Read et al. [6] have demonstrated. It must be questioned, therefore, whether any systematic review of diagnostic tests is worthwhile.


Just as large samples are needed to overcome the random effects of chance for treatments, so they are also needed for tests. An example is the controversy over falling sperm counts. A meta-analysis [7] collected 61 studies on sperm counts published between 1938 and 1990. Almost one-half of these studies (29/61) studied fewer than 50 men. The smallest number was nine and the largest 4435 men. Only 2% of the data on nearly 15000 men was collected before 1970, in small studies. Figure 1.2 shows the variability by size. The overall mean sperm count was 77 million/ml, but small individual studies recorded means from 40 to 140 million/ml. Only large studies correctly estimated the overall mean, and any temporal relationship is spurious because the old studies were small.

Table 1.3. Standards of reporting quality for studies of diagnostic tests

Reporting standard Background


Per cent meeting standard

Spectrum The sensitivity and specificity of a test depend on the composition characteristics of the population studied. Change the population and you change these indices. Since most diagnostic tests are evaluated on populations with more severe disease, the reported values for sensitivity and specificity may not be applicable to other populations with less severe disease in which the test will be used.

Pertinent Sensitivity and specificity may represent average values for a subgroups population. Unless the condition for which a test is to be used is narrowly defined, then the indices may vary in different medical subgroups. For successful use of the test, separate indices of accuracy are needed for pertinent individual subgroups within the spectrum of tested patients.

Avoidance of This form of bias can occur when patients with positive or workup bias negative diagnostic test results are preferentially referred to receive verification of diagnosis by the gold standard procedure.

For this standard to be met, the report had to contain 27

information on any three of these four criteria: age distribution, sex distribution, summary of presenting clinical symptoms and/or disease stage, and eligibility criteria for study subjects.

This standard is met when results for indices of accuracy were reported for any pertinent demographic or clinical subgroup (for example, symptomatic versus asymptomatic patients).

For this standard to be met in cohort studies, all subjects 51 had to be assigned to receive both the diagnostic test and the gold standard verification either by direct procedure or by clinical follow up. In case-control studies, credit depended on whether the diagnostic test preceded or followed the gold standard procedure. If it preceded, credit was given if disease verification was obtained for a consecutive series of study subjects regardless of their diagnostic test result. If the diagnostic test followed, credit was given if test results were stratified according to the clinical factors which evoked the gold standard procedure.

Per cent meeting

Reporting standard Background Criteria standard

Avoidance of This form of bias can be introduced if the diagnostic test or review bias the gold standard is appraised without precautions to achieve objectivity in their sequential interpretation - like blinding in clinical trials of a treatment. It can be avoided if the test and gold standard are interpreted separately by persons unaware of the results of the other.

The reliability of sensitivity and specificity depends on how many patients have been evaluated. Like many other measures, the point estimate should have confidence intervals around it, which are easily calculated. Not all tests come out with a black or white, yes/no, answer. Sometimes they are equivocal, or indeterminate. The frequency of indeterminate results will limit a test's applicability, or make it cost more because further diagnostic procedures are needed. The frequency of indeterminate results and how they are used in calculations of test performance represent critically important information about the test's clinical effectiveness. Test reproducibility Tests may not always give the same result - for a whole variety of reasons of test variability or observer interpretation. The reasons for this, and its extent, should be investigated.

Precision of results for test accuracy

Presentation of indeterminate test results

For this standard to be met in either prospective cohort 43 studies or case-control studies, a statement was required regarding the independent evaluation of the two tests.

For this standard to be met, confidence intervals or standard 12 errors must be quoted, regardless of magnitude.

For this standard to be met, a study had to report all of the 26 appropriate positive, negative or indeterminate results generated during the evaluation and whether indeterminate results had been included or excluded when indices of accuracy were calculated.

For this standard to be met in tests requiring observer 26

interpretation, at least some of the tests should have been evaluated for a summary measure of observer variability. For tests without observer interpretation, credit was given for a summary measure of instrument variability.

Figure 1.2

Mean sperm counts from individual studies in a meta-analysis (7). Each symbol represents one study. Size of the symbol is proportional to the number of patients included. The vertical line shows the overall mean (77 million/ml) from over 15000 men.

Table 1.4. Frequency of use of methods of assessing test accuracy (50 physicians in each category)

Bayesian method

ROC curve

Likelihood ratios

Specialist physician




Generalist physician








General surgeon




Family practice








Overall percentage




Expressing results

How do doctors use information about tests? A survey of US doctors [8] showed that almost none of them use these terms in any formal way. A stratified random sample of physicians in six specialties with direct patient care (at least 40% of time with patients) across the USA was determined by researchers at Yale. These physicians were then contacted, with a 10-minute telephone survey about their attitudes to formal methods of test use. There were 10 questions, reproduced in an appendix to the paper. A typical question was: 'Do you use test sensitivity and specificity values when you order tests or interpret test results?'

There were 300 physicians in the final sample, 50 in each specialty. Few of them used formal methods of assessing test accuracy (Table 1.4). Bayesian methods were used by 3%, and receiver-operating characteristic (ROC) and likelihood ratio data by 1% each. Although as many as 84% said they used sensitivity and specificity at some time, from adopting the use of a new test to using them when interpreting a diagnostic test result, this was almost always done in an informal way.

The authors make a number of salient points. Firstly, information on test accuracy must be 'instantly available' when tests are ordered. Secondly, formal training needs to be improved. Thirdly, published information is mostly useless, because it usually fails to reflect the patient population in which it is being used. They might also have gone further and said that we need new and better ways of expressing the results of diagnostic tests. Sensitivity and specificity, or ROC curves or likelihood ratios are not understood or used by doctors. This remains a huge challenge.

0 0

Post a comment