Reproducibility- if you can’t reproduce the result did it actually happen?

Written by Dr Karina Bunting, Cardiac Physiologist Research Fellow, University of Birmingham

Reproducibility underpins everything we do, whether it is in clinical practice, research or in ensuring you’re weighing out the same amount of flour for another lockdown banana bread bake!

Reproducible measurements are essential for the tests we carry out as clinical scientists, as ultimately the results we report guide the management of patients. To take a topical example, imagine you carry out a coronavirus test on a patient and it comes back negative, but you do another swab a few hours later and it comes back  positive, so you take another one and it comes back negative what do you do? Do you put the patient on a coronavirus ward or on a clean ward? If you choose incorrectly this could potentially have harmful consequences to the patient or other patients on the ward.

It is essential that tests are reproducible so that any change observed can be assumed to be due to actual clinical change and not an error with the measurement process. However often in scientific studies the reproducibility testing is confined to a few sentences, tucked away at the end of the results section, and in clinical practice how often are you interrogating how reproducible your department’s measurements are?

So firstly what does reproducibility mean and how is this different to accuracy? I’m sure we’ve all come across the bullseye analogy, but here is a reminder of how reproducibility differs from accuracy.

Figure 1. Bullseye

Bullseye (a) shows a high level of reproducibility because every time the shooter has aimed, the arrows have hit the target within a short range of each other. However, the arrows are far away from the centre of the bullseye meaning that although it’s reproducible, the shooter’s aim is not accurate. This would be the same as measuring a patient’s blood pressure and it consistently coming back as high within a narrow range of readings (153/92 mmHg, 152/91 mmHg, 155/92 mmHg, 154/90 mmHg) but the accurate blood pressure reading is normal at 120/80 mmHg.

In bullseye (b), the arrows have all hit the bullseye close to the centre suggesting the shooter is fairly accurate; however the arrows are hitting a different area each time, suggesting the reproducibility is low. This would be the same as getting blood pressure readings ranging from 115/70 mmHg to 128/85 mmHg; although the blood pressure readings are not too far away from the accurate reading of 120/80 mmHg, the readings are not reproducible and there will be uncertainty as to whether the patient is getting a lower or higher blood pressure over time due to the poor reproducibility.

In (c), the arrows are hitting the centre of the bullseye and in a similar area each time, suggesting the shooter’s aim is both accurate and reproducible. This would be the same as getting blood pressure readings all within the range of 118/79 mmHg to 122/80 mmHg; the blood pressures have a high level of reproducibility and are very close to the true value.

In (d), the arrows are all over the place and nowhere near the centre of the bullseye, suggesting that the shooter’s aim is both inaccurate and not reproducible. In the blood pressure scenario this would be like getting a series of blood pressures ranging from 90/60 mmHg to 170/95 mmHg; at this point you might question whether the machine had been dropped on the floor!


Then to add to the confusion there are the three “R”s: reproducibility, repeatability and reliability. These are often used inter-changeably but they mean different things and will guide how you carry out any studies and interpret your results.

[Reproducibility] is explicitly defined as the variation of the same measurement made on a subject, under changing conditions, but in real-life practice also includes changes in measurement, method, observer, time-frame, instrumentation, location and/or environment.  Repeatability can be separately considered as the variation in repeat measurements made on the same subject under identical conditions, whereas Reliability describes the magnitude of error between measurements.”

Bunting KV et al, 2019; Journal of the American Society of Echocardiography

For more information on this, consider reading this paper by Bartlett & Frost (2008).


Setting up your own study

When setting up your own study, whether it is for a clinical audit or a research study, it is important to think carefully about some key steps before starting. In the following section I list how I would go about planning a reproducibility study/ audit and I provide an example study (in italics).

As a cardiac clinical scientist I have used a cardiology example. For your reference an echocardiogram is an ultrasound scan of the heart to assess the heart’s structure and function. Left ventricular ejection fraction (LVEF) is a routine measurement of pumping function, which measures the change in volume of the ventricle during systole, measured by the operator drawing around the endocardial boarder in diastole and systole (figure 2).

Figure 2. Simpson’s biplane left ventricular ejection fraction measurement
  1. Determine your question/ hypothesis?

It is important you have a clear and measurable question/ hypothesis so that you can plan your study and on receiving your results come to a conclusion.

What is the inter-operator reproducibility between cardiac clinical scientists measuring heart function in patients with heart failure? OR,

There is a high level of reproducibility between cardiac clinical scientists measuring heart function in patients with heart failure.

  1. Have a think about what you will do with the results?

It’s important you have good reason for doing the study and you can clearly state how it will benefit patients/ staff. Also to have an idea of what actions you will take with the results.

If there is low reproducibility between operators, I will plan training sessions to improve reproducibility.

  1. Decide on what measurement you want to assess

It is essential that this is specific and instructions are given clearly so that there is no confusion about the technique used.

Heart function measured by Simpson’s Biplane left ventricular ejection fraction (%)

  1. Decide on what population?

A specific patient population isn’t essential but it will enable similar demographics so that any variability seen is less likely to do with the patient population and more to do with the measurement process.

In patients attending heart failure clinic for an echocardiogram

  1. Are there any patients you want to exclude?

It is best to minimise exclusions to make the study as generalizable to clinical practice as possible. However if you believe there is a certain patient demographic that will skew the results, then they should be considered for exclusion. For example in this case patients with significant ventricular ectopy and atrial arrhythmia have been excluded, because their measurements of heart function will vary between cardiac cycles regardless of inter-operator reproducibility potentially under-estimating inter-operator reproducibility.

Patients with significant ventricular ectopy or atrial arrhythmia

  1. In how many patients do you want to test?

To determine how many patients a sample size needs to be calculated. An adequate sample size is important so that you can make true inferences about the population generally.  If you don’t have access to statistical software there are several online calculators which will do this for you and there are usually some friendly hospital statisticians, who will be happy to help you. The things you need to consider is: 1) what level of significance would you accept, this is generally seen as the p-value; a p=<0.05 is saying there is less than a 1 in 20 chance that the observations seen (difference) are down to chance, in another words you can be sure that it’s not a false positive; 2) power, this is the opposite of significance and is defined as the false negative rate we are willing to accept, so in another words the rate at which we fail to detect an actual difference; 3) expected effect size, this is what we expect the difference to be between the things we are comparing which is determined by your experience or what’s been observed in the literature and finally 4) standard deviation, this is the anticipated variability in the data.     

16 patients derived from 1) significance= 0.05 2) power= 80% 3) expected effect size= 5% (from experience I think there is usually a 5% difference between operator’s EF) and 4) standard deviation in the measurement= ±5%, this is the established EF standard deviation

You can read more about sample size and design of reliability studies in this paper by Walter, Eliasziw & Donner (2010).

  1. Decide whether you want to assess reproducibility or repeatability?

This will significantly affect your experimental design. A repeatability design will mean that all conditions are kept the same between measurements (same echocardiographer, same patient, same machine & same position) and the only thing changing will be the small variation in time between the LVEF measurements. Whereas with reproducibility an aspect of the conditions will change to see how that affects the measurement, in this case a different operator.

Reproducibility- between cardiac clinical scientists within the department

  1. Decide on what aspect of reproducibility you would like to assess?

This could be anything which involves changing the environment in which the measurement is taken, to interrogate how this affects its reproducibility, for example: machines, method of measurement, operator, and time.

Between two cardiac clinical scientists (inter-operator reproducibility) taking a single LVEF measurement each on patients with heart failure.

  1. Who are you going to test in?

Again this depends on what you are trying to answer and your study design. If you want to test the department as a whole it is best to get a selection of clinical scientists with different years of experience to test the measurement in. However if you wish to compare between two operators it makes more sense to choose two with similar experience to fairly test the measurement’s reproducibility.

Between two senior cardiac clinical scientists with similar experience in echocardiography

  1. How can you ensure non-biased selection?

Statistical bias is defined as a systematic tendency in the process of data collection which can cause misleading results. To avoid bias in the selection of patients, they should either be randomly selected or consecutively selected. This is important to ensure that you are performing the test fairly and making it generalizable to everyday clinical practice. In this example it will be easier to measure the Simpson’s biplane LVEF in patients with good quality images. However if you “cherry pick” the patients who you think have good quality images, not only is this not representative of an everyday echocardiography clinic, you are also likely to achieve better reproducibility results misleading your interpretation of how reproducible inter-operator assessment of Simpson’s biplane EF is in your department.  

Consecutive patients attending heart failure clinic with no pre-exclusions to image quality.

  1. How can you ensure blinding?

It is important that the assessors are unable to see what each other’s measurements are; otherwise they may be influenced by the other’s results causing an unfair test which is highly biased. 

The second operator will be blinded to the first operator’s results and ideally any previous LVEF measurements performed in the patient.  

  1. How can you ensure other influential environmental factors are minimised?

If you are focussing on inter-operator reproducibility it’s important to not introduce other sources of variation that will confound your results for example choice of machine or measurement software.

The same machine will be used throughout the study and the order of 1st and 2nd operator will be the same between patients. The study will be carried out within a small time frame (avoiding days between data collection)

  1. How should I record my data?

Prior to beginning the study create your database template. In this ensure that you have all the data you want to collect for the study, including the units. The categories should be easy to understand to avoid any ambiguity of what data to put in each column and should have their own unique variable name. I would recommend using excel, as this will mean your data will already be in a format to analyse and can be transferred to statistical software easily, if available.

When you have your results, how do you go about interpreting them?

Interpreting your data

When you have collected your data, it’s time to interpret the results. To start with I would recommend plotting a simple graph; for example a bar chart for categorical data and a scatter plot for continuous data. The graph will give you a good idea of what’s going on before any statistical methods are applied. The choice of statistics applied will depend on your experimental design and data collected. The extensive list of statistics you can apply to your data is beyond the scope of this blog, but it is important to think carefully about what best represents your data and be aware of the disadvantages and advantages of each method.

There are several different terms used to assess how reproducible your measurements are: agreement, association/correlation, variation and bias. These mean different things and it is important to understand their meaning when describing your results or interpreting another group’s reproducibility study. Association assess the relationship between the data. Agreement assesses the degree of consensus between measurements. Variability describes the magnitude in difference between repeated measurements.  Bias tells us to what extent there is a real difference between two data points that has not occurred by chance (statistical significance) and it can also be used to describe the overall magnitude and direction of the difference in data (systematic and proportional bias) to add to the confusion!

To show a brief example of how these terms are used to interpret the overall reproducibility of your results, below in figure 3 are four sets of possible results from a similar study set-up to the example study described above (two operators measuring LVEF in a series of different patients).

Figure 3. Reproducibility assessment between two operators (taken from Bunting KV et al. JASE  2019)

The brown diagonal dotted line on each graph represents the line of equality; if all the points are aligned on the line of equality this would mean that operator 1 and 2 are obtaining the same measurement every time and so have a high level of agreement and perfect association. The red dotted line represents the line-of-best-fit for the data points; if the data points are all aligned on the line-of best-fit it, this suggests that the measurements taken by operator 1 and 2 are highly associated with each other but this does not necessarily mean that the measurements taken by operator 1 and 2 are agreeing. In example A the points are aligned on both the line-of-equality and line-of-best-fit; therefore operator 1 and 2’s measurements have both a high level of agreement and association. In example B the points are well aligned to the line-of-best fit suggesting there is high association but it is shifted to left from the line-of-equality therefore there is low agreement. In this case operator 2 is consistently measuring LVEF higher than operator 1 which is called systematic bias. In example C there seems to be a change in the distribution of the points according to the measurement value; as the LVEF measurement gets higher the points are further away from the line of best fit suggesting there is lower association between the operators’ measurements as the LVEF gets higher, this is called proportional bias. However the points aren’t too far away from the line-of-equality and so other overall agreement isn’t too low. In contrast example D’s points are far away from the line of equality and the line-of-best fit suggesting poor agreement and association but no significant bias as there is no tendency for the points to go in either direction.

There are of course other options for graphs and it will be up to you to decide what best represents your findings; for example in this case if you are looking at the difference between two groups of measurements (operator 1 and 2), a Bland and Altman plot is really useful to visualise the degree of agreement and detect systematic bias (figure 4).

Figure 4. Bland and Altman plots for agreement between two tests or operators (taken from Bunting KV et al. JASE  2019)

The “bias” calculates the mean difference across all observations; the closer this is to 0 will suggest that there is no systematic bias. The closer the points are to the 0 line the better the agreement. The upper and lower limits of agreement represent the range you would expect 95% of values to lie between if you were to repeat the test again. Therefore the narrower the limits of agreement the lower the variability between repeated tests. The limits of agreement can be calculated simply using the following equation: bias ± 1.96 x standard deviation (use “+ 1.96” for the upper limits and “- 1.96 for the lower limits). In Figure 4 Example A-D shows the same data as in Figure 3 A-D but is represented in a Bland and Altman plot. In example A the points are all very close to the 0 line and so there is a high level of agreement and hence the bias is low. At the same time the limits of agreement are also narrow suggesting that the results themselves are reproducible; 95% of future measurements between operator 1 and 2 will lie within the limits. Example B has a positive bias value because operator 2 is consistently measuring higher than operator 1, and so the points are far away from the 0 line suggesting poor agreement. However the limits of agreement are narrow suggesting that there will be a similar difference between the operators on future occasions so the results themselves are consistent or have low variability. In example C the overall bias is low suggesting there is no systematic bias but the points are distributed widely and above and below the 0 line, suggesting that on occasions there is poor agreement and this is why the limits of agreement are wide as there is poor consistency within the reproducibility results. Finally in example D the points are far away from the 0 line suggesting poor agreement and the points are very widely distributed, hence why the limits of agreement are very wide.

In conclusion from this data: example A shows a high level of reproducibility between operators and this is consistent between patients; this is a good result and indicates no training is required but an audit should be repeated in a year or so to ensure the standard is still high. In example B operator 2 is consistently measuring higher than operator 1, which suggests that there may be a variation in how the operators have been taught to measure LVEF, so further training would be required to resolve this problem in reproducibility. In example C the data suggests that inter-operator reproducibility is lower when measuring LVEF in patients with either a very low LVEF or high LVEF, so further training is required in this demographic of patients. Finally for example D the inter-operator reproducibility is very poor across all patients, suggesting a lack of understanding of the LVEF measurement and so a significant amount of re-training would be required. 

I applied this to echocardiography in the publication listed below, but both the recommendations and the online tool I developed can be applied to any clinical measurement:

Bunting KV, Steeds RP, Slater LT, Rogers JK, Gkoutos GV, Kotecha D. A Practical Guide to Assess the Reproducibility of Echocardiographic Measurements. J Am Soc Echocardiogr. 2019 Dec;32(12):1505-1515. doi: 10.1016/j.echo.2019.08.015.

Tool: http://www.birmingham. ac.uk/echo

Further information on how reproducibility studies should be reported is available:

Kottner J, Audige L, Brorson S, Donner A, Gajewski BJ, Hrobjartsson A, et al. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. J Clin Epidemiol. 2011;64:96-106

Hopefully this has given you a small insight into the importance of assessing reproducibility and some things to think about when planning and interpreting you next study. Happy testing!

Dr Karina V Bunting (@buntingkarina)

Cardiac Clinical Scientist (University Hospitals Birmingham, NHS Trust)

Post-Doc Research Fellow (University of Birmingham)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s