## Abstract

**No reference equation applies universally: default equations should not be used without considering potential biases** http://ow.ly/29XB304sRrm

*To the Editor:*

Knowing whether a patient's lung function result is similar to what can be expected of a healthy individual is critical for the correct interpretation of pulmonary function test results. Since lung function changes with growth and ageing and differs according to sex and ethnicity, there are now >300 published reference equations available for spirometry alone. Consequently, individual pulmonary function laboratories are left with a challenging decision: which reference equation do I choose? While it is recommended that reference equations are population specific, there are obvious logistical challenges to collect normative data from a large and representative population such that the resulting range of normal values are not biased.

Alternatively, laboratories can choose to use a published reference equation derived from a population similar to the one being tested. Several studies have shown that not all lung function reference equations produce the same results [1–3]. For instance, some equations are based on smaller samples of the population, while others are based on larger and representative population surveys. In some studies, spirometry curves are reviewed rigorously for quality control, while in others they are not. Some data are collected on antiquated equipment no longer available today, while others contain novel outcome measures. Furthermore, within each commercial spirometer, users have an overwhelming number of options: to use a single reference equation, or multiple reference equations combined into a single prediction set. The “stitching” of different equations results in artificial jumps in predicted values that could lead to erroneous interpretation of results and unnecessary intervention [2, 4, 5]. Most importantly, often the physicians interpreting results are unaware of which equations were used, and how the choice of equation might affect their interpretation [6–8]. Another common practice is to extrapolate reference equations beyond the age range of the intended age range, which has also been shown to bias interpretation of results [9].

The Global Lung Function Initiative (GLI) spirometry reference equations are one of the 300 available spirometry reference equation sets and come with their own advantages and disadvantages [10]. The GLI equations are based on data from >74 000 individuals from 26 countries and are endorsed by six of the major respiratory societies. Despite the name, the world's largest populations, from the Indian subcontinent, Africa and Polynesian countries, are not represented. Nonetheless, the GLI dataset is the largest collection of normative spirometry data and reflects many populations, and the equations are a major step towards standardised interpretation of pulmonary function tests around the world. The distinct advantage of these equations is the large sample of healthy individuals across the age span (3–95 years) across multiple ethnic groups. The external validation of the reference equations is recommended by the American Thoracic Society/European Respiratory Society (ERS), and there have been many studies conducted that aim at validating the GLI equations. Several studies have found that the GLI equations fit the population well [11–15], including African children and adolescents of Bantu origin in Angola, Democratic Republic of Congo and Madagascar [16], while others have shown disagreement [17–19]. In some instances, the observed differences may be attributed to statistical differences that may not be clinically meaningful, while in others the observed differences may elucidate important population trends that may bias interpretation of results if the GLI equations are used inappropriately.

Any time a sample of the entire population is selected, there is a random chance that the selected sample will differ from the entire population; smaller samples are more likely to be different from the population. This can be illustrated as follows. A sample will have perfect fit if the mean calculated z-score using the GLI equations is 0 with a standard deviation of 1, and 5% of observations are below the fifth percentile defined by a z-score of −1.64.

The large Health Survey for England (3661 males, mean±sd z-score for forced expiratory volume in 1 s (FEV_{1}) of 0.04±1.03, 4.89% of observations below fifth percentile), which was carried out on the same population as that used for the GLI equations, by the same staff, methods and quality control, provided an opportunity to assess the role of selection bias [20]. Random selection of 150 subsamples of 50–1500 records from the Health Survey for England datasets shows that the variability of the outcome decreased as the sample size increased (figure 1), the same pattern as that for the 58 datasets that made up the GLI study. However, the amplitude of differences is smaller for the Health Survey for England compared with the GLI dataset, possibly reflecting greater biological and technical differences in administering pulmonary function tests, as well as in quality control, in the 58 GLI datasets. It follows that it is unrealistic to expect that a sample will fit the overall population perfectly; even with large samples (150 and 1000 subjects), the average z-scores of the sample have been shown to differ up to 0.4 units from predicted within the same population, simply due to sampling variability [20].

Deciding whether the GLI, or any other reference equation, is appropriate for a particular population based on the observed differences of a selected sample of healthy individuals ought to be interpreted based on physiological and clinical significance as opposed to statistical significance. With a large enough sample, even the smallest differences can be statistically significant. What matters is whether an offset or trend reflects sampling error or true biological differences between the sample population and the reference population. The greatest emphasis should be placed on whether clinical decision making will be affected by the choice of reference equation. If differences persist and are biased in a systematic way, then elucidating the physiological reasons for the discrepancies between the GLI equations and these populations is imperative. For example, data collected as part of a large population survey in Japan showed that changes in socioeconomic and general health conditions were associated with changes in body frame, such as a change in relative leg length, which has been associated with a secular trend in lung function in Japan [19]. In view of genetic determinants of pulmonary function [21–23], it is similarly a challenge to define normal lung function in mixed-ethnicity populations; possibly this played a role in the finding of an unsatisfactory fit of the GLI equations in a Tunisian population where Berbers have sub-Saharan ancestry [22]. Improving our understanding of how social and cultural changes affect both somatic and lung growth will make an important contribution to the literature.

In studying whether there is a good fit of predicted values to a particular dataset, interpretation of data as a percentage of predicted should be discouraged. Using the percentage of predicted introduces an important age bias because the variance around the predicted value is not a fixed percentage of predicted [10, 24–26]. Instead, data should be expressed as z-scores, which indicate by how many standard deviations a measured value differs from the predicted, taking into account age, height, sex and ethnicity, and which are therefore unbiased. There are proportional differences in FEV_{1} and forced vital capacity (FVC) between ethnic groups, so that the FEV_{1}/FVC ratio is for practical purposes the same in healthy populations [10]. A considerable difference from the predicted ratio (defined by z-scores) may therefore reflect limited sample size or a difference in health status.

In the event that the GLI equations do not fit a particular population, individual laboratories/countries are left with the options to revert to an older reference equation, or collect new normative data. In under-represented populations, the latter is highly advisable, both for our understanding of how lungs differ between different populations, but also to provide accurate normative data for these populations.

The GLI equations are not a permanent fixture, the network of collaborators was established to build an infrastructure to maintain and expand upon the data published in 2012. Individuals with data from under-represented populations, or with contemporary data collected from groups included in the GLI-2012 population, are encouraged to share their data with the GLI community (www.ers-education.org/guidelines/global-lung-function-initiative.aspx). The ERS recently launched the ERS Research Agency [27], which will coordinate and fund respiratory research across Europe. The central aim of the Research Agency is to facilitate respiratory research through the coordination and support of the respiratory research community, and to assist in its efforts to obtain funding. The ERS Research Agency has recognised the GLI as a rich data source and now houses and maintains it. The dataset will soon be available not only to update the GLI-2012 equations but also for researchers to access for independent research questions.

There is no single available reference equation that can be applied universally to all pulmonary function laboratories globally. Each individual laboratory should carefully compare the options available for the population they are testing and choose the equations that are most appropriate. Importantly, laboratories should not use default equations arbitrarily without considering the implications and potential biases. Overall, there is an urgent need for the respiratory community to collect more normative data in under-represented populations to further improve how we interpret lung function. Regardless of which equations are used, clinical decisions should never be based solely on lung function test results. Results near the lower limit of normal should be cautiously interpreted and backed up with repeated testing and complementary laboratory clinical and physical findings.

## Footnotes

Editorial comment in:

*Eur Respir J*2016; 48: 1535–1537.Conflict of interest: None declared.

- Received September 3, 2016.
- Accepted September 16, 2016.

- Copyright ©ERS 2016