## Abstract

**Background** How best to express the level of transfer factor of the lung for carbon monoxide (*T*_{LCO}) has not been properly explored.

**Methods** We used the most recent clinical data from 13 829 patients (54% male; 10% non-European ancestry; median age 60.5 years, range 20–97 years; median survival 3.5 years, range 0–20 years) to determine how best to express *T*_{LCO} function in terms of its relationship to survival.

**Results** The proportion of subjects of non-European ancestry with Global Lung Function Initiative (GLI) *T*_{LCO} z-scores above predicted was reduced, but was significantly increased between −1.5 and −3.5, suggesting the need for ethnicity-appropriate equations. Applying GLI forced vital capacity (FVC) ethnicity methodology to GLI *T*_{LCO} z-scores removed this ethnic bias and was used for all subsequent analysis. *T*_{LCO} z-scores using the GLI equations were compared with Miller's USA equations with median *T*_{LCO} z-scores being −1.43 and −1.50 for GLI and Miller equations, respectively (interquartile range −2.8 to −0.3 and −2.4 to −0.7, respectively). GLI *T*_{LCO} z-scores gave the best Cox regression model for predicting survival. A previously proposed six-tier grading system for level of lung function did not show much separation in survival risk in the less-severe grades. A new four-tier grading based on z-scores of −1.645, −3 and −5 showed better separation of risk with hazard ratio for all-cause mortality of 2.0, 3.4 and 6.6 with increasing severity.

**Conclusion** Applying GLI FVC ethnicity methodology to GLI *T*_{LCO} predictions to remove ethnic bias together with a new four-tier z-score grading best relates *T*_{LCO} function to survival.

## Abstract

**A four-tier grading of lung function defined by z-scores of −1.645, −3 and −5 is simpler and better relates to subsequent survival than previous grading systems. Using ethnicity-specific GLI T_{LCO} prediction equations is appropriate.** https://bit.ly/2PRxc0u

## Introduction

Single-breath transfer factor of the lungs for carbon monoxide (*T*_{LCO}) has been shown to be the best predictor of survival in the general population [1]. With the new Global Lung Function Initiative (GLI) prediction equations for *T*_{LCO} [2] now available it is important to look at verifying their validity by how well they fit survival in large datasets [3]. How lung function results are presented has been under scrutiny in recent years to find the best method. Using percentage of predicted (PP) was for many years an accepted means, but this method retains age, sex and size effects that may lead to false conclusions, such as the misconception that females were more susceptible to COPD [4]. Standardised residuals (that is, z-scores) are the preferred method for defining how far a subject's result is from their population norm [5, 6], but older subjects can never have z-scores as low as young people [7], which may lead to problems in grading the severity of an abnormality. T-scores have been proposed as an improved method for using spirometry data [8], which might overcome this latter issue because T-scores relate the subject's observed value to the subject's sex- and height-specific maximum predicted value seen in early adulthood and not to their age-related predicted value.

For grading the severity of airflow obstruction the previous American Thoracic Society (ATS)/European Respiratory Society (ERS) recommendations used cut levels of PP values [6] to give five levels of abnormality. This approach was subsequently adapted to be used with z-scores [9]. Because this approach was just for airflow obstruction, it is uncertain whether it can be applied to other lung function indices and whether five distinct levels are justified. The ATS/ERS previously recommended three grades of severity for *T*_{LCO} [6]. Grading the degree of lung function reduction is arbitrary, in that the degree of reduction is a continuum and the previously proposed five levels of grading for airflow obstruction may be implying differences that are not substantive.

We looked at a large dataset of patients (>13 800) with their survival and compared two different prediction equations for *T*_{LCO}; we used different methods for presenting the results to look for the best method for grading *T*_{LCO} and spirometry result abnormalities.

## Methods

We extracted anonymised patient data from a clinical lung function database set up in January 1996 and closed in October 2016 which stored all routine lung function tests performed at the Queen Elizabeth Hospital (University Hospitals Birmingham National Health Service Foundation Trust, Birmingham, UK). The data in this study complied with patient confidentiality and comprise the results of the most recent attendance tests for spirometry and diffusing capacity measured on the same day from 13 829 patients (54% male) referred for any reason for lung function tests. All patients had their survival registered up to 19 September 2016 in UK National Health Service data records. All tests were performed according to ATS/ERS criteria on equipment validated to conform to ATS/ERS specifications [10].

GLI prediction equations [11] were used for spirometry with patients of non-European ancestry classified as ethnicity “other”. For transfer factor the GLI equations [2] were compared with those of Miller *et al.* [12]. Initial analysis showed that GLI *T*_{LCO} predictions gave many subjects extremely low z-scores (values as low as −23.0), which was not seen for the Miller equations, and this was particularly seen in females. After contacting GLI about this, an error was discovered in their data, with the wrong sex attributed to some subjects in their dataset [13]. Revised GLI-2020 *T*_{LCO} prediction equations have been issued, which were used for all our analyses [14]. T-scores were derived as the number of standard deviations the index result was from the subject's sex- and height-specific highest GLI predicted value for that index, which was at an age in their early 20s and which differed between the different indices. We graded for lung function abnormality with the ATS/ERS six-tier system for grading airflow obstruction [6] using PP and Quanjer *et al.*’s [9] system using z-scores, and also with the ATS/ERS four-tier system suggested for *T*_{LCO} [6]. We looked to see if one grading system might work for *T*_{LCO}, forced vital capacity (FVC) and forced expiratory volume in 1 s (FEV_{1}).

All Cox regression models were created using Stata/SE v16.1 (StataCorp, College Station, TX, USA) using lung function indices as predictors for survival with stratification for age quintiles, sex and smoking status (see supplementary material for detail). Time to event was date of death or censored at 19 September 2016 (last entry into database) if no event occurred. Proportional hazard assumptions were tested by looking for parallel separation of log–log survival curves and with Schoenfeld residuals tests [15]. As a measure of the best fit we took the lowest Akaike information criterion (AIC), which is a measure of the relative amount of information lost by a given model [16] and calculated Harrel's C concordance index, where a value of 0.5 means the predicted outcomes are no better than guessing and 1.0 means the predicted outcomes are all completely accurate.

## Results

Out of the 13 829, patients 54% were male, 37% had never smoked, 47% were ex-smokers and 16% were current smokers. There were 1435 (10.4%) patients of non-European ancestry. The lung function request forms gave putative diagnoses, relevant symptoms and the subject's therapy. 12.0% indicated COPD, 9.6% asthma, 9.0% interstitial lung disease, 4.1% bronchiectasis, 2.6% emphysema and 14.6% had breathlessness. There were 7508 (54%) patients who were not on any treatment.

Because patients of non-European ancestry appeared to have lower *T*_{LCO} z-scores than expected, suggesting that an adjustment for ethnicity might be appropriate, we then took the GLI ethnicity-specific FVC coefficients [11] and applied them to *T*_{LCO}, since any adjustment would probably need to correct for size differences (see supplementary material for detail). Figure 1 shows the percentage of subjects who were of non-European ancestry by bins of *T*_{LCO} z-score values with (pale grey) and without (dark grey) using the GLI ethnicity-specific FVC coefficients. It shows that there were significantly fewer subjects of non-European ancestry with GLI *T*_{LCO} z-scores better than predicted (z-score >0.0) and significantly more than expected in the range −1.5 to −3.5 (Chi-squared 79.5, p<0.0001). The GLI z-scores using ethnic specific FVC coefficients did not show this bias (Chi-squared 16.8, p=0.16). All of the *T*_{LCO} analyses presented used ethnicity-specific FVC coefficients. Alveolar volume (*V*_{A}) showed even more skewness in relation to ethnicity than did *T*_{LCO}, and the GLI FVC correction methodology did not satisfactorily correct these *V*_{A} differences (supplementary figures S1 and S2).

The median, interquartile range and range of values found for age, survival and lung function indices are shown in table 1. There were 4371 patients (32% of the total) who had normal FEV_{1}, FVC, FEV_{1}/FVC and *T*_{LCO} (all z-scores ≥−1.645) and of these 70% were not on any treatment. The distribution for GLI *T*_{LCO} z-scores was left-skewed as expected for patient data, whereas that for the Miller equations was not (supplementary figure S3). The distribution of *T*_{LCO} against age did not lend itself to the use of a *T*_{LCO} quotient index (that is, relating *T*_{LCO} to a sex-specific lower 1st centile value) in the way that a FEV_{1} quotient has been used (supplementary figures S4 and S5 and supplementary table S1) [7].

Cox models were prepared using severity grading by PP as previously proposed by the ATS/ERS for *T*_{LCO} [6] along with cut-points for z-scores (−1.645, −3 and −5) that were selected to roughly give similar numbers in each grade. Table 2 shows the results for these Cox regression models using the GLI and Miller equations. The models for *T*_{LCO} T-scores and for Miller z-scores did not comply with assumptions for proportional hazard and are not shown. The degree of fit for the GLI z-score model was similar to that for PP (AIC 51 392 and 51 402, respectively, and both had Harrel's C=0.66). The PP Miller model was slightly less good, with AIC 51 472. Both PP models showed a slightly older and more male distribution in the most severe grade. A sensitivity analysis using the 12 916 subjects aged <80 years (3903 deaths) gave hazard ratios (HR) (95% confidence limits) for *T*_{LCO} z-scores of 2.0 (1.8–2.2), 3.5 (3.2–3.8) and 6.4 (5.7–7.1) for the three grades *versus* 2.0 (1.8–2.1), 3.4 (3.2–3.7) and 6.6 (6.0–7.3) for the full dataset.

Figure 2 is a plot of *T*_{LCO} z-score against PP for 11 117 patients with values below predicted with the intersections of z-scores −3 and −5 and 40% predicted and 60% predicted, which shows that for a given z-score males had a lower PP value than females. The subjects in quadrants A and C are those where PP puts them in a more severe grade than z-score and quadrants B and D are those subjects where PP puts them in a less severe grade than z-score. Table 3 shows the percentages of subjects in these quadrants who were male, in the oldest tertile for age, in the tallest tertile for height and with ethnicity as “other”. For all these indicators there was a significant difference between the groupings shown in table 3 (p<0.005, Chi-squared) with quadrants A and C (where PP grades more severely than z-score) comprising more males and more subjects of older age, taller height and European ancestry. While there was a higher percentage of deaths in groups A and C compared to B and D, this was accounted for by older age and greater proportion of male subjects, so in Cox regression with the comparator group being all those with *T*_{LCO} greater than lower limit of normal (LLN) and with age and sex accounted for by stratification, the HRs in A and C due to their lung function was found to be lower than that for groups B and D. Supplementary figures S6 and S7 show that the residual standard deviation (RSD) for *T*_{LCO} in the GLI equations varies with the subject's sex, age, height and ethnicity. A smaller RSD in comparison to the magnitude of predicted (being female, shorter, of non-European ancestry and younger) will lead to PP grading being better than z-score grading, and *vice versa*.

Cox models with *T*_{LCO} were prepared using the six tier gradations of z-scores and PP previously recommended for FEV_{1} in the context of airflow limitation [9]. The results for *T*_{LCO} z-scores and PP models are shown in table 4. The numbers of subjects in the most severe grading varied considerably, from 325 with Miller z-scores to 1649 for GLI z-scores, with 647 in the worst PP grade. There was little difference in HR for all-cause mortality between the mild, moderate and moderately severe grading, suggesting that fewer grades were more appropriate. The results for the six-tier gradation for FVC in the whole group and for FEV_{1} in those with FEV_{1}/FVC <LLN are shown in table 5. Only the model for FVC z-score met the assumptions for proportional hazard. In all the models shown there was little difference in HR for all-cause mortality for the lower grades, suggesting that fewer tiers would be more appropriate. Table 6 shows the new proposed four-tier grading applied to FVC in the whole dataset and for FEV_{1} in those with airflow obstruction. The model for FVC was not as good as that for *T*_{LCO} with AIC 52 366 and Harrel's C=0.58 compared to 51 392 and 0.66, respectively, for the *T*_{LCO} model (table 2). The models for FVCPP and for FEV_{1}PP did not meet the assumptions of proportional hazard. Changing the thresholds to −3 and −4.5 gave slightly better fits for *T*_{LCO} and FVC, but the model for FEV_{1} did not meet the assumptions of proportional hazard (supplementary material). Only with cut levels of −3 and −5 could satisfactory models be derived for *T*_{LCO}, FVC and FEV_{1}.

## Discussion

We have shown that the GLI prediction equations for *T*_{LCO} gave the best survival prediction model for our population of patients, which was better than the survival predictions from spirometry and is in keeping with what was found for predicting survival in the general population [1]. We initially found that the original GLI equations gave many more extremely low z-scores than did the Miller equations, but this was due to an error in the original GLI data. Using revised GLI equations we found that the 10.4% of our population who were of non-European ancestry had *T*_{LCO} z-scores that were distributed lower than anticipated. There was no *a priori* reason to expect that our patients of non-European ancestry would have a higher prevalence of moderate disease than those of European ancestry, so this suggested that accounting for ethnicity would be appropriate [17]. We chose to adopt the methodology used by GLI for ethnicity with respect to FVC and apply this to *T*_{LCO}, which assumes that the ethnic difference in *T*_{LCO} relates to size issues rather than any intrinsic difference in the ability to transfer carbon monoxide to the blood. Using these ethnicity-specific coefficients removed the uneven distribution of ethnicity across the *T*_{LCO} z-score values in our patients. GLI were originally unable to propose ethnicity-specific predictions for *T*_{LCO} due to a lack of available data. While our proposal is not as satisfactory as deriving ethnicity-appropriate reference values from suitable data from all ethnicities, it does offer a sensible approach until such data are available. This will mean that all patients can be assessed against the most appropriate standard available for them as an individual and so avoid unnecessary misclassification of lung function results. We have only been able to examine this using the GLI category “other” for ethnicity, and future work will need to see if the other GLI ethnicity FVC coefficients can improve *T*_{LCO} predictions for those ethnicities.

For spirometry, we previously found that in the extremely old, the GLI predictions were high [18], suggesting that rigorous entry criteria for data to be used in prediction equations might lead to using values from subjects not properly representative of usual people for that age. However, for GLI *T*_{LCO} equations the upper age limit was suggested to be 85 years [2], but coefficients are available up to the age of 90 years. Thus, our 42 patients who were older than this were being assessed against their 90-year-old predicted values. This means that the z-scores of these subjects may be inappropriately low. Despite this, there was no age effect on the severity grading categories for GLI z-scores (table 4), whereas there were possible age effects within the grading system using the Miller equations and with PP values.

We have proposed a new simplified four-tier system for grading lung function from z-scores based on survival analysis. The six-tier system of Quanjer *et al.* [9] proposed for assessing airways obstruction with FEV_{1} did not show much separation in survival for the lower grades of severity (table 4). All gradings of severity are arbitrarily set on a continuous scale, and the six levels suggest a degree of separation between grades that does not appear to be justified. We think that using broader, but fewer, grades that reflect survival is a better and simpler approach. We have found marked differences in the severity groups identified by grading using z-scores *versus* PP for *T*_{LCO}. The analysis around the discordant subjects in quadrants A, B, C and D in figure 2 suggest that the z-score thresholds more correctly grade the severity with respect to all-cause mortality, since hazard for mortality better related to the z-score cut-offs than the PP cut-offs. These discordant groups showed significant skew in age, sex and height.

This difference in the way that z-scores and PP grade lung function deficit is related to the fact that PP takes no account of the scatter of the range of values in fit and healthy people and how this scatter relates to the magnitude of the index and the sex, ethnicity, age and height of the subject. PP assumes that the degree of lung function deficit is always simply proportional to the predicted. There is no evidence or law indicating that, for example, a 50% reduction in lung function for a 25-year-old is in some way equivalent to the same reduction in a 75-year-old. Consider *T*_{LCO}, an index that has a numerically relatively large predicted value, which, for a 75-year-old male of average height (175 cm) has a predicted value of 8.14 mmol·min^{–1}·kPa^{−1} with a 5th centile value of 5.94 mmol·min^{–1}·kPa^{−1}, which is at 73.0% of the predicted value, whereas for a 25-year-old male of this height, the LLN is at 79.3% predicted. Now consider forced expiratory flow at 25–75% of FVC, an index that has a numerically smaller predicted value with a relatively wide range of values found in normal healthy people. For the same 75-year-old man, the predicted is 2.17 L·s^{–1}, 5th centile is 0.90 L·s^{–1}, which is at 41.4% predicted, whereas for a 25-year-old man of this height the LLN is at 63.6% predicted. So comparing degrees of lung function reduction by PP is going to be problematic, since 73% predicted and 79% predicted in one is equivalent in terms of grading lung function deficit to 41% and 64% predicted in the other. In most previous prediction equations, for example the European Community for Steel and Coal equations [5] and from the third National Health and Nutrition Examination Survey [19], the RSD from the regression equations was the same for all subjects. GLI have observed that the scatter of results in healthy subjects was not uniform for all subjects and this was accounted for in their predictions, so people of different age, height, sex and ethnicity may have different RSD values, as shown in supplementary figures S8 and S9. These differences in approach mean that z-scores will grade differently from PP and appear to relate better to outcome.

Although others have found that T-scores for FEV_{1} were effective in determining abnormality in lung function [8], we have found that T-scores did not convincingly relate to survival in our patients. For FVC, T-scores were equivalently good at predicting survival to z-scores, but in all the other indices T-scores violated Cox regression assumptions and gave poorer predictions. The potential advantage of T-scores over z-scores is that in the older population z-scores cannot go so low as in younger subjects because the median value falls with age with the RSD not changing in proportion. Since T-scores are relating a subject's result to their predicted value at a much younger age, their values can go just as low in old age as in younger subjects. In our data this does not seem to be advantageous in predicting mortality since healthy older subjects have a much lower T-score than z-score due to age alone and this adversely distorts the survival analysis. Also the origins of T-scores with bone density [20] is predicated on that fact that values of bone density as a young adult can pertain into later life to define a normal range. This is not the case for lung function. The spread of lower values of *T*_{LCO} against age did not favour the easy adoption of a *T*_{LCO} quotient in the way that has proven effective for FEV_{1} [7, 21–23]. This may just reflect the spread of our data and our referral pattern for younger people for lung transplantation, and so this should be further explored in other datasets.

The limitations to our study include that we could not separate patients into disease categories, since the referral information was not verifiable. Additionally, we did not know the cause of death, so we could not look for respiratory causes of mortality to enhance the analysis. Our results are only applicable for adults, and in paediatrics such survival analysis may not be appropriate. However, the z-score grading system should still be applicable and avoid sex and size effects. Our patients were classified as either “white” or “other” in terms of GLI with respect to ethnicity and it will be necessary to check from data in African American and Asian populations whether the application of the FVC methodology to *T*_{LCO} for these ethnic groups is appropriate.

We conclude that the revised GLI-2020 equations with the addition of ethnicity-specific coefficients give a better spread of z-score results compared to the Miller equations and this leads to better survival prediction. We have found that a four-tier grading system for grading *T*_{LCO} function (z-scores of −1.645, −3 and −5) was appropriately related to survival.

## Supplementary material

### Supplementary Material

**Please note:** supplementary material is not edited by the Editorial Office, and is uploaded as it has been supplied by the author.

Supplementary material ERJ-02046-2020.Supplement

## Shareable PDF

### Supplementary Material

This one-page PDF can be shared freely online.

Shareable PDF ERJ-02046-2020.Shareable

## Acknowledgements

We thank the staff of the Lung Function department at the Queen Elizabeth Hospital (Birmingham, UK) for their dedication and professionalism in working with our patients to obtain the best possible test results.

## Footnotes

This article has supplementary material available from erj.ersjournals.com

This article has an editorial commentary: https://doi.org/10.1183/13993003.01326-2021

Conflict of interest: M.R. Miller has nothing to disclose.

Conflict of interest: B.G. Cooper has nothing to disclose.

- Received May 28, 2020.
- Accepted March 29, 2021.

- Copyright ©The authors 2021. For reproduction rights and permissions contact permissions{at}ersnet.org