Epigenome-wide association study of lung function level and its change

Previous reports link differential DNA methylation (DNAme) to environmental exposures that are associated with lung function. Direct evidence on lung function DNAme is, however, limited. We undertook an agnostic epigenome-wide association study (EWAS) on pre-bronchodilation lung function and its change in adults. In a discovery–replication EWAS design, DNAme in blood and spirometry were measured twice, 6–15 years apart, in the same participants of three adult population-based discovery cohorts (n=2043). Associated DNAme markers (p<5×10−7) were tested in seven replication cohorts (adult: n=3327; childhood: n=420). Technical bias-adjusted residuals of a regression of the normalised absolute β-values on control probe-derived principle components were regressed on level and change of forced expiratory volume in 1 s (FEV1), forced vital capacity (FVC) and their ratio (FEV1/FVC) in the covariate-adjusted discovery EWAS. Inverse-variance-weighted meta-analyses were performed on results from discovery and replication samples in all participants and never-smokers. EWAS signals were enriched for smoking-related DNAme. We replicated 57 lung function DNAme markers in adult, but not childhood samples, all previously associated with smoking. Markers not previously associated with smoking failed replication. cg05575921 (AHRR (aryl hydrocarbon receptor repressor)) showed the statistically most significant association with cross-sectional lung function (FEV1/FVC: pdiscovery=3.96×10−21 and pcombined=7.22×10−50). A score combining 10 DNAme markers previously reported to mediate the effect of smoking on lung function was associated with lung function (FEV1/FVC: p=2.65×10−20). Our results reveal that lung function-associated methylation signals in adults are predominantly smoking related, and possibly of clinical utility in identifying poor lung function and accelerated decline. Larger studies with more repeat time-points are needed to identify lung function DNAme in never-smokers and in children.

, and samples with call rate < 98 % were excluded. Probes with call rate < 95 % were excluded from the analyses.

Discovery cohort: SAPALDIA -Swiss Study on Air Pollution Heart and Lung Disease in Adults
Study description SAPALDIA was initiated in 1991 to specifically study the air pollution impact on respiratory health. [5,6] It is a population-based cohort in Switzerland recruiting subjects aged 18 to 60 from population registries in eight communities, representing the three largest language groups (German, French, Italian) as well as different levels of air pollution and degrees of urbanization. Subjects underwent spirometry and answered a detailed questionnaire on respiratory health, allergies, smoking history, and lifestyle factors in the baseline (year 1991, SAPALDIA 1), follow-up (year 2002, SAPALDIA 2) and second follow-up (year 2010, SAPALDIA 3) examination. The study is in agreement with the Declaration of Helsinki. Participants provided prior written informed consent and ethical approval for the study was given by the Overall Regional Ethics Commission for Clinical Medicine (Swiss Academy of Medical Sciences) and by the respective cantonal ethical committee for each survey.

Study description
The Avon Longitudinal Study of Parents and Children (ALSPAC) recruited 14,541 pregnant women with expected delivery dates between April 1991 and December 1992. [7,8] Of these initial pregnancies, there were 14,062 live births and 13,988 children who were alive at 1 year of age. When the oldest children were approximately 7 years of age, an attempt was made to bolster the initial sample with eligible cases who had failed to join the study originally. As a result, when considering variables collected from the age of seven onwards (and potentially abstracted from obstetric notes) there are data available for more than the 14,541 pregnancies mentioned above. The study website contains details of all the data that are available through a fully searchable data dictionary (http://www.bris.ac.uk/alspac/researchers/data-access/data-dictionary).

DNA methylation data and QC and preprocessing
As part of the ARIES project (http://www.ariesepigenomics.org.uk), a sub-sample of 1,018 ALSPAC mother-child pairs had DNA methylation measured using the Infinium HumanMethylation450 BeadChip (Illumina, Inc.). DNA methylation data was measured in three samples per child, from cord blood and venous blood samples at age 7 and again at age 15 or 17 years. All DNA methylation wetlab and preprocessing analyses were performed at the University of Bristol as part of the ARIES project and has been described in detail previously. [9] In detail, peripheral blood was collected according to standard procedures, spun and frozen at -80˚C. DNA methylation analysis and data pre-processing were performed at the University of Bristol as part of the ARIES project (ariesepigenomics.org.uk). Following extraction, DNA was bisulfite converted using the Zymo EZ DNA MethylationTM kit (Zymo, Irvine, CA). Following conversion, the genome-wide methylation status of over 485,000 CpG sites was measured using the Illumina Infinium® HumanMethylation450k BeadChip assay according to the standard protocol. The arrays were scanned using an Illumina iScan and initial quality review was assessed using GenomeStudio (version 2011.1). The level of methylation is expressed as a "Beta" value (β-value), ranging from 0 (no cytosine methylation) to 1 (complete cytosine methylation). Samples from all time-points in ARIES were distributed across slides using a semi-random approach (sampling criteria were in place to ensure that all time-points were represented on each array) to minimize the possibility of confounding by batch effects. Samples failing quality control (average probe detection p-value ≥ 0.01) were repeated. As an additional quality control step genotype probes on the HumanMethylation450k were compared between samples from the same individual and against SNP-chip data to identify and remove any sample mismatches. Data were pre-processed in R (version 3.0.1) with the WateRmelon package according to the subset quantile normalization approach described by Touleimat & Tost in an attempt to reduce the non-biological differences between probes. We removed probes that had a detection P-value >0.05 for >5% of samples, probes on the X or Y chromosomes and SNPs (rs probes).
Proportions of cell types were estimated from DNA methylation data using the estimateCellCounts function in the minfi R package which is based on the method developed by Houseman et al. . [10,11] This estimated the proportion of B cells, CD8 T cells, CD4 T cells, granulocytes, eosinophils, neutrophils, NK cells and monocytes at the 7.5 year methylation time-point and at the 16.5 year methylation time-point independently. Ten surrogate variables were generated and included in models to adjust for technical batch in cross-sectional and prediction models, using the sva R package. In longitudinal models, twenty surrogate variables were used, ten from each of the two methylation time points. Asthma status at 7.5 years of age was defined from questionnaires completed by the mothers of the study children at approx. 7.5 years of age when they were asked if their study child have ever been diagnosed by a doctor with asthma. Mothers were also asked in the same questionnaire if their study child had taken any asthma medicine in the past 12 months. The mothers of the study children were asked the same questions about asthma doctor diagnosis and asthma medications in the past 12 months when study children were 14 years old.

Lung function and covariates
Spirometry was done in a research clinic at ages 8.5 and 15 years approximately by using methods described previously. [12] Lung function at 15 years was defined as the highest of 3 measures before administration of salbutamol (pre-salbutamol measures) and 15 minutes after receiving 400 mg of salbutamol (post-salbutamol measures) administered by using metered aerosol and a spacer.
Maternal education (proxy for social class) was classified for this study as 1"University education", 2"A-level" or 3"O-level or lower" based on questionnaires completed by the study mothers. We derived smoking status from questionnaires completed by the study children at approximately 16 years of age. Study participants were asked if they had ever smoked a cigarette. Those responding Yes to the smoking question were then asked about their smoking frequency, indicating either 1"I have only ever tried smoking cigarettes once or twice" , 2"I used to smoke sometimes but I never smoke cigarettes now", 3"I sometimes smoke cigarettes but I smoke less than one a week", 4"I usually smoke between one and six cigarettes a week", 5"I usually smoke between one and six cigarettes a week but not every day" and 6"I usually smoke one or more cigarettes every day".
Height and weight were measured at research clinic attendance at the same ages as lung function assessment was carried out. During clinic attendance, height was measured to the last complete mm using a Harpenden Stadiometer and weight was measured using a Tanita Body Fat Analyser (Model TBF 305; Tanita Europe Ltd, Amsterdam, The Netherlands).

Study description
The Finnish Twin Cohort (FTC) study was initiated in 1974 to study genetic and environmental factors contributing to complex diseases and behavioral risks. [13] The study participants were recruited from the FinnTwin16 cohort, [13] a population-based, longitudinal study of five consecutive birth cohorts (1975)(1976)(1977)(1978)(1979) of twins, their siblings, and their parents. The FinnTwin16 cohort was established in 1991, and the first assessments took place when the twins were 16 years of age, with four waves of follow-up when the twins were 17, 18.5, 24,[14] and 34 years, [13] on average. We studied young adult twin individuals who were selected by their responses to questions on weight and height at the age of 23-36 years to represent a wide range of intra-pair differences in body mass index (BMI).
The spirometric examinations were performed during 2004-2013 by a mass flow sensor (Vmax encore, Sensormedics, Palm Springs, CA, USA). The flow device was cleaned before calibration by the device's automatic cleaning program, after which 0 calibration was performed. Flow calibration was then performed with a 3 liter pump between flow values 0.5 l/s and 6 l/s, after which a volume calibration was performed. During spirometry, patient was sitting with the nose closed with a clip, and at least 3 maximal flow volume curves were measured, the difference of best two FEV1 values or FVC values had to be less than 150 ml, and the expiration should last at least 6 seconds. If this was not fulfilled, additional measurements were performed. The spirometric results were given according to ERS/ATS recommendation from 2005. [15] Forced vital capacity (FVC), forced expiratory volume in one second (FEV1) and FEV1/FVC ratio were measured. The 2012 multiethnic reference values were used to compare the spirometric results. [16] The study subjects provided written, informed consent. The protocol was designed and performed according to the principles of the Helsinki Declaration and was approved by the Ethics Committee of the Hospital District of Helsinki and Uusimaa.

DNA methylation data
DNA extracted from white blood cells of 308 FinnTwin16 twins (112 MZ and 42 DZ pairs) aged 23-36, was used to generate DNA methylation data by Infinium HumanMethylation450 BeadChip (Illumina, Inc.). All DNA methylation wet-lab analyses were performed at the Norwegian Genomics Consortium, Oslo, Norway, and data preprocessing analyses were performed at the University of Helsinki, Finland, as part of the ongoing FTC projects, and has been described in detail previously. [17] For the current study, 110 MZ twins (55 MZ pairs) with both DNA methylation and spirometry data available from the same time point were selected.

Replication cohort: IOWBC -Isle Of Wight Birth Cohort Study description
The IOWBC is a single-centre study designed to represent the community population. All children born on the Isle of Wight in a defined period (January 1989 to February 1990) were eligible for inclusion. The cohort was recruited through the 1509 women who gave birth to 1536 children on the IOW during the recruitment period. The children in the IOWBC have been seen on six occasions over the course of 26 years, at 1, 2, 4, 10, 18 and 26 years. [18] The focus of the IOWBC is to investigate the etiology and natural history of asthma and allergic disease manifestations in an unselected population during childhood and early adult life. Spirometry data was performed at follow-ups at age 10, 18 and 26. The spirometer used at each follow-up was Koko spirometer and software with a portable desktop device (both PDS Instrumentation, Louisville,KY, USA). [19] DNA methylation data Biologic sample collection of peripheral blood was obtained at follow-ups at the age 10 and 18 and the samples of a subgroup were used for DNA methylation typing. 817 samples were typed using the Illumina Infinium® HumanMethylation450k BeadChip assay according to the standard protocol. [19] Methylation data was processed using standard QC-pipelines (CPACOR, quantile normalization and ComBat) using Minfi and SVA R packages (R version 3.51).

Study description
The Cooperative Health Research in the Augsburg Region Study, KORA, aims to gain new insights into the causes, development and consequences of cardiovascular disease, diabetes as well as lung diseases and allergies. [20][21][22] The KORA S4 survey is an independent population-based sample from the general population living in the region of Augsburg, Southern Germany. KORA S4 was conducted in 1999/2001 and standardized examinations were applied in the survey (4261 participants). A total of 3080 subjects participated in a follow-up examination of S4 in 2006-08 (KORA F4), comprising individuals who, at that time, were aged 32-81 years. A subset of 1321 subjects, aged 44-64 years, underwent spirometry and were followed up in the KORA FF4 survey 7 years later (2013/2014). Of those, a total of 628 subjects had DNA methylation data from blood samples collected at KORA F4 and data available on spirometry from KORA F4 and KORA FF4.

DNA methylation data
DNA methylation was measured in DNA extracted from whole blood of the participants using the Infinium HumanMethylation450K BeadChip at the Helmholtz Zentrum München, Research Unit Molecular Epidemiology and Genome Analysis Centre. The bisulfite conversion and genome-wide methylation assessment were performed as previously described. [23,24] QC and preprocessing Normalization of the methylation data was conducted following the CPACOR pipeline, [4] beginning with exclusion of 65 single nucleotide polymorphism markers and background correction using the R package minfi. [10] Probes were set to NA if the detection p-value ≥0.01 or number of beads ≤3. Samples were excluded if the detection rate was ≤0.95. Quantile normalization was then performed on the signal intensities. The methylation of a given cytosine was first calculated as a β-value, the ratio of the methylated signal intensity to the sum of the methylated and unmethylated signal intensities. Following normalization, CpG sites with a detection rate below 95% were excluded. To reduce possible impact of non-biological effects, we adjusted the methylation values for technical effects prior to analysis. In detail, principal component analysis was performed on the intensities of all (non-negative, autosomal) control probes after background correction. We then modeled the methylation values of each CpG site across all samples as a function of the first 20 principal components. Residuals of these models were used as "technically adjusted" methylation values for all analyses. [

DNA methylation data
Blood samples for methylation were taken from participants at each wave of testing. DNA methylation was measured at 485,512 sites using the Infinium HumanMethylation450 BeadChip array, at the Edinburgh Clinical Research Facility Genetics Core, Western General Hospital, Edinburgh. DNA methylation data, spirometry measures and covariates were available for 449 individuals at waves 1, 2 and 3.

Replication cohort: LifeLines Cohort Study description and DNA methylation data
The LifeLines cohort study is a large Dutch population-based cohort study designed to investigate chronic diseases and healthy aging. [29,30] Detailed information about LifeLines can be obtained at the official website (http://www.lifelines.net). A subgroup of 1,656 participants of the LifeLines cohort was non-random selected based on lung function, smoking status and exposure to environmental exposures. Whole blood samples collected for DNA extraction and DNA methylation level for each CpG site was measured using the IlluminaInfinium® Human Methylation 450K array (Illumina,Inc.) at the UMCG, Groningen, The Netherlands.

Study description
The Northern Sweden Population Health Study (NSPHS) was initiated to provide a health survey of the population in a geographically remotes area and to study the medical consequences of lifestyle and genetics. [31,32] According to the Sweden Census, on 31 December 2006, of 826 eligible inhabitants (aged 15 years or older) 740 subjects agreed to participate (90%) and 656 subjects contributed complete data, resulting in a final sample of 347 (53%) women and 95 (14.5%) individuals with a traditional lifestyle. The comprehensive collection of data included genealogy, socio-demography, body size, blood samples for clinical chemistry, medical history of participants and family members, and lifestyle. Spirometry was performed in a sitting position without nose clips using a Spida 5 spirometer (MicroMedical; http://www.medisave.co.uk). Three consecutive lung function measurements per participant were performed and the maximum value per measured lung function parameter was used for further analysis. Within the scope of this article, Peak Expiratory

Study population
The discovery cohorts were three well-characterized longitudinal cohort studies which were part of the Aging Lungs in European Cohorts (ALEC) project. Briefly, two discovery cohorts were populationbased studies specifically designed to investigate respiratory health, the European Community  Figure S1: Selection of study participants in discovery cohorts and data availability in replication cohorts Figure S1: Discovery cohorts: Flow chart of the selection of study participants. Replication cohorts: Contribution of each replication cohort to the meta-analyses based on data availability.

Discovery and replication data availability
For the current analysis purposes the data availability across eight adult inception cohorts (three discovery and five replication cohorts) was maximal for a one time-point cross-sectional association analysis. Among the replication cohorts, only one adult birth cohort (LBC1936, n=449) and both childhood birth cohorts (ALSPAC, n=258 and IOWBC, n=162) had methylation information and epidemiological data available from two time points. The sample size for the replication of repeat cross-sectional associations in adults was limited to the LBC1936 data. The replication sample size for the prediction EWAS on change in lung function over follow-up time was limited to two cohorts (KORA, n= 628, and LBC1936, n=449). The data available on spirometry was obtained from two follow-ups was 7 years apart for KORA and 6 years apart for LBC1936 and DNA methylation data was obtained from blood samples collected at the time of first spirometry. The remaining replication cohorts had DNA methylation, spirometry and epidemiological data available from only one time point (LifeLines (n=1622), NSPHS (n=535) and FTC (n=93)).

Whole blood sample collection and DNA methylation typing in discovery cohorts
For discovery cohorts, biological samples of peripheral blood were collected using standard operating procedures at and normalization procedures applied in the discovery cohorts are described in table at 2.6 of this Appendix. In total 4,568 samples of 2,259 participants were typed in the discovery cohorts. The methylation data was QC-processed, normalized and correction for technical batch effects were performed separately in each cohort. Cohort specific QCsteps were performed. A call rate of 95% was applied as selection criteria on marker and sample level; we had 2,043 samples for both time points (ECRHS: n=470, NFBC1966: n=611 and SAPALDIA: n= 962). For cohort-specific EWAS analyses, we used all autosomal markers available for each time point. Given the use of two different DNAme typing arrays (850K BeadChip and 450K BeadChip), all cohort-specific EWAS marker results were meta-analysed without restricting to a set of common markers.

QC-processing of DNA methylation data
We performed the methylation data processing within each cohort separately using R packages minfi, [10] RnBeads, [33] and CPACOR. [4] After standard QC cleaning steps regarding duplicates, sex inconsistency, low sample quality (sample call rate >95%) for all three cohorts, ECRHS additionally excluded outlier samples beyond 1.5 interquartile range (IQR) and NFBC1966 excluded samples with their 1st PC score of the DNA methylation values outside the interval given by the mean +/-4SD (see n=762'003 (ECRHS2), n=832'569 (NFBC1966 (2012)), n=472'079 (SAPALDIA3)). For cohort-specific EWAS analyses, we used all autosomal markers available for each time point and cohort-specific EWAS marker results were metaanalysed without restriction to common markers.
The cohorts used spirometers of different brands and models and change in spirometers between surveys did occur. Spirometric comparability across time points was addressed within cohorts. [

Methods: Statistical analysis
In the discovery cohorts, two types of smoking score were generated to test their combined effect. Briefly, an ALEC custom SI specific for each lung function outcome (FEV 1 , FEV 1 /FVC or FVC) was generated from a subset of CpGs selecting known smoking-related markers from the ALEC discovery repeat cross-sectional EWAS 100 top association signals; second a lung-function-gene-SI including smoking associated CpGs located in 18 GWAS-identified lung function candidate genes.

Generation of smoking scores
For the discovery cohorts, two types of smoking indices (SI) were constructed according to an algorithm based on the deviation from the reference group mean of the measured DNAme across a subset of CpGs in a given participant. [44] To build smoking indices specific to each lung function outcome in the discovery cohorts, we used the technical bias-adjusted residuals of the measured DNAme. Never smokers were defined as the reference group to derive the mean and standard deviation (µc and σc) used in the index equation. For each CpG of the selected subset, the mean DNAme, μc, and its standard deviation, σc, across the group of never smokers (reference group) was computed. In detail, the smoking index was defined across a subset of a number of CpGs (N) in a given participant (SI(s)) using the DNAme measured in the biological sample (s) by summing the difference in DNAme level at a given CpG (βcs) from the mean reference DNA methylation (µc) divided by the standard deviation in the reference group (σc) and by taking the direction hyper-or hypomethylation the smoking-associated CpG into account (wc, hypermethylation =+1 and hypomethylation=-1). The direction of the effect of smoking on DNAme was derived from Joehanes et al [45].
Using this algorithm, an SI can be constructed for all participants irrespective of their smoking status. Two different subsets of CpGs were selected to test their combined effect in a SI score, the Mediation-SI (i) and the lung-functiongene-SI (ii) i) Mediation-SI: In a recent report 10 CpGs had been identified to be associated with smoking applying an independent EWAS in the LifeLines cohort study and confirmed by mediation analysis to be mediators of smoking on lung function [46]. We used these 10 CpGs for constructing a mediation SI.

Annotation of genomic loci of replicated CpGs
Annotation to genomic location was achieved by permutations testing. To determine whether enrichment occurred more often than expected by chance we drew 10000 random sets with matched SD structure as the lung funciton associated loci from the Illumina 450K arrays probes. For each set, we recorded the overlap to the genomic feature under test, creating a distribution that reflected the overlap of a random permuted set of 450K probes with the same standard squared deviation (SD) structure to the genomic feature. We obtained enrichment p-values empirically from distribution of overlaps generated by the permuted sets and the observed overlap in the lung function associated loci, and performed a Fisher exact test by cross tabulation of the mean overlaps from the permuted sets versus the observed overlap in lung function associated loci. The SD structure of the 450k probes was recorded in NFBC1966. Subsequently, the 450K probes were divided into 10 SD bins. The 10000 permuted sets were created containing the same number of probes per SD bin as the credible set of loci.
We assessed the overlap of our lung function specific marker to histone modification H3K4, H3K27me3 and the chromatin state model reported by Roadmap. [47] DNAse hypersensitivity sites as well as transcription factor binding sites data were retrieved from ENCODE project. [48] Chromosomal contact points and domains (HiC) were downloaded from GEO and concordant experiments were undertaken as previously described. [49] To assess the overlap to SNP signatures we used the NHGRI-EBI GWAS catalogue (downloaded June 2016). SNPs per trait were pruned across studies using a 1MB window. We restricted this analysis to traits with 50 or more independent variants. To define enrichment in regions relative to the transcription start site of a gene and CpG island we used Illumnina's 450K manifest file.

Pathway analyses of replicated CpGs
Genes from the 57 confirmed sentinel CpGs, replicated in independent cohorts, were selected for further analysis.

Smoking enrichment analysis
We performed an enrichment analysis to test if CpGs previously identified as being associated with smoking were statistically overrepresented among the top lung function associated CpG markers of the discovery meta-analyses EWAS. We defined a CpG marker to be a smoking CpG if it was part of a previously reported set of CpG markers (n = 18,760) associated with smoking behavior at a FDR-corrected P-value<0.05. [45] The test statistics (t-values) were retrieved from each EWAS result and tested for the enrichment of the smoking-related CpG sites. The Weighted Kolmogorov Smirnov (WKS) test was used to assess if the test statistics on a certain set of loci differed from those in random loci of the same size. [54] The WKS is a better alternative to the commonly used gene set enrichment analysis [55] for examining an enrichment of custom curated CpG marker sets, as this method may be less biased towards identifying enrichment of large genes with increased number of probes represented on the arrays.

Replication of mediation analysis in SAPALDIA
We conducted mediation analysis for each of the 10 CpGs used to compute the Mediation-SI, using R package "mediation". SAPALDIA samples at time point 2 were analyzed (n = 962). Mediation model was fitted by linear regression of technical variable adjusted residuals of the methylation level on smoking status (ever vs never-smoker)     S3, table S4 and table S5: * Presentation of CpG markers showing meta-analysis P-value < 5x10 -7 at combined meta-analysis. CpGs consistent in direction of the crosssectional associations at both time points were for FEV 1 /FVC: cg05575921, cg21161138, cg26703534 and cg25648203 (AHRR), cg03636183 (F2RL3), cg21566642, cg01940273 and cg03329539 (in vicinity of ALPPL2), cg23771366 and cg11660018 (PRSS23), cg15342087 (IER3), cg19572487 (RARA), cg24859433, and cg14753356, cg15342087 (FLOTL1) and cg04885881 (SRM); for FEV 1 : cg03636183 (F2RL3), cg11660018 (PRSS23), cg19572487 (RARA), cg23771366 (PRSS23), cg03149958 (ETV7), cg18181703 (SOCS3), cg05673882 (POLK), cg01127300, cg04813697 (PIP4K2A), cg21990700 (C1RL), cg06826457, cg25953130 (ARID5B), cg01882991 (intergenic), cg16288101 (intergenic), cg12761472 (intergenic), cg11231349 (NOS1AP); and none for FVC. † Base model (M base ) EWAS were covariate adjusted for age, age squared, height, squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition. ‡ Smoking CpGs defined on the reported FDR corrected P-value <0.05 for association reported with smoking status and direction of effects. [45] ¶ For combined meta-analysis: FTC was excluded from this meta-analysis, given the smaller sample size and lower mean age (30. Figure S2: Quantile-Quantile plot of cross-sectional EWAS on FEV 1 and FVC Figure S2: Quantile-Quantile plots of cross-sectional covariate-adjusted EWAS (M base *) on FEV1 and on FVC at first and second time point, all participants. Increase in numbers of signals with aging. Meta-analyses were performed without genomic control. For FEV 1 , we identified 34 CpGs at time point 2 compared to none at time point 1 to be statistically significant (inflation factor  for time point 1 ( = 1.14) and for time point 2 ( = 1.14)). For FVC, we identified three CpGs at time point 2 compared to none at time point 1 to be statistically significant (inflation factor  for time point 1 ( = 1.09) and for time point 2 ( = 1.02)). Figure S2: *Base model (M base ): EWAS were covariate adjusted for age, age squared, height, squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition.

Results: EWAS repeat -Repeat cross-sectional EWAS in the same study participants -combining the data from both time points
The discovery EWAS repeat was undertaken using mixed linear regression with a random intercept on the study participant to assess the data from both time points in the same model. The EWAS repeat results are presented only in the online supplement ( Figure Box1 and Table S6, this Appendix). For FEV 1 , we found seven CpGs associated either in discovery EWAS meta-analysis or in the combined meta-analysis of four cohort studies to be associated with a P<5x10 -7 . Three CpGs did formally replicate in the LCB36 replication cohort (P<0.0071, Bonferroni correction for 7 tests). All three replicated differentially methylated markers had previously been identified as smoking-related CpGs: cg05575921, in the AHRR gene replicating at P replication = 0.0033 (P combined =1.86x10 -9 ), cg21566642 in the ALPPL2 gene (2q37.1) at P replication = 0.0033 (P combined =1.86x10 -9 ) and cg04813697 in the PIP4K2A gene at P replication = 0.0006 (P combined =4.61x10 -7 ). Results (not shown) see Box3 Figure "Manhattan and Q-Q plots of EWAS repeat "). For FEV 1 /FVC, there were 23 CpG markers found to be associated at P-value < 5x 10-7 and six replicated formally at a P<0.0021 (Bonferroni correction for 23 tests). Again all replicated CpGs were known smoking-related CpGs. In addition to the five prominent differentially methylated markers (cg21566642 and cg01940273 in the ALPPL2 gene (2q37.1), cg05951221 in the AHRR gene, and cg03636183 in the gene F2RL3), we identified cg09935388 in gene GFI1 replicating at P replication =0.0001 (P combined =2.54x10 -9 ).
For FVC, only one CpG marker (cg07709627 in gene PLEKHF1) reached the genome-wide selection threshold for replication in the repeated cross-sectional analyses, but it did not replicate. This method performed poorer as fewer statistically significant CpGs were identified compared to the cross-sectional EWAS using data from the second time point. Comparing the results of the repeated cross-sectional EWAS with the results of two cross-sectional EWAS, we noted for some CpGs consistent associations for both approaches. For other CpGs it seemed that fixed effect linear approach at two time points revealed more easily consistent and statistically significant associations compared to the mixed linear models likely due to increased sample size of the replication cohort for the cross-sectional analyses. Footnote: *Base model (M base ): EWAS were covariate adjusted for age, age squared, height, squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition.

Results: Effect of smoking adjustment on the cross-sectional EWAS (Ms mok ):
After applying smoking adjustment, we identified generally fewer genome-wide significant associations.  Figure S4, next page).
Footnote to Figure S3: ) EWAS were covariate adjusted for age, age squared, height, squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition and additionally for adjusted for smoking covariates: history of smoking intensity as pack years smoked up to the time point of data collection for regressions and for smoking status (current, former and never smoker).

Results: Prediction of DNAme1 association with lung function (EWAS predict )
8.1. Table S8: Prediction EWAS (M base ) for change in FEV 1 and for FVC, in all participants.  showing meta-analysis P-value < 5x10 -7 at discovery or replication level. CpGs shown sorted by statistical significance of combined metaanalysis results. † Base model (M base ) EWAS were covariate adjusted for age, age squared, height, lung function at time point 1 (FEV 1 or FVC respectively), squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition. ‡ Replication was defined for association with FEV 1 if replication P-value<0.00067 (multiple testing correction for 74 tests), and with FVC if replication P-value<0.0031 (multiple testing correction for 16 tests). ¶ Smoking CpGs defined on the reported FDR corrected P-value <0.05 for association reported with smoking status and reported direction of effects for association with smoking. [45] 8.2. Figure S4: Plots of prediction EWAS on FEV 1 and FVC Figure S4: Manhattan and Quantile-Quantile plots of covariate-adjusted prediction* EWAS (M base †) in all participants A) on FEV 1 ( = 1.26); and B) on FVC ( = 1.23). Figure S4: * Prediction association of DNA methylation at first time point (DNAme 1) with annual change in lung function during follow-up †Base model (M base ): EWAS were covariate adjusted for age, age squared, height, lung function at time point 1 (FEV 1 or FVC respectively), squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition. 9. Results: EWAS meta-analyses in never smokers 9.1. Figure S5: Quantile-Quantile plots for cross-sectional associations, in never smokers. Figure S5: Quantile-Quantile plots of cross-sectional covariate-adjusted EWAS (M base *) at first and second time point, in never smokers A) on FEV 1 (inflation factor  for time point 1 ( = 1.09) and for time point 2 ( = 1.05));B) FEV 1 /FVC (inflation factor  for time point 1 ( = 1.11) and for time point 2 ( = 0.96)); and C) FVC (inflation factor  for time point 1 ( = 1.08) and for time point 2 ( = 1.03)). Figure S5: *Base model (M base ): EWAS were covariate adjusted for age, age squared, height, squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition. Table S9: Cross-sectional associations, in never smokers. 9.3. Table: EWAS repeat in never smokers:

9.2.
In the repeat cross-sectional EWAS cg14366110 was the top hit (P discovery =1.72 x10 -10 ) for association with FEV 1 /FVC, yet this association did not replicate in the LBC1936 sample (P replication = 0.454) though the direction of the effect was consistent and the P combined = 4.63 x10 -10 remained statistically epigenome-wide significant.  Footnote table EWAS repeat in never smokers: * Presentation of CpG markers showing meta-analysis P-value < 5x10 -7 in the combined meta-analysis. † Base model (M base ) EWAS were covariate adjusted for age, age squared, height, squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition.
‡ Smoking CpGs defined on the reported FDR corrected P-value <0.05 for association reported with smoking status and direction of effects. [45] 9.4. Table S10: Prediction EWAS, in never smokers. Footnote to table S10: * Prediction association of DNA methylation at first time point (DNAme 1) with annual change in lung function during follow-up, defined as lung function at second time point -lung function at first time point divided by the time of follow-up in years. Presentation of CpG markers showing meta-analysis P-value < 5x10 -7 at discovery or replication level. † Base model (M base ) EWAS were covariate adjusted for age, age squared, height, lung function at time point 1 (FEV 1 , FVC or FEV 1 /FVC respectively), squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition.

Results
: ALEC discovery meta-analysis look-up results of CpGs previously associated with respiratory traits 13.1.

Results: Two-sample Mendelian Randomization look-up in publicly available databases
We found genetic instruments for 13 of the 57 replicated CpGs and finally only for eight CpGs a two sample MR on cross-sectional lung function could be completed. Results for FEV 1 are presented in table S20. There was evidence for causal effect of cg23771366, cg11660018 (PRSS23) and cg21990700 (C1RL) and cg00073460 (ZC3H12D) on FEV 1 percent predicted and of cg00073460 (ZC3H12D) and cg24086068 (SHROOM3) on FVC. Of the four top loci associated with FEV 1 /FVC (AHRR, F2RL3, ALPPL2 and PRSS23, see table 2) we could only identify MR-genetic instruments for ALPPL2 and PRSS23 and only the association of PRSS23 with lung function could be obtained from the MRbase resource.  table 2). Both other dominant smoking-related CpGs which also remained significant after smoking adjustment (cg05575921 (AHRR) having P=2.69x10 -11 (M smok ) versus P=7.22 -50 (M base ) and cg03636183 (F2RL3) having P=1.02x10 -8 (M smok ) versus P=4.5x10 -43 (M base )) were not available in the mQTL database and thus their causal association with lung function remains to be investigated. 14.1. Table S20: Two-sample MR results for FEV 1 predicted percentage and for FVC

Results: Mediation analysis and Mediation smoking index (Mediation-SI)
15.1.  Footnote to table S21: * set of ten CpG markers previously identified to be associated with smoking and to mediate the effect of smoking on lung function [46]. Two CpGs (cg05951221 and cg06126421) were present only on 450K array and thus missing in NFBC1966 data from time point 2 and in ECRHS data from both time points. Values for Mediation-SI score for missing data for cg05951221 and cg06126421 were imputed to mean of never-smokers in each cohort. † Smoking CpGs defined on the reported FDR corrected P-value <0.05 for association reported with smoking status and reported direction of effects for association with smoking. [45] ‡ EWAS repeat : Discovery meta-analysis of repeat cross-sectional association combining data from time point1 and time point 2 using mixed linear regression with a random intercept on the study participant.    Figure S6: *Associations were adjusted for the base model (M base ): age, age squared, height, squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition. The M base -adjusted model explained 71.0% of the variance in the outcome. The M base -adjusted model additionally adjusted for the Mediation-SI explained 71.9% of the FEV1 variance (total adjusted R 2 = 0.719) of which 3.2% of the variance was specifically explained by the SI variable. This was comparable to the variance explained by the M base -adjusted model additionally adjusted for packyears and smoking status corresponding to the M smok model (R 2 = 0.721, and with 3.3% of the variance specifically explained by the packyears variable). Model including both smoking adjustments (M smok and additionally Mediation-SI) explained 72.4% of the FEV1 variance.  Figure S7: *Associations were adjusted for the base model (M base ): age, age squared, height, squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition. The M base -adjusted model explained 78.8% of the variance in the outcome. The M base -adjusted model additionally adjusted for the Mediation-SI explained 79.1% of the FVC variance (total adjusted R 2 = 0.791) of which 1.2% of the variance was specifically explained by the SI variable. This was comparable to the variance explained by the M base -adjusted model additionally adjusted for packyears and smoking status corresponding to the M smok model (adjusted R2 = 0.792, and with 0.5% of the variance specifically explained by the packyears variable). Model including both smoking adjustments (Msmok and additionally Mediation-SI) explained 79.3% of the FVC variance (total adjusted R 2 = 0.793). Table S23: Mediation-SI association with FEV1 and with FVC by smoking strata Table S23: Meta-analyses* of Mediation-SI on the cross-sectional association with lung function, FEV 1 (L) and FVC (L), time point 2, separately, and longitudinally predicting the change in lung function during follow-up, FEV 1 (ml/year) and FVC (ml/year), base model adjustment (M base †), in all study participant, ever and never smokers. 15.6. Figure S8: Mediation-SI association with FEV 1 in ever -and never smokers Figure S8: Forest plots of cohort-specific results and meta-analyses of the association of the Mediation-SI with FEV 1 and change in FEV 1 in ever -and never smokers in the discovery cohorts. Associations run applying base model adjustment (M base *).

15.5.
Footnote to Figure S8: *Base model (M base ): EWAS were covariate adjusted for age, age squared, height, squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition. Prediction models were additionally adjusted for FEV 1 at time point 1.

Results
: Candidate Gene based DNAme smoking index (lung-function-gene-SI) 17.1. Table S24: List of CpGs contributing to lung-function-gene-SI Table S24: List of CpGs contributing to the gene-based smoking index (lung-function-gene-SI) representing a subset of CpG markers previously identified to be associated with smoking* that were located in GWAS identified genes associated with lung function. Information on ranks of association of the selected CpGs in the discovery EWAS (M base ) with FEV 1 are presented, similar low ranks were observed for association with FEV 1 /FVC or FVC. Footnote to table S24 *Smoking CpGs defined on the reported FDR corrected P-value <0.05 for association reported with smoking status and reported direction of effects for association with smoking. [45] Result Associations of lung-function-gene-SI with FEV 1 , FEV 1 /FVC and FVC:

Locus
A smoking index (SI) based on 18 candidate CpGs located in previously identified genes to determine lung function (lung-function-gene-SI) was constructed. The construction and associations of the SI were performed as for the ALEC custom SIs. Meta-analysis results of the lung-function-genes-SI in all participants and by smoking status strata are presented below (Table S25). The effect of the lung-function-genes-SI, combining the effects of gene candidate CpGs, was not as pronounced as that of the effect of the ALEC custom SI, combining the effects of the smokingrelated CpGs. The strongest associations were observed in ever smokers for cross-sectional associations with FEV 1 and for prediction of change in FVC in ever smokers ( (SE) =-18.1ml/year (5.1), P=0.0004). Table S25: Meta-analyses of lung function gene-based candidate smoking index (SI) based on a subset of CpG markers (18 CpG markers, Table S24) previously identified to be associated with smoking [45] were located in GWAS identified genes associated with lung function. Associations* on the cross-sectional association with FEV 1 (L), Footnotes to table 25: *Cohort-specific association results for lung-function-gene -SI were meta-analysed. † Base model covariate adjustment (M base ): age, age squared, height, squared deviation from the mean of height, sex and interaction terms of age, age squared, height and squared deviation of height with sex, education (low, medium, high), body mass index, spirometer type, study center as well as cell composition. Prediction models were additionally adjusted for FEV 1 /FVC at time point 1. ‡ P-value of meta-analysis: P<0.008 was considered statistically significant, Bonferroni correction for 6 tests per lung function outcome. Order of cohorts for direction of effects: ECRHS, NFBC1966, SAPALDIA. Abbreviations: beta -coefficient of association; chr -chromosome; SE -standard error; P-value (het) stands for P-value between study heterogeneity.