Abstract
Asthma is a common condition caused by immune and respiratory dysfunction, and it is often linked to allergy. A systems perspective may prove helpful in unravelling the complexity of asthma and allergy. Our aim is to give an overview of systems biology approaches used in allergy and asthma research. Specifically, we describe recent “omic”-level findings, and examine how these findings have been systematically integrated to generate further insight.
Current research suggests that allergy is driven by genetic and epigenetic factors, in concert with environmental factors such as microbiome and diet, leading to early-life disturbance in immunological development and disruption of balance within key immuno-inflammatory pathways. Variation in inherited susceptibility and exposures causes heterogeneity in manifestations of asthma and other allergic diseases. Machine learning approaches are being used to explore this heterogeneity, and to probe the pathophysiological patterns or “endotypes” that correlate with subphenotypes of asthma and allergy. Mathematical models are being built based on genomic, transcriptomic and proteomic data to predict or discriminate disease phenotypes, and to describe the biomolecular networks behind asthma.
The use of systems biology in allergy and asthma research is rapidly growing, and has so far yielded fruitful results. However, the scale and multidisciplinary nature of this research means that it is accompanied by new challenges. Ultimately, it is hoped that systems medicine, with its integration of omics data into clinical practice, can pave the way to more precise, personalised and effective management of asthma.
Abstract
With the recent influx of “big data” in asthma research, clinicians and scientists need to become familiar with analytical approaches that use systems-based methods to make sense of large datasets http://bit.ly/2oUO1tG
Introduction
Asthma is a common yet complex disease that involves immune and respiratory dysfunction, and it is often associated with allergy. The ongoing prevalence of asthma and allergy has been linked to changes in environment and lifestyle [1, 2]. But while we know of several genetic and environmental determinants of asthma, the potential interactions between these determinants remain unclear. Furthermore, asthma and allergy are umbrella terms that describe a spectrum of disease, with unexplained heterogeneity in clinical manifestations. Finally, with the development of high-throughput technologies, we may be able to unravel this complexity, but it remains challenging to process, analyse and interpret the large volumes of biological data that emerge from these technologies. All these challenges have prompted researchers to search for new methods of inquiry more suited to these research problems.
Systems biology is a relatively recent development that addresses the growing complexity of biomedical research questions. The term was coined in the 1960s to describe mathematical modelling of physiological systems [3]. Today it embodies expertise across multiple fields, including biology, mathematics, statistics, informatics and computer science. The “systems” community is diverse and as such there is no singular definition of the term “systems biology” [4]. However, it is commonly presented as the study of biomedical problems involving complex systems and their interactions, by surveying and integrating high-volume data that may cover wide spatiotemporal scales [3]. These “big datasets” typically originate from “omics”, fields of study involving high-throughput measurement of biomolecules: for instance, genomics for DNA, transcriptomics for RNA transcripts and proteomics for translated proteins (figure 1). Mathematical and computational expertise is then required to explore this high-volume data, using techniques such as dimension reduction, data- and text-mining, modified statistical analyses that account for spatiotemporal complexity and multiple testing burden, machine learning, and mathematical modelling. Further perturbation experiments may be performed, where a biological system can be disrupted (e.g. via receptor antagonists or gene knockouts) to identify functionally relevant elements of the system [5, 6]. Therefore, systems biology is by its very nature multi- and inter-disciplinary.
“Omics” in allergy, and their interrelationships. A depiction of the various omics that can be found in allergy and asthma research. Lines connecting the omics represent various biological relationships, associations or interactions that may exist. In systems biology, bottom-up approaches progress from the molecular scale to the macroscopic scale; vice versa for top-down approaches.
The practice of systems biology follows two approaches: an unbiased, hypothesis-free data-driven approach, where few a priori assumptions are made and models are learnt from the data; and a hypothesis-driven approach, where model design and analysis are guided by previous experiments and expert knowledge [7]. The data-driven approach is becoming increasingly popular as it can uncover new knowledge on emergent behaviour from complex data, and can be used to generate new hypotheses. In turn, hypothesis-driven studies can also be used to test those very hypotheses. For instance, data-driven approaches allow us to determine new subphenotypes of asthma, while hypothesis-driven studies will allow us to test the relationship of these subphenotypes to existing paradigms of disease (T2-driven versus non-T2). Furthermore, systems biology can be dichotomised into top-down versus bottom-up approaches, to describe the direction of enquiry in decreasing (big to small, long to short, system to components) or increasing spatiotemporal scales (small to big, etc.), respectively (figure 1) [8, 9].
On the surface, systems biology seems antithetical to the reductionist paradigm of old. However, systems-based approaches can produce new insights on how to proceed with reductionist experiments, and vice versa. In addition, there are strengths and weaknesses attached to each; while reductionist methods can oversimplify problems, their tests are more appropriate in contexts such as causal inference. Nonetheless, systems approaches are becoming indispensable to biomedical research; they allow us to better understand disease phenomena, and form the basis for precision medicine, helping us improve the screening and management of disease.
Asthma and allergy, as biomedical problems, are well suited to systems approaches. These diseases have complex pathogenesis, with multiple tiers of biological complexity, polygenicity and gene–environment interactions. Systems approaches used in asthma and allergy research include 1) discovery of disease associations within each omic field; 2) identification of relationships within and across omic fields; 3) examination of heterogeneity of disease states and phenotypes, typically by exploring the multidimensional structure of omic data via clustering or classification; 4) investigation of interconnections between system components in omic data by network analysis; and 5) mathematical modelling to model physiological systems or disease states, and to generate and test predictions (figure 2). Although the final approach is closest to the original formulation of systems biology, our review takes a high-level look at all approaches, with a focus on the first three.
Overview of systems-based approaches to tackling research questions in allergy and asthma. The various ways in which systems biology of allergy can be interrogated: a) discovery of disease associations within each omic field of enquiry; b) identification of relationships within and across omics; c) examination of the heterogeneity of disease states or phenotypes, typically by exploring the structure of omic data via clustering or classification; d) investigation of interconnections between system components in omic data by network analysis; and e) mathematical modelling to model physiological systems or disease states, and to generate and test predictions. Diagrams are for illustrative purposes only and do not convey real data. a) shows a simplified Manhattan plot, with the vertical axis representing negative log p-value (statistical significance), horizontal axis representing chromosome and position and each point representing a single nucleotide polymorphism; the red line is the adjusted p-value threshold, and significant loci are marked by “peaks” extending beyond the red line.
Overview of omic findings in allergy and asthma
We begin our examination from the bottom up: from the molecular level of genomes and transcriptomes to the macroscopic level of observable phenotypes. We offer high-level summaries of recent findings at each level of profiling. As allergy-related mechanisms comprise a significant portion of asthma pathogenesis, much of the discussion involves findings related to allergic diseases at large. However, there is some exploration of the omics of nonallergic asthma.
Genomics
Asthma and allergy are highly heritable, with estimated heritability ranging from 35% to 95% [10]. In the past half-century, the quantitation of genetic variation has progressed from rough “ballpark” measurements, such as restriction fragment length polymorphisms (RFLP), to precise single-nucleotide variants or polymorphisms (SNPs) interrogated en masse using DNA microarrays. More recently, there has been a move towards whole-exome and whole-genome sequencing. The complexity of genetic data analysis has grown in parallel, from candidate gene studies, to genome-wide linkage studies within pedigrees, to genome-wide association analyses (GWAS) [11]. The GWAS approach is based on one-by-one association testing of thousands to millions of genetic variants across the genome with a phenotype of interest (e.g. asthma versus non-asthma status), with subsequent statistical adjustments for the multiple testing burden. The logic of adjusting for multiple testing is fundamental to many other omic-wide analyses, and is a key reason for the need for large sample sizes when dealing with such analysis.
However, while some older associations from linkage and candidate gene studies have been replicated in GWAS (e.g. IL13/IL4 and IL4R), most have not. There is low concordance of significant results between these older studies and GWAS. These suggest that 1) older findings may be plagued by false positives; 2) each approach may have its own use: positional candidates from linkage studies may flag variants determining intrafamily disease risk, while GWAS flag variants determining population-wide risk; and 3) rarer or weaker gene associations may need larger sample sizes in GWAS to achieve stringent multiple testing thresholds. The latter is exacerbated by the difficulty of precise phenotyping in large population samples. However, the prevailing view in the human genetics community is that linkage and candidate gene studies of old were hampered by limitations in methodology and assumptions and frequently did not replicate in independent samples, thus many have been discounted in favour of those provided by GWAS [12].
In the past 10 years, GWAS have identified loci shared across multiple allergic phenotypes, including asthma, allergic rhinitis/hayfever, atopic dermatitis/eczema and food allergy (supplementary table S1). These probably represent genetic contributors to general allergy, and include: the human leukocyte antigen (HLA) locus, specifically HLA-DQ/DRB1, HLA-DQA1/2 and HLA-B/C (6p21.32-33); C11orf30/LRRC32 (11q13.5); IL13/RAD50 (5q31.1); IL1RL1/IL18R1 (2q12.1); and TSLP/WDR36 (5q22.1) [11, 13–15]. Some of these have plausible biological underpinnings linked to immune function; biomolecules such as IL13, IL4, IL33 and TSLP are related to the “T2” immune response of type 2 helper T-cells (Th2) and innate lymphoid cells (ILC)2, and these are classically implicated in allergy. The HLA region encodes major histocompatibility complex class II molecules responsible for antigen presentation. Other associated genes remain uncertain in terms of pathophysiology (e.g. WDR36, CLEC16A), and require further investigation.
Due to frequent comorbidity of asthma with other allergic diseases, there is difficulty in discerning asthma-specific loci. Many loci previously thought to be unique to asthma [15] have now been found across multiple allergic diseases [16]. However, there are some loci that do appear to act specifically for certain asthma phenotypes. In particular, ORMDL3/GSDMB/LRRC3C (17q21.1) is linked to childhood-onset asthma [17–19]. Some loci (e.g. TLR1/TLR6, ADAD1/IL2) may be linked to Th17-related mechanisms of disease (supplementary table S1). Several studies and reviews have explored loci for asthma subphenotypes (e.g. aspirin-mediated and occupational asthma) and in relation to other respiratory traits (e.g. lung function, chronic obstructive pulmonary disease (COPD) and viral respiratory infections) [20–24]; however, these results have been inconsistent. Lately, there has been a focus on loci with ethnicity-specific effects: most of the aforementioned loci were identified primarily in European cohorts, and newer studies have begun exploring non-European populations. For instance, PYHIN1 is significantly associated with asthma, but only in individuals of African ancestry [18]. In addition, there is an increasing focus on using admixture to map risk loci [10].
Subsequent to these findings, significant loci and their molecular products have been targeted via numerous pharmacological approaches. Prior to the GWAS era, numerous anti-IgE (omalizumab), anti-interleukin (IL)5 (mepolizumab), anti-IL13 (lebrikizumab), anti-IL4R (dupilmab) and anti-IL2RA (daclizumab) antibodies [25–28] had been trialled with varying degrees of success. Following GWAS discovery of new loci, other therapies (anti-IL33, IL6R, thymic stromal lymphopoietin (TSLP), etc.) have been tested, again with modest results [28] (supplementary table S1). Such “biologics” are currently reserved for asthma resistant to conventional forms of treatment [28, 29]. It is probable that certain biologics (e.g. anti-T2 cytokine therapies) are only effective in individuals whose asthma is driven by specific mechanisms (e.g. T2 and not T1 or Th17/ILC3). In addition, it is possible that varying efficacy may depend on patient genetics and asthma subphenotype. Indeed, GWAS for responsiveness to asthma therapy with β2-agonist bronchodilators, leukotriene modifiers and steroids [30, 31] have identified loci distinct from disease-susceptibility loci. However, there have so far been few pharmacogenomic studies for the reverse: directly exploring the effects of GWAS-derived disease risk variants on treatment efficacy. Furthermore, many existing studies have been limited by lack of replication or inconsistent results. It remains a challenge to apply pharmacogenetic findings in practice, and they have yet to make a significant impact on current treatment and management.
Lately, there has been a push for risk scores based on genome-wide summary statistics. Despite the large number of novel associations discovered using GWAS, these collectively explain only a small proportion of the total heritability of asthma and allergy. The use of significant SNPs as a predictive tool for disease is often limited [32]. The existing criteria for genome-wide significance may not be sensitive for so-called “mid-hanging fruit” [33]: loci that are not genome-wide significant but still have an incremental effect on the phenotype. Recently, alternative strategies such as genomic or polygenic risk scores (PRS) have been employed to account for this missing signal. These use summary statistics from existing large-scale GWAS to generate additive scores from either a genome-wide assortment of SNPs, or from a selection of highly predictive SNPs. These have shown promise as predictive or risk-stratifying tools for other chronic polygenic diseases such as cardiovascular disease [32], but in asthma and allergy research, existing PRS have so far been limited to small subsets of genome-wide significant SNPs. Belsky et al. [34] derived a PRS based on 17 SNPs from an asthma GWAS [17], pruned by significance and linkage disequilibrium R2 threshold; this score was predictive for earlier asthma onset, allergy, reduced lung function and risk of childhood asthma becoming persistent into adulthood. Arabkhazaeli et al. [35] developed a similar score for childhood allergy using 10 SNPs from a GWAS for adult allergy [36]. These methods, while interesting, may be less predictive than models that use a broader genome-wide selection of thousands to millions of SNPs, accounting for the known polygenic architectures of the diseases. For example, Lehto et al. [37] used a genome-wide PRS, but for affective traits, to identify possible shared genetic influences between asthma and depression. It remains to be seen whether such findings replicate across multiple studies, and whether PRS can be used to reliably capture disease pathophysiology.
Transcriptomics
The transcriptome represents the entire repertoire of genes expressed in an organism or cell. Mirroring the developments in genomics, there has been a move from investigation of single-gene transcripts via traditional methods (e.g. Northern blotting), to genome-wide methods involving oligonucleotide microarrays, and most recently to RNA sequencing (RNA-seq, which involves reverse transcription to cDNA followed by deep sequencing) [38]. Transcriptomes may be determined by aligning RNA-seq reads to transcripts annotated in a reference genome, or assembling transcripts de novo, followed by quantification based on abundance of reads per transcript. Unlike the genome, the transcriptome varies across tissues and cell types, and changes dynamically during development and in response to external stimuli. Common tissue sources for transcriptomics include blood with or without cell sorting; bronchial epithelium, smooth muscle or sputum cells for asthma; nasal epithelium for allergic rhinitis; and skin for atopic dermatitis. Different cell types may feature different associations, and this provides insight into how various genes contribute to the many manifestations of allergy.
Recent studies of allergy have identified, across multiple tissue types, differential expression of genes involved in innate and adaptive immunity, inflammatory and repair responses and epithelial integrity. Cytokines (T2-related and others), chemokines and their receptors, host defence proteins (defensins), protease inhibitors (SERPINs) and other multifunctional regulatory proteins (S100 family) are differentially expressed in allergic diseases. SERPINs control various immune and inflammatory processes, for instance by inhibiting neutrophil proteases (elastase, cathepsin G) and fibrinolytic enzymes (plasminogen activators). S100 proteins commonly serve as damage-associated molecular patterns, signals of cell stress or injury. As such, both SERPIN and S100 family proteins probably represent downstream sequelae of the immune-inflammatory responses typically seen in asthma and allergic diseases. Multiple studies have identified such changes for atopic dermatitis, in both lesional and non-lesional skin samples [39–42]; and in airway epithelial or sputum samples of asthma [43–47]. For asthma, further analyses have linked certain transcriptomic profiles to inflammatory subtypes of asthma: eosinophilic or T2-driven airway inflammation has been associated with elevated airway expression of periostin (POSTN), CLCA1, SERPINB2, CLC, CPA3 and DNASE1L3 [43, 44, 46]; while neutrophilic or Th17-linked inflammation has instead been linked to expression of IL1B, ALPL, DEFB4B, CXCR2 and other chemokines [43, 47], an expression profile that bears some similarities with psoriatic skin lesions [47]. Differences in gene expression across inflammatory phenotypes are also reflected in blood and sputum transcriptomics [47], and show some promise in being exploitable as putative biomarkers for disease subtypes [43]. Furthermore, there is evidence that T1 and Th17/ILC3 pathways act in partial opposition to each other, and while T2-mediated eosinophilic inflammation is responsive to steroid treatment [44, 48], it may also lead to enhanced Th17 activity and subsequent risk of neutrophilic inflammation [49]. Finally, although it may be enticing to describe T2-mediated eosinophilic inflammation as “allergic”, and T1 or Th17-mediated neutrophilic inflammation as “non-allergic”, other inflammatory profiles (paucigranulocytic, mixed) also exist, thus complicating the narrative. Nonetheless, associating gene expression profiles with specific inflammatory phenotypes may provide the next step towards improving precision in managing asthma and allergic disease.
It is notable that few of the aforementioned differentially expressed genes were identified as genome-wide significant loci in previous GWAS for asthma. As discussed earlier, this is probably the result of differential gene expression being indicative of inflammatory pathology downstream of the genetics. This is supported by the observation that several expression quantitative trait loci (eQTLs) are located around T2-related loci (IL4R, TSLP, IL13), and that unsupervised gene module analysis of airway transcriptomics has revealed consolidation of certain expressed genes into T1-driven versus T2-driven modules [46]. eQTL analyses are similar to GWAS, in that eQTLs are essentially SNPs with genome-wide significant effects on expression of nearby genes, for instance by altering the regulatory region of those genes (cis-eQTLs) [50], or altering a transcription factor for a distant gene (trans-eQTLs). Nowadays it is uncommon to see GWAS without an accompanying eQTL analysis in related tissue types. Significant loci from asthma and allergy GWAS that overlap with eQTLs in specific tissue types (usually whole blood) are shown in supplementary table S1. A limitation of whole blood eQTL analyses is that it is not clear whether the eQTL is active across all blood cell types, or only within specific blood or immune cells.
More recently, single-cell transcriptomics have come to the fore. Scientists can now isolate single cells (e.g. micromanipulation with capillary pipettes, flow-activated cell sorting, microfluidics) [51], then investigate transcriptional differences between individual cells within a sample, rather than assuming homogeneous expression and averaging transcription across the sample. Single-cell (sc)RNA-seq presents new opportunities to explore the inner workings of the human immune system, whether it be exploring trajectories of certain types of immune cells ordered by pseudotime, mapping immune cell lineages or investigating B- and T-cell repertoires [52–54]. Of particular future interest is the potential harnessing of scRNA-seq to identify drug or vaccine targets for modifying B- and T-cell responses [53]. Most recently, using scRNA-seq, Croote et al. [55] identified that certain IgE antibodies of peanut-allergic individuals converged upon identical gene rearrangements. Chiang et al. [56] identified that a subset of Th2 cells (Th2+) in peanut-allergic individuals demonstrated functions beyond IgE isotype switching, such as expression of cytokines that contributed to local tissue inflammation (IL-3, colony-stimulating factor (CSF)2), as well as resistance to attempted suppression by regulatory (Treg) T-cells. Widespread adoption of scRNA-seq is currently limited by high cost, high computational demand of data processing and inherent challenges in subsequent statistical analyses, specifically in relation to sample normalisation, batch effects and other sources of bias [51, 54]. However, it is anticipated that these obstacles will be gradually resolved with time. There is ongoing development of statistical and systems-based approaches designed to deal with such data, especially measures of trajectory and pseudotime [57].
Epigenomics
The epigenome is the set of heritable biochemical modifications that change gene expression, but are not coded in the DNA sequence. Epigenetics functions as a bridge between genome and transcriptome, providing mechanisms by which the micro- or macroenvironment can influence gene expression within each cell, and by which transgeneration inheritance can occur after initial exposure to an epigenome-modifying environment [58, 59]. Epigenetic signals include 1) DNA methylation at CpG islands, which silences expression of adjacent genes; 2) histone modifications (acetylation, methylation and others), whose effects vary depending on type and position of modification; and 3) noncoding RNA such as microRNA (miRNA), which can silence genes by binding or degrading complementary mRNA [60]. Together, these epigenetic markers cause changes in accessibility of a local DNA segment to transcription or regulatory factors.
Low- and high-throughput detection methods exist for each type of epigenetic signal. Methylation-sensitive restriction fingerprinting and microarrays for detecting 5-methylcytosine have been used to describe the DNA “methylome”. Genome-wide histone modifications can be detected using chromatin immunoprecipitation (ChIP). Next-generation sequencing options also exist (miRNA-seq, DNase-seq, formaldehyde-assisted isolation of regulatory elements (FAIRE)-seq, ChIP-seq, 3C-seq), which function by isolating DNA fragments that are accessible or inaccessible to a factor of interest, and sequencing those fragments to determine their identity [9]. Epigenome-wide association studies can then be performed to identify epigenetic features for a given trait or disease. Finally, like the transcriptome, the epigenome is responsive to external stimuli and varies across cell types, and most epigenomic studies of allergy have so far examined blood, skin or airway samples.
There is evidence that development and maturation of T-cell lineages is partly determined by epigenetic changes [58]. Th2 differentiation is driven by STAT6 and GATA3, resulting in epigenetic changes (DNA methylation, histone acetylation) that induce Th2-related (IL4/IL13) and suppress Th1-related (TBET, IFNG, IL-12/STAT4 pathway) expression; conversely, Th1 differentiation is driven by STAT4 and TBET to elicit the opposite epigenetic changes; finally, Treg differentiation is driven by STAT5, with associated epigenetic changes in FOXP3 and the IL10 locus [58, 60]. Given the role of epigenetics in T-cell development, it is plausible that allergic disease may be linked to altered epigenetics affecting this process. Epigenetic signals have been observed across multiple tissue types in allergy. Changes to DNA methylation have been noted in loci related to Th2 function and T-cell development (IL4R, TSLP, IFNG, FOXP3, STAT5A) [59, 61–65], while other significant loci control antigen presentation, eosinophil activity, lipid metabolism and mitochondrial function [66, 67]. The relationship between histone modifications and allergy or asthma is less clear. Some studies have shown changes to global histone acetylation with reduced deacetylating-to-acetylating (HDAC-to-HAT) activity in asthmatic lungs compared to normal [68–70]; while others suggest that HDAC inhibition can improve the suppressive function of Tregs [71]. Similarly, certain miRNAs are known to influence allergy risk. For example, Okoye et al. [72] observed that miR-155 and miR-146 may be critical in determining T-cell differentiation towards Th2 versus Th1/Th17. Other relevant miRNAs are reviewed elsewhere [73, 74]. There remain too many knowledge gaps to allow us to fully use epigenetics to our advantage in managing asthma and allergy. However, investigation of the full compendium of miRNA species is progressing rapidly, and may lead to new targeted therapeutics in the future.
An important aspect of epigenetics is the link to environmental exposures. Because the development of the immune system begins in utero and continues through infancy, environmental modifiers of epigenetic signals may have a stronger impact earlier in life. Experimental and observational studies show that maternal exposures during pregnancy and exposures during early childhood can modify the child's epigenome. These exposures include changes to diet, macro- and micro-nutrition, farm environments, infections and microbes, animals, allergens, medications, pollutants, tobacco smoke and even maternal stress [60, 75, 76]. In particular, folate and vitamin B12 are methyl donors that have a global impact on DNA methylation [60]. Finally, genome associations have been identified for methylation patterns as quantitative traits (meQTLs). These include the ORMDL3/GSDMB locus, where a SNP behaves as both an eQTL and a meQTL [77], and others [66, 78, 79]. All these findings illustrate that certain perinatal exposures can act through genetics and epigenetics to influence disease risk.
The microbiome
The microbiota is the community of microbes, including commensals and pathogens that reside within a host or environment, while the microbiome is the genomic content that represents the microbiota. The “microbiota hypothesis”, a modern reiteration of the hygiene hypothesis, suggests that perinatal microbial exposure is vital to proper development of immune functions, especially of tolerance [80–82]. Microbial exposures may modify allergy susceptibility by initiating different trajectories of immune development and function [75]. Epigenetic changes may also be involved in this process, although the exact nature of these changes remains unclear.
The primary interfaces for host–microbe interactions are the epithelial surfaces exposed to the external environment, in the skin and respiratory and gastrointestinal tracts, so most studies on allergy microbiomes involve sampling at one of these sites directly (biopsy or surface samples) or indirectly (faecal or sputum samples). The gut is home to gut-associated lymphoid tissue, and its microbiome can influence disease at other mucosal surfaces, such as the respiratory tract [83, 84]. The respiratory microbiome may exert a direct influence on local inflammatory processes leading to asthma development [85]. The environmental microbiome may drive restructuring of host microbiomes, or modify allergy risk by other means; this may be particularly relevant in relation to the protective effect of farming environments [75]. Description of the microbiome relies mostly on quantification of DNA sequences encoding the 16S ribosomal RNA (rRNA) gene, which is common to all bacteria but contains variable regions used to differentiate taxa. The gene sequence is amplified using PCR and then examined using gel electrophoresis, terminal RFLP, microarrays or sequencing. Recently, there has been a transition to deep metagenomic sequencing, which captures the genomes of all organisms present in a sample, not just the 16S rRNA gene, and can be used to infer both taxonomic composition and function of the microbial community.
Microbiome studies are complicated by the fact that host microbiomes can change with age, season, time of day, site sampled on the host's body and geography [86, 87]. However, a number of consistent findings have been established for asthma and allergy. Features of the gut microbiome associated with allergy include early-life reduction in microbial diversity; reduced populations of Bifidobacteria, Lactobacilli and Bacteroidetes; and increased coliforms and specific Firmicutes (Staphylococci, Enterococci) [83, 84, 88]. Reversing the above changes, for instance by oral administration of certain Lactobacillus and Bifidobacterium species, may offer some protection against both the initial development of allergy and further exacerbations of atopic disease [80]. Within the airway microbiome, asthma development, symptoms and exacerbation have all been associated with increased Proteobacteria populations (especially Haemophilus, Moraxella, Streptococcus and Neisseria spp.), and reduced Bacteroidetes and Fusobacteria commensals [80, 81, 84, 85, 89]. Remarkably, these associations begin during infancy: the detection of asthma-related bacteria in the first few months of life has been associated with developing allergic asthma by primary school age [81, 83]. Although it is unclear whether microbial changes represent a cause or effect of underlying immune dysfunction, there is evidence of altered gut and airway microbial communities preceding allergic sensitisation [85, 90, 91]. Ultimately, these findings suggest two independent processes at work: microbiota, especially of the gut, exerting systemic effects on immune maturation; and microbiota causing local inflammatory processes at the sites they inhabit, including those associated with asthma in the respiratory tract.
Other recent studies have uncovered the potential role for nonbacterial microbes, including viruses such as human rhinovirus and respiratory syncytial virus, in causing early childhood wheeze and bronchiolitis that often precedes full-blown asthma [75, 85, 92–97]. There is evidence for the role of rhinovirus (RV), specifically RV-C, in causing severe respiratory illnesses that are associated with increased asthma risk later in life. This is further supported by evidence that a genetic locus significant for childhood asthma, CDHR3, modifies the binding and replication of RV-C, and hence infection susceptibility. The pathophysiology behind the viral associations may be related to chronic airway injury due to recurrent infection, possibly interacting synergistically with allergic mechanisms, to elicit and maintain sustained inflammation. Microbe-specific systems such as the virome and the (fungal) mycobiome may also be helpful towards understanding asthma pathogenesis. Microbiome modification and control of respiratory infection risk (e.g. through vaccines or pre/probiotic supplementation) are possible avenues for future investigation.
The exposome and environmental exposures
Researchers have frequently explored the relationship between environmental exposures and disease. The “exposome” builds on this idea by encapsulating all environmental exposures that contribute to human health and disease. The environmental microbiome, for instance, is just one type of exposure; the host microbiome itself can be considered an exposure when describing microbes residing on the skin, or on luminal surfaces of hollow viscera exposed to the external environment. It is difficult to measure all exposures, let alone on a high-throughput scale, and there are other challenges related to correlation, confounding and interaction amongst different exposures [98]. Instead, most studies have so far quantified a limited set of relevant exposures via questionnaires and environmental sampling. North et al. [99] is one of the first studies to adopt an exposomic approach to examine multiple types of exposures simultaneously, in their search for associations with childhood wheeze.
The environment can contribute to asthma and allergy pathogenesis in many ways. As mentioned in the epigenomics section, these include mechanisms acting through diet and nutrition, exposures to pets and animals, allergens, pollution, tobacco smoke and other chemical exposures. For some of these, it is possible to measure and perform high-throughput analyses on proteomic and metabolomic data. Diet is one example: a protective effect against allergy has been reported for polyunsaturated fatty acids (PUFAs) found in fish oil, and for their metabolites [100]. Higher proportions of certain very-long-chain PUFAs in plasma during childhood has been linked to reduced allergic disease in later adolescence [101]. There has been slow adoption of omics-level analyses in food [102], and it remains controversial whether food and dietary supplements have any impact on allergy or asthma risk (given the innumerable potential confounders). However, in the future, it may be possible to scan the contents of an individual's diet in a high-throughput manner, construct a “foodome” (combining lipidomics and metabolomics), and search for de novo associations with disease. Airborne pollutants may also be explored in a similar manner.
Environmental allergens can themselves be investigated by multiple omic approaches, in relation to quantity of exposure, geography of exposure, and allergenicity of protein structures. For instance, studies have identified that low environmental load of allergen can be a risk factor for disease [103, 104]. Timing and route of allergen exposure may also be relevant: early introduction of solids, including peanuts, may be protective, but only within a specific time window [105]. Furthermore, early exposure to peanut allergen through the skin may promote sensitisation, while exposure through the gut may promote tolerance [106]. Other studies have overlaid geographical maps of exposure with maps of disease, as has been done for traffic-related air pollution and asthma [107]. Finally, it is still not clear why allergens behave as allergens, or more specifically whether there is anything inherent in the molecular structure of putative allergens that confer allergenicity. The term “allergome” is typically used to describe the proteomics-based discovery of allergenic protein structures within individual allergens (see the section on proteomics for discussion).
Occupational and chemical exposures comprise a less common, but well-known source of irritants and allergens that cause asthmatic disease. In occupational health, the term reactive airways dysfunction syndrome is often used to refer to bronchial reactivity without an initial latency or sensitisation period [108]. Examples of culprit chemicals include isocyanates, acid anhydrides, azodicarbonamide, dyes, enzymes and metals [109]. Potential mechanisms of disease are highly variable, and may involve type I (allergic) hypersensitivity versus non-allergic/irritant, or type IV (T-cell-mediated) mechanisms; “inducers” versus “inciters”; dermal versus respiratory sensitisers; and low molecular weight (hapten-like) versus high molecular weight allergens [108]. It is likely that many culprits drive disease through mixed mechanisms. An important note here is that a current lack of evidence for IgE-mediated mechanisms with a particular chemical trigger does not rule out allergy to that chemical as a cause; limitations still currently exist in the engineering of appropriate detection methods for chemical-specific IgE [110].
As alluded to previously, environmental exposures can act through interactions with host microbiome to modify disease risk [75, 83, 84]. Maternal and perinatal exposure to rural environments confers some protection, possibly due to contact with microbial products such as lipopolysaccharide, greater diversity in microbial exposure or environmental modification of host microbiota. Caesarean deliveries and perinatal use of antibiotics may increase risk for allergy, possibly by disrupting neonatal microbial colonisation. The protective effect of oral probiotics with Lactobacilli and Bifidobacteria spp. has been reported, as noted previously, and they may also provide cross-organ protection, reducing the incidence and severity of respiratory infections [84]. The use of dietary fibre in prebiotics, with subsequent fermentation into short-chain fatty acids, may protect from allergy via Toll-like receptor and G protein-coupled receptor signalling or epigenetic modifications [80, 81]. Vitamin D has potential immune and microbiome-modifying effects, and vitamin D deficiency is a suspected risk factor for allergy [111, 112]. Breastmilk contains immunoactive molecules and may alter gut microbiota composition [80]. Altogether, these findings offer a glimpse into how multiple environmental exposures may interact in a complex fashion to elicit disease.
Proteomics, metabolomics and lipidomics
The proteome is the repertoire of proteins produced by cells or tissues, reflecting the molecular effectors and metabolic consequences of cell function. Common proteomic technologies can be grouped into antibody-based (ELISA), peak-profiling mass spectrometry (MS)-based (“fingerprinting”), gel-MS based (1D/2DG, 2D-DIGE) and liquid chromatography-MS based methods [113]. The general approach is to perform coarse separation of digested proteins into “bands” or “spots”, and then further investigate each spot by MS. The MS steps are often conducted in tandem (MS/MS) to achieve higher resolution. The information gained from MS can then be used to identify the peptide or construct its amino acid sequence. Sources of proteomic samples include sites of pathology such as the airway, in the form of cellular or fluid content from bronchoalveolar lavage, induced sputum, biopsies or in vitro cell cultures; or it may involve the usual blood or urine sample [114]. An accessible type of specimen unique to asthma research is exhaled breath condensate (EBC), which provides information on volatile compounds released from the airway.
In relation to asthma and allergy research, proteomic changes often depict nonspecific pathology, as in a general elevated inflammatory state, as well as underlying pathological mechanisms. Therefore, recent findings in proteomics mostly mirror transcriptomic changes, in that they reflect altered functions in immunity, inflammation and antiprotease activity: affected proteins include defensins, α1 antitrypsin, α2 macroglobulin, SERPINs, S100-family proteins, apolipoproteins and complement proteins [113–116]. Of particular note is a recent study by Schofield et al. [117], which combined sputum transcriptomic and proteomics with airway histology and clinical features. They found that eosinophilic phenotypes were associated with increased blood periostin and sputum haptoglobin, while neutrophilic phenotypes were associated with increased S100A9 and MMP9. Interestingly, few recent studies have identified proteome-wide significant changes to T2-related cytokines, although associations have been found within low-throughput in vivo studies in the past [118]. It is plausible that, being upstream of signalling cascades, these T2 cytokines are less apparent in proteome-wide analyses, where significant findings tend to be more dominated by downstream proteins that have been amplified via the cascades.
Another important contribution of proteomics to allergy research is allergen detection and discovery [119]. Studies have investigated a compendium of epitopes for aeroallergens such as house dust mite [120–122] and plant pollen [123–125], and for food allergens in seafood and processed foods [119]; these have served both to confirm existing epitopes and to identify new ones. Findings from these studies can be applied to nonclinical settings, such as food processing and safety [119].
Metabolomics is the systems-level study of metabolites, the nonpeptide macromolecules representing the substrates and end-products of cellular activity. The two main technologies of measurement used in metabolomics are nuclear magnetic resonance, which provides a spectral fingerprint of a system's metabolite constituents, and MS. Like proteomics, most metabolomic studies focus on samples of blood serum, EBC and urine from asthmatic patients [126–128]. Lipidomics is a subset of metabolomics specifically dealing with lipid molecules, and lipidomic studies have shown that allergic disease is typically associated with elevation of arachidonic acid metabolites belonging to the lipoxygenase pathway, such as leukotrienes [129, 130]. Metabolomic associations with asthma involve immune and inflammatory functions, oxidative stress and hypoxia, cellular energy homeostasis and lipid metabolism pathways [127]. These associations seem to reflect general biological stress or inflammatory pathology, rather than specificity for allergy or asthma. However, predictive and discrimination models based on metabolomic findings have shown some promise [127]. Ultimately, as was the case for transcriptomics, proteomics and metabolomics are being used to identify potential biomarkers to screen for asthma and stratify into asthma subphenotypes. Additionally, in line with developments towards integrating multiple omics: proteomic, metabolomic and lipidomic methods may be applied not just to host samples, but also to environmental samples.
The phenome and physiome
Phenomics is a broad term encompassing all physical or biochemical traits (phenotypes), observable in cells or individuals, that reflect states of disease or health (“physiome”). In the case of allergy and asthma, possible phenotypes include cell types based on morphology and response (immunophenotyping); clinical biomarkers, such as antibody assays and cell counts; and the extensive physical manifestations of disease, embodied in clinical history, symptoms and signs and investigation results. These traits may be quantified and described in detail, although not necessarily using high-throughput technologies. Phenotypic traits of interest may also include nondisease states, such as clinical remission, and traits that vary with age. Integration with other omics-based datasets, and incorporation of time scales into analyses, may yield further insight as to how such resilience against asthmatic disease is conferred.
Phenome-wide association analyses, where large sets of traits are screened for enrichment of allergy-related genetic loci [131, 132], have been performed in the past, but have yet to gain widespread popularity. Phenomes and phenotypes can also be analysed by machine learning, whether it be comparison of known phenotypes (via supervised classification) or construction of new phenotypes from omic or non-omic data (via unsupervised cluster analysis). This is discussed in further detail later.
The “immunome” is a subset of the physiome that is highly relevant to allergy, and where high-throughput technologies play a major role. Immunomics broadly describes the systemic quantification of immune function by examining immune cell populations and expression of immune mediators. It may use immunoglobulin [133–135] and cytokine (proteomic or transcriptomic) arrays [136, 137] to quantify immune responses such as sensitisation, in vivo or in vitro. It can also involve leukocyte immunophenotyping and high-dimensional or mass cytometry [138–141]. The immunome is complex and varies dramatically by sampled immune cell type, tissue or organ, age and timing of sampling, especially before and after sensitisation. Using these types of data, a number of recent studies have begun exploring the “core immune signatures” of newborn infants and adults alike [142–144]. Although it is well known that allergy is a T2-driven phenomenon, it is still not clear how all the components interact to generate disease, nor is it clear how heterogeneity in disease or health is explained by immunome heterogeneity. Furthermore, the nonallergic contributions of asthma are not well understood, and it is unclear how T2 and non-T2 (particularly T1 and Th17/ILC3) mechanisms interact to generate the spectrum of disease. Future studies may be able to shed light on this.
One major aspect of human physiology that has a known impact on asthma risk and progression is sex. There are clear differences between males and females in terms of the development and physiology of the immune [145] and respiratory systems [146]. During early childhood, males have a higher incidence of asthma or wheezing illness than females; this switches during and after adolescence [147]. Pregnancy in females can often exacerbate asthma symptoms, with partial resolution postpartum. It is likely that hormonal changes during these life phases influence these changes in disease risk. Current evidence suggests that ovarian hormones and oestrogen-mediated signalling promotes both Th2-related and Th17-related inflammation, while testosterone is protective against Th2-mediated inflammation [147]. While existing studies already account for sex and gender differences as an important covariate or cofactor, future studies may further explore sex interactions with immunomics, in particular, investigating how sex hormones modify the function of both adaptive and innate immune cells.
Integration of omics data
Following our overview of the omics, we now discuss common techniques used to integrate and interpret omics data in allergy and asthma research.
Exploring intra- and inter-omic relations
To understand disease pathogenesis, it is natural to compare findings across different omics, and construct a multi-omic model of pathophysiology that links these various elements together. This may be a simple sequential model of causality, or a complex network of interacting components. Many studies on omic associations with allergy and asthma also search for inter- and intra-omic relationships. Relationships can take the form of direct associations, where one entity behaves as a trait for another, or an interactive effect between two entities in relation to a third entity as the trait of interest. The study of these relationships is the crux of modern systems biology.
Genomics, being the most studied system in allergy and asthma, features extensively in intra-omic and inter-omic analyses. GWAS can be found not only for clinical phenotypes (e.g. presence of allergic disease) as traits, but also for expression of transcripts (eQTL analyses), epigenetic markers (meQTLs) and intermediate phenotypes such as microbial exposures and immunomes. Recently, there has been a concerted move towards integrative genomics and genetic effects on gene expression are a pervasive component of modern association studies, in the form of mandatory genome-wide eQTL analyses or targeted measurements of gene transcripts [74]. Also coming into vogue is the use of Mendelian randomisation, a technique which uses genomic information as instruments to infer causal links between one trait or phenomenon and another, based on the assumption that allelic genotypes are randomly assigned as they are passed from parent to offspring [148]. The traits being linked may themselves be related to gene loci or expressed genes [149].
Analyses for interactive effects with other omics also feature heavily in allergy genomics. It is unlikely that genetic and environmental factors act independently in conferring risk, so modern genomic studies often include interaction terms with exposure variables. Scientists have explored interaction effects on asthma susceptibility between genetics and exposures such as air pollution and tobacco smoke [150, 151]. Another example is the impact of allergen exposure and genetics on immune cell gene expression [152]. Interaction analyses also extend beyond environmental effects. Gene–ethnicity interaction has been investigated via admixture mapping [10]. Genetic–epigenetic interactions have been reported; some genome-wide significant loci (e.g. IL4R) may interact with nearby epigenetic signals to alter disease risk [65]. While investigation of gene–gene interaction (epistasis) is of intense interest, the overwhelming number of active genes in the human genome means that such analyses have a large statistical burden and hence remain difficult. Therefore, gene–gene interaction studies are so far limited to a few selected genes or SNPs. Polygenic risk scores tend to employ additive linear models that do not reflect epistatic effects, and it remains a challenge to integrate these in an accurate manner. Finally, interactive effects may be explored by means beyond using interaction terms in regression models, for example, eQTL-weighted GWAS have been reported [153].
Given the strong links between environmental factors and asthma, interactions with environment exposures have been explored to a degree. Importantly, prominent gene–environmental interactions have been observed with glutathione S-transferase variants impacting on susceptibility to environmental sources of oxidative stress, such as air pollutants [154], passive exposure to tobacco smoke [155] and isocyanate [156, 157], as well as subsequent asthmatic disease [154, 158]. Similar interactions may exist for respiratory infections, given that infection and the subsequent immune-inflammatory reaction are also sources of oxidative stress. In addition, microbial and pathogen exposures have been linked to differential gene expression, for instance, viral infections are associated with changes to airway epithelial transcriptomics in asthma [159, 160]. Unsurprisingly, the exposome and microbiome have also been linked to epigenetic changes, and the various exposures are intricately entwined in complex interactions. For instance, a recent study has looked at the interaction between air pollution and the allergenicity of ragweed pollen [161]. Another recent study has identified that maternal phthalate exposure may promote allergy in subsequent generations via epigenetics [162]. Other examples concerning environmental interactions with diet and microbiome have already been discussed.
Finally, a common application of integrative omics is the use of gene ontology analysis to annotate discovered genes from genomic, transcriptomic or epigenomic analyses [163]. This makes use of a pre-curated database of functional annotations for known genes, based on existing literature, to segregate discovered genes into groups or pathways with shared functions. An example is the Gene Ontology Consortium [164]. These databases of functional annotations convey phenomic information, where cell phenotypes, functions and behaviours are organised into discrete categories. In doing so, one aims to condense diverse genome-wide findings into concise summaries of biological function that may be easier to interpret when building a conceptual model of pathophysiology. Similar annotation analyses exist for proteomics [165, 166]. A limitation of such techniques is that the annotations may not always be certain, reliable or up-to-date, and can often be vague or uninformative.
Inter- and intra-omic relationships may be explored either by low-throughput pairings, or by high-throughput assessment of larger networks [167, 168]. However, especially with the latter, it may be difficult to account for noncausal correlations or confounders. For example, despite the hygiene hypothesis, low socioeconomic status and impoverished environments remain risk factors for the development and severity of asthma [84]. This may be due to confounding factors that coexist with poverty, including urbanised environments, exposure to allergen and pollutants, dietary intake and access to healthcare. There is no doubt that modifiers of allergy risk may co-occur, but whether this represents a causal link is another matter. Methods such as Mendelian randomisation (described earlier) may be used to disentangle this, but one must be wary of violating the numerous assumptions that underlie Mendelian randomisation. In addition, given the high dimensionality of inter- and intra-omic analyses, dimension reduction and machine learning may be used instead to identify potentially robust signals of relevance to pathogenesis.
Machine learning, dimension reduction and clustering
Machine learning is a set of methods that use computing to learn and formulate solutions from supplied data, with or without explicit human input. It is already in common use with various biomedical and ecological applications [169–171]; however, it is particularly useful when dealing with complex, high-throughput and multidimensional data, especially in cases where pre-existing human knowledge may be unavailable or insufficient to decipher the data. Machine learning approaches typically involve iteration, where an algorithm repeatedly refines a model based on observed data until a metric of model quality (e.g. objective/cost/loss function) satisfies a particular threshold. Applications of machine learning in biomedicine typically involve the exploration of data structure, or generating predictive or explanatory models of biological systems.
Cluster analysis and classification are methods used to subset data samples or individuals into different groups or categories, thus giving a summary of data structure. Such methods typically employ machine learning at the most fundamental level: for instance, hierarchical clustering is an iterative process where the “objective function” would be the minimisation of within-cluster similarity and/or maximisation of between-cluster dissimilarity. There is usually a subtle distinction between clustering and classification: cluster analysis is a data-driven approach, where omic data is used to generate clusters in an unsupervised fashion. The clusters can then be interpreted for hypothesis generation and testing. Conversely, classification is a hypothesis-driven approach: known phenotypes or pre-curated categories are used to determine a model of classification based on training data, which can then be applied to other datasets, or examined to look for further biological associations (figure 3).
Data-driven versus hypothesis-driven machine learning for integration of omic data. a) Data-driven (unsupervised) cluster analysis used to generate de novo groupings, reflective of shared pathophysiology (“endotypes”); b) hypothesis-driven (supervised) classification to compare known phenotypes or endotypes, and to allow prediction of phenotype/endotype membership for additional samples. Diagrams are for illustrative purposes only and do not convey real data.
A drawback of clustering and classification (as for other applications of machine learning) is that there is little consensus or standardisation of optimal methods, although there are certainly favoured approaches for each problem. In addition, they may be intimidating for the regular clinician or biologist to adopt, and choice of method often depends on a specialist understanding of nuances in the data. As an example: when performing cluster analysis, many decisions need to be made prior to and during the procedure. This includes how to deal with missing data; select the variables or “features” for clustering; scale or normalise features; choose the algorithm to do the actual clustering; pick the number of clusters; control for overfitting; and validate or replicate results [172]. It is not necessarily clear what the best choice is for any of these decision points.
In exploring the correlation structure and confounders in a dataset, principal components analysis or similar methods can be used to transform the dataset into uncorrelated variables or “principal components”. In doing so, we can observe which of the original variables describe similar information (i.e. are highly correlated with each other), and by plotting principal components, we can visualise the data in a way that maximises variability between samples or variables. By condensing our data to a limited selection of principal components, we can reduce the number of dimensions and simplify the input features for subsequent clustering or classification [173]. Feature selection can be limited to a single omic entity, or cover multiple omics simultaneously, depending on the question asked.
Cluster analysis involves separating samples in a dataset into discrete groups (clusters) based on what can be learnt from data structure, without specifying training examples for each group [172, 174]. Its objective is to minimise intragroup differences and maximise intergroup differences. Measures of difference or dissimilarity may be distance- or correlation-based. Common clustering techniques include hierarchical clustering, medoid-based methods, and latent variable modelling. Cluster analysis allows the identification of homogeneous groups within a heterogeneous dataset, and simplify analyses to comparisons between clusters rather than across entire cohorts. Clustering can also expose confounders without explicit adjustment for correlation, especially if clustering is “guided” by cosegregating omic variables. Using molecular omic-based features, cluster analysis may allow us to determine endotypes (subtypes of disease or health states) by common biomolecular interactions and pathophysiology [175]. These can be compared with known phenotypes to explore how variation in pathophysiological mechanisms are linked to variation in disease manifestations. In addition, cluster analysis can be applied to phenome data to deal with heterogeneity in phenotypes. Using “cleaner” subphenotypes for association analyses may improve the power and specificity of subsequent findings.
Classification methods determine a statistical model or decision-making algorithm that allocates individuals of a training dataset into known groups (classes) [172]. The learnt model or algorithm can then be applied to other test datasets for classification into classes. Methods include regression analysis, discriminant analysis, support vector machines and partitioning or decision trees. The objective of classification varies with the method, but mainly involves achieving the “best fit”, minimising differences between predicted and actual class allocation for the training dataset, without compromising generalisability to external datasets. Classification can be used to design diagnostic or risk stratification algorithms from an omic dataset. Each sample is labelled as one of a predefined set of phenotypes (e.g. allergic versus nonallergic asthma, eosinophilic versus neutrophilic, severe versus nonsevere), then the algorithm seeks biomolecular or clinicophysiological features that best define the phenotype [176, 177]. In the absence of predefined phenotypes, clustering and classification may be combined: clusters are generated based on a training dataset, then a classifier is devised which can classify test datasets into the discovered clusters.
Both cluster analysis and classification have been used extensively in asthma research. Major findings from such analyses include the discovery and characterisation of different subsets of childhood and adult asthma. Childhood wheeze has been categorised, by both traditional and machine-learning approaches, into persistent atopic wheeze of early onset, transient remitting viral wheeze, and a mixed atopic/nonatopic phenotype of variable onset [178–180]. Atopic wheeze appears to be characterised by Th2 activation, early sensitisation to allergens, greater severity of respiratory disease, greater likelihood of persistence to full-fledged allergic asthma and concurrence of other atopic diseases. In terms of adult asthma, there are subtypes based on lung function [181], as well as atopic, nonatopic, mixed and other phenotypes [175]. Eosinophilic, neutrophilic and paucigranulocytic airway inflammation can be distinguished from sputum samples, and accompanying transcriptomic, proteomic and immunomic data can provide some insight into underlying pathophysiology for each phenotype [43, 176, 182–184]. Neutrophilic, Th1/Th17-dominant, and steroid-resistant asthma tend to co-occur, suggesting a common endotype. Asthma, COPD and mixed asthma/COPD phenotypes have been explored [185]. Other studies have looked at allergy phenotypes related to degree and pattern of allergic sensitisation (mono- versus poly-sensitised; early- versus late-sensitised) [186, 187].
Clustering can be applied to other omic data, other than phenotype data. In Teo et al. [85], hierarchical clustering was used to generate the microbiome profile groups which categorised the infant nasopharyngeal microbiome into discrete clusters based on microbial abundance. This facilitated simpler analysis and interpretation of otherwise complex data.
Some researchers have identified that membership within asthma clusters or subphenotypes change or transition with age [188, 189]. This latter point highlights an ongoing challenge of subphenotyping asthma: the fact that these phenotypes or clusters are inherently unstable, and may change with age complicates post hoc analyses. To address this, clustering can be applied in a time-dependent manner: several research groups have used techniques (e.g. latent transition analysis) that leverage longitudinal data to model transition probabilities between clusters at different time points [189–191]. Such methods reveal which asthma phenotypes are inherently stable or unstable; at a cursory glance, it appears that early-life atopy tends to correlate with entrenched asthma in later life [191, 192]. Our own laboratory recently employed a method of cluster analysis to derive trajectories representing distinct patterns of evolving composition in the nasopharyngeal microbiome, and subsequently related these to asthma outcomes [193].
Network analysis
Network analysis is the use of networks to model and investigate systems. Networks are represented by graphs consisting of nodes and edges, where nodes represent entities (e.g. biomolecule) and the edges between nodes indicate relationships between entities (e.g. correlation, transition probability, molecular interaction). Edges can be undirected (symmetrical) or directed (asymmetrical). Many types of network analyses involve use of machine learning to generate a best-fitting network for a given dataset.
Networks are used to discover and visualise how different components in a system relate to each other, whether they be abstract relations or actual molecular interactions. Bayesian network analysis involves probabilistic modelling of a network, where edges are directed and annotated with a transition probability from one node to another. This technique has been used frequently in asthma research, for instance to identify candidate genes or SNPs associated with a bronchodilator response [194]; to quantify interactions between measured pathophysiological variables related to asthma and allergy [195]; and to describe gene regulatory networks using gene expression and GWAS data [196, 197].
Gene co-expression networks can be generated based on correlation between expression levels of different genes. High correlation reflects genes that are co-expressed and hence may be co-regulated or share a common biochemical pathway. Nodes represent genes, while edges represent correlation between them. Furthermore, edges can be weighted by degree of correlation, as in weighted gene co-expression network analysis (WGCNA); and highly connected or proximate subgraphs can be interpreted as gene modules of functional importance. WGCNA has been used to identify co-expression networks underlying helper T-cell responses to house dust mite stimulation [198], transcription networks in whole blood of asthmatics [199], an IgE-signalling gene network associated with blood lipids [168], and co-methylation models that reflect asthma endotypes [77]. Modena et al. [46] identified that adults with severe asthma have differential airway expression of gene modules corresponding to various biological functions, including epithelial growth and repair, T1/T2 inflammation, neuronal and cilia functions. In particular, certain subsets of severe asthmatics exhibited high expression of a T1-associated gene module, featuring core genes STAT1 and PARP9 as well as other notable downstream proteins (e.g. interferon-γ induced chemokines).
Other applications of network analysis exist. For example, Pillai et al. [200] used bipartite network analysis of cytokine expression to sort patients into distinct endotypes. Hinks et al. [195] constructed a network of asthmatic individuals based on similarity in clinicophysiological parameters, then used topological data analysis to assign nodes into clusters.
Finally, the term “network analysis” has been used to describe the application of genomic, transcriptomic or proteomic data to existing networks stored in databases, specifically protein–protein interaction networks, or networks representing biomolecular pathways. This is often done to generate subsets of the original interaction networks, which are then examined for biological interpretation [201, 202]. Network databases concerning other omics may be used to achieve a similar purpose (e.g. ingenuity pathway analysis, InnateDB) [198, 203].
Mathematical modelling and prediction
The ultimate goal of integrative analyses is to generate models that reliably explain biological phenomena. At the simplest level, identified omic associations can be used as biomarkers; to generate a model consisting of the strongest biomarkers; and test the model on an external dataset. Many examples of such an approach exist in the literature [128, 204, 205]. At a deeper level, multiple biomarkers (potentially omic-wide) may be aggregated into a risk score, such as a genomic risk score [34]. The classification and network models discussed previously are themselves mathematical models that cover multiple omic domains and are testable on external datasets. In some of these applications, the models represent abstract attributions of risk, and strive to be useful as clinical predictive tools, rather than to be accurate or comprehensive representations of pathophysiology.
Another approach is to observe the consequences of perturbing a system, and infer normal function based on the results [6]. These perturbations may include gene knockouts in animal or cell models [206], neutralising antibodies or receptor antagonists to observe subsequent disruption of function [207, 208], or simple observation of distribution of perturbations among cases versus controls in an observational study (e.g. GWAS). Perturbations may be deliberate and controlled, targeting single genes or molecules; or they may be randomised and wholesale, in keeping with the “systems” philosophy (e.g. random mutagenesis studies in animal models). While such approaches have been employed extensively in biological research, there have been few in-depth explorations of the consequences of perturbation at multi-omic levels, at least in asthma research. Research groups with access to multi-omic datasets may be well positioned to begin exploring such questions.
Based on existing knowledge from biomarker and perturbation studies, it may be possible to generate in silico mathematical models to describe a complete biological subsystem in terms of components, interactions and functions; and then describe their perturbation during disease. Modelling biological systems in such a manner is challenging, as there are still many unknowns about its components. However, this has not stopped researchers from trying; for example, Höfer et al. [209] modelled the IL4-dependent activation of GATA3 transcription in T2 development. Multi-scale approaches have been used to describe multiple levels of biological function, from intracellular molecular processes, to cell-to-cell communication, to organ-level function. For instance, Lauzon et al. [210] formulated a model of airway hyperresponsiveness that accounted for actin–myosin mechanics, calcium signalling in airway smooth muscle (ASM) regulation; mechanical forces of airway narrowing, and time-dependent distribution of ASM contraction throughout the lung. Such approaches require knowledge of techniques that use differential equations and state diagrams; a review of these approaches is provided elsewhere [211]. However, since such models are usually generated with data from in vitro systems or animal models, it remains an ongoing challenge to test their relevance to in vivo human systems, and they should therefore be treated with caution [212]. Upcoming projects such as the Human Cell Atlas [213] seek to address some of these challenges, bridging the gap between cell biology and clinical medicine.
Pitfalls and challenges
Many challenges remain for systems biology. There are methodological challenges associated with statistical power, even in large consortia. This is due to the sheer scale of omic data, and the number of possible omic–omic comparisons or interactions. Next-generation technologies are becoming cheaper and more efficient, but the volume of data they generate will continue to pose a statistical challenge. Furthermore, the theory behind statistical and modelling methods still lags behind, and there is currently little consensus on the optimal systems-level pipelines (e.g. RNA-seq). Although research groups have recently been paying more attention to measurable environmental exposures in terms of their impact on biological systems [99], the lack of environmental data, and the uncertainty about which exposures actually matter, hinders examination of gene–environment interactions. Finally, even if we have a sufficiently powered sample, there is a so-called Faustian bargain [214], where large sample sizes introduce heterogeneity in cases and controls, thus obscuring findings. There is also a similar problem of the winner's curse [215], where significant results in a ome-wide study tend to exhibit larger effect sizes than they are in reality.
Machine learning has been the go-to tool to handle phenotypic heterogeneity [216]. However, many biologists and clinicians remain sceptical of it, with concerns about its hype or fad-like status, its opaque “black box” nature, and the perceived lack of clear, consistent or immediately-applicable results [217]. Moore et al. [181] were among the first groups to apply unsupervised cluster analysis to an adult asthma cohort and identify distinct clusters. While the clusters themselves proved useful in describing disease risk and severity profiles in the discovery population, subsequent studies attempting to replicate these clusters in other cohorts have had mixed results [218]. Numerous other studies have identified different sets of clusters based on different parameters and populations [219]. Results of machine learning methods may vary significantly depending on the nature of the input data, in terms of its quality, its relevance to the disease being studied, its depth (resolution of data: categorical versus continuous) and breadth (single versus multiple biological domains) and its balance (one domain prioritised over another versus all treated equally). The variability in research outcomes may suggest to some that machine learning methods are ultimately unreliable, but the field is still growing, and we argue that it simply illustrates the immense complexity of biomedical systems, complexity that will remain impenetrable if we limit ourselves to traditional expert-driven approaches. Unsupervised machine learning can serve as a springboard for future hypotheses: Lazic et al. [220] used unsupervised latent class analysis to identify a high-risk multiple-sensitised subgroup, whose pathophysiological origins in early life may be worth exploring in further detail with hypothesis-driven approaches. Ultimately, a balance of human expertise and machine learning will be necessary to make the right decisions about data input and interpretation, and to transform big data into biomedically relevant results.
Systems biology is multidisciplinary, and with this comes another challenge: communication and collaboration between the various disciplines. There is often a conflict of priorities: a clinician might be more interested in diagnosis, treatment and prognosis; an immunologist in the pathophysiology of allergy and asthma; the biostatistician in making sure that the statistics and modelling are sound; and the bioinformatician in generating clean data and writing problem-free code. There may be residual scepticism amongst some biologists or clinicians who perceive systems approaches as “data fishing” [9]. There is some evidence to suggest that multidisciplinary research projects have greater difficulty in getting funded or making a strong scientific impact [221], and this may reflect the challenge of balancing multiple priorities and conveying different perspectives to a broad audience, more so than the actual quality of the writing or research.
Multiple reviews have highlighted the ongoing inaccessibility of systems approaches to many biologists and clinicians, and have recommended the creation of biologist-friendly tools [138, 211, 222]. While this may indeed be helpful for common or simple analyses, there remains an ongoing need for specialist input in developing and using new tools. Tools are only useful if applied correctly, and a research group should not eschew specialist statistics or informatics input, simply to save costs or to keep things simple. Clearly, systems biology is itself very diverse, covering multiple avenues of inquiry. Subspecialties are likely to emerge within the field, each focusing on specific methodologies and their applications. It is likely that there will be a demand for specialists and generalists alike, and the movement of tertiary institutions towards incorporating mathematics, statistics and informatics in undergraduate biomedical courses is certainly a welcome one.
Future directions and concluding statements
The recent developments in systems biology exemplify the global drive towards systems medicine [187, 223], and more broadly, “P4” medicine: predictive, preventative, personalised and participatory [224]. Our ultimate objective is to achieve a critical level of biomedical understanding that permits development of precise and personalised interventions for individual patients. The employment of systems biology in asthma research represents the first step towards achieving this goal. Omics-level research allows us to hone in on the myriad pathological changes that contribute incrementally to disease, and to attempt reversal of these changes by addressing new therapeutic targets (supplementary table S1). Omics–omics integration may enable us to connect these changes and visualise how they operate in each individual patient. Another important contribution of systems biology is the ongoing clarification of the hygiene/microbial hypothesis; the interaction between genetics and environment; and the future possibility of environmental and host microbiota modification to manage or prevent disease. Therapies may be tailored to individual patients depending on their underlying endotype or pattern of pathophysiology (e.g. anti-T2 biologic treatment for allergic asthma; combined anti-Th2/Th17 treatment for steroid-resistant disease), though the exact implementation of such precision medicine remains a work in progress.
Worldwide, there has been a push by many groups to implement systems medicine, charting a path from wet lab to dry lab to bedside. Large consortia, such as MeDALL (Mechanisms of the Development of Allergy) in Europe [187] and STELAR (Study Team for Asthma Life Research) from the UK [216], have been established specifically to record and integrate multi-omic data related to allergy and asthma, and conduct well-powered systems-based analyses. Other smaller groups are also involved in similar research via frequent cross-collaborations: these include the Childhood Asthma Study (Australia) [85], U-BIOPRED (Unbiased Biomarkers in Prediction of Respiratory Disease Outcomes; European) [225], Childhood Origins of Asthma (USA) [93], Copenhagen Prospective Study on Asthma in Childhood (COPSAC; Denmark) [14], Manchester Asthma and Allergy Study (UK) [186], Severe Asthma Research Program (USA) [182] and others. In the modern age of systems biology, collaboration and data sharing is virtually mandatory when it comes to uncovering complex associations such as gene–environment interactions.
Overall, systems biology has yielded fruitful outcomes in asthma research, and promises to deliver more in the future. At the moment, we are still a far way off from truly personalised medicine – being able to predict with reasonable accuracy the disease or prognosis of an individual based on well-sampled data. However, we can only expect the field to grow exponentially in the years to come.
Supplementary material
Supplementary Material
Please note: supplementary material is not edited by the Editorial Office, and is uploaded as it has been supplied by the author.
Supplementary material ERJ-00844-2019.SUPPLEMENT
Shareable PDF
Supplementary Material
This one-page PDF can be shared freely online.
Shareable PDF ERJ-00844-2019.Shareable
Footnotes
This article has supplementary material available from erj.ersjournals.com
Conflict of interest: H.H.F. Tang has nothing to disclose.
Conflict of interest: P.D. Sly has nothing to disclose.
Conflict of interest: P.G. Holt has nothing to disclose.
Conflict of interest: K.E. Holt has nothing to disclose.
Conflict of interest: M. Inouye has nothing to disclose.
Support statement: Supported by the National Health and Medical Research Council (Grant: APP1114753). Funding information for this article has been deposited with the Crossref Funder Registry.
- Received April 29, 2019.
- Accepted September 12, 2019.
- Copyright ©ERS 2020