Abstract
RNA sequencing is a powerful tool for high-resolution transcriptomic analysis and can be leveraged to better understand the molecular underpinnings of diverse lung diseases http://bit.ly/2mqBnl4
Introduction
With the evolution of high throughput sequencing technologies, the past decade has seen an exponential rise in the use of RNA sequencing (RNA-seq). RNA-seq has deepened our understanding of biological systems to unprecedented levels of resolution, identifying not only gene expression signatures but also regulatory RNA molecules that may play critical roles in disease pathogenesis. Pulmonary research has quickly incorporated this technology, from characterising the IL-17 signature of steroid-unresponsive COPD patients [1] to discovering pathogen–host interactions of Mycobacterium tuberculosis [2]. This mini-review provides an overview of core features and applications of RNA-seq to familiarise non-experts with the methodology and how it has impacted our understanding of lung pathophysiology.
Basic principles of RNA-seq
Messenger and non-coding RNA together comprise the transcriptome, a dynamic representation of the functional elements of a cell or organism's genome in a given physiological condition [3]. Earlier methods to analyse the transcriptome consisted of sequencing short segments of complementary DNA (cDNA) with Sanger-based methods or using gene expression microarrays, in which cDNA is quantified by hydridisation with known probes. These approaches were limited by time, incomplete characterisation of sequences, background noise, and reliance on reference genomes [3]. In contrast, RNA-seq applies next generation sequencing to read tens to hundreds of millions of transcripts simultaneously, capturing subtle genetic variations and known or novel RNA isoforms with or without a reference genome.
While several RNA-seq platforms are available, most share a common pathway consisting of isolating RNA, generating cDNA libraries, amplifying and sequencing cDNA, and analysing data (figure 1).
RNA-seq pipeline: A typical RNA sequencing experiment consists of designing the experiment, isolating RNA from the desired cells or tissue, generating cDNA libraries prior or following fragmentation, sequencing cDNA, and processing and analysing data.
Experimental design
Biological, technical and statistical biases can be minimised by thoughtful experimental design with clearly defined objectives and performing quality control assessments at each stage of the RNA-seq pipeline [4]. Selection of the specimen to be sequenced is guided by the question being asked; induced sputum [5], peripheral blood cells [6] and bronchoalveolar lavage fluid [7] may be useful for surveying the immune landscape and identifying potential biomarkers less invasively, for example yielding important insights into the Th2 signatures of asthmatics [6], whereas structural mechanisms may be better interrogated with nasal or bronchial brushings [8, 9], biopsies [10] or lung explants [11, 12]. Nasal brushings are being increasingly used as they are more accessible than bronchial specimens, provide high quality RNA [9] and may be used as a surrogate for bronchial epithelial cells in functional studies [13]. Isolating specific cell types from heterogeneous tissue can be achieved by flow-based strategies [14, 15] but are often constrained by marker specificity. Cultured primary cells may not retain the characteristics of parent tissue and animal- and cell line-based models may provide only limited insights into human disease, yet these may be more accessible and controlled than human samples. Each approach also poses unique technical challenges; sputum collection times can influence microbiotic diversity [16] while explanted tissue is often subject to long ischaemic times, potentially affecting RNA yield and quality [17]. RNA preservation medium can improve yield and quality, and novel techniques have been developed to increase biomarker concentration from bronchoscopy specimens [18]. Table 1 highlights some of the key features of the main respiratory specimen types.
Main specimen types that have been used for respiratory RNA sequencing studies and technical considerations
Another consideration is whether the information sought is primarily quantitative (e.g. assessing differential gene expression in lung endothelial cells in idiopathic pulmonary arterial hypertension (PAH) [11]) or qualitative (e.g. identifying fusion genes in lung cancers [19]). This determines the need for biological replicates and appropriate sequencing coverage (portion of targeted region being sequenced) and depth (number of independent reads) [20]. Read depth is a critical determinant of sequencing accuracy and the ability to identify sequence variants. Appropriate coverage and depth may vary widely depending on the cell type or developmental stage [21]. Long reads, paired-end reads, and strand specificity are important for de novo transcript assembly and isoform classification but come at a trade-off of increased cost and reduced depth [22, 23].
RNA isolation, library preparation and sequencing
Obtaining adequate RNA while preserving integrity is critical for successful RNA-seq [20]. This can be challenging, as RNA is notoriously susceptible to degradation especially in clinical or formalin-fixed paraffin-embedded specimens. The advantages of fresh samples must be weighed against biases introduced by batch effects. A commonly used metric of quality is the RNA Integrity Number (RIN), which reflects RNA degradation as a ratio of 28S and 18S rRNA but is less informative regarding sample purity. RIN scores of 7 or greater are often recommended for RNA-seq. Ultraviolet absorbance is a rapid method of assessing RNA quality and concentration but lacks specificity for RNA and does not directly assess degradation.
Extraction is carried out with silica gel membrane columns or organic solvents (which must be subsequently removed to prevent interference with downstream steps, including reverse transcription). Transcript RNA can be enriched from total RNA by poly(A) selection using oligo(dT) primers and depletion of ribosomal RNA (rRNA) with enzymatic digestion or probe-based elimination. A high proportion of poly(A) tails may be lost in degraded RNA; to avoid underrepresenting the transcriptome, alternative enrichment approaches should be considered, such as exome capture [17, 24]. Enriched RNA is fragmented to fit the size range of the sequencing platform.
Reverse transcription generates cDNA inserts which are then ligated to adapters; these serve as amplification signals and sample-specific barcodes. cDNA libraries are usually amplified by PCR before sequencing. Commonly used commercial sequencing platforms include Illumina, SOLid, PacBio and Ion Torrent, which vary based on library preparation, sequencing chemistry and the characteristics of data generated, such as maximum sequence read length (e.g. PacBio can produce significantly longer reads). To minimise cost and time, high throughput sequencers can combine (i.e. “multiplex”) libraries from several experiments into a single sequencing reaction.
Single-cell RNA-seq
Single-cell RNA-seq (scRNA-seq) has rapidly gained popularity to assess gene expression in individual cells. Whereas bulk RNA-seq profiles transcriptomes averaged across a pool of cells, scRNA-seq can resolve mixed populations including those in tumours, complex organs, and progenitor and immune cells. For example, scRNA-seq delineated 52 stromal cell subtypes in the lung tumour microenvironment [25] and identified aberrant alveolar epithelial subtypes associated with idiopathic pulmonary fibrosis (IPF) [15]. Recently, scRNA-seq and lineage tracing of human and mouse airway cells revealed “pulmonary ionocytes”, a novel epithelial cell critical for CFTR activity [26, 27].
ScRNA-seq employs a powerful amplification process to sequence transcripts from minute quantities of RNA. Advanced microfluidics (e.g. Drop-seq) enable separation of tens of thousands of cells into nanolitre-sized droplets [28]. Nuc-seq has further developed this technology to sequence single-nuclear RNA from frozen tissue [29]. Clustering similar cells is important for generating a more comprehensive transcriptome and reducing dimensionality [30]. Major challenges of scRNA-Seq include cost, low gene expression levels and ensuring complete dissociation of cells, particularly in solid tissues. This limits sequencing depth and the capacity to detect rare transcripts [20, 30].
Data processing and analysis
Processing the “big data deluge” inherent to omic studies can be daunting, requiring thoughtful selection from a broad range of bioinformatic tools. After obtaining raw reads, sequencing artefacts (e.g. from adapters and very short or low complexity reads) are filtered out. The next step, aligning reads to a reference genome or transcriptome [31], is perhaps the most computationally demanding; each read must be accurately and uniquely mapped, considering sequencing errors versus true biological variation (i.e. insertions, deletions or single nucleotide polymorphisms) [32]. Sophisticated algorithms can reconstruct full transcripts, assemble de novo transcripts, quantitate gene expression, examine small RNA species and long non-coding RNAs, and detect alternative splicing, gene fusion and variants [33, 34]. Commonly used RNA-seq aligners include TopHat2, STAR and HISAT. Alignment accuracy can vary widely depending on factors including the number of intron-spanning reads and transcript annotation [35]. The percentage of mapped reads is an important metric of overall sequencing quality. While unmapped reads often reflect sequencing errors or contamination by rRNA or genomic DNA, they may also represent unannotated transcripts or true biological variants [36]. Conversely, multimapped reads can result from isoforms of identical sequences [20]. Reads are quantified and normalised by transcript length and library size.
With normalised quantitative data in hand, one can now ask: what are the differences in gene expression between states? Which pathways are activated? These questions can be answered in a multitiered way usually beginning with differential gene expression. Genes are ranked by the strength of the signal and p-values adjusted for multiple testing using statistical packages such as DESeq, edgeR or limma/voom, which differ based on the distribution model and treatment of sample size, replicates, sequencing depth and overdispersion (i.e. observed versus theoretical dispersion). For example, DESeq2 models data using negative binomial distribution and Bayes theorem to estimate dispersion [37], whereas limma/voom employs linear modelling and estimates the mean–variance relationship of gene counts to weigh each observation [38]. Comparisons of these software have been studied systematically [39, 40].
Because individual genes may exert only a weak phenotypic effect, pathway enrichment analysis with databases of functionally annotated gene sets (e.g. GO, KEGG or Reactome) can be used to group differentially expressed genes based on biological processes or molecular functions. This not only highlights key regulatory pathways but reduces the dimensionality of the statistical analysis [41]. Co-expression network analysis, which identifies networks of highly correlated genes to infer gene function and gene-disease associations [42], and integration with other omic datasets including genomic, methylation, microRNA, proteomic and metabolomic studies, can complement RNA-seq to model biologically correlated behaviours [20, 43]. For example, parallel microRNA and mRNA-seq data from IPF and COPD lungs were integrated to create a gene/miRNA regulatory network, identifying MIR96 as a regulator of the p53/hypoxia lung response to environmental injury [12].
Experimental validation
RNA-seq is a powerful tool for discovery, but must be augmented by other experimental approaches to validate findings, establish causality and elucidate mechanism. Gene signatures may be validated at the protein expression level and further investigated with cell- and animal-based functional assays. After combining RNA-seq and microarray findings to identify senescence markers in IPF lung, Schafer et al. [44] assessed p16 expression in fibrotic lung tissue via immunostaining, characterised the profibrotic secretome in senescent fibroblasts, and then found that clearance of senescent cells improves lung compliance in experimental fibrosis. EFNA1, a gene related to BMPR2 signalling found to be downregulated in human PAH endothelial cells by RNA-seq, was knocked down in vitro and in mice to recapitulate PAH phenotypes [11]. Finally, as an example of how RNA-seq targets might be translated directly to clinical therapies, the discovery of PD-1-positive progenitor cells of type 2 innate lymphoid cells (ILC2s) by scRNA-seq led to reduction of acute lung inflammation in mice treated with antibody-depleted PD-1hi ILCs [45] and highlighted the potential for PD-1 inhibitors, already widely used in cancers, to treat ILC2-mediated disorders such as allergic asthma.
Such studies underscore the capacity of RNA-seq to generate hypotheses and uncover novel therapeutic targets. It has thus emerged as the method of choice for transcriptome analysis, although challenges, including cost (typically several hundred dollars per sample, and higher for scRNA-seq, greater sequencing depths and lengths and more sensitive platforms) remain. Moreover, the volume and complexity of data requires a level of bioinformatic and statistical expertise currently in short supply. Despite its enormous potential, direct clinical applications of RNA-seq remain limited. RNA-seq has been integrated with machine learning to diagnose IPF from transbronchial biopsies [46] and mild/moderate asthma from nasal brushings [9]. Other RNA-based technologies, primarily using quantitative real-time PCR and microarrays, are already employed in the clinical setting, including the detection of respiratory RNA viral agents (e.g. severe acute respiratory syndrome and Middle East respiratory syndrome) and for lung cancer diagnosis and prognostication [47]. With the strong foothold RNA-seq provides for resolving complex gene structures and functions, future efforts should be directed not only towards gaining a comprehensive understanding of lung pathobiology, but also translating discoveries to clinically useful biomarkers and targets for precision therapies.
Shareable PDF
Supplementary Material
This one-page PDF can be shared freely online.
Shareable PDF ERJ-01625-2018.Shareable
Footnotes
Conflict of interest: S.G. Chu has nothing to disclose.
Conflict of interest: S. Poli De Frias has nothing to disclose.
Conflict of interest: B.A. Raby reports grants from National Institutes of Health (P01HL114501), during the conduct of the study.
Conflict of interest: I.O. Rosas reports grants from National Institutes of Health (P01HL114501), during the conduct of the study.
- Received August 26, 2018.
- Accepted September 16, 2019.
- Copyright ©ERS 2020