Understanding transcriptomics: A practical guide to RNA-sequencing approaches

Introduction

When people think about the genome, DNA is usually the first thing that comes to mind. But for genetic information to become biologically meaningful, genes first need to be transcribed into RNA. The complete set of RNA molecules present in a biological sample at a given time is known as the transcriptome.

The transcriptome includes much more than messenger RNA (mRNA). Alongside protein-coding transcripts, cells also produce many types of non-coding RNA, including long non-coding RNAs (lncRNAs), microRNAs (miRNAs), circular RNAs (circRNAs), and other regulatory RNA molecules. Because RNA expression changes dynamically across conditions, transcriptomics provides a functional snapshot of cellular activity that cannot be captured from DNA alone.

Why does transcriptomics matter?

Transcriptomics helps researchers understand how genes respond under different biological conditions, disease states, or environmental stimuli.

Today, many RNA-Seq studies focus not only on mRNA, but also on important non-coding RNA classes such as lncRNA, miRNA, small RNA (sRNA), and circRNA. Because these RNA populations have very different biological characteristics, the RNA enrichment strategy — whether poly(A) selection or rRNA depletion — plays an important role in determining transcriptome coverage, sequencing efficiency, and downstream analysis quality.

RNA-Seq approaches: Choosing the right strategy

mRNA-Seq

mRNA-Seq remains the most commonly used RNA-Seq approach and focuses primarily on protein-coding transcripts. It is widely used for gene expression profiling, differential gene expression analysis, and identifying pathways associated with disease or biological processes.

For eukaryotic samples with good RNA quality and sufficient input material, poly(A) selection is the standard library preparation strategy. This method uses oligo-dT capture to enrich RNA molecules carrying poly(A) tails, helping concentrate sequencing reads on protein-coding transcripts. It works particularly well for high-quality and relatively intact RNA samples.

For degraded RNA samples — such as FFPE tissues or samples with low RNA Integrity Number (RIN) values — as well as prokaryotic RNA, which typically lacks stable poly(A) tails, rRNA depletion is often a better option. In these workflows, ribosomal RNA is removed before library preparation, and cDNA synthesis is usually performed using random priming. Because random primers bind across multiple positions along RNA molecules independently of poly(A) tails, this approach is more compatible with fragmented or non-polyadenylated RNA.

Library type – stranded or non-stranded – should be selected based on RNA quality, study goals, and the level of transcript resolution required (see “Stranded vs. Non-Stranded Libraries”).

For standard differential gene expression analysis, approximately 20 million paired-end 150 bp (PE150) reads (~6 Gb) is generally sufficient. More advanced transcriptome analyses often require at least 40 million PE150 reads (~12 Gb).

Ultra-low input RNA-Seq

Many transcriptomics studies involve biologically or clinically limited material — such as small needle biopsies, rare cell populations, or low-yield clinical specimens — where total RNA quantity becomes a major experimental constraint. Ultra-low input RNA-Seq addresses this challenge by enabling transcriptome profiling from minimal amounts of starting material.

Specialized library preparation workflows, including those based on the Clontech SMARTer platform, support library construction from as little as 1–10 ng of total RNA, thereby expanding transcriptomic applications to sample-limited settings.

Due to the technical challenges associated with low-input workflows, these protocols commonly employ non-stranded library preparation in order to simplify workflow complexity and maximize library yield when RNA input is limiting (see “Stranded vs. Non-Stranded Libraries”).

lncRNA-Seq

lncRNA-Seq is a specialized transcriptomic approach designed to profile long non-coding RNAs (lncRNAs), particularly in studies investigating gene regulation, epigenetic mechanisms, and biomarker discovery.

Unlike conventional mRNA-Seq workflows, not all lncRNAs are polyadenylated, and even when present, poly(A) tails may be short or unstable. Consequently, lncRNA-Seq workflows typically rely on rRNA depletion combined with random-primed cDNA synthesis rather than poly(A) selection in order to preserve broader non-coding transcriptome coverage.

Because many lncRNAs exist as antisense transcripts or overlap with protein-coding loci, stranded library preparation is generally preferred to improve transcript annotation accuracy and facilitate correct assignment of overlapping transcripts (see “Stranded vs. Non-Stranded Libraries”).

Given the relatively low abundance and high diversity of lncRNAs, deeper sequencing is often required. Typical recommendations range from 30 million PE150 reads (~10 Gb) to 50 million PE150 reads (~15 Gb) or higher depending on sample complexity and study objectives.

Small RNA Sequencing (sRNA-Seq)

sRNA-Seq focuses on the small RNA fraction of the transcriptome, including microRNAs (miRNAs), small interfering RNAs (siRNAs), and PIWI-interacting RNAs (piRNAs), all of which play important roles in post-transcriptional gene regulation.

This approach is widely applied for small RNA expression profiling in studies of immune regulation, inflammation, cancer biology, and chronic disease.

Because target RNA molecules are extremely short and the primary objective is typically quantitative expression profiling, sRNA-Seq workflows commonly employ non-stranded library preparation (see “Stranded vs. Non-Stranded Libraries”).

A sequencing depth of approximately 20 million single-end 50 bp (SE50) reads (~1 Gb) or greater is generally recommended, depending on sample complexity and analytical objectives.

circRNA-Seq

circRNA-Seq is a specialized approach for profiling circular RNAs (circRNAs), a structurally distinct class of RNA molecules generated through back-splicing events that form covalently closed RNA loops. The absence of free 5′ and 3′ termini makes circRNAs highly resistant to exonuclease-mediated degradation, contributing to their remarkable molecular stability.

Increasing evidence suggests that circRNAs participate in cancer biology, metabolic disorders, and other complex diseases, making them attractive candidates for biomarker discovery and regulatory RNA research.

circRNA-Seq workflows typically incorporate a two-step enrichment strategy. First, ribosomal RNA is removed through rRNA depletion. Subsequently, RNase R treatment is applied to selectively digest linear RNA molecules while preserving circular transcripts, thereby enriching the circRNA fraction prior to sequencing.

Because strand information is important for accurate characterization of back-splicing junctions, stranded library preparation is commonly preferred for circRNA-Seq workflows (see “Stranded vs. Non-Stranded Libraries”).

Recommended sequencing depth is generally >30 million PE150 reads (~9 Gb), increasing to approximately 40 million PE150 reads (~12 Gb) or higher when low-abundance circRNAs are of interest.

Whole transcriptome sequencing

Whole Transcriptome Sequencing provides a comprehensive view of the transcriptome by simultaneously profiling multiple RNA biotypes within the same sample, including mRNA, lncRNAs, and additional regulatory non-coding RNA species.

This approach is particularly well suited for hypothesis-generating studies, regulatory network analysis, and projects where the target RNA class has not been predefined.

In practice, Whole Transcriptome Sequencing commonly relies on rRNA depletion to maximize transcriptome breadth and retain diverse RNA populations beyond polyadenylated transcripts. However, because RNA biotypes differ substantially in abundance, transcript length, and biological characteristics, careful optimization of library preparation strategy, insert size selection, strand specificity, and sequencing depth remains essential for generating interpretable transcriptomic data.

Metatranscriptomics and dual RNA-sequencing

Metatranscriptomics and Dual RNA-Seq are specialized transcriptomic approaches designed for studying host–pathogen and host–microbiome interactions.

Rather than profiling the transcriptome of a single organism in isolation, these methods simultaneously capture RNA signals from both host and microbial organisms within the same biological sample. This simultaneous profiling strategy enables researchers to investigate how interacting organisms dynamically modulate transcriptional responses during infection or symbiosis.

Because host and microbial transcriptomes often differ substantially in RNA composition and abundance, experimental design for metatranscriptomic studies frequently requires tailored optimization of rRNA depletion strategy, strand specificity, and sequencing depth.

Figure 1: Different types of RNA produced in cells (Source: Salsabeel Elkholey – BioRender)

Stranded vs. non-stranded mRNA libraries

Strand specificity represents a key design consideration in RNA-Seq experiments. Stranded (strand-specific) libraries preserve information regarding the strand of origin of each RNA molecule, enabling sequencing reads to be assigned to the correct genomic strand. This becomes particularly important when analyzing overlapping transcripts, antisense RNAs, or structurally complex transcriptomes.

The choice between stranded and non-stranded libraries is primarily determined by two factors: the characteristics of the target RNA and the requirements of the downstream analysis.

From a biological perspective, stranded libraries are generally preferred for lncRNA studies because many lncRNAs overlap with protein-coding genes or exist as antisense transcripts that cannot be accurately resolved without strand information. Similarly, stranded data improves the characterization of back-splicing events in circRNA studies. In contrast, small RNA sequencing workflows commonly adopt non-stranded library preparation because the target RNA molecules are quite short around 18 to 40 nt.

Regarding bioinformatics perspective, stranded libraries are often required for transcript-level analyses such as de novo transcriptome assembly, isoform quantification, alternative splicing analysis, and novel transcript discovery. In contrast, non-stranded libraries are often sufficient for differential gene expression analysis in well-annotated genomes while offering lower overall cost and simplified workflows.

Non-stranded protocols are also commonly applied in ultra-low input RNA-Seq workflows where maximizing library yield is prioritized.

Understanding Transcriptomics: A Practical Guide to RNA-Seq Approaches

Figure 2: Comparison between non-stranded and stranded mRNA libraries (Image source: adapted from Hemagirri et al. (2024), Biogerontology, Vol. 25, pp. 705–737)

Single-end vs. paired-end sequencing

In RNA-Seq experiments, single-end (SE) sequencing reads only one end of each cDNA fragment, while paired-end (PE) sequencing reads both ends of the same fragment. Because paired-end sequencing generates information from both sides of a fragment, it provides better alignment accuracy and more transcript-level information, although it also requires higher sequencing cost and data volume.

Single-end sequencing is often sufficient for standard gene expression profiling and differential gene expression (DGE) analysis, particularly when working with well-annotated reference genomes. It is therefore commonly used as a cost-effective option for routine gene-level analysis.

Paired-end sequencing is generally preferred when more detailed transcript information is needed. Sequencing both ends of the same fragment helps improve read alignment across exon–exon junctions, repetitive regions, and structurally complex transcripts. As a result, paired-end sequencing is commonly recommended for applications such as alternative splicing analysis, isoform quantification, novel transcript discovery, gene fusion detection, and allele-specific expression analysis.

Chọn giải pháp RNA-Seq phù hợp cho người mới bắt đầu

Figure 3: Comparison between single-end and paired-end sequencing (Image source: Zhernakova et al., 2013, doi.org/10.1371/journal.pgen.1003594)

Samples QC (RNA quality assessment)

RNA quality is one of the most critical determinants of RNA-Seq success. Prior to library preparation, RNA samples should be evaluated in terms of concentration, purity, and integrity.

Qubit fluorometry is widely preferred for RNA quantification because fluorescence-based detection provides more accurate concentration measurements and is less affected by contaminants. NanoDrop spectrophotometry is commonly used to assess sample purity through absorbance ratios, including A260/280 for protein contamination and A260/230 for organic compounds or residual salts.

RNA integrity is typically assessed using capillary electrophoresis platforms such as the Agilent Bioanalyzer, TapeStation, or Fragment Analyzer, which evaluate RNA size distribution and generate an RNA Integrity Number (RIN) ranging from 1 to 10. Higher RIN values indicate relatively intact RNA, whereas lower values reflect progressive RNA degradation. RIN measurements are therefore important for determining compatibility with different library preparation workflows.

For poly(A)-selection workflows, relatively intact RNA is generally preferred because degraded RNA can introduce strong 3′ bias resulting from preferential capture of poly(A)-proximal fragments. In contrast, rRNA depletion-based workflows are substantially more tolerant of degraded RNA and are therefore commonly used for FFPE-derived samples and other low-RIN input materials.

Figure 4: Assessing input RNA quality

Experimental Design: Controls, Replicates, and Batch Effects

Careful experimental design is as important as sequencing technology selection for generating reliable and interpretable RNA-Seq data. Well-designed studies reduce technical variability, mitigate systematic bias, and improve statistical power for detecting biologically meaningful transcriptional changes.

Controls

Appropriate experimental controls enable monitoring of the complete workflow, from RNA extraction through library preparation and sequencing. Exogenous RNA spike-ins such as ERCC or SIRV controls, may be included to evaluate normalization performance, library preparation efficiency, and inter-run consistency.

Batch Effects and randomization

Batch effects arise when observed differences between samples are driven by technical variables — including library preparation date, reagent lot, sequencing run, or operator — rather than true biological differences.

The most effective strategy for minimizing batch effects is to randomize samples across experimental batches so that biological groups are not systematically confounded with technical processing variables. When complete randomization is not feasible, detailed documentation of batch metadata becomes essential for downstream statistical correction using tools such as ComBat or limma.

Biological replicates

Biological replication is fundamental for estimating within-group variability and supporting statistical significance in differential expression analysis.

A minimum of three biological replicates per condition is commonly recommended, while five to six or more replicates generally provide substantially improved statistical power, particularly for heterogeneous tissues or biologically variable samples.

Summary: Matching the RNA-Seq strategy to the biological questions

Understanding transcriptomics: A practical guide to RNA-sequencing approaches

No single RNA-Seq approach is universally optimal for every study design. For studies focused primarily on protein-coding gene expression, mRNA-Seq is often the most appropriate starting point. Investigations centered on regulatory non-coding RNA biology may instead require dedicated approaches such as lncRNA-Seq, sRNA-Seq, or circRNA-Seq, depending on the RNA class of interest. For broad discovery-oriented studies aimed at comprehensively profiling RNA populations, Whole Transcriptome Sequencing provides the most extensive transcriptomic coverage.

Ultimately, selecting the appropriate RNA target, enrichment strategy, library type, sequencing depth, and RNA quality control framework collectively determines whether an RNA-Seq study generates robust and biologically interpretable transcriptomic data.

GeneSmart is the official distributor of Novogene in Vietnam, providing end-to-end support for RNA-Seq projects — from experimental design and sample preparation to library construction, sequencing, and downstream bioinformatics analysis. Contact GeneSmart to discuss the most suitable RNA-Seq strategy for your study design, sample types, and research objectives.

------------

GENESMART CO., LTD | Phân phối ủy quyền 10X Genomics, Altona, Biosigma, Hamilton, IT-IS (Novacyt), Norgen Biotek, Rainin tại Việt Nam.

Website: https://genesmart.vn/

Hotline: 0947 528 778

Email: [email protected]