If you’re considering between 16S/18S/ITS Amplicon Metagenomic Sequencing and need more instructions for choosing the right approach, read our previous blog here.
1. Introduction
Shotgun metagenomic sequencing uses modern genomic techniques to sequence all the genes in all the microorganisms present in a sample without the need for specific isolation or culture of individual species. This opens a world of research as nearly 99% of all microorganisms cannot be cultured in the laboratory. This method enables researchers to evaluate microbial diversity, detect the abundance of microbes in the environment sampled, and investigate microbial structure and interactions. In addition, the sequencing results can also be used to assess functional diversity in the sample analysed.
Shotgun metagenomic sequencing is faster and more reliable than traditional techniques as it removes the need for culturing microbes which could create conditions that favor certain species and introduce bias into the results. Further, metagenomic sequencing removes other sources of bias associated with traditional techniques, including multiple PCR amplification requirements.
2. Shotgun Metagenomic Sequencing workflow
A shotgun metagenomics project workflow typically follows the same six steps: sample preparation, DNA extraction, quantification and quality control, library construction, sequencing, and bioinformatics analysis.

Figure 1: General workflow of Shotgun Metagenomic Sequencing
Sample preparation depends on the environment that you are sampling from. The samples generally taken for these types of projects range from soil and water to faeces, swabs, and tissues. Some issues can arise with sample preparation, especially those taken from tissues, as they may have high risk of contamination with the host genome. This can result in low-quality data being produced. Certain strategies can be used to avoid this, such as not sampling too close to the host’s tissue and using the correct kit when extracting. In addition, if the host’s genome is known, it is possible to remove this from the sequencing data, or for host DNA depletion approaches, you may refer more here.
Once the sample has been prepared, DNA can be extracted. The preferred method for DNA extraction is using the CTAB method. However, depending on the specificity of your samples, some customized kits may work better, for example, for sludge and soil samples, it is highly recommended using the PowerSoil ® DNA isolation kit. Following extraction, the samples will then undergo quality control to check whether there is any degradation and the samples are good enough to proceed to library prep.
DNA is fragmented into 250 – 300 bp length for constructing a library of 350bp insert size. This library is then sequenced using the Illumina Novaseq platform with pair-end 150 bp strategy. Once sequenced, the reads then go through data quality control to filter the reads. There are three types of read that are removed:
- Reads containing adapters
- Reads containing >10% of unknown bases
- Low-quality reads
After the data filter, the clean reads can proceed to the analysis pipeline. With Shotgun Metagenomic sequencing data, we are interested in answering two types of questions: the first is to examine the sample to see who is there, and the second is to see what they are doing. More specifically, to answer the first question, we are interested in examining the taxonomic and phylogenetic diversity to see what types of microbes are present in the sample. The second question addresses the role the microbes play within the sample by looking at gene prediction and functional annotation. We first need to assemble the reads and annotate the results to answer these questions. Following this, a host of downstream analyses can be carried out depending on the samples and questions you have. These include but are not limited to:
- Gene catalogue statistics
- Taxonomy annotations
- PCA, PCoA and NMDS Analyses
- Functional Annotation
- Antibiotic Resistant Gene Distribution
We’ll briefly summarize various tools and visualization plots used in Assembly-based Shotgun Analysis Pipeline in the section below:
- Metagenome Assembly: The clean data from each sample, after quality control, was used for metagenome assembly.
- Gene Prediction: Using MetaGeneMark, gene prediction was performed on the assembled scaffolds. The predicted genes were pooled and dereplicated to create a gene catalogue, enabling the assessment of gene abundance for each sample.
- Taxonomy Annotation: The metagenomic reads were compared with the microNR database for taxonomy annotation, resulting in abundance tables at various taxonomic ranks.
- Function Annotation: The functions of coding sequences were inferred by comparing them with databases like KEGG, eggNOG, and CAZy, providing a functional profile of the metagenome.
- Antibiotic genes annotation:
- ARGs Analysis: Antibiotic resistant genes were annotated using the CARD database, revealing their abundance and species distribution.
- MGE Analysis: Unigenes were compared against databases for insertion sequences, integrons, and plasmids to determine their abundance.
- Statistical and Comparative Analysis: Utilizing the abundance tables, analyses including clustering, Anosim, PCA, PCoA, and NMDS were conducted. For grouped data, MetaGenomeSeq and LEfSe were used for multivariate analysis and pathway comparison.

Figure 2: Analysis pipeline for Short-based Shotgun Metagenomic Sequencing
3. Analysis results and plots visualization
3.1. Gene prediction and abundance analysis
- What it does: MetaGeneMark identifies genes (specifically, coding sequences or CDSs) from assembled metagenomic scaffolds by locating open reading frames (ORFs), then pools (collect across all samples) and dereplicates to remove highly similar sequences. The resulting non-redundant gene catalogues are used for downstream analyses.
- Visualization plots:
- UniqGene Length Distribution: shows the range and frequency of predicted gene lengths to assess prediction quality and aid normalization.

Figure 3: UniqGene Length Distribution
- Core-pan genome analysis:
- While pan genome is the total genes present across all microbes in a study, including core genes (shared by all samples) and accessory genes (unique to some but not all), core genome is typically more conserved.
- Rarefaction curves comparing between core genome and pan genome usually show reduced variations as a sample number increases, but the number of non-redundant genes just increase in pan genome whereas the core one quickly shows saturation.
- Correlation analysis of samples: reflects the reliability of experiment and the reasonability of the chosen samples by using spearman’s correlation coefficient. Samples within the same condition/ group should show the positive correlation and vice versa.
- Gene number analysis: To investigate the difference of gene number among groups, the gene numbers of different groups were shown as follows in a boxplot chart:
- DOP: shows the highest median gene count at around 910k, and the narrowest spread indicating consistent values within this group.
- HFD: shows the average median gene count among three groups, and has the largest distribution indicating variations within this group.

Figure 4: Gene number of different groups
3.2. Taxonomy annotation
Taxonomy annotation determines the taxonomic identity (e.g., species, genus, phylum) of the microbes present in the sample. See more typical plots used for taxonomic analysis in this blog – 16S/18S/ITS Amplicon Metagenomic Sequencing.
3.3. Function annotation
- What it does: Identifies what the predicted genes do by assigning them functional categories. In details, the predicted coding sequences (CDSs) are compared against several major functional databases, including:
- KEGG (Kyoto Encyclopedia of Genes and Genomes): For pathways and enzymes.
- eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups): For orthologs (genes in different diseases but derived from a gene of the common ancestor) and general functional classification.
- CAZy (Carbohydrate-Active enZYmes): For enzymes involved in building and breaking down carbohydrates.
- VFDB (Virulence Factors of Pathogenic Bacteria) database: focuses on virulence factors from pathogenic bacteria, chlamydia and mycoplasma. This database contains general characterization of virulence genes, their functions and pathogenic mechanisms.
- Phi (Pathogen-Host Interaction database): focuses on pathogen-host interactions. It collects, organizes and provides information on the interactions between various pathogens (e.g., bacteria, viruses, fungi, and parasites) and their hosts, including humans and other plants and animals.
- This generates a functional profile of the microbiome – basically a map of metabolic capabilities and potential activities.
- Visualization plots: eggNOG is used as a representative group for all other mentioned plots above
- Basic annotation methods: We compared the predicted genes to known databases to find the strongest match. Then, the functional annotation is performed by counting genes linked to each function, and a table is created to show how many genes with specific functions appear in each sample.
- Functional abundance of eggNOG: The number of matched genes with various functional categories was depicted below:
- Among genes with known functions, L: Replication, recombinant and repair group has the highest number of matched genes at approximately 150k genes.
- The minority of gene counts belongs to A: RNA processing and modification; B: Chromatin structure and dynamics, W: Extracellular structures and Z: cytoskeleton.

Figure 5: Functional abundance of eggNOG
- Relative abundance of eggNOG: shows the relative quantification of different function categories across different sample groups.

Figure 6: Relative abundance of eggNOG
- Relative abundance cluster analysis of eggNOG: Top 35 functions in terms of abundance in each sample were selected to draw a heat map with:
- X- axis: shows different sample groups
- Right-handed Y-axis: shows various functional categories
- Left-handed Y-axis: clusters these functions based on their similarities.
- E.g. Group M8 shows a high gene count in “P: Inorganic ion transport and metabolism” and “G: Carbohydrate transport and metabolism”, and these two functions have a high similarity also as they are grouped together.

Figure 7: Relative abundance cluster analysis of eggNOG
3.4. Taxonomy and function diversity
Taxonomic and functional diversity analyses are essential components of metagenomic studies, enabling researchers to uncover the composition and ecological roles of microbial communities. Taxonomic analysis identifies which microorganisms are present, while functional analysis reveals genes and pathways involved in community metabolism and resistance.
Downstream analyses like alpha diversity, principal component analysis (PCA), and non-metric multidimensional scaling (NMDS) provide insights into the richness and evenness of microbial communities, as well as similarities or differences among samples.
You can read more about alpha diversity here.
3.5. Statistical and comparative analysis
Statistical and comparative analysis is used to compare microbial communities across samples to find patterns, differences, or associations with conditions (e.g., disease vs healthy). This is quite a complex problem, so we’ll resolve this part in a separate blog
3.6. Antibiotic resistance gene annotation
a. ARGs (Antibiotic Resistance Genes) Analysis
- What it does: Detects and quantifies genes that confer resistance to antibiotics based on CARD (Comprehensive Antibiotic Resistance Database).
- Visualization plot:
- Relative abundance of resistance genes: In this stacked bar plot, it’s obvious that
- vanT_gene_in_vanG_cluster and vanW_gene_in_vanI_cluster (vancomycin resistance – associated) show the highest abundance (red and dark blue bars at the bottom)
- adeF, part of multidrug efflux pumps also shows prominent presence from group M1 to DH12.
- Relative abundance of resistance genes: In this stacked bar plot, it’s obvious that

Figure 8: Relative abundance of antibiotic resistance genes
b. MGE (Mobile Genetic Elements) Analysis
- What it does: Detects Mobile Elements that can spread genes (especially ARGs) between microbes based on specialized databases for Insertion sequences (ISFinder), Integrons (INTEGRALL), Plasmids (PLSDB or similar)
- Visualization plot:
- Relative abundance of Mobile Genetic Elements:
- AP011957 (red bar) dominates nearly all samples (>85% in most), suggesting a major plasmid or mobile element consistently carrying resistance genes.
- Relative abundance of Mobile Genetic Elements:

Figure 9: Relative abundance of Mobile Genetic Elements
References
- Novogene. (n.d.). A beginner’s guide to microbial shotgun metagenomic sequencing. https://www.novogene.com/amea-en/resources/blog/a-beginners-guide-to-microbial-shotgun-metagenomic-sequencing-2/
- Novogene Co., Ltd. Short-read Shotgun metagenomics Demo report
------------
GENESMART CO., LTD | Phân phối ủy quyền 10X Genomics, Altona, Biosigma, Hamilton, IT-IS (Novacyt), Norgen Biotek, Rainin tại Việt Nam.











