How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery

Learn how to build a complete single-cell RNA-seq analysis pipeline using Scanpy, from quality control to trajectory discovery.

Introduction

Single-cell RNA sequencing (scRNA-seq) is a powerful technology that allows scientists to study gene expression in individual cells, revealing cellular heterogeneity and identifying rare cell types. In this beginner-friendly tutorial, you'll learn how to build a complete single-cell RNA-seq analysis pipeline using Scanpy, a Python toolkit designed for processing and analyzing scRNA-seq data. We'll work with the PBMC-3k dataset, a standard benchmark for testing scRNA-seq workflows, and perform essential steps like quality control, clustering, annotation, and trajectory discovery.

By the end of this tutorial, you'll have a working pipeline that you can adapt to analyze your own scRNA-seq datasets.

Prerequisites

Before starting this tutorial, you should have:

Basic knowledge of Python programming
Python installed (preferably Python 3.8 or higher)
Installed packages: scanpy, anndata, matplotlib, seaborn, and numpy

To install the required packages, run the following command in your terminal or command prompt:

pip install scanpy anndata matplotlib seaborn numpy

Step-by-Step Instructions

1. Import Required Libraries

We start by importing the necessary Python libraries. These libraries provide the tools needed to load, process, and visualize scRNA-seq data.

import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Why? Importing these libraries gives us access to Scanpy's powerful functions for data analysis, and other libraries for data manipulation and visualization.

2. Load the PBMC-3k Dataset

Scanpy provides a built-in function to load the PBMC-3k dataset, which is a commonly used benchmark for scRNA-seq analysis.

# Load the PBMC-3k dataset
pbmc = sc.datasets.pbmc3k()
print(pbmc)

Why? This dataset contains gene expression data from 3,000 peripheral blood mononuclear cells (PBMCs), making it ideal for learning scRNA-seq analysis techniques.

3. Inspect the Dataset Structure

Before proceeding, it's important to understand what data we're working with.

# Inspect the dataset structure
print(pbmc.obs)
print(pbmc.var)

Why? The obs attribute contains information about cells (like sample IDs, cell types), and var contains gene information. Understanding this structure is crucial for downstream analysis.

4. Perform Quality Control (QC) Checks

Quality control is essential to filter out low-quality cells and genes. We'll check gene counts, total counts, mitochondrial content, and ribosomal gene signals.

# Calculate QC metrics
sc.pp.calculate_qc_metrics(pbmc, percent_top=None, log1p=False, inplace=True)

# View QC metrics
print(pbmc.obs.head())

Why? QC metrics help us identify and remove cells with low gene counts, high mitochondrial content (indicating cell damage), or other poor-quality data points.

5. Filter Low-Quality Cells and Genes

Now, we filter out cells and genes that do not meet quality thresholds.

# Filter cells with too few genes or too many mitochondrial genes
sc.pp.filter_cells(pbmc, min_genes=200)
sc.pp.filter_genes(pbmc, min_cells=3)

# Check the filtered dataset
print(f"Number of cells: {pbmc.n_obs}")
print(f"Number of genes: {pbmc.n_vars}")

Why? Filtering ensures that our analysis focuses on high-quality data, improving the reliability of results.

6. Normalize the Data

Normalization is necessary to account for differences in sequencing depth between cells.

# Normalize the data
sc.pp.normalize_total(pbmc, target_sum=1e4)
sc.pp.log1p(pbmc)

# Check normalized data
print(pbmc.X[:5, :5])

Why? Normalization ensures that gene expression values are comparable across cells, which is essential for downstream clustering and visualization.

7. Identify Highly Variable Genes

Highly variable genes are important for identifying cell types and distinguishing between clusters.

# Identify highly variable genes
sc.pp.highly_variable_genes(pbmc, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(pbmc)

# Subset the data to highly variable genes
pbmc = pbmc[:, pbmc.var.highly_variable]

Why? Selecting highly variable genes reduces noise in the data and focuses the analysis on genes that are most informative for cell type identification.

8. Perform Dimensionality Reduction and Clustering

We now reduce the dimensionality of the data and cluster cells to identify distinct cell populations.

# Run PCA for dimensionality reduction
sc.tl.pca(pbmc, svd_solver='arpack')

# Compute the neighborhood graph
sc.pp.neighbors(pbmc, n_neighbors=10, n_pcs=4)

# Run UMAP for visualization
sc.tl.umap(pbmc)

# Cluster the cells
sc.tl.leiden(pbmc, resolution=0.5)

# Visualize the clusters
sc.pl.umap(pbmc, color='leiden', legend_loc='on data')

Why? PCA and UMAP help us visualize high-dimensional data in 2D, while clustering algorithms group similar cells together.

9. Annotate Cell Types

Once we have clusters, we need to assign biological meaning to them by annotating cell types.

# Annotate cell types based on marker genes
sc.tl.marker_genes(pbmc, groupby='leiden')

# Visualize marker genes
sc.pl.dotplot(pbmc, marker_genes, groupby='leiden')

Why? Cell type annotation allows us to interpret the biological significance of our clusters and understand what cell populations are present in the sample.

10. Discover Cell Trajectories

Finally, we can explore how cells transition from one state to another using trajectory inference.

# Run trajectory inference
sc.tl.paga(pbmc, groups='leiden')
sc.pl.paga(pbmc, color='leiden', legend_loc='on data')

# Compute trajectory
sc.tl.diffmap(pbmc)
sc.tl.dpt(pbmc, n_branches=2)

# Visualize trajectory
sc.pl.diffmap(pbmc, color='dpt_pseudotime')

Why? Trajectory analysis helps us understand cellular development or differentiation processes, revealing how cells change over time.

Summary

In this tutorial, you've learned how to build a complete single-cell RNA-seq analysis pipeline using Scanpy. You've performed quality control, normalized data, identified variable genes, clustered cells, annotated cell types, and discovered cell trajectories. This workflow is a solid foundation for analyzing scRNA-seq data and can be adapted to various biological datasets.

With these skills, you're now ready to explore your own scRNA-seq datasets and uncover the complex cellular dynamics hidden within them.