Build a custom scRNAseq reference
Table of Contents
Building a Custom Cell Ranger Reference (GRCh38 + HSV viral genome)
Single-cell RNA-seq workflows live and die by their reference genome. Most of the time, the default Cell Ranger references are perfectly fine — until they’re not. The moment you introduce a custom construct, viral gene, reporter, or engineered transcript, you’re in “roll your own reference” territory.
This post walks through a clean, reproducible pipeline for building a custom Cell Ranger reference by extending GRCh38 with a custom gene. Along the way, I’ll explain the why behind each step and highlight a few sharp edges that can save you hours of debugging.
High-level plan
- Download canonical genome and annotation sources
- Normalize FASTA headers to Cell Ranger expectations
- Append a custom gene sequence
- Clean up GTF IDs (a quiet but critical step)
- Add a properly formatted custom GTF entry
- Build the reference with cellranger mkref
Step 0 — Metadata & structure
GENOME_NAME="GRCh38-HSV"
REF_VERSION="2024-A-custom"
SRC_DIR="reference_sources"
BUILD_DIR="build"
Why this matters:
- Cell Ranger embeds metadata into reference.json.
- Clear naming avoids confusion later when you’re running multiple references side-by-side.
- Separating sources from build artifacts keeps things tidy (and version-controllable).
Step 1 — Download the genome FASTA
FASTA_URL="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
Why Ensembl Primary Assembly?
- Cell Ranger Compatibility: It perfectly matches the header and structure expectations of the
cellrangerpipeline. - Reduced Complexity: It avoids alternate contigs (ALTs) unless you explicitly want them, preventing multi-mapping issues.
- STAR-Friendly: It plays nicely with the STAR aligner, which is the engine running under the hood of Cell Ranger.
- Persistent Workflow: The FASTA is downloaded once and reused—meaning no silent re-downloads or version drift mid-project.
Step 2 — Download the GTF
GTF_URL="http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz"
Why GENCODE v44?
- It’s current
- It’s curated
- Cell Ranger works best with GENCODE annotations
FASTA and GTF don’t have to come from the same provider, but they must be from compatible genome builds (GRCh38 here).
Step 3 — Fix FASTA chromosome headers
sed -E 's/^>(\S+).*/>\1 \1/'
sed -E 's/^>([0-9]+|[XY]) />chr\1 /'
sed -E 's/^>MT />chrM /'
Why this matters
- Cell Ranger / STAR expect chr1, chrX, chrM
- Ensembl FASTAs don’t always match that convention
- Mismatched names = silent mapping failures or mkref errors
Step 4 — Add the custom gene sequence
awk '/^>/ {print; next} {print toupper($0)}' "$CUSTOM_FASTA" >> "$FASTA_MOD"
Keys here:
- Uppercase sequence
- Append directly to the genome FASTA
- Header remains unchanged, so contig name = HSV
I also explicitly sanity-check:
- Number of FASTA entries
- Non-ACGT characters
- The tail of the file
Step 5 — Fix GTF IDs (quietly essential)
sed -E 's/gene_id "ENSG...";/gene_id "ENSG..."; gene_version "...";/'
This step looks boring. It is not.
GENCODE encodes versions like:
gene_id "ENSG000001234.5";
Cell Ranger expects:
gene_id "ENSG000001234";
gene_version "5";
This step normalizes gene, transcript, and exon IDs.
Step 6 — Add the custom gene annotation
Cell Ranger requires a proper GTF annotation that describes how reads mapping to that sequence should be counted.
Cell Ranger does not simply count reads mapped to a contig. Instead, it:
- Aligns reads to the genome (STAR)
- Assigns reads to exons
- Groups exons into transcripts
- Reports counts at the gene level As a result, a valid custom GTF must define a complete hierarchy:
gene → transcript → exon
For a custom HSV gene to be recognized and counted, the GTF must satisfy:
- Exactly 9 tab-separated columns
- A matching contig name between FASTA and GTF (e.g. HSV1_UL54)
- One gene entry
- At least one transcript entry
- At least one exon entry
Minimal working example
HSV1_UL54 custom gene 1 3948 . + . gene_id "HSV1_UL54"; gene_name "ICP27"; gene_type "protein_coding";
HSV1_UL54 custom transcript 1 3948 . + . gene_id "HSV1_UL54"; transcript_id "HSV1_UL54-1"; transcript_type "protein_coding";
HSV1_UL54 custom exon 1 3948 . + . gene_id "HSV1_UL54"; transcript_id "HSV1_UL54-1"; exon_id "HSV1_UL54-1.exon1";
So, add custom gtf as
cat custom/HSV.gtf >> "$GTF_MOD"
The key rule here is: A valid GTF line has exactly 9 tab-separated columns.
You can explicitly verify this with
awk -F'\t' 'NF!=9 {print NR, $0}' "$GTF_MOD"
If this prints nothing, you’re golden. This check alone saved me from a multi-hour debugging session caused by invisible whitespace.
Step 7 — Build the reference
cellranger mkref \
--genome="$GENOME_NAME" \
--ref-version="$REF_VERSION" \
--fasta="$FASTA_MOD" \
--genes="$GTF_MOD" \
--nthreads=16
If successful, you’ll see a directory like:
GRCh38-HSV/
├── fasta/
├── genes/
├── star/
└── reference.json
At this point, you can plug it straight into:
cellranger count --transcriptome=GRCh38-HSV ...
Finally, here’s the full bash script—ready to run, fully automated, and easy to reproduce later.
#!/usr/bin/env bash
set -euo pipefail
echo "=== Building custom Cell Ranger reference ==="
########################################
# STEP 0 — Metadata & directories
########################################
GENOME_NAME="GRCh38-HSV"
REF_VERSION="2024-A-custom"
SRC_DIR="reference_sources"
BUILD_DIR="build"
mkdir -p "$SRC_DIR" "$BUILD_DIR"
########################################
# STEP 1 — Download genome FASTA
########################################
FASTA_URL="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
FASTA_RAW="$SRC_DIR/genome.fa"
if [ ! -f "$FASTA_RAW" ]; then
echo "Downloading genome FASTA..."
curl -sS "$FASTA_URL" | zcat > "$FASTA_RAW"
fi
# manual check
ls -lh "$FASTA_RAW"
########################################
# STEP 2 — Download GTF
########################################
GTF_URL="http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz"
GTF_RAW="$SRC_DIR/genes.gtf"
if [ ! -f "$GTF_RAW" ]; then
echo "Downloading GTF..."
curl -L "$GTF_URL" -o "$GTF_RAW.gz"
gunzip -c "$GTF_RAW.gz" > "$GTF_RAW"
fi
########################################
# STEP 3 — Fix FASTA chromosome names
########################################
FASTA_MOD="$BUILD_DIR/genome.fa"
echo "Fixing FASTA headers..."
cat "$FASTA_RAW" \
| sed -E 's/^>(\S+).*/>\1 \1/' \
| sed -E 's/^>([0-9]+|[XY]) />chr\1 /' \
| sed -E 's/^>MT />chrM /' \
> "$FASTA_MOD"
# check
grep '^>' "$FASTA_MOD" | head
########################################
# STEP 4 — Add custom gene sequence
########################################
CUSTOM_DIR="custom"
CUSTOM_FASTA="$CUSTOM_DIR/HSV.fa"
echo "Adding HSV sequence from file..."
# Convert sequence to uppercase (leave header unchanged) and append
awk '/^>/ {print; next} {print toupper($0)}' "$CUSTOM_FASTA" >> "$FASTA_MOD"
# checks
tail -n 5 "$FASTA_MOD"
grep -c '^>' "$FASTA_MOD"
grep -v '^>' "$FASTA_MOD" | grep -Ev '^[ACGTN]+$'
########################################
# STEP 5 — Fix GTF IDs
########################################
GTF_MOD="$BUILD_DIR/genes.gtf"
ID="(ENS(MUS)?[GTE][0-9]+)\.([0-9]+)"
echo "Fixing GTF IDs..."
cat "$GTF_RAW" \
| sed -E 's/gene_id "'"$ID"'";/gene_id "\1"; gene_version "\3";/' \
| sed -E 's/transcript_id "'"$ID"'";/transcript_id "\1"; transcript_version "\3";/' \
| sed -E 's/exon_id "'"$ID"'";/exon_id "\1"; exon_version "\3";/' \
> "$GTF_MOD"
########################################
# STEP 6 — Add custom gene GTF
########################################
echo "Adding HSV annotation with tabs..."
cat custom/HSV.gtf >> "$GTF_MOD"
# checks
cat -T "$GTF_MOD" | tail -n 5
awk -F'\t' 'NF!=9 {print NR, $0}' "$GTF_MOD"
########################################
# STEP 7 — Build Cell Ranger reference
########################################
ml CellRanger/9.0.0
echo "Running cellranger mkref..."
cellranger mkref \
--genome="$GENOME_NAME" \
--ref-version="$REF_VERSION" \
--fasta="$FASTA_MOD" \
--genes="$GTF_MOD" \
--nthreads=16