Table of Contents

Building a Custom Cell Ranger Reference (GRCh38 + HSV viral genome)

Single-cell RNA-seq workflows live and die by their reference genome. Most of the time, the default Cell Ranger references are perfectly fine — until they’re not. The moment you introduce a custom construct, viral gene, reporter, or engineered transcript, you’re in “roll your own reference” territory.

This post walks through a clean, reproducible pipeline for building a custom Cell Ranger reference by extending GRCh38 with a custom gene. Along the way, I’ll explain the why behind each step and highlight a few sharp edges that can save you hours of debugging.

High-level plan

Download canonical genome and annotation sources
Normalize FASTA headers to Cell Ranger expectations
Append a custom gene sequence
Clean up GTF IDs (a quiet but critical step)
Add a properly formatted custom GTF entry
Build the reference with cellranger mkref

Step 0 — Metadata & structure

GENOME_NAME="GRCh38-HSV"
REF_VERSION="2024-A-custom"

SRC_DIR="reference_sources"
BUILD_DIR="build"

Why this matters:

Cell Ranger embeds metadata into reference.json.
Clear naming avoids confusion later when you’re running multiple references side-by-side.
Separating sources from build artifacts keeps things tidy (and version-controllable).

Step 1 — Download the genome FASTA

FASTA_URL="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"

Why Ensembl Primary Assembly?

Cell Ranger Compatibility: It perfectly matches the header and structure expectations of the cellranger pipeline.
Reduced Complexity: It avoids alternate contigs (ALTs) unless you explicitly want them, preventing multi-mapping issues.
STAR-Friendly: It plays nicely with the STAR aligner, which is the engine running under the hood of Cell Ranger.
Persistent Workflow: The FASTA is downloaded once and reused—meaning no silent re-downloads or version drift mid-project.

Step 2 — Download the GTF

GTF_URL="http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz"

Why GENCODE v44?

It’s current
It’s curated
Cell Ranger works best with GENCODE annotations

FASTA and GTF don’t have to come from the same provider, but they must be from compatible genome builds (GRCh38 here).

Step 3 — Fix FASTA chromosome headers

sed -E 's/^>(\S+).*/>\1 \1/'
sed -E 's/^>([0-9]+|[XY]) />chr\1 /'
sed -E 's/^>MT />chrM /'

Why this matters

Cell Ranger / STAR expect chr1, chrX, chrM
Ensembl FASTAs don’t always match that convention
Mismatched names = silent mapping failures or mkref errors

Step 4 — Add the custom gene sequence

awk '/^>/ {print; next} {print toupper($0)}' "$CUSTOM_FASTA" >> "$FASTA_MOD"

Keys here:

Uppercase sequence
Append directly to the genome FASTA
Header remains unchanged, so contig name = HSV

I also explicitly sanity-check:

Number of FASTA entries
Non-ACGT characters
The tail of the file

Step 5 — Fix GTF IDs (quietly essential)

sed -E 's/gene_id "ENSG...";/gene_id "ENSG..."; gene_version "...";/'

This step looks boring. It is not.

GENCODE encodes versions like:

gene_id "ENSG000001234.5";

Cell Ranger expects:

gene_id "ENSG000001234";
gene_version "5";

This step normalizes gene, transcript, and exon IDs.

Step 6 — Add the custom gene annotation

Cell Ranger requires a proper GTF annotation that describes how reads mapping to that sequence should be counted.

Cell Ranger does not simply count reads mapped to a contig. Instead, it:

Aligns reads to the genome (STAR)
Assigns reads to exons
Groups exons into transcripts
Reports counts at the gene level As a result, a valid custom GTF must define a complete hierarchy:

gene → transcript → exon

For a custom HSV gene to be recognized and counted, the GTF must satisfy:

Exactly 9 tab-separated columns
A matching contig name between FASTA and GTF (e.g. HSV1_UL54)
One gene entry
At least one transcript entry
At least one exon entry

Minimal working example

HSV1_UL54	custom	gene	1	3948	.	+	.	gene_id "HSV1_UL54"; gene_name "ICP27"; gene_type "protein_coding";
HSV1_UL54	custom	transcript	1	3948	.	+	.	gene_id "HSV1_UL54"; transcript_id "HSV1_UL54-1"; transcript_type "protein_coding";
HSV1_UL54	custom	exon	1	3948	.	+	.	gene_id "HSV1_UL54"; transcript_id "HSV1_UL54-1"; exon_id "HSV1_UL54-1.exon1";

So, add custom gtf as

cat custom/HSV.gtf >> "$GTF_MOD"

The key rule here is: A valid GTF line has exactly 9 tab-separated columns.

You can explicitly verify this with

awk -F'\t' 'NF!=9 {print NR, $0}' "$GTF_MOD"

If this prints nothing, you’re golden. This check alone saved me from a multi-hour debugging session caused by invisible whitespace.

Step 7 — Build the reference

cellranger mkref \
  --genome="$GENOME_NAME" \
  --ref-version="$REF_VERSION" \
  --fasta="$FASTA_MOD" \
  --genes="$GTF_MOD" \
  --nthreads=16

If successful, you’ll see a directory like:

GRCh38-HSV/
├── fasta/
├── genes/
├── star/
└── reference.json

At this point, you can plug it straight into:

cellranger count --transcriptome=GRCh38-HSV ...

Finally, here’s the full bash script—ready to run, fully automated, and easy to reproduce later.

#!/usr/bin/env bash
set -euo pipefail

echo "=== Building custom Cell Ranger reference ==="

########################################
# STEP 0 — Metadata & directories
########################################
GENOME_NAME="GRCh38-HSV"
REF_VERSION="2024-A-custom"

SRC_DIR="reference_sources"
BUILD_DIR="build"

mkdir -p "$SRC_DIR" "$BUILD_DIR"

########################################
# STEP 1 — Download genome FASTA
########################################
FASTA_URL="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
FASTA_RAW="$SRC_DIR/genome.fa"

if [ ! -f "$FASTA_RAW" ]; then
  echo "Downloading genome FASTA..."
  curl -sS "$FASTA_URL" | zcat > "$FASTA_RAW"
fi

# manual check
ls -lh "$FASTA_RAW"

########################################
# STEP 2 — Download GTF
########################################
GTF_URL="http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz"
GTF_RAW="$SRC_DIR/genes.gtf"

if [ ! -f "$GTF_RAW" ]; then
  echo "Downloading GTF..."
  curl -L "$GTF_URL" -o "$GTF_RAW.gz"
  gunzip -c "$GTF_RAW.gz" > "$GTF_RAW"
fi

########################################
# STEP 3 — Fix FASTA chromosome names
########################################
FASTA_MOD="$BUILD_DIR/genome.fa"

echo "Fixing FASTA headers..."
cat "$FASTA_RAW" \
  | sed -E 's/^>(\S+).*/>\1 \1/' \
  | sed -E 's/^>([0-9]+|[XY]) />chr\1 /' \
  | sed -E 's/^>MT />chrM /' \
  > "$FASTA_MOD"

# check
grep '^>' "$FASTA_MOD" | head

########################################
# STEP 4 — Add custom gene sequence
########################################
CUSTOM_DIR="custom"
CUSTOM_FASTA="$CUSTOM_DIR/HSV.fa"

echo "Adding HSV sequence from file..."

# Convert sequence to uppercase (leave header unchanged) and append
awk '/^>/ {print; next} {print toupper($0)}' "$CUSTOM_FASTA" >> "$FASTA_MOD"

# checks
tail -n 5 "$FASTA_MOD"
grep -c '^>' "$FASTA_MOD"
grep -v '^>' "$FASTA_MOD" | grep -Ev '^[ACGTN]+$'

########################################
# STEP 5 — Fix GTF IDs
########################################
GTF_MOD="$BUILD_DIR/genes.gtf"
ID="(ENS(MUS)?[GTE][0-9]+)\.([0-9]+)"

echo "Fixing GTF IDs..."
cat "$GTF_RAW" \
  | sed -E 's/gene_id "'"$ID"'";/gene_id "\1"; gene_version "\3";/' \
  | sed -E 's/transcript_id "'"$ID"'";/transcript_id "\1"; transcript_version "\3";/' \
  | sed -E 's/exon_id "'"$ID"'";/exon_id "\1"; exon_version "\3";/' \
  > "$GTF_MOD"

########################################
# STEP 6 — Add custom gene GTF
########################################
echo "Adding HSV annotation with tabs..."
cat custom/HSV.gtf >> "$GTF_MOD"

# checks
cat -T "$GTF_MOD" | tail -n 5
awk -F'\t' 'NF!=9 {print NR, $0}' "$GTF_MOD"

########################################
# STEP 7 — Build Cell Ranger reference
########################################
ml CellRanger/9.0.0

echo "Running cellranger mkref..."
cellranger mkref \
  --genome="$GENOME_NAME" \
  --ref-version="$REF_VERSION" \
  --fasta="$FASTA_MOD" \
  --genes="$GTF_MOD" \
  --nthreads=16