Build a custom scRNAseq reference

Table of Contents

Building a Custom Cell Ranger Reference (GRCh38 + HSV viral genome)

Single-cell RNA-seq workflows live and die by their reference genome. Most of the time, the default Cell Ranger references are perfectly fine — until they’re not. The moment you introduce a custom construct, viral gene, reporter, or engineered transcript, you’re in “roll your own reference” territory.

This post walks through a clean, reproducible pipeline for building a custom Cell Ranger reference by extending GRCh38 with a custom gene. Along the way, I’ll explain the why behind each step and highlight a few sharp edges that can save you hours of debugging.

High-level plan

  1. Download canonical genome and annotation sources
  2. Normalize FASTA headers to Cell Ranger expectations
  3. Append a custom gene sequence
  4. Clean up GTF IDs (a quiet but critical step)
  5. Add a properly formatted custom GTF entry
  6. Build the reference with cellranger mkref

Step 0 — Metadata & structure

GENOME_NAME="GRCh38-HSV"
REF_VERSION="2024-A-custom"

SRC_DIR="reference_sources"
BUILD_DIR="build"

Why this matters:

  • Cell Ranger embeds metadata into reference.json.
  • Clear naming avoids confusion later when you’re running multiple references side-by-side.
  • Separating sources from build artifacts keeps things tidy (and version-controllable).

Step 1 — Download the genome FASTA

FASTA_URL="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"

Why Ensembl Primary Assembly?

  • Cell Ranger Compatibility: It perfectly matches the header and structure expectations of the cellranger pipeline.
  • Reduced Complexity: It avoids alternate contigs (ALTs) unless you explicitly want them, preventing multi-mapping issues.
  • STAR-Friendly: It plays nicely with the STAR aligner, which is the engine running under the hood of Cell Ranger.
  • Persistent Workflow: The FASTA is downloaded once and reused—meaning no silent re-downloads or version drift mid-project.

Step 2 — Download the GTF

GTF_URL="http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz"

Why GENCODE v44?

  • It’s current
  • It’s curated
  • Cell Ranger works best with GENCODE annotations

FASTA and GTF don’t have to come from the same provider, but they must be from compatible genome builds (GRCh38 here).

Step 3 — Fix FASTA chromosome headers

sed -E 's/^>(\S+).*/>\1 \1/'
sed -E 's/^>([0-9]+|[XY]) />chr\1 /'
sed -E 's/^>MT />chrM /'

Why this matters

  • Cell Ranger / STAR expect chr1, chrX, chrM
  • Ensembl FASTAs don’t always match that convention
  • Mismatched names = silent mapping failures or mkref errors

Step 4 — Add the custom gene sequence

awk '/^>/ {print; next} {print toupper($0)}' "$CUSTOM_FASTA" >> "$FASTA_MOD"

Keys here:

  • Uppercase sequence
  • Append directly to the genome FASTA
  • Header remains unchanged, so contig name = HSV

I also explicitly sanity-check:

  • Number of FASTA entries
  • Non-ACGT characters
  • The tail of the file

Step 5 — Fix GTF IDs (quietly essential)

sed -E 's/gene_id "ENSG...";/gene_id "ENSG..."; gene_version "...";/'

This step looks boring. It is not.

GENCODE encodes versions like:

gene_id "ENSG000001234.5";

Cell Ranger expects:

gene_id "ENSG000001234";
gene_version "5";

This step normalizes gene, transcript, and exon IDs.

Step 6 — Add the custom gene annotation

Cell Ranger requires a proper GTF annotation that describes how reads mapping to that sequence should be counted.

Cell Ranger does not simply count reads mapped to a contig. Instead, it:

  1. Aligns reads to the genome (STAR)
  2. Assigns reads to exons
  3. Groups exons into transcripts
  4. Reports counts at the gene level As a result, a valid custom GTF must define a complete hierarchy:
gene  transcript  exon

For a custom HSV gene to be recognized and counted, the GTF must satisfy:

  1. Exactly 9 tab-separated columns
  2. A matching contig name between FASTA and GTF (e.g. HSV1_UL54)
  3. One gene entry
  4. At least one transcript entry
  5. At least one exon entry

Minimal working example

HSV1_UL54	custom	gene	1	3948	.	+	.	gene_id "HSV1_UL54"; gene_name "ICP27"; gene_type "protein_coding";
HSV1_UL54	custom	transcript	1	3948	.	+	.	gene_id "HSV1_UL54"; transcript_id "HSV1_UL54-1"; transcript_type "protein_coding";
HSV1_UL54	custom	exon	1	3948	.	+	.	gene_id "HSV1_UL54"; transcript_id "HSV1_UL54-1"; exon_id "HSV1_UL54-1.exon1";

So, add custom gtf as

cat custom/HSV.gtf >> "$GTF_MOD"

The key rule here is: A valid GTF line has exactly 9 tab-separated columns.

You can explicitly verify this with

awk -F'\t' 'NF!=9 {print NR, $0}' "$GTF_MOD"

If this prints nothing, you’re golden. This check alone saved me from a multi-hour debugging session caused by invisible whitespace.

Step 7 — Build the reference

cellranger mkref \
  --genome="$GENOME_NAME" \
  --ref-version="$REF_VERSION" \
  --fasta="$FASTA_MOD" \
  --genes="$GTF_MOD" \
  --nthreads=16

If successful, you’ll see a directory like:

GRCh38-HSV/
├── fasta/
├── genes/
├── star/
└── reference.json

At this point, you can plug it straight into:

cellranger count --transcriptome=GRCh38-HSV ...

Finally, here’s the full bash script—ready to run, fully automated, and easy to reproduce later.

#!/usr/bin/env bash
set -euo pipefail

echo "=== Building custom Cell Ranger reference ==="

########################################
# STEP 0 — Metadata & directories
########################################
GENOME_NAME="GRCh38-HSV"
REF_VERSION="2024-A-custom"

SRC_DIR="reference_sources"
BUILD_DIR="build"

mkdir -p "$SRC_DIR" "$BUILD_DIR"

########################################
# STEP 1 — Download genome FASTA
########################################
FASTA_URL="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
FASTA_RAW="$SRC_DIR/genome.fa"

if [ ! -f "$FASTA_RAW" ]; then
  echo "Downloading genome FASTA..."
  curl -sS "$FASTA_URL" | zcat > "$FASTA_RAW"
fi

# manual check
ls -lh "$FASTA_RAW"

########################################
# STEP 2 — Download GTF
########################################
GTF_URL="http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz"
GTF_RAW="$SRC_DIR/genes.gtf"

if [ ! -f "$GTF_RAW" ]; then
  echo "Downloading GTF..."
  curl -L "$GTF_URL" -o "$GTF_RAW.gz"
  gunzip -c "$GTF_RAW.gz" > "$GTF_RAW"
fi

########################################
# STEP 3 — Fix FASTA chromosome names
########################################
FASTA_MOD="$BUILD_DIR/genome.fa"

echo "Fixing FASTA headers..."
cat "$FASTA_RAW" \
  | sed -E 's/^>(\S+).*/>\1 \1/' \
  | sed -E 's/^>([0-9]+|[XY]) />chr\1 /' \
  | sed -E 's/^>MT />chrM /' \
  > "$FASTA_MOD"

# check
grep '^>' "$FASTA_MOD" | head

########################################
# STEP 4 — Add custom gene sequence
########################################
CUSTOM_DIR="custom"
CUSTOM_FASTA="$CUSTOM_DIR/HSV.fa"

echo "Adding HSV sequence from file..."

# Convert sequence to uppercase (leave header unchanged) and append
awk '/^>/ {print; next} {print toupper($0)}' "$CUSTOM_FASTA" >> "$FASTA_MOD"

# checks
tail -n 5 "$FASTA_MOD"
grep -c '^>' "$FASTA_MOD"
grep -v '^>' "$FASTA_MOD" | grep -Ev '^[ACGTN]+$'

########################################
# STEP 5 — Fix GTF IDs
########################################
GTF_MOD="$BUILD_DIR/genes.gtf"
ID="(ENS(MUS)?[GTE][0-9]+)\.([0-9]+)"

echo "Fixing GTF IDs..."
cat "$GTF_RAW" \
  | sed -E 's/gene_id "'"$ID"'";/gene_id "\1"; gene_version "\3";/' \
  | sed -E 's/transcript_id "'"$ID"'";/transcript_id "\1"; transcript_version "\3";/' \
  | sed -E 's/exon_id "'"$ID"'";/exon_id "\1"; exon_version "\3";/' \
  > "$GTF_MOD"

########################################
# STEP 6 — Add custom gene GTF
########################################
echo "Adding HSV annotation with tabs..."
cat custom/HSV.gtf >> "$GTF_MOD"

# checks
cat -T "$GTF_MOD" | tail -n 5
awk -F'\t' 'NF!=9 {print NR, $0}' "$GTF_MOD"

########################################
# STEP 7 — Build Cell Ranger reference
########################################
ml CellRanger/9.0.0

echo "Running cellranger mkref..."
cellranger mkref \
  --genome="$GENOME_NAME" \
  --ref-version="$REF_VERSION" \
  --fasta="$FASTA_MOD" \
  --genes="$GTF_MOD" \
  --nthreads=16