HSV expert - a RAG LLM

Table of Contents

Building a Custom RAG Agent for HSV Research: Lessons from the Trenches

Research papers are growing exponentially, and it’s easy to feel overwhelmed by the sheer volume of literature when you’re trying to find relevant information. Sometimes you have an idea and wonder if another genius has already explored it. Other times, you’re entering a new research field or tackling a fresh topic, and the lack of background knowledge forces you to broadly scan papers—deliberately, but without direction. And sometimes you’re deep in grant writing, where literature support and cross-references are critical—but exhausting.

What if there were a tool—a domain-specific expert that’s up-to-date, fast, and context-aware—that could answer scientific questions instantly? That would be a game changer for productivity and sanity.

With the rise of LLMs, Retrieval-Augmented Generation (RAG) seems like the perfect solution. It’s versatile and fits exactly what I needed. That’s the motivation behind building this HSV expert agent—to streamline literature searching while writing grants and doing research.

Why Build Custom Instead of Using Existing Frameworks?

This RAG agent is fully custom, not built on mainstream frameworks like LangChain or LlamaIndex. While those frameworks are powerful, I found that certain components needed fine-grained control and optimization that off-the-shelf tools couldn’t provide.

Most RAG systems follow the same architecture: document ingestion, chunking, embedding, retrieval, LLM generation, and UI. Instead of going over each piece in detail, I’ll focus on the real challenges I ran into while building it.

The Vision: An All-Knowing HSV Expert

My goal was to create an HSV expert agent with comprehensive domain knowledge—a tool that could keep me afloat amid the flood of publications. To do this, I scraped and preprocessed HSV-related papers from public sources.

PubMed/PMC was the obvious first step, but not every article is open access. So, I built an upload gateway allowing users to load their own PDFs. This not only broadened the dataset but also introduced the need to support multiple formats, particularly XML and PDF.

The XML Challenge

XML files have nested structures, making it tricky to extract information while preserving context—like whether a sentence came from the results or discussion section, or which subsection it belongs to.

My solution: recursively parse the document structure, and tag each sentence with section titles and subsections. This metadata is crucial for later retrieval and response accuracy.

The PDF Problem

PDFs are notoriously unstructured. Footnotes, captions, headers—they all float together. I tried tools like PyMuPDF and pdfplumber, but accuracy was lacking.

Eventually, I went with Adobe’s commercial tool, which offers 500 free extractions per month. It delivered the most accurate results when it came to parsing by section.

Embedding: The Heart of RAG

Picking the right embedding model is absolutely critical. While many free options exist, domain alignment is more important than popularity.

I tested two broad categories:

  • BERT-based models
  • OpenAI’s embedding models

Among the BERT options, BioBERT and PubMedBERT clearly outperformed generic BERT, which was expected given their biomedical training corpora.

The Chunk Size Dilemma

Chunk size—the number of tokens per text segment—is another critical design choice. BERT models usually max out around 512 tokens; OpenAI’s go up to thousands.

Smaller Chunks (~300 tokens):

  • Pros: More targeted embeddings, higher retrieval precision, reduced noise.
  • Cons: May need more chunks per query, increasing retrieval noise if not filtered well.

Larger Chunks (~500+ tokens):

  • Pros: Better context in a single chunk, helps with reasoning-heavy queries.
  • Cons: Risk of topic dilution and embedding drift.

Empirically, I found that 300-token chunks with PubMedBERT struck the best balance.

Experimental Results

I tested:

  • BERT-based models: 300 vs. 500 tokens
  • OpenAI embeddings: 300, 500, 1,000 tokens

Similarity Score Performance: Surprisingly, OpenAI’s text-embedding-3-large underperformed BERT-based models on similarity scores.

BioBERT performed best, followed closely by PubMedBERT. Generic BERT lagged.

Relevance and Topic Coverage: Here, the results flipped: OpenAI embeddings outperformed BERT in capturing broader context and topic relationships.

PubMedBERT showed better performance at 300 tokens than 500, particularly for mechanistic and factual queries.

I went with PubMedBERT + 300-token chunks—it offered the most consistent and precise retrieval for HSV biomedical content.

While GPT-3.5 embeddings capture general semantics well, they’re not tailored for niche biomedical language. PubMedBERT, trained on PubMed full text, provides much sharper domain alignment.

Retrieval Strategy: Dense + Sparse + Cross-Encoder

The final architecture combines:

  • Dense retrieval (vector similarity)
  • Sparse retrieval (BM25) for keyword match
  • Reciprocal Rank Fusion (RRF) for combining dense + sparse
  • Cross-encoder re-ranking for final relevance

Pipeline Example: “What is HSV?”

  1. Dense Retrieval → Query embedded, top 50 results returned by cosine similarity.
  2. BM25 Search → Re-ranks using keyword overlap. May rescue results missed by dense embedding.
  3. RRF Fusion → Merges rankings using score = 1 / (k + rank), with k=60.
  4. Cross-Encoder → Scores each passage-query pair, ensuring final output is semantically relevant.

Why This Hybrid Works

  • Dense search handles semantic matching
  • BM25 recovers exact matches
  • RRF prevents over-reliance on any one method
  • Cross-encoder sharpens final ranking

This kind of hybrid pipeline is essential in scientific domains, where exact terminology matters just as much as semantic relevance.

Final Thoughts

This project has been a wild mix of scraping, parsing, fine-tuning, and experimentation. But the result is an HSV research assistant that can actually keep up with the firehose of literature.

Building a custom RAG agent gave me control over every component—from how documents are ingested to how they’re retrieved and presented. It’s not as turnkey as LangChain, but it’s way more precise for what I need.