Retrieval-augmented generation for docs, part 2

2026-01-19

In my first post about retrieval-augmented generation (RAG) for documentation, I explained some RAG concepts. I also gave some information about large language models (LLMs) to explain more about how RAG works.

In this post, I walk through the first part of journey to actually prototype an RAG search system for documentation (the technical bits). When writing this post, I styled it more in narrative form.

Before I begin

I decide some of this project’s constraints:

I will implement RAG search for a Docusaurus site, as I use Docusaurus in other projects.
All LLMs I use will run locally, on my computer.
I can and will use an LLM such as ChatGPT or GitHub Copilot to generate code and data.
I’m not trying to create something production-ready, but rather limiting myself to a proof-of-concept.

Step 1: set up the environment

I create a git repository to track all changes I do from now on.
I add my tooling configuration file, mise.toml. It uses Node.js and PNPM (my personal choice).
I install Docusaurus to the repository.
I create a bunch of fake docs to add to Docusaurus. In this case, I askOpenAI Codex to generate fake documentation about office supplies.
I test the Docusaurus build to make sure it works, and that the docs are visible.

With Node.js and Docusaurus in place, I continue.

Step 2: chunk the docs

Before I do fancy AI stuff like vectorizing chunks of documentation, I actually need documentation chunks. How do I do this?

The concept

I have an idea, but let’s ask ChatGPT a naïve question:

Another question first: since this is a documentation site, and the chunks will contain metadata, am I chunking MDX or am I chunking the HTML output?

ChatGPT responds:

Short answer: you should chunk the source MDX (or a normalized Markdown AST), not the rendered HTML output.

ChatGPT explains that there are three options:

Process your Markdown or MDX documents to an abstract syntax tree (AST), and grab the semantic chunks from there.
Preprocess the Markdown text to strip unnecessary components and chunk based on heading markers. Less precise than option 1, but still source-based.
Chunk from the final HTML documents. Not recommended, but defensible when you do not control the source and only have the final presentation.

Option 1 is recommended. I agree; if we want the content without fluff, we should leverage the Markdown/MDX processors that already take care of cleaning the docs files for us. Notably, this means that something has to create chunks using Docusaurus libraries, or while Docusaurus is doing its build. I can work with that.

I also have to decide what metadata is available alongside each chunk of text. After reviewing my conversations with ChatGPT, I decide that a chunk needs the following information:

ID: a unique identifier for the chunk of text. I don’t know if I really need unique IDs, but they won’t hurt my proof-of-concept. I decide that IDs correspond to a hash of the chunk content.
Text: the actual textual content of the chunk.
Slug: the URI of the page on which this textual content is found, relative to the site root.
Section path: an ordered list (array) of all headers on the page under which this textual content is found.

While figuring out chunk metadata, I also decide that a chunk should be no more than 1000 tokens long, and every header on a page should be a chunk limit. Maybe I should have constrained chunk limits to H2 and H3, but this is okay for now.

At this point, I’m not sure which file format I’ll use to store chunks, but whatever I read seems to point to JSON as a good choice. That’s what I go with.

The implementation

Time to get to work!

Before I open an LLM to create code, I use an online tool to design a JSON schema representing how the chunk JSON file should store its info. My reasoning is that a JSON schema not only helps me verify that my idea works, but I can pass the file to an LLM to improve my query.

Next, I ask ChatGPT to create a summary of our conversation on RAG and embeddings. I also want to use this to improve my query.

I then start up Codex CLI from my git repository. I select a coding model, and write a query containing the following information:

The conversation summary.
A request to build a Docusaurus script that takes documentation text during its processing phase, and safely outputs it and related metadata to a JSON file.
The JSON schema, so the LLM has some awareness of the output I’m after.

After some thinking, Codex creates the file docusaurus/plugins/docusaurus-plugin-json-exporter/index.js. This file is a single-file Docusaurus plugin that can take two parameters: one for max token length, and another for the output file name. Codex explains where I add the plugin in docusaurus.config.ts, and I do so:

  plugins: [
    [
      require.resolve("./plugins/docusaurus-plugin-json-exporter"),
      {
        maxTokens: 1000,
        outputFile: "rag-chunks.json",
      },
    ],
  ],

Reading the plugin source code, I think the plugin does its own processing of Markdown/MDX to an AST, but uses the Docusaurus build process to grab metadata. Fair enough.

The result

I run a Docusaurus build, and get a good JSON file on the first try. That’s nice! Here’s a quick snippet from the file:

{
  "chunks": [
    {
      "id": "9a58104f04ff4e1945c5b7b9be83987e256a0ecef0fea116e01e0cdc2ab6177f",
      "text": "Tutorial Intro\n\nLet's discover Docusaurus in less than 5 minutes.",
      "slug": "/docs/intro",
      "section_path": [
        "Tutorial Intro"
      ]
    },
    {
      "id": "e3fabf0c06ff12af7a10c15c60ad887f245a4290e7bd4cb7d592056936c09d0c",
      "text": "Getting Started\n\nGet started by creating a new site.\n\nOr try Docusaurus immediately with docusaurus.new.",
      "slug": "/docs/intro",
      "section_path": [
        "Tutorial Intro",
        "Getting Started"
      ]
    },
    ...
  ]
}

Good for now.

Step 3: vectorize the chunks

I have documentation chunks. Time to turn these chunks into embeddings.

The concept

I understand that all chunk text needs to be passed to an embedding model for processing, then saved. On a basic level, it appears to me as a loop that goes through every entry in the JSON file, passing it to a specialized LLM.

I start looking at lists of LLMs that can run locally to create embeddings. After reading some articles comparing different models, I settle on the model Qwen/Qwen3-Embedding-0.6B; it appears small enough to run quickly, and supposedly gives decent results. The choice of LLM also makes a few other decisions for me:

I will run my local LLMs with Ollama, a popular tool that is easy enough for me to set up.
AI-related code is overwhelmingly written in Python, so my next scripts will be written in Python.

I’m aware that I’m now entering the part of my project which feels more like magic.

The implementation

I go back to Codex CLI and start a new conversation. This time, my query contains the following information:

The same conversation summary from ChatGPT.

The following text:

Current status: created a chunk schema and implemented a chunk extraction script. Example chunks file in rag-chunks.json. Currently **working on local embedding step**.
TASK: Generate a script that creates embeddings based on text in chunks file rag-chunks.json.
SUGGESTION: use the model Qwen3-Embedding-0.6B"). This model's HuggingFace page gives this code example:

A code example, taken from the model’s official web page on HuggingFace. I hope that it will nudge Codex’s output into the right direction.

Codex takes time to process my new query. It creates a script that reads an input file, encodes each chunk with the Qwen embedding model, and writes an output JSON file with embeddings attached for downstream indexing. Codex suggests I can run it as such:

python3 scripts/embed_chunks.py --input rag-chunks.json --output rag-embeddings.json

Codex also informs me that I need to install Python dependencies such as torch, sentence-transformers, and transformers.

Dear reader, I don’t like Python dependencies. It took some time for me to reacquaint myself with Python and pip, so I will give you an overview of what happens:

Create a Python “venv”. This is a directory which somewhat sandboxes python dependencies and its running environment.
Enter the venv.
Use a special python utility to figure out what requirements a script has.
Use the Pip utility to install those requirements to the current venv.

Remember that venv, as you’ll need to enter it to run your script later.

One more thing before I try the script: I install Ollama, and have it download the model Qwen/Qwen3-Embedding-0.6B.

The result

Ollama is running. Ollama has the model Qwen3-Embedding-0.6B loaded. I’m in the correct Python venv with the right dependencies. I failed a few script executions due to Python issues, but I finally run the script with the right environments.

The script runs. Ollama starts using up my computer memory.

Not long after, the script finishes successfully. I’m left with a new file, rag-embeddings.json. This new file has the same data as the chunks JSON file, but adds more metadata. It looks something like this:

{
  "model": "Qwen/Qwen3-Embedding-0.6B",
  "dimension": 1024,
  "normalized": true,
  "chunks": [
    {
      "id": "9a58104f04ff4e1945c5b7b9be83987e256a0ecef0fea116e01e0cdc2ab6177f",
      "text": "Tutorial Intro\n\nLet's discover Docusaurus in less than 5 minutes.",
      "slug": "/docs/intro",
      "section_path": [
        "Tutorial Intro"
      ],
      "embedding": [
        0.018638374283909798,
        -0.01859218440949917,
        -0.011958424933254719,
        ...

Let’s pause and examine the new data added by the script:

model: the script kept a record of which model was used to vectorize data. Seems good to reduce errors later.
dimension: the script kept track of how many dimensions each vector would have. A value of 1024 means that each vector consists of 1024 numbers. Seems like a lot, but this value is apparently normal!
normalized: all vectors were adjusted to have the same mathematical length. The script kept track of this. I don’t know if this information is useful, but it doesn’t hurt.
embedding: this is the heart of what I wanted. embedding is an array of floating-point numbers that, together, represent the vector for a text chunk. Since our dimension count is 1024, each chunk has an embedding consisting of 1024 numbers.

I find it cool to see all this data, but the fun turns into shock when I realize how much space the embeddings take: my original JSON file was 105 Kb in size, and the file with the embeddings weighs in at 8.8 Mb. An 88-times file size increase is no joke!

The result gets me thinking about embeddings for real-world documentation. I consider how much text there is on professional documentation sites, and how much space their embeddings could take depending on the model choice.

Step 4: index the embeddings

I skip the indexing step.

Without the indexing step, each query has to load and process the embeddings JSON file. 8.8 Mb doesn’t seem like too much data, and I’m eager to try out queries, so I move on.

…what actually happened is that in my eagerness to try out queries, I went straight to developing a query interface, and the LLM didn’t stop me. I later questioned the LLM that was helping me, and it confirmed that I didn’t really need an index for such a small amount of data. If I had tens of thousands of chunks, or wanted really fast queries, then I could use one.

If I were to do this exercise again, I would ask the LLM to build a script that stores the embeddings in a local index like FAISS right from the start.

Still, don’t do what I just did in production.

Step 5: query the data

I move on to the final step: building an application from which I can query my data.

The concept

As I’m doing my proof-of-concept on a local computer, I figure the application only needs to be a script of some kind, probably Python. The script needs to perform these actions:

Connect a local LLM instance, which I decided would be done through Ollama.
Show a query interface in which I can make a request.
Embed the query.
Retrieve the most relevant chunks from the embeddings file by similarity.
Call the LLM to generate an answer from my query and the context.

This step seems like it will be even more magical. I don’t understand the complexities of the math involved, and I don’t know enough Python to accurately check the generated script. However, I think that there are enough examples such that OpenAI Codex models should generate these scripts alright. I hope the script will not do unnecessary things.

The implementation

I go back to Codex CLI and continue the previous conversation to keep the context. I send a simple followup query: how can we create a local application (or script) that can answer queries from the JSON files using a local LLM? I let Codex decide how to interpret unsaid information.

Codex plans architecture and code, and generates another Python script. The new Python script is a “minimal local RAG CLI” that embeds a query, retrieves top chunks, and optionally calls a local LLM to generate an answer from the context:

Retrieve context only (no LLM; prints the full prompt you can paste anywhere): rag_answer.py --embeddings rag-embeddings.json --query "How do I create a new site?"

Use a local LLM via Ollama (easiest backend):
rag_answer.py --embeddings rag-embeddings.json --llm-backend ollama --llm-model llama3.2 --query "How do I create a new site?"

Interactive mode:
rag_answer.py --embeddings rag-embeddings.json --interactive

I take note of what Codex created.

I find the ability to only retrieve context odd at first, but useful. Retrieving context alone is something good to think of for a test or proof-of-concept, so I’m grateful for the result.
Codex also notes that query embedding must use the same embedding model as the chunks. This is where the model name in the embeddings file shows its use.

Codex also embedded an initial system message. Useful! I should’ve thought of that.

def build_prompt(*, query: str, context: str) -> str:
return (
	"You are a helpful documentation assistant.\n"
	"Answer the user's question using only the provided context.\n"
	"If the context is insufficient, say what is missing and ask a clarifying question.\n"
	"Cite sources by repeating the 'Source:' label(s) you used.\n\n"
	f"Context:\n{context}\n\n"
	f"Question: {query}\n\n"
	"Answer:"
)

As the new script is also Python, I inspect its dependencies. The dependencies are the same as the embedding scripts’ dependencies, which means I can run the new script in the venv I created earlier. That’s one less thing to worry about.

Before I run the query, though, I need to choose a general-purpose, local LLM to accept the query!

I look at available models on Ollama’s website. I look for a model that is:

Relatively new
Small enough to fit in local memory
General-purpose

For this proof-of-concept, I choose ministral-3:3b. My computer has 16 GB RAM, and this one-month-old model should only take around 3-4 GB. I start Ollama, pull this model, and prepare myself.

The result

Once again, Ollama is running. Ollama has the model ministral-3:3b loaded. I’m in the correct Python venv with the right dependencies.

I run the Python script with a query:

python3 scripts/rag_answer.py --llm-backend ollama --llm-model ministral-3:3b --query "What kinds of office supplies are there?" --embeddings rag-embeddings.json

A few seconds later, I get my results:

Based on the provided context, the main categories of office supplies mentioned include:

1. **Stapler** – Specifically for stapling documents, with guidelines for staple sizes (e.g., light stacks of 2–10 sheets).
2. **Scissors** – Classified into:
   - **Office scissors** (for paper and light packaging)
   - **Craft scissors** (for detailed work)
   - **Heavy-duty scissors** (for cardboard or plastic straps)
1. **Pencil** – Primarily for drafting, temporary notes, and erasability.
2. **Pen** – Includes:
   - **Ballpoint** (reliable, low smudge)
   - **Gel** (smooth, vivid ink, smudges easily)
   - **Fountain** (adjustable line width, needs compatible paper)
   - **Fineliner** (consistent thin lines, ideal for diagrams)
1. **Desk Organizer** – A stocking system with minimum quantities:
   - Pens, pencils, highlighters, sticky notes, paper clips, binder clips, scissors, and tape.

**Missing context:**
- No details on **paper, tape, or binder clips** (e.g., types of tape or specific clip designs).
- No information on **stationery (e.g., notebooks, folders, envelopes)** or **specialized tools (e.g., staplers with automatic feeders)**.

Would you like recommendations for a specific category or further clarification on any item?

It works!

Journey’s end? Not yet.

With a whisper from the machine, the hero falls to his knees. His quest is finally over, and… oh, wait, wrong story.

I reflect on what I built. Although I generated scripts using an LLM, I enjoyed what I learned during the process from the questions I had to ask. I had fun, but doubts remain.

This kind of project doesn’t feel like a technical writing team should implement it… at least, not without a dedicated engineer.
I’m not sure where I would use RAG. Do I use it directly in a docs search engine, or do I want RAG used alongside other tools that create customer-facing files?
This project depends on the documentation being up-to-date. The moment documentation gets put to the side, the value of RAG goes down.
The RAG project depends on quite a few decisions, such as: choice of embedding model; choice of query model; AI tool budget; engineering budget; initial system prompt for the search system. Who decides, and who iterates?

Meanwhile, I have yet to try my scripts on real-world documentation. That will be in part three, where I also correct some design decisions.