Retrieval-augmented generation for docs, part 3

2026-01-30

In my first and second posts about retrieval-augmented generation (RAG) for documentation, I explained some RAG concepts, then used a large language model (LLM) to prototype a RAG search tool.

In this post, I discuss what happened after creating the prototype: refinement, bug-fixing, and using real-world documentation.

Where I left off

In part two, I had successfully generated a proof-of-concept for using RAG that used documentation as a source. The proof-of-concept consisted of the following elements:

A plugin to extract Docusaurus documentation into a chunked format.
A script to vectorize the documentation chunks by using an LLM, turning the chunks into embeddings.
A script to feed a natural language query to an LLM, at the same time providing the embeddings for the search.

This proof-of-concept worked for artificial documentation, but I wanted to do more before deciding my project was a success.

First, I wanted to clean up the plugin and scripts. I knew more information than when I started, which I thought should allow me to improve the scripts according to what I thought was important.

Second, I wanted to test the RAG system on real-world documentation. I had access to a big documentation site, so creating and using its embeddings was my target.

Fixing up the scripts

First things first: I didn’t think the scripts were ready for anything close to real-world use. When OpenAI Codex generated the scripts according to my request, it had made a lot of assumptions that I would not have thought of. I felt that I should change some things, and document the rest.

Indexing embeddings

My proof-of-concept’s workflow skipped an indexing step. Though I was okay with skipping it at first, I had read that anything but the smaller datasets would benefit from a fast lookup index.

Time to correct my skip.

The previous Codex conversation had suggested a few indexing libraries. I chose the first listed library, FAISS, solely based on its popularity.

I went back to said Codex conversation, and wrote a request to build an indexing system using FAISS. I specified that existing scripts should be made to work with this index, and let the LLM get to work.

Codex generated two scripts:

build_faiss_index.py, to build the FAISS index from an embeddings file I created.
verify_faiss_index.py, to verify the index was correct.

The first script seemed to work, but the verification script failed with a segmentation fault. I asked Codex to investigate and debug the script.

Codex’s debugging session seemed a bit awkward. It started by suggesting “FAISS search can segfault on dimension mismatch,” and added explicit dimension checks to the file. Then it tested the script in a restricted environment multiple times, triggering impossible network requests and causing the test to hang. I was wondering how necessary these tests were until it came up with a fix: when running the script without --queries or --query-file flags, it would run a test with chunk embeddings from the embeddings JSON file rather than loading the SentenceTransformers library that was making the network calls.¹

The debug session worried me a bit, because I had not known that the verification script actually wanted queries. Maybe I should have read over the scripts more. Maybe I was wrong to trust that the LLM would explain its work like it had done earlier in the conversation. Mostly, though, I felt foolish that I had no way of reliably checking the LLM’s output in this case.

Still, the FAISS indexing scripts seemed to work. I noted my thoughts and moved on.

Documenting the work

The indexing scripts had many command-line options, which reminded me that the other Python scripts had many options as well. I didn’t want to change that too much, but I did want to understand the options better.

My next request to Codex came in two parts:

Document the script CLI flags within the scripts themselves, so I could better understand the point of what was added.
Ensure --help/-h flags exist, and that they clearly explain the script and the flags. Add short example commands to their output.

For the first request, Codex added Python comments to the top of each script. Each comment had textual descriptions of all supported flags. Sounds good to me!

For the second request, Codex went a bit further: it updated the script argument parsers, added a help text formatter, and added short example commands to the help output. I had not specified what example commands I wanted; Codex generated commands like I had previously used, omitting the more complex CLI flags.

I think Codex handled these requests quite well. The new textual content was descriptive to a point I understood it. I knew you can always refine generated content where needed, but I didn’t need to iterate beyond my two requests.

Removing an unnecessary assumption

One flag I did want to change, though, was a flag related to normalization. Remember this block from the RAG embeddings file?

{
  "model": "Qwen/Qwen3-Embedding-0.6B",
  "dimension": 1024,
  "normalized": true,
  "chunks": [
    {
      "id": "9a58104f04ff4e1945c5b7b9be83987e256a0ecef0fea116e01e0cdc2ab6177f",
      "text": "Tutorial Intro\n\nLet's discover Docusaurus in less than 5 minutes.",
      "slug": "/docs/intro",
      "section_path": [
        "Tutorial Intro"
      ],
      "embedding": [
        0.018638374283909798,
        -0.01859218440949917,
        -0.011958424933254719,
        ...

The line "normalized": true, stood out to me because I didn’t see its purpose. I saw there were CLI flags on many scripts that, when unset, defaulted to normalizing vectors. Why did that matter?

Quick math refresher: vectors have properties such as direction and length (magnitude). A normalized vector, or unit vector, is a vector of length 1 in a normed space. When you are comparing normalized vectors, their cosine similarity and dot product are the same.

For LLMs, having these similarity factors helps the computer do more efficient similarity comparisons… which is exactly what we’re using the vectors for.

I did query the LLM to find out what would happen if I disabled normalization by using the relevant CLI flag. It answered that the RAG answer script would perform extra calculations. I saw no point in this, so on to action.

I decided that the workflow should always normalize vectors, making it mandatory end-to-end. The changes consisted of the following:

embed_chunks.py removed the “no-normalize” path and made normalization always on.
build_faiss_index.py, rag_answer.py and verify_faiss_index.py were aligned to the “always normalized” assumption.
The LLM still kept a path to check for legacy JSON files with the line `“normalized”: true, which I requested removed because I would rebuild everything.
The LLM updated internal script documentation accordingly.

Normalization was now the norm. More importantly, I no longer had that line staring at me in the JSON file.

Real-world implementation

As far as I was concerned, it was time to try my scripts on real documentation.² I guessed that more problems would appear at this point. Boy, was I right.

Chunking failures

I added the Docusaurus chunking script to a local copy of a real-world Docusaurus site. The build failed instantly, with a micromark MDX error that didn’t tell me what documentation file caused the problem. Great.

Still in the same Codex conversation, I requested a debug option for the Docusaurus plugin. The debug option should log and output each documentation file as it worked on said file, so I could match the error to the correct MD or MDX file. I tested the plugin again, and found the problem file.

Error 1: HTML comments break the plugin. Wait, what?!

Our documentation had a regular Markdown file that contained an HTML comment block: . Normally, building our content abstract syntax tree would strip HTML comments. However, the micromark parser in my plugin choked on the comment before an abstract syntax tree could be built.

Fix 1: the LLM implemented a pre-parse sanitizer. Strip HTML comments from the raw content when found, but preserve them in fenced code blocks. Keep stripping the comment nodes at the AST level just in case.

With this fix, I ran the Docusaurus build again.

Error 2: heading ID syntax breaks the plugin.

The second error was caused by a Docusaurus feature that allows writers to specify a custom ID at the end of a Markdown header. This useful feature is not standard Markdown, which explains why a Markdown parser would be unhappy.

Fix 2: the LLM implemented another pre-parse sanitizer. Strip trailing {#...} from heading lines (outside of code fences) before parsing. No need to preserve the IDs for our test case, but the possibility is there.

These two fixes were enough for the Docusaurus build to succeed, which meant a JSON file full of documentation chunks. This was great progress!

Chunking behavior tweaks

I remembered an earlier warning that Codex gave me: managing chunk size is an important component of a RAG strategy. The Docusaurus script chunked content up to heading level four (H4); I wanted to better control chunking in case earlier assumptions caused problem.

I asked the LLM to implement an option that set what heading depths the plugin chunked at. It added the option maxHeadingDepth with a default value of 4, only taking integers between 1-6. This option worked on the first try.

With the means to better handle chunks, I passed the chunks JSON file to my scripts.

Memory issues

While my JSON data was clean, the embedding and query Python scripts didn’t work on the first try. They both failed in the same way. I would run them, the computer would freeze, and the scripts would eventually fail with a segmentation fault. The culprit: memory.

As a reminder, I was doing this on a machine with 16 GB RAM. LLMs are known to use a lot of memory. Even though the LLM used for the query fit in memory, working with the vectors proved to be too much for a first real-world try.

I worried about whether I had overstepped what I could do on a local machine, but I had to keep trying.

Each time I ran one of these scripts, I went back to the Codex conversation and explained the errors. I asked Codex if it was possible to fix the scripts so they could run successfully. Here are the solutions Codex implemented after some minutes of debugging:

Add options and explanations to embed_chunks.py:
- A flag to cap the tokenizer length and reduce memory usage.
- A flag to run the embedder on the CPU, rather than attempting on a GPU or NPU.
- A flag to control the embedding batch size, with lower defaults.
Modify rag_answer.py:
- Reduce the amount of data it tries to handle in parallel.
- Load the FAISS index in a lazy fashion, in fewer CPU threads.
- Add checks for some code paths to fail cleanly.

These changes somehow fixed the errors. I say “somehow” because I still ask myself what the real limit is. Is there a point where data would be too much for my computer to handle? I hope there isn’t a future minimum price to implement this.

The result

Not only were the errors fixed, but the RAG answer script worked nicely.

With a local chat LLM running in the background, and vectorized documentation, I asked a question about the documented product. The LLM thought for longer than with the proof-of-concept, but managed to generate a relevant answer to my question.

It felt underwhelming.

Although this project was a success, I couldn’t help but think of next steps and concerns. I had created a series of scripts that worked, but didn’t have the knowledge to maintain them; the system prompt needed to be tweaked; what do I do with all of this?

In the final post, I’ll go over what I took from this exercise.

Software accessing the network without you knowing is a really big deal. I wasn’t too worried here because I knew the SentenceTransformers library uses the network to try and load models. In any other circumstance, “it needs network access and we don’t know why” can be cause to panic. ↩
I had access to documentation from my work. If you don’t, you can always use real-world documentation from open-source projects. ↩