Retrieval-augmented generation for docs, conclusion

2026-02-21

My first, second, and third posts about retrieval-augmented generation (RAG) for documentation were explanatory. I planned this fourth post to reflect on what I had learned. I was going to post it earlier, but I got very busy at work.

You know what, though? It’s a good thing I waited. I used LLMs so much this week that I get to add more to this post!

Reflections on RAG

In short:

I learned a lot by reading.
I learned some things by doing, and LLMs assisted me.
Some elements of RAG are cool. Some are scary.
I shouldn’t be the person implementing this.

Read the things

I learned that retrieval-augmented generation was a combination of techniques: chunking written material, embedding chunks, sending queries to an LLM, making the LLM do a search, and having the LLM construct your personalized answer.

I knew of RAG before this exercise. I didn’t know quite how RAG worked, so I had to read about it. Reading one article would send me down a small rabbit hole of information, one article at a time, until I had a clearer picture of what was involved. Conceptually, this process was not unlike asking an LLM to construct a personalized answer; my brain was doing the construction.

There was no single article that explained RAG in a way I found comprehensive. It feels like scattered bits and pieces, with some key information undisclosed by many writers. Is there a secret society, a cabal of engineers who refuse to disclose their key wisdom, or are we just bad at explaining concepts from start to finish?

Still, I found my information in the end. Reading is good and teaches you things, such as “hire a technical writer” (I kid, but not really).

Do the things?

To implement RAG in a documentation system, I had two options:

Pay someone to do it.
Code it myself.

I was working on a proof-of-concept with no budget, so option one was out of the question. Fortunately, work gives me access to LLMs, so I asked a few LLMs to implement the code for me. These LLMs seemed very much in tune with this kind of code.

The exercise probably helped reinforce how RAG worked by making me think through the implementation. It probably taught me a bit more about Python code and data structures. However, my learning didn’t leave the conceptual much. If I had coded everything myself, I would have learned more about coding; since the LLMs did most of the implementation work, I learned more about how to steer LLMs to do my bidding.

I did brainstorming. I did data structure design. I did the act of making requests. However, I did not code, so my knowledge of RAG remains fairly theoretical. That’s the trade-off.

Though I’m not saying the trade-off is bad, it’s a good thing I came into the exercise with some coding experience. It’s hard to get by with only theoretical knowledge, so relying on LLMs to do the heavy lifting is a disservice to learning. I say this acknowledging that LLMs can be a really useful service to retrieve information!

Cool, yet scary

Cool: I liken LLMs to semantic similarity searches, but with tokens instead of words. Semantic similarity helps with searches when you don’t know the exact word used by a writer. Useful!

Scary: like with all LLMs, you don’t know exactly what comes out. You have to craft the system prompts and fine-tune them forever; LLM models change; it’s not deterministic,¹ and you’re not in control.

I’m not the one

I can’t shake the feeling that I shouldn’t have been the one to create a RAG proof-of-concept. The knowledge was there to read, the LLM tools were there to use, the result was successful… but I can’t verify that what I created worked properly. My proof-of-concept worked when I tried using it, but that’s as far as I got. It was definitely not ready for a real-world implementation.

But even if I was the right person for the proof-of-concept, I am not the one to implement RAG for any search system.² I have neither the software engineering experience to code such as system, nor to verify LLM output. Someone else must be involved, and thus documentation search showed how much of a cross-team effort it should be.

Then came the refactor

I could have been done here, but then came a massive task: a docs refactor! You see, my team at work had planned to change our documentation to better fit our new personas. We would have to fix the navigation bar, move articles around, and fix links. At the same time, we had been told we really needed versioned documentation to support a new type of software deployment, which meant splitting our docs into two parts. No pressure.

This work needed to be done quickly. I armed myself with caffeine, GitHub Copilot, OpenAI Codex, months of planning, and a lot of tea. This entire workweek was spent on the migration, and I’m glad to report that the refactored documentation builds correctly.

The number of times I turned to LLMs got me thinking. Here’s a sample of tasks that I used them for:

Not everyone in the team followed best practices when writing relative URLs, and some people implemented JSX components with a URL style that could not work after the refactor. I used LLMs to help me find and fix the former (augmenting regular expressions), and spent hours asking LLMs why links in JSX components didn’t work as expected, because I had no way of knowing. Some time was saved, but lots of time was lost trying to refine queries, and learning how to ask the right questions.
To avoid maintaining two entirely separate documentation sites, I implemented Docusaurus’s “multi-instance” feature for docs: two separate documentation instances, built together. However, I had to ask an LLM to create a smart instance switcher in the navigation bar. The result looks good, and I think the code looks okay, but how maintainable is that switcher component?
I had to fix bad code in plugins created by other people who had used LLMs. In one instance, multiple LLMs caught errors in a script that loads in the reader’s browser, and I asked LLMs to be critical of its errors and fix its bugs. I went through three LLMs (GPT-5.3-codex, Claude Opus 4.6, GPT-5.2-codex) to try and make sure that they policed each-other’s results. I’m still not sure the end result is great.

On reflection, I didn’t need to use LLM tools, but the amount of weird little things in our documentation made it reasonable to delegate some tasks to an outside force. Some of these weird little things had been my fault, most had been done by others, and it was considered normal that I would fix them. Put another way, AI tools make it more acceptable for people to take thoughtless action.

This finding goes back to the cool and scary parts of retrieval-augmented generation: it’s great that people can do more, but there are still consequences to those actions. We have the ability to summarize information and be vibe coders, but we diminish our knowledge and fail to understand the consequences of what we create.

Another thought is that LLMs teach us how language is interpreted differently. Someone who produces language understands their words differently compared to the consumer of said language. This week, I got to see how slightly different wording affected LLMs; even though the exercise was successful, I still found myself using words with some human expectations the LLMs could not understand. Literal meaning trumped the occasional badly-voiced intent.

RAG still depends on the wording of the original content, and on the wording of the user query. Augmentation is not omniscience, and we should never treat it that way.

Final thoughts

I’m impressed and humbled by both the RAG proof-of-concept, and by this week’s results. Being able to create something complex by thinking at a screen is damn-near miraculous. We’re way past punch cards!

Still, how far does this go? AI economic bubbles aside, the technology is here to stay, and is used to go beyond one’s knowledge. This week alone, what I created went beyond my understanding many times. Will the new normal consist of creating things we cannot maintain well on a shoestring budget?

While we think on these questions, I’ll be here waiting for computer parts to go down in price. Chances are I’ll be waiting awhile.

LLMs are not deterministic, but they could be. Mathematics and code are inherently deterministic unless you deliberately introduce randomness, which is what I understand happens in these AI models. Still, don’t expect determinism anytime soon. ↩
For now, anyhow. ↩