Cognitive RAG Architectures: How AI Can Borrow from Human Memory to Make Better Decisions
Written for Computational Models of Decision-making, Columbia.
Every decision you make is bound by what you can bring to mind. You don't choose between every possible option in the universe. You choose between the ones your memory hands you in the moment. If your brain doesn't surface a relevant past experience, that experience may as well not exist as far as the choice in front of you is concerned. This is a quiet fact about decision-making that gets ignored when we talk about intelligence: the bottleneck is retrieval.
Most AI systems that need to "know things" use a method called Retrieval-Augmented Generation, or RAG. RAG works by hooking a language model up to a giant document database. When the user asks a question, the system searches the database for matching passages, then feeds those passages to the model so it can write an answer. The original paper describing this approach, by Lewis et al. in 2020, called the database a "non-parametric memory" — essentially an external library the model can consult. It was a clever fix for a real problem: language models forget, hallucinate, and can't easily be updated. Hooking them up to a searchable library solves all three at once.
Those sound similar but actually aren't. Human cognition figured this out a long time ago and solved it with a specific architecture that is stratified, decay-weighted, reflective, and reconstructive. AI is finally starting to copy this architecture. This paper argues that the future of retrieval-based AI lies not in scaling the corpus but in scaling the cognitive architecture: by importing human memory's stratified structure, decay-weighted accessibility, reflective consolidation, and reconstructive recall, AI systems can make better decisions from sparse data than RAG can from massive ones.
Introduction and History of RAG
The original RAG system is relatively minimal. Lewis et al. describe it as a model where "the parametric memory is a pre-trained seq2seq transformer, and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever." In plain terms, there's the model's built-in knowledge (parametric), and there's the library it can search (non-parametric). When a query comes in, the retriever finds the top-K matching passages, and the generator writes an answer conditioned on those passages.
This works. RAG set new state-of-the-art results on open-domain question answering when it came out. But cognitively, it's flat. There's no distinction between different kinds of memory, no sense of which memories are old or new, and no mechanism for the system to reflect on what it has retrieved and form higher-level conclusions. The implicit bet is that bigger and better corpora will keep producing better answers.
There is, however, good reason to doubt that bet. In a 2022 paper on scaling laws and model architectures, Tay et al. ran an enormous experiment training over a hundred different model architectures at different sizes. They determined that not all architectures scale the same way. Some models that performed well at small sizes hit a wall as they got bigger. The authors warn that "novel inductive biases can be indeed quite risky," and that scaling without the right architectural priors can magnify failure rather than fix it. The lesson for retrieval systems is structural. If the architecture has weak cognitive priors (i.e. treating retrieval as simple lookup) throwing more documents at it isn't a solution. It's the same losing bet at a bigger scale.
Recent Systems Borrowing from Cognition
Three recent lines of work have started doing something different. Instead of scaling the corpus, they're scaling the architecture by importing structure from cognitive psychology.
CoALA, by Sumers et al., proposes a framework called Cognitive Architectures for Language Agents. The core move is to take the memory structure that cognitive scientists have studied for decades — distinguishing working memory from long-term memory, and further dividing long-term memory into episodic, semantic, and procedural types — and apply it to language model agents. Working memory holds what the agent is currently thinking about. Episodic memory stores past experiences. Semantic memory stores facts about the world. Procedural memory stores how to do things.
CoALA argues that "language agents choose actions via decision-making, which follows a repeated cycle." In each cycle, the agent uses retrieval and reasoning to propose, evaluate, and select an action. What memory gets surfaced into working memory at each step shapes what options the agent even considers. The stratification matters because different decisions need different kinds of memory: planning needs episodic recall of similar past situations, factual reasoning needs semantic knowledge, skill execution needs procedural memory. RAG's flat library can't make these distinctions.
Coll et al. go a step further with their Human Digital Twin system, which they call "Digital Me." The system is designed to act as a conversational stand-in for a specific person, and to do that, it has to remember the way that person remembers. The authors explicitly compare their system to standard RAG and argue that "authentic HDTs require more dynamic and reflective mechanisms, similar to the functionality of the human memory system, where information is not only stored but continuously updated, structured, and selectively retrieved."
Their retrieval system scores memories along three dimensions: recency, importance, and relevance. Recency uses a decay function, modeled on the "testing effect" in cognitive psychology, which is the well-documented finding that memories you've recalled recently are easier to recall again. Importance is rated by GPT-4o, which the authors validate against 73 human evaluators and find produces "realistic, human-like ratings." Relevance uses embedding similarity, the standard RAG move. The combined score determines what gets surfaced.
This matters for decisions because not all relevant information is equally useful. A memory from ten minutes ago about your friend's mood is more decision-relevant than a memory from ten years ago, even if both technically match the query. A memory of a near-disaster carries more weight than a memory of an unremarkable lunch. Digital Me's scoring system is a decision-relevance heuristic. It's asking not what matches but what matters.
Park et al., in their 2023 work on generative agents, add the final cognitive piece: reflection. Their agents don't just remember observations, but also periodically synthesize those observations into higher-level insights. The authors describe reflection as a process where the agent generates "higher-level, more abstract thoughts" that get stored alongside raw observations and can themselves be retrieved later.
The decision-making payoff is direct. In one example, an agent named Klaus is asked who he'd most like to spend time with. Without reflection, his system simply picks the person he's seen most often, which happens to be a dorm neighbor he barely talks to. With reflection, the system has synthesized memories of Klaus working on research into a higher-level inference that he's passionate about research, and a similar inference about another agent named Maria. The decision changes. As the authors put it, "with access to reflection memories, Maria answered confidently" about a related decision that required deeper synthesis. Reflection lets agents make decisions that aren't justified by any single retrieved memory, but by the pattern across many of them.
In a more recent Park et al. paper from 2024, their team interviewed over a thousand people for two hours each, then built language model agents grounded in the transcripts. The agents predicted held-out responses on the General Social Survey at 85% of the participants' own test-retest consistency. More importantly for our purposes, they predicted choices in five well-known economic games and behaviors in five social science experiments at comparable accuracy.
The 2024 Park et al. paper provides an empirical bridge from memory to decision. The same retrieval architecture that predicts what someone says they believe also predicts how they actually choose under real stakes. And the authors' ablation analysis shows that when they remove the survey questions whose answers can be looked up directly from the interview, accuracy barely drops. When they remove the questions that require inference (or reasoning from indirect cues in the interview), accuracy collapses. So the cognitive payoff isn't in the matching, but in the inferring. Roughly half the system's predictive power comes from the model's ability to reason over sparse, indirectly-relevant data, not from finding direct matches.
That finding cuts directly against the corpus-scaling assumption baked into standard RAG. You don't need every fact about a person to predict their decisions. You need the right cognitive architecture to infer from a few facts. Sparse data plus good priors beats dense data plus weak priors. This mirrors the Tay et al. conclusion, transposed from model architectures to memory architectures.
Toward a Cognitive RAG
CoALA, Digital Me, and the generative agents work all do something RAG doesn't: taking human cognition as a serious architectural template. But they still miss something fundamental about how human memory actually works. Digital Me's authors flag the gap, pointing out that their system "does not yet simulate intuitive or subconscious memory influences, such as spontaneous recall, intuition-driven decision-making, or the implicit emotional impact of past experiences." Conscious explicit retrieval is roughly solved but the rest isn't, and the rest is how most human decisions actually occur.
Three additions would push retrieval architecture closer to how human memory actually conditions choice. I'll describe them at a conceptual level. (None are implementation-ready proposals; they're sketches of what a more cognitively-faithful retrieval system would need. As a summer project, I will be delving deeper into their implementations.)
Spreading activation
In human cognition, recalling one memory partially activates related memories, even ones you didn't consciously search for. The thought of a high school friend brings the smell of the cafeteria with it. The cafeteria brings your old locker combination. This happens subliminally below threshold, even before you attempt to explicitly retrieve the memory. Standard RAG doesn't do this. Each query is independent. A user sends a question, the retriever embeds it, runs nearest-neighbor search over the document index, returns the top-K matches, and the generator finally writes an answer. For the next query, the system starts from scratch again and the retriever has no memory of what was retrieved last time. While some RAG systems do have a form of cross-query state, such as chat history, where previous messages stay in the context window — that's the context window doing the work, not the retrieval system.
A cognitive RAG would let each retrieved memory partially activate its embedding-space neighbors, raising their retrieval scores for some bounded window of future queries. The system would carry a kind of attentional field across queries. Decisions would get conditioned on adjacent considerations the agent didn't explicitly ask for, which is how human decisions actually work.
Affective tagging
Digital Me has an importance score but no emotional valence. Human memory doesn't work that way. Memories formed in high-arousal states are stickier. Mood-congruent memories retrieve preferentially, so when you're sad, you remember sad things more easily. Decisions made under emotional load draw from a different memory pool than decisions made calmly. This is one of the most replicated findings in memory research, and it's entirely absent from RAG.
A cognitive RAG would tag each memory at encoding with a valence and arousal vector, and the language model can infer these from the content. Retrieval scores would then be modulated by the agent's current emotional context. Decisions would track mood-congruent recall, which is to say, decisions would look more like human decisions actually look.
Stochastic reconstructive retrieval
The deepest issue is that RAG treats retrieval as replay. The same query returns the same passages every time. But Digital Me's own authors note that human retrieval is "a reconstructive process shaped by our current knowledge, assumptions, and the context of retrieval." Each act of recall is slightly different from the last. This is how the testing effect works in the first place. Slight variability across retrievals strengthens slightly different memory traces, producing richer, more flexible representations over time.
A cognitive RAG would sample memories stochastically, drawing from a softmax over retrieval scores rather than just taking the top-K. The same query would produce slightly different retrievals on different occasions, and each retrieval would slightly modify the memory it touched. Decisions would gain the productive variability that lets humans avoid local optima. Retrieval would become reconstruction not replay.
Humans Borrowing from AI??
Interestingly, while AI systems are beginning to emulate human memory architecture, humans are doing the reverse, increasingly externalizing semantic memory to retrieval systems. The "Google effect," as cognitive psychologists call it, is observable and measurable. We increasingly remember where to find information rather than the information itself. We treat our own beliefs as retrievable rather than constructed. We outsource reflection to language models that can do it for us, less effortfully than we can.
AI is internalizing what humans do. Humans are externalizing what AI does. Both are converging on hybrid retrieval-reasoning architectures. The architectures making the best decisions, in the long run, will probably be the ones that take human memory seriously not as a target to surpass but as a design specification. The corpus was only a starting point. The architecture is the ultimate target.