Research

Cognitive Retrieval and the Architecture of Memory

Approaching long-term memory as a cognitive process

Hyperplane Labs · May 2026

Memory is not search

The default architecture for AI memory is search. Store text, embed it, retrieve the closest vectors when a query arrives. This works for factual lookup, the kind of memory that answers "what's the user's name?" But it fails for the kinds of memory that actually define intelligence: knowing what changed, understanding what matters now versus what mattered then, assembling a coherent picture from fragments scattered across time.

The failure is not a tuning problem. It is a category error. Search treats memory as a static collection to be queried. Human memory is nothing like this.

When you try to remember what a colleague said about a project two weeks ago, you don't search a transcript. You think about where you were sitting, whether it was a meeting or a hallway conversation, what decision it was connected to. The answer surfaces through context, not lookup. Sometimes you get it wrong, because reconstruction is lossy. But the process itself is what makes human memory useful: it prioritises, contextualises, and connects.

We think AI memory systems should work this way too. Not as a metaphor, but as a design principle with specific architectural consequences.

What cognitive science actually tells us

Three bodies of research shaped our thinking. They are often cited in AI memory papers, usually as a sentence of motivation before returning to engineering. We want to take them more seriously than that.

Episodic memory (Tulving, 1972) is not just "memory with timestamps." Tulving's core insight was that episodic memory is a fundamentally different system from semantic memory, with different encoding, different retrieval mechanisms, and different failure modes. Semantic memory stores facts: Paris is the capital of France. Episodic memory stores experiences: the conversation where someone told you they were moving to Paris, the tone of their voice, the context of why it came up.

Most AI memory systems collapse this distinction entirely. Everything goes into the same vector store, embedded the same way, retrieved the same way. A user preference stated casually three months ago is treated identically to a fact stated emphatically yesterday. The system has no way to reason about the experiential context of how something was learned, which means it has no way to judge how much weight to give it.

M-1 preserves episodic structure. Memories carry their conversational context, their temporal position, their relationship to other memories from the same session. This is not metadata bolted onto a vector. It is part of how retrieval works.

Reconstructive recall (Bartlett, 1932) is the finding that human memory does not store and replay recordings. It reconstructs. Bartlett's famous serial reproduction experiments showed that people don't remember stories accurately. They remember the gist, fill in gaps with expectations, and produce something coherent but not faithful. This was initially seen as a flaw. Later work (Schacter et al., 2011) reframed it: reconstruction is what allows memory to be flexible, to answer questions that were never explicitly stored, to synthesise across experiences.

For AI memory, the implication is direct. A system that can only retrieve what was explicitly stored will fail at multi-session reasoning, because the answer to a cross-session question was never written down in any single place. It has to be assembled. M-1's multi-query decomposition is our implementation of this principle. A complex query is broken into parallel retrieval passes, each targeting a different aspect of the answer. The results are assembled into a coherent context. The answer is reconstructed, not looked up.

Temporal context models (Howard & Kahana, 2002) formalise something intuitive: that memory is organised by time, not just by content. Items experienced close together in time are associated with each other. Recent items are more accessible, but older items of high personal relevance persist. The decay is not uniform, it is shaped by how significant the original experience was and how often it has been revisited.

Most retrieval systems ignore this entirely, or implement a crude recency bias that downweights anything old. M-1's temporal salience scoring is more nuanced. It models the interaction between recency, relevance, and conversational significance, reflecting the actual dynamics that cognitive research describes. A memory from six months ago that was stated with emphasis in a focused conversation should outrank a passing mention from yesterday. Getting this balance right turns out to matter a great deal for temporal reasoning and knowledge update tasks, exactly the categories where simpler systems struggle.

Design consequences

These are not abstract principles. Each one has specific architectural consequences that distinguish M-1 from systems built on the search paradigm.

Preserving episodic structure means memories are not just text with embeddings. They carry context about when, where, and how they were formed. Retrieval uses this context, not just semantic similarity.

Reconstructive retrieval means queries are decomposed, not matched. A question that spans multiple conversations generates multiple retrieval passes, and the results are assembled and checked for coherence before the answering model sees them. The context presented to the model is constructed for the question, not pulled from a fixed index.

Temporal awareness means scoring reflects the dynamics of human recall. Not just "how similar is this to the query" but "how recent is this, how significant was it, does it supersede something older, does it contradict something else." These signals are integrated into ranking, not applied as post-hoc filters.

Exabase's benchmark results on LongMemEval are consistent with this approach. The details are in their research paper (exabase.io/research). What matters from our perspective is not the specific score but what the category-level results reveal: that the hardest memory tasks, multi-session synthesis, temporal reasoning, knowledge updates, are exactly the ones where cognitive architecture makes the largest difference.

What comes next

Retrieval is the first problem, not the only one. A complete memory system needs to do more than surface the right information at the right time. It needs to actively maintain, reorganise, and improve what it knows.

Our research program at Hyperplane Labs extends the same cognitive framing to the processes that surround retrieval.

Perception is the front door: how raw input is comprehended, filtered, and encoded. Current systems ingest everything uniformly. A passing comment and a deliberate instruction get the same treatment. Human perception does not work this way. We attend to what seems important, encode it more deeply, and let the rest fade. Without this, memory systems accumulate noise at the same rate as signal, and retrieval quality degrades over time no matter how good the ranking is.

Consolidation is what happens to memories after they are formed. In biological systems, sleep plays a role: memories are replayed, patterns are extracted, redundancies are merged, unimportant details are allowed to decay. AI memory systems currently do none of this. They store everything forever at full fidelity, which sounds like an advantage until the memory store becomes so large and noisy that retrieval breaks down. Selective consolidation, knowing what to keep, what to compress, and what to forget, is an unsolved problem with direct practical consequences.

Abstraction is the process of moving from specific instances to general knowledge. You remember individual meals at a restaurant, but you also form a general impression of the place. Both levels of representation are useful for different kinds of questions. Current AI memory is almost entirely instance-level: it stores individual interactions but cannot generalise across them. Building systems that form and maintain abstractions without losing the specific episodes they are derived from is a hard problem and one we think is essential for memory systems that improve with use.

Reorganization is what happens when new information changes what you already know. Not just adding a new fact, but restructuring existing knowledge around it. If a user says "actually, I moved to Berlin," the system should not just add this fact alongside the old one. It should understand that the user's city has changed, that previous location-dependent preferences may need revision, and that the old information is now historical context, not current state. This is harder than it sounds and most systems handle it poorly or not at all.

Anticipation is proactive memory: surfacing information before it is requested, because the system has learned what tends to be relevant in a given context. This is speculative and further out, but it follows naturally from temporal context models. If the system can learn patterns in what information is needed and when, it can begin to pre-surface context, reducing latency and improving the quality of agent responses.

Reflection is self-directed review: the system examining its own knowledge to identify gaps, contradictions, stale information, and decay. This is the process that keeps a memory system healthy over time. Without it, knowledge bases rot. Links go stale, facts become outdated, contradictions accumulate. Automated reflection, where the system periodically reviews and maintains its own knowledge, is a prerequisite for memory systems that operate reliably over long time horizons.

The core thesis

The position we are building from is simple: memory is not a support function for intelligence. It is the core of it.

Perception is memory recognising. Reasoning is memory recombining. Learning is memory reorganising. These are not analogies. They are descriptions of how cognition works, and they should be descriptions of how we build AI systems that need to remember, reason, and learn.

Current approaches treat memory as a utility, a store-and-retrieve layer bolted onto a language model. We think this underestimates the problem and the opportunity. Building memory systems that take cognitive principles seriously is harder, but it produces systems that work in ways that search-based approaches cannot.

M-1 is the first output of this research program, developed in collaboration with Exabase, where it is deployed in production. It is a retrieval system, and retrieval is just the beginning.

Compare similar apps and tools: