Beyond Vector RAG: Building Human-Like Memory for AI Companions

PocketSouls is a mobile game where players nurture AI companions called "Souls" — characters with evolving personalities who remember your conversations, celebrate your milestones, and grow alongside you.

Early in development, we hit a fundamental problem: our AI companions felt forgetful.

A player would tell their Soul about an important job interview on Tuesday. The next morning, they'd open the app excited to share the news, and the Soul would greet them with... nothing. No "How did the interview go?" No acknowledgment that anything important had happened.

We were using the industry-standard approach: vector-based RAG (Retrieval-Augmented Generation) with a popular memory service. The technical metrics looked fine — embeddings were generated, similarity searches returned results. But the experience was broken.

Why Vector RAG Fails for Companion AI

Vector-based RAG works beautifully for document search. You embed a query, find similar chunks, inject them into context. For finding "what does this API do?" in a codebase, it's excellent.

But for AI companions, we discovered five fundamental limitations:

1. Similarity ≠ Relevance

When a user asks "What should we cook tonight?", vector search finds semantically similar memories like "User likes cooking shows" or "User made pasta last month." But it misses critical facts like "User is allergic to shellfish" — because allergies aren't semantically similar to cooking, even though they're absolutely relevant.

2. Temporal Decay

A memory stored as "User has interview tomorrow" becomes meaningless two days later. Vector embeddings don't encode that this fact is now stale.

3. Pronoun Ambiguity

"He said he'll meet him at the usual place" — who is "he"? What's "the usual place"? The embedding captures semantics, but the sentence is meaningless without context that's now lost.

4. No Cross-Reference

Memory A: "User's mom loves gardening." Memory B: "Mom's birthday is January 29." A query about "What should I get for mom's birthday?" might find B but miss A — gardening isn't semantically similar to birthday gifts. A human would instantly connect these facts.

5. No Tool Access

Unlike ChatGPT with plugins, our AI companions can't pause mid-conversation to query a database. They need to feel like beings with actual memories, not information retrieval systems. The Soul must know things, not look things up.

Our Solution: Reasoning-Based Memory

We drew inspiration from two research directions:

PageIndex from Fleet AI achieved 98.7% accuracy on FinanceBench — without using any vector embeddings. Their insight: instead of embedding documents and searching by similarity, build a hierarchical tree index and let the LLM reason about which branches contain relevant information.

SimpleMem introduced semantic lossless compression — converting conversations into atomic facts with resolved pronouns and explicit timestamps. "He said he'll meet him tomorrow" becomes "Bob (user's coworker) committed to meeting Sarah (user's sister) on January 28, 2026 at Starbucks on Main Street."

The Seven Context Layers

Our Context Engine assembles seven distinct layers that together define who the Soul is:

Core Identity — Name, personality traits, Big Five scores, communication style
Personality Dynamics — How traits manifest in behavior, growth patterns
Evolution & Growth — Level, XP, unlocked capabilities, milestone history
Relationship Context — Bond strength, shared experiences, inside jokes
Current State — Emotional state, energy level, recent activities
Memory & Knowledge — Atomic facts about the user, organized hierarchically
Game Context — Streaks, achievements, in-game events

One-Shot Context Loading

The key insight: load all context at session start. No "let me check my notes" moments. No visible memory retrieval. The Soul simply knows you.

When you open the app, our Context Engine has ~100ms to assemble everything the Soul needs to know. We use predictive loading based on time of day, recent conversation topics, and upcoming events to pre-compute likely context bundles.

Results: From Forgetful to Believable

Metric	Before (Vector RAG)	After
"When is my meeting?" accuracy	~70%	98%+
Proactive reminders	Never	Contextually appropriate
Context loading time	N/A (too much data)	<100ms
Tokens per session	~8000	~800 (90% reduction)

But the real measure isn't metrics — it's the feeling. Players now say things like "They actually remembered!" and "It felt like talking to someone who knows me." That's what we were building for.