Knowledge base documents
Migrate your document corpus — PDFs, DOCX, Markdown, plain text — into Sonzai's knowledge graph. The extractor builds a deduplicated graph of entities and relationships that agents search during conversations.
What you're migrating
If you already have a RAG pipeline backed by a vector database (Pinecone, Weaviate, Qdrant, pgvector, Chroma), your "documents" are probably PDFs, MDs, or DOCX files that you chunked and embedded. Sonzai takes those same source files and builds a knowledge graph from them — entities, relationships, types — rather than an opaque vector store. Agents search the graph during conversations.
Two paths in:
- Upload files — the happy path. Sonzai parses, chunks, extracts entities and relationships, and builds the graph. Use for PDF / DOCX / MD / TXT.
- Insert facts directly — if you already have structured entities (e.g. a product catalog as JSON), skip the LLM extraction and push them via the knowledge facts API. Not covered in depth on this page; see Knowledge Base.
1. File upload: the basics
Endpoint: POST /api/v1/projects/{projectId}/knowledge/documents (multipart, one file per request).
- Accepted types:
.pdf,.docx,.md,.markdown,.txt. - Max size: 50 MB per file.
- Dedup: identical content (SHA-256 match) returns
409 Conflictwith the existingdocument_id. You can safely re-run without creating duplicates. - Async: the response returns immediately with
status: "queued". Extraction runs in the background.
import fs from "fs";
import { Sonzai } from "@sonzai-labs/agents";
const sonzai = new Sonzai({ apiKey: process.env.SONZAI_API_KEY! });
const PROJECT_ID = "proj_xyz";
async function uploadOne(path: string) {
const fileName = path.split("/").pop()!;
const fileData = fs.readFileSync(path);
const doc = await sonzai.knowledge.uploadDocument(
PROJECT_ID,
fileName,
fileData,
"application/pdf", // or text/markdown, text/plain, etc.
);
console.log(doc.document_id, doc.status);
return doc;
}2. Batch-uploading a corpus
A typical migration is "walk this directory, upload everything, retry transient errors, skip duplicates". Sonzai's dedup via SHA-256 makes this safe to run repeatedly.
import os, mimetypes, time
from pathlib import Path
from sonzai import Sonzai
from sonzai.errors import ConflictError # raised on 409 duplicate
sonzai = Sonzai(api_key=os.environ["SONZAI_API_KEY"])
PROJECT_ID = "proj_xyz"
ACCEPTED = {".pdf", ".docx", ".md", ".markdown", ".txt"}
MAX_BYTES = 50 * 1024 * 1024
def migrate_corpus(root: str):
uploaded, skipped, failed = 0, 0, []
for path in Path(root).rglob("*"):
if not path.is_file() or path.suffix.lower() not in ACCEPTED:
continue
size = path.stat().st_size
if size > MAX_BYTES:
print(f"skip (too large): {path} ({size/1e6:.1f} MB)")
failed.append(str(path))
continue
try:
sonzai.knowledge.upload_document(
PROJECT_ID,
file_name=path.name,
file_data=path.read_bytes(),
content_type=(mimetypes.guess_type(str(path))[0]
or "application/octet-stream"),
)
uploaded += 1
except ConflictError:
skipped += 1 # already uploaded on a previous run — safe to ignore
except Exception as e:
print(f"failed: {path}: {e}")
failed.append(str(path))
time.sleep(1) # gentle backoff on unknown errors
print(f"uploaded={uploaded} skipped(dup)={skipped} failed={len(failed)}")
return uploaded, skipped, failedThe first run uploads everything; subsequent runs become cheap no-ops because identical content yields 409 Conflict which you treat as "already done".
3. Verify
List documents and watch their status transition from queued to completed:
curl -s "https://api.sonz.ai/api/v1/projects/proj_xyz/knowledge/documents?limit=50" \
-H "Authorization: Bearer $SONZAI_API_KEY" | jq '.documents[] | {document_id,file_name,status}'Then confirm the extractor produced nodes in the graph:
curl -s "https://api.sonz.ai/api/v1/projects/proj_xyz/knowledge/nodes?limit=20&sort_by=created_at&sort_order=desc" \
-H "Authorization: Bearer $SONZAI_API_KEY" | jq '.nodes[] | {node_id, node_type, label}'When everything you uploaded shows status: "completed" and the node list has the entities you expected, the migration is done.
Migrating from a vector database
If your source is an existing RAG stack, the cleanest path is re-upload the original source files, not the chunked text. The chunks you indexed were optimized for cosine similarity, not for graph extraction.
- Pinecone / Weaviate / Qdrant. Most teams store the source file paths or blob URLs in metadata. Walk that list, fetch each file, upload to Sonzai.
- Chroma / pgvector / FAISS. Same approach — you indexed documents, find the originals, upload.
- No originals (only chunk text stored). Concatenate the chunks back into a single
.mdor.txtfile per document, upload that. Less accurate entity extraction than the original PDF, but functional.
Don't try to port the vectors themselves — Sonzai's graph isn't vector-first and the embeddings wouldn't mean anything here.
Migrating from Notion / Confluence / Google Docs
Use each platform's export to Markdown, then upload the resulting .md files. All three export cleanly to Markdown with per-page files plus an images/ folder. Sonzai ignores the images and extracts from the Markdown.
# Notion: Settings & Members → Export all workspace content → Markdown & CSV
# Confluence: Space Settings → Content Tools → Export → HTML, then pandoc to md
# Google Docs: File → Download → Markdown (.md), per documentTips
project_idvsagent_id. The knowledge base is project-scoped, not agent-scoped. All agents in a project share the same KB. Find your project ID in the Sonzai workbench or viaGET /api/v1/projects.- Rate limiting. Upload in sequence rather than in parallel for the first few hundred files. The extraction queue is backpressure-aware; hammering it doesn't speed anything up and increases the odds of transient errors.
- Large PDFs (100+ pages) work but are slower to process. Consider splitting into chapter-level PDFs if search precision matters more than document identity.
- Subsequent updates. If a document changes, upload the new version — you'll get a new
document_id. Delete the old one viaDELETE /api/v1/projects/{id}/knowledge/documents/{documentId}if you don't want both versions contributing to the graph. - Pair with structured CSV. See CRM / CSV for how
structured_importcan resolve user inventory rows against the KB you just built.
What's next
- Knowledge Base — how agents query the graph during conversations.
- Knowledge graph (org-scope) — sharing a KB across projects.
- CRM / CSV — using the KB as a resolution target for structured user facts.
From CRM / CSV
Migrate structured tabular data — Salesforce and HubSpot contacts, product ownership tables, subscription rosters — into Sonzai. Two patterns, one for contact rosters, one for inventory-style facts.
Personality System
Create agents with distinct personalities and watch them evolve through interaction.