Skip to main content
SONZAI

Knowledge base documents

Migrate your document corpus — PDFs, DOCX, Markdown, plain text — into Sonzai's knowledge graph. The extractor builds a deduplicated graph of entities and relationships that agents search during conversations.

What you're migrating

If you already have a RAG pipeline backed by a vector database (Pinecone, Weaviate, Qdrant, pgvector, Chroma), your "documents" are probably PDFs, MDs, or DOCX files that you chunked and embedded. Sonzai takes those same source files and builds a knowledge graph from them — entities, relationships, types — rather than an opaque vector store. Agents search the graph during conversations.

Two paths in:

  1. Upload files — the happy path. Sonzai parses, chunks, extracts entities and relationships, and builds the graph. Use for PDF / DOCX / MD / TXT.
  2. Insert facts directly — if you already have structured entities (e.g. a product catalog as JSON), skip the LLM extraction and push them via the knowledge facts API. Not covered in depth on this page; see Knowledge Base.

1. File upload: the basics

Endpoint: POST /api/v1/projects/{projectId}/knowledge/documents (multipart, one file per request).

  • Accepted types: .pdf, .docx, .md, .markdown, .txt.
  • Max size: 50 MB per file.
  • Dedup: identical content (SHA-256 match) returns 409 Conflict with the existing document_id. You can safely re-run without creating duplicates.
  • Async: the response returns immediately with status: "queued". Extraction runs in the background.
import fs from "fs";
import { Sonzai } from "@sonzai-labs/agents";

const sonzai = new Sonzai({ apiKey: process.env.SONZAI_API_KEY! });
const PROJECT_ID = "proj_xyz";

async function uploadOne(path: string) {
const fileName = path.split("/").pop()!;
const fileData = fs.readFileSync(path);
const doc = await sonzai.knowledge.uploadDocument(
  PROJECT_ID,
  fileName,
  fileData,
  "application/pdf",   // or text/markdown, text/plain, etc.
);
console.log(doc.document_id, doc.status);
return doc;
}

2. Batch-uploading a corpus

A typical migration is "walk this directory, upload everything, retry transient errors, skip duplicates". Sonzai's dedup via SHA-256 makes this safe to run repeatedly.

import os, mimetypes, time
from pathlib import Path
from sonzai import Sonzai
from sonzai.errors import ConflictError  # raised on 409 duplicate

sonzai = Sonzai(api_key=os.environ["SONZAI_API_KEY"])
PROJECT_ID = "proj_xyz"
ACCEPTED = {".pdf", ".docx", ".md", ".markdown", ".txt"}
MAX_BYTES = 50 * 1024 * 1024

def migrate_corpus(root: str):
  uploaded, skipped, failed = 0, 0, []
  for path in Path(root).rglob("*"):
      if not path.is_file() or path.suffix.lower() not in ACCEPTED:
          continue
      size = path.stat().st_size
      if size > MAX_BYTES:
          print(f"skip (too large): {path} ({size/1e6:.1f} MB)")
          failed.append(str(path))
          continue
      try:
          sonzai.knowledge.upload_document(
              PROJECT_ID,
              file_name=path.name,
              file_data=path.read_bytes(),
              content_type=(mimetypes.guess_type(str(path))[0]
                            or "application/octet-stream"),
          )
          uploaded += 1
      except ConflictError:
          skipped += 1  # already uploaded on a previous run — safe to ignore
      except Exception as e:
          print(f"failed: {path}: {e}")
          failed.append(str(path))
          time.sleep(1)  # gentle backoff on unknown errors

  print(f"uploaded={uploaded} skipped(dup)={skipped} failed={len(failed)}")
  return uploaded, skipped, failed

The first run uploads everything; subsequent runs become cheap no-ops because identical content yields 409 Conflict which you treat as "already done".

3. Verify

List documents and watch their status transition from queued to completed:

curl -s "https://api.sonz.ai/api/v1/projects/proj_xyz/knowledge/documents?limit=50" \
  -H "Authorization: Bearer $SONZAI_API_KEY" | jq '.documents[] | {document_id,file_name,status}'

Then confirm the extractor produced nodes in the graph:

curl -s "https://api.sonz.ai/api/v1/projects/proj_xyz/knowledge/nodes?limit=20&sort_by=created_at&sort_order=desc" \
  -H "Authorization: Bearer $SONZAI_API_KEY" | jq '.nodes[] | {node_id, node_type, label}'

When everything you uploaded shows status: "completed" and the node list has the entities you expected, the migration is done.

Migrating from a vector database

If your source is an existing RAG stack, the cleanest path is re-upload the original source files, not the chunked text. The chunks you indexed were optimized for cosine similarity, not for graph extraction.

  • Pinecone / Weaviate / Qdrant. Most teams store the source file paths or blob URLs in metadata. Walk that list, fetch each file, upload to Sonzai.
  • Chroma / pgvector / FAISS. Same approach — you indexed documents, find the originals, upload.
  • No originals (only chunk text stored). Concatenate the chunks back into a single .md or .txt file per document, upload that. Less accurate entity extraction than the original PDF, but functional.

Don't try to port the vectors themselves — Sonzai's graph isn't vector-first and the embeddings wouldn't mean anything here.

Migrating from Notion / Confluence / Google Docs

Use each platform's export to Markdown, then upload the resulting .md files. All three export cleanly to Markdown with per-page files plus an images/ folder. Sonzai ignores the images and extracts from the Markdown.

# Notion: Settings & Members → Export all workspace content → Markdown & CSV
# Confluence: Space Settings → Content Tools → Export → HTML, then pandoc to md
# Google Docs: File → Download → Markdown (.md), per document

Tips

  • project_id vs agent_id. The knowledge base is project-scoped, not agent-scoped. All agents in a project share the same KB. Find your project ID in the Sonzai workbench or via GET /api/v1/projects.
  • Rate limiting. Upload in sequence rather than in parallel for the first few hundred files. The extraction queue is backpressure-aware; hammering it doesn't speed anything up and increases the odds of transient errors.
  • Large PDFs (100+ pages) work but are slower to process. Consider splitting into chapter-level PDFs if search precision matters more than document identity.
  • Subsequent updates. If a document changes, upload the new version — you'll get a new document_id. Delete the old one via DELETE /api/v1/projects/{id}/knowledge/documents/{documentId} if you don't want both versions contributing to the graph.
  • Pair with structured CSV. See CRM / CSV for how structured_import can resolve user inventory rows against the KB you just built.

What's next

On this page