Sintropia · Technical Reference

The RAG Architecture

The full system design behind every Knowledge AI product we build — how the pieces fit together, why each decision was made, and what it costs to run in production.

— System diagram

Every layer, explained.

This is the complete request flow — from the moment a user types a question to the moment a grounded answer arrives. Each layer has a specific job. Removing any one of them degrades the quality of the output in a measurable way.

Ingestion — runs once / on file upload
Source
PDF / XLSX
Your documents
Parser
PyMuPDF / openpyxl
Extract text + tables
Chunker
Semantic Chunks
Rows → sentences for XLSX
Metadata Tag
Source Tagger
source_type · filename · row
Embedding
Gemini Embed
3072-dim vectors
Index (stored once)
sqlite-vec + FTS5
Vector + full-text index
Query runtime
Input
User Prompt
Natural language — no source specified
HTTP POST
REST API
FastAPI Endpoint
Python · uvicorn · CORS · rate limiting
Cache Layer
Exact-match Cache
In-memory · TTL eviction
cache hit →
Return cached response
cache miss ↓
Orchestrator
Classifier + Fallback Router
Keyword / regex · scores confidence of query type
Detects query type · assigns confidence
Structured path
HIGH confidence
numbers · dates · totals
Generation
Text-to-SQL
Gemini Flash writes query
Database
SQLite
Exact · deterministic
Result
Structured Answer
Precise · source: spreadsheet
Ambiguous path
LOW confidence
no source named
Both paths fire
Parallel Execution
Structured + unstructured simultaneously
both
Synthesis
Gemini Flash decides
Discards irrelevant sources · cites source_type
Unstructured path
HIGH confidence
concepts · clauses · text
Runs in parallel
Keyword
BM25
FTS5 · exact terms
Semantic
KNN
sqlite-vec · cosine
Fusion
RRF Merge
Reciprocal Rank Fusion · top-20
Reranker
Rerank Pass
Flash scores → top-5 chunks
Generation
Gemini 2.5 Flash LLM
Synthesizes context + prompt → final answer · cites source
Streamed response
Output
Final User Response
Streamed SSE · answer + source citation
Est. cost
1k queries/day
$3–8
Gemini Flash / mo
$1–3
Embed API / mo
$6
VPS (Railway)
~$15
Total / month

— Design decisions

Why each layer exists.

Every component was chosen deliberately — the result of optimizing for low infrastructure cost, high retrieval precision, and production reliability at the scale a small or medium business actually needs.

01

SQLite instead of a vector database

Pinecone, Weaviate, and Qdrant cost $70–$500/month and add an external dependency. SQLite with sqlite-vec runs in-process, fits on a $6 VPS, and starts in milliseconds. For knowledge bases under a few million chunks — which covers every small business use case — the performance is identical. Zero infra overhead.

sqlite-vecIn-process$6/mo VPSvs $70–500/mo cloud DB
02

Hybrid BM25 + KNN instead of vector-only

Pure semantic search fails for exact queries — product codes, invoice numbers, names, dates. Pure keyword search fails for conceptual queries — "what's our refund policy", "how do we handle late deliveries." Hybrid search covers both. RRF fusion produces a ranked list more accurate than either strategy alone.

BM25 · FTS5KNN cosineRRF fusionBest of both worlds
03

Three-branch router instead of a single path

Structured data (spreadsheets, tables) should go through SQL for exact answers. Unstructured data (PDFs, text) should go through RAG. Ambiguous queries run both and let the LLM decide. Most RAG systems ignore structured data entirely — which means half the answers a business needs are wrong or missing.

Text-to-SQLRAG pipelineParallel fallback100% answer coverage
04

System prompt as first-class engineering

The system prompt is not configuration — it is the product. It defines the agent's role, tone, citation behavior, what to say when it doesn't know, and when to escalate. A well-engineered system prompt running a small model outperforms a poorly-engineered one running a large model. It is the most critical decision in the stack.

Role · ToneCitation rulesHandoff signalMost critical decision

— What it actually costs

Infrastructure that fits any budget.

State-of-the-art AI should not require enterprise infrastructure budgets. Adjust the slider to see the real cost of running this architecture for your usage level.

Queries per day 1,000
1002,5005,0007,50010,000
Gemini Flash generation
$5
Embedding API
$2
Railway VPS
$6
Total / month ~$13
This architecture ~$13/mo
vs
Pinecone + GPT-4 ~$180/mo
vs
Enterprise RAG platform ~$500+/mo

Build cost (one-time) — this is where Sintropia adds value. The infrastructure is cheap. The engineering — document audit, ingestion pipeline, retrieval calibration, system prompt design, deployment, interface — is a 4–8 week engagement priced in MXN, accessible to businesses of any size.

— Proof of concept

Arthur was built from this template.

Arthur — the conversational agent running on this site — is a live deployment of this exact architecture. Same ingestion pipeline, same hybrid search, same three-branch router, same streaming SSE endpoint. The only thing that changed is the knowledge base (4 markdown files about Sintropia's services and pricing) and the system prompt (Arthur's role and personality). It was built in under two weeks.

This is what the template delivers. Not a demo — a production system answering real questions, qualifying real leads, and sending real emails. The same system, pointed at your documents, becomes your agent.

Architecture component
What Arthur uses

FastAPI · sqlite-vec + FTS5 · Gemini Flash · gemini-embedding-2 · BM25 + KNN hybrid search · RRF merge · SSE streaming · Railway volume for SQLite · Resend for emails

What was customized
The knowledge + the prompt

4 markdown files: services, pricing, methodology, FAQ — all in Spanish for the Mexican market. System prompt defines Arthur's warm tone, handoff signal, and the rule that HANDOFF:true requires real project details before firing.

Build time
Under 2 weeks

From empty repository to live production agent with bilingual support, streaming responses, email capture, semantic cache, rate limiting, and security hardening.

Running cost
~$15 / month

Gemini Flash generation + embedding API + Railway VPS + Resend email. The infrastructure bill for a production AI agent serving a real business is less than a Netflix subscription.

Arthur's architecture → Talk to Arthur →

Build yours on this architecture.

Same template. Your knowledge base. Your agent. In weeks, not quarters — at a price that fits a real business budget.

hello@sintropia.io