RAG Architecture — Sintropia

— System diagram

Every layer, explained.

This is the complete request flow — from the moment a user types a question to the moment a grounded answer arrives. Each layer has a specific job. Removing any one of them degrades the quality of the output in a measurable way.

Ingestion — runs once / on file upload

Source

PDF / XLSX

Your documents

Parser

PyMuPDF / openpyxl

Extract text + tables

Chunker

Semantic Chunks

Rows → sentences for XLSX

Metadata Tag

Source Tagger

source_type · filename · row

Embedding

Gemini Embed

3072-dim vectors

Index (stored once)

sqlite-vec + FTS5

Vector + full-text index

Query runtime

Input

User Prompt

Natural language — no source specified

HTTP POST

REST API

FastAPI Endpoint

Python · uvicorn · CORS · rate limiting

Cache Layer

Exact-match Cache

In-memory · TTL eviction

cache hit →

Return cached response

cache miss ↓

Orchestrator

Classifier + Fallback Router

Keyword / regex · scores confidence of query type

Detects query type · assigns confidence

Structured path

HIGH confidence
numbers · dates · totals

Generation

Text-to-SQL

Gemini Flash writes query

Database

SQLite

Exact · deterministic

Result

Structured Answer

Precise · source: spreadsheet

Ambiguous path

LOW confidence
no source named

Both paths fire

Parallel Execution

Structured + unstructured simultaneously

both

Synthesis

Gemini Flash decides

Discards irrelevant sources · cites source_type

Unstructured path

HIGH confidence
concepts · clauses · text

Runs in parallel

Keyword

BM25

FTS5 · exact terms

Semantic

KNN

sqlite-vec · cosine

Fusion

RRF Merge

Reciprocal Rank Fusion · top-20

Reranker

Rerank Pass

Flash scores → top-5 chunks

Context

Extracted Text

High-precision · source: pdf

Generation

Gemini 2.5 Flash LLM

Synthesizes context + prompt → final answer · cites source

Streamed response

Output

Final User Response

Streamed SSE · answer + source citation

Est. cost
1k queries/day

$3–8

Gemini Flash / mo

$1–3

Embed API / mo

$6

VPS (Railway)

~$15

Total / month

— Design decisions

Why each layer exists.

Every component was chosen deliberately — the result of optimizing for low infrastructure cost, high retrieval precision, and production reliability at the scale a small or medium business actually needs.

01

SQLite instead of a vector database

Pinecone, Weaviate, and Qdrant cost $70–$500/month and add an external dependency. SQLite with sqlite-vec runs in-process, fits on a $6 VPS, and starts in milliseconds. For knowledge bases under a few million chunks — which covers every small business use case — the performance is identical. Zero infra overhead.

sqlite-vecIn-process$6/mo VPSvs $70–500/mo cloud DB

02

Hybrid BM25 + KNN instead of vector-only

Pure semantic search fails for exact queries — product codes, invoice numbers, names, dates. Pure keyword search fails for conceptual queries — "what's our refund policy", "how do we handle late deliveries." Hybrid search covers both. RRF fusion produces a ranked list more accurate than either strategy alone.

BM25 · FTS5KNN cosineRRF fusionBest of both worlds

03

Three-branch router instead of a single path

Structured data (spreadsheets, tables) should go through SQL for exact answers. Unstructured data (PDFs, text) should go through RAG. Ambiguous queries run both and let the LLM decide. Most RAG systems ignore structured data entirely — which means half the answers a business needs are wrong or missing.

Text-to-SQLRAG pipelineParallel fallback100% answer coverage

04

System prompt as first-class engineering

The system prompt is not configuration — it is the product. It defines the agent's role, tone, citation behavior, what to say when it doesn't know, and when to escalate. A well-engineered system prompt running a small model outperforms a poorly-engineered one running a large model. It is the most critical decision in the stack.

Role · ToneCitation rulesHandoff signalMost critical decision

— What it actually costs

Infrastructure that fits any budget.

State-of-the-art AI should not require enterprise infrastructure budgets. Adjust the slider to see the real cost of running this architecture for your usage level.

Queries per day 1,000

1002,5005,0007,50010,000

Gemini Flash generation

$5

Embedding API

$2

Railway VPS

$6

Total / month ~$13

This architecture ~$13/mo

vs

Pinecone + GPT-4 ~$180/mo

vs

Enterprise RAG platform ~$500+/mo

Build cost (one-time) — this is where Sintropia adds value. The infrastructure is cheap. The engineering — document audit, ingestion pipeline, retrieval calibration, system prompt design, deployment, interface — is a 4–8 week engagement priced in MXN, accessible to businesses of any size.

— Proof of concept

Arthur was built from this template.

Arthur — the conversational agent running on this site — is a live deployment of this exact architecture. Same ingestion pipeline, same hybrid search, same three-branch router, same streaming SSE endpoint. The only thing that changed is the knowledge base (4 markdown files about Sintropia's services and pricing) and the system prompt (Arthur's role and personality). It was built in under two weeks.

This is what the template delivers. Not a demo — a production system answering real questions, qualifying real leads, and sending real emails. The same system, pointed at your documents, becomes your agent.

Architecture component

What Arthur uses

FastAPI · sqlite-vec + FTS5 · Gemini Flash · gemini-embedding-2 · BM25 + KNN hybrid search · RRF merge · SSE streaming · Railway volume for SQLite · Resend for emails

What was customized

The knowledge + the prompt

4 markdown files: services, pricing, methodology, FAQ — all in Spanish for the Mexican market. System prompt defines Arthur's warm tone, handoff signal, and the rule that HANDOFF:true requires real project details before firing.

Build time

Under 2 weeks

From empty repository to live production agent with bilingual support, streaming responses, email capture, semantic cache, rate limiting, and security hardening.

Running cost

~$15 / month

Gemini Flash generation + embedding API + Railway VPS + Resend email. The infrastructure bill for a production AI agent serving a real business is less than a Netflix subscription.

Arthur's architecture → Talk to Arthur →

The RAG Architecture