Olis API Architecture & Backend Developer Guide
Table of Contents
- Architecture Overview
- System Components
- API Endpoints
- Authentication & Authorization
- RAG Pipeline Deep Dive
- Configuration Reference
- Development Guide
- Performance & Monitoring
- Deployment
- Troubleshooting
Architecture Overview
High-Level Architecture
The Olis API Server is a FastAPI-based RAG (Retrieval-Augmented Generation) system that provides intelligent question answering with document context.Technology Stack
| Component | Technology | Purpose |
|---|---|---|
| API Framework | FastAPI | High-performance async API server |
| Vector Database | Milvus | Semantic search with embeddings |
| Full-Text Search | Elasticsearch | BM25 keyword search |
| Cache/Memory | Redis | Session state, thread memory |
| LLM Provider | OpenAI / Ollama | Answer generation |
| Embeddings | HuggingFace Transformers | Document & query embeddings |
| Reranker | BGE Reranker v2-m3 | Result re-ranking |
| Authentication | JWT | Token-based auth |
| Background Tasks | FastAPI BackgroundTasks | Async memory updates |
System Components
1. Application State (AppResources)
The API maintains application-wide state through the AppResources class:
lifespan() context manager, available via dependency injection.
2. Multi-Tenant Organization Manager
Hard isolation per organization:- Each org gets dedicated Milvus collection:
{base}_orgid - Each org gets dedicated ES index:
{base}_orgid - Resources cached per org to avoid reconnection overhead
3. RAG Pipeline Components
a. Hybrid Retriever
- Vector Search (Milvus): Semantic similarity using embeddings
- BM25 Search (Elasticsearch): Keyword/lexical matching
- Weighted Fusion: Configurable weights (
WEIGHT_VECTOR,WEIGHT_BM25) - Reranking: Optional BGE reranker for precision
b. Query Processor
- Query decomposition into sub-queries (optional)
- Query expansion and reformulation
- Temporal filtering and date extraction
c. LLM Generator
- Supports OpenAI (
gpt-4o-mini) and Ollama (local models) - Configurable temperature, top-p, max tokens
- Structured output with citations
d. Memory System
- Session Memory: Long-term user context (10 mins TTL)
- Thread Memory: Conversation history per thread (5 mins TTL)
- Stored in Redis with automatic expiration
API Endpoints
Base URL
1. Health Check
2. Intent Detection
Analyze user input to detect intent before RAG pipeline execution.3. RAG Query/Prediction
Main endpoint for question answering with document retrieval.- Extract JWT, validate user
- Create or retrieve thread ID
- Normalize query text
- Run RAG pipeline (see RAG Pipeline Deep Dive)
- Filter results by RBAC
- Update session memory (background task)
- Return structured answer with citations
4. Document Upsert/Ingestion
Insert or update documents in the knowledge base.- Validate JWT and extract org_id
- Get org-specific ingester (creates if needed)
- Chunk documents (default: 1000 chars, 150 overlap)
- Generate embeddings (HuggingFace model)
- Store in Milvus (vectors) and Elasticsearch (full-text)
- Apply org_id stamp for hard isolation
title: Document titlecontent: Full text contentmetadata.source: Source system (confluence, slack, github, etc.)metadata.owner_id: Document owner emailmetadata.acl: Access control list (pipe-delimited)metadata.created_at: ISO date stringmetadata.*: Any additional metadata
5. Test Prediction (No Auth Required)
Simplified prediction endpoint for testing (auth bypass)./predict
Response: Same as /predict + prediction_time field
⚠️ WARNING: This endpoint bypasses JWT auth. Only enable in development!
Authentication & Authorization
JWT Authentication
Token Format: Bearer token inAuthorization header
- Signature verification with
JWT_SECRET - Expiration check (
exp) - Issuer validation (
iss) - Token type must be
access
401: Missing/invalid/expired token503: JWT_SECRET not configured
Role-Based Access Control (RBAC)
Documents include ACL metadata for fine-grained access control. ACL Format (pipe-delimited keyset):- Deny always wins: If
|deny:[email protected]|present, access blocked - Owner access: If
owner_idmatches user email, granted - Public access: If
|user:_all_|present, granted to all - User-specific: If
|user:[email protected]|matches, granted - Role-based: If
|role:engineer|and user has role, granted - Group-based: If
|group:product|and user in group, granted - Default deny: If no ACL or no match, only owner can access
- Post-retrieval filtering:
_filter_docs_by_rbac(docs, user_info) - Applied after retrieval, before LLM generation
RAG Pipeline Deep Dive
Pipeline Execution Flow
Step-by-Step Breakdown
1. Query Normalization
2. Retrieval Phase
Vector Retrieval (Milvus):- Embed query using same model as documents
- Cosine similarity search in vector space
- Top-K results:
VECTOR_RETRIEVER_TOP_K(default: 8)
- Lexical keyword matching
- TF-IDF scoring
- Top-K results:
BM25_RETRIEVER_TOP_K(default: 8)
3. Reranking (Optional)
IfUSE_RERANKER=true:
- Model:
BAAI/bge-reranker-v2-m3 - Re-score top results for precision
- Select top
RERANK_TOP_K(default: 5)
4. RBAC Filtering
5. Context Preparation
context: Concatenated document chunks with source markerssources: Dict of source documents with metadataquery: Current queryoriginal_query: Original user query (before decomposition)subqueries: List of decomposed queries (if enabled)
6. LLM Generation
- Model:
RAG_GENERATOR_MODEL(default:gpt-4o-mini) - Temperature:
RAG_LLM_TEMP(default: 0.0) - Top-P:
RAG_LLM_TOP_P(default: 0.2) - Max tokens: Configured via
OLLAMA_NUM_PREDICTor model default
7. Post-Processing
8. Fast Path (Empty Context)
IfSKIP_LLM_ON_EMPTY_CONTEXT=true and no docs retrieved:
Query Decomposition (Advanced)
IfUSE_QUERY_DECOMPOSITION=true:
- Decompose complex query into
QUERY_DECOMPOSITION_Nsub-queries (default: 5) - Retrieve docs for each sub-query independently
- Select top
SUBQUERY_TOP_Kdocs per sub-query (default: 5) - Merge and deduplicate results
- Reason over combined results (if
SUBQUERY_REASONING_ENABLED=true)
Configuration Reference
Environment Variables
Database Configuration
| Variable | Default | Description |
|---|---|---|
MILVUS_URI | http://milvus:19530 | Milvus connection URI |
MILVUS_COLLECTION | docs_collection | Base collection name |
ES_URI | http://elasticsearch:9200 | Elasticsearch URI |
ES_INDEX | docs_bm25 | Base index name |
LLM Configuration
| Variable | Default | Description |
|---|---|---|
RAG_LLM_PROVIDER | openai | LLM provider (openai or ollama) |
OLLAMA_BASE_URL | http://localhost:11434 | Ollama server URL |
OLLAMA_MODEL | llama3.1 | Ollama model name |
OLLAMA_NUM_PREDICT | null | Max tokens to generate |
OLLAMA_NUM_CTX | null | Context window size |
RAG_GENERATOR_MODEL | gpt-4o-mini | OpenAI model name |
RAG_LLM_TEMP | 0.0 | LLM temperature |
RAG_LLM_TOP_P | 0.2 | Top-p sampling |
OPENAI_API_KEY | - | OpenAI API key |
Retrieval Configuration
| Variable | Default | Description |
|---|---|---|
VECTOR_RETRIEVER_TOP_K | 8 | Vector search top-K |
BM25_RETRIEVER_TOP_K | 8 | BM25 search top-K |
RETRIEVER_TOP_K | 5 | Final top-K after fusion |
WEIGHT_VECTOR | 0.5 | Vector search weight |
WEIGHT_BM25 | 0.5 | BM25 search weight |
USE_RERANKER | false | Enable reranking |
RERANK_TOP_K | 5 | Reranker top-K |
RERANKER_SCORE_THRESHOLD | 0.5 | Min reranker score |
SKIP_LLM_ON_EMPTY_CONTEXT | true | Fast path when no docs |
Query Processing
| Variable | Default | Description |
|---|---|---|
USE_QUERY_DECOMPOSITION | true | Enable query decomposition |
QUERY_DECOMPOSITION_N | 5 | Number of sub-queries |
SUBQUERY_TOP_K | 5 | Docs per sub-query |
SUBQUERY_REASONING_ENABLED | true | Multi-hop reasoning |
Memory Configuration
| Variable | Default | Description |
|---|---|---|
REDIS_URL | - | Redis connection URL |
REDIS_HOST | redis | Redis host |
REDIS_PORT | 6379 | Redis port |
REDIS_DB | 0 | Redis database number |
REDIS_PASSWORD | - | Redis password |
REDIS_SSL | false | Use SSL for Redis |
SESSION_TIMEOUT | 600 | Session TTL (seconds) |
MAX_SESSION_TOKENS | 500 | Max tokens in session |
THREAD_TIMEOUT | 300 | Thread TTL (seconds) |
Document Ingestion
| Variable | Default | Description |
|---|---|---|
CHUNK_SIZE | 1000 | Chunk size (characters) |
CHUNK_OVERLAP | 150 | Overlap between chunks |
USE_SEMANTIC_SPLIT | false | Semantic chunking |
USE_RAPTOR | true | Enable RAPTOR hierarchical indexing |
EMBED_MODEL_NAME | sentence-transformers/msmarco-distilbert-base-v4 | Embedding model |
EMBED_DIMENSION | 768 | Embedding dimension |
Authentication
| Variable | Default | Description |
|---|---|---|
JWT_SECRET | - | JWT signing secret |
JWT_ISSUER | oauth-app | JWT issuer claim |
JWT_ALGORITHM | HS256 | JWT signing algorithm |
Performance
| Variable | Default | Description |
|---|---|---|
PERF_LOG | false | Enable performance logging |
MAX_RETRIEVAL_THREADS | 10 | Concurrent retrieval threads |
MAX_INGESTION_THREADS | 10 | Concurrent ingestion threads |
Development Guide
Local Setup
Prerequisites
- Python 3.10+
- Docker & Docker Compose
- Poetry (Python dependency manager)
1. Clone Repository
2. Install Dependencies
3. Start Infrastructure Services
4. Set Environment Variables
.env:
5. Run Server
6. Verify
Testing
Integration Tests
Unit Tests
Load Testing
Debugging
Enable Debug Logging
Performance Logging
Agent Debugging
Custom debug logging is available via_debug_log():
DEBUG_LOG_PATH (default: .cursor/debug.log)
Performance & Monitoring
Performance Metrics
The RAG pipeline emits detailed performance metrics:retrieval_ms: Time for hybrid retrievalllm_ms: Time for LLM generationtotal_ms: End-to-end latencydocs_count: Number of retrieved documentscontext_chars: Context size sent to LLM
Optimization Tips
1. Tune Retrieval Parameters
2. Use Fast Path
3. Optimize Context Size
4. Disable Query Decomposition
5. Use Ollama for Local LLM
Caching Strategy
- Session Memory: 10-minute TTL, stores user context
- Thread Memory: 5-minute TTL, stores conversation history
- Org Resources: Cached indefinitely per org_id
- Session: ~10KB per user
- Thread: ~5KB per thread
- Estimate: 100 concurrent users = ~1.5MB
Deployment
Docker Deployment
1. Build Image
2. Run Container
3. Docker Compose
Production Considerations
1. Security
- Never expose
/testpredictin production (bypasses auth) - Use strong
JWT_SECRET(32+ chars, random) - Enable HTTPS/TLS for all connections
- Lock down CORS:
allow_origins=["https://yourdomain.com"]
2. Scaling
- Horizontal Scaling: Run multiple API instances behind load balancer
- Stateless Design: All state in Redis, safe to scale
- Database Scaling:
- Milvus: Standalone → Cluster mode
- Elasticsearch: Single node → Cluster
- Redis: Single instance → Redis Cluster/Sentinel
3. Monitoring
- Health Checks:
/healthzendpoint for load balancer - Metrics: Export performance logs to DataDog/Prometheus
- Alerting: Monitor
total_ms> 5000ms, error rates
4. Backup
- Milvus: Regular snapshots of
/var/lib/milvus - Elasticsearch: Snapshot and restore API
- Redis: RDB/AOF persistence enabled
Troubleshooting
Common Issues
1. “JWT_SECRET not configured”
Symptom: All API calls return 503 Solution:2. Milvus Connection Failed
Symptom: Server fails to start with Milvus error Check:3. Empty RAG Results
Symptom: All queries return empty answers Debug:- Ingest documents via
/upsert - Check ACLs match user permissions
- Verify embeddings are generated correctly
4. Slow Query Performance
Symptom:total_ms > 5000ms
Profile:
5. Redis Connection Errors
Symptom: “Error connecting to Redis” Check:API Design Patterns
Dependency Injection
FastAPI’s dependency injection provides clean resource access:Background Tasks
Expensive operations run asynchronously:Lifespan Context Manager
Resources initialized once at startup, cleaned up at shutdown:Advanced Topics
Custom Prompts
Modify prompts inbackend/variables/prompts.py:
Multi-Tenant Isolation
Each organization gets isolated resources:Streaming Responses
For real-time LLM streaming:Additional Resources
- FastAPI Docs: https://fastapi.tiangolo.com/
- Milvus Docs: https://milvus.io/docs
- LangChain Docs: https://python.langchain.com/
- OpenAI API: https://platform.openai.com/docs
- Ollama: https://ollama.ai/
Support & Contact
For questions or issues:- GitHub Issues: olis-monorepo/issues
- Email: [email protected]
- Slack: #olis-backend
Last Updated: 2026-02-10 Version: 1.0.0 Author: Olis Backend Team