Skip to main content

Olis API Architecture & Backend Developer Guide

Table of Contents

  1. Architecture Overview
  2. System Components
  3. API Endpoints
  4. Authentication & Authorization
  5. RAG Pipeline Deep Dive
  6. Configuration Reference
  7. Development Guide
  8. Performance & Monitoring
  9. Deployment
  10. Troubleshooting

Architecture Overview

High-Level Architecture

The Olis API Server is a FastAPI-based RAG (Retrieval-Augmented Generation) system that provides intelligent question answering with document context.

Technology Stack

ComponentTechnologyPurpose
API FrameworkFastAPIHigh-performance async API server
Vector DatabaseMilvusSemantic search with embeddings
Full-Text SearchElasticsearchBM25 keyword search
Cache/MemoryRedisSession state, thread memory
LLM ProviderOpenAI / OllamaAnswer generation
EmbeddingsHuggingFace TransformersDocument & query embeddings
RerankerBGE Reranker v2-m3Result re-ranking
AuthenticationJWTToken-based auth
Background TasksFastAPI BackgroundTasksAsync memory updates

System Components

1. Application State (AppResources)

The API maintains application-wide state through the AppResources class:
class AppResources:
    retriever: object              # Hybrid retriever (Milvus + ES)
    generator: object              # LLM handler
    answer_service: AnswerService  # Core RAG service
    user_chains: Dict              # Per-user conversation chains
    session_memory: SessionMemory  # User session memory
    thread_memory: ThreadMemory    # Conversation thread memory
    ingester: DocumentIngester     # Document ingestion pipeline
    org_manager: OrgResourceManager # Multi-tenant org isolation
Lifecycle: Initialized in lifespan() context manager, available via dependency injection.

2. Multi-Tenant Organization Manager

Hard isolation per organization:
  • Each org gets dedicated Milvus collection: {base}_orgid
  • Each org gets dedicated ES index: {base}_orgid
  • Resources cached per org to avoid reconnection overhead
# Example: Getting org-specific resources
org_res = resources.org_manager.get(org_id, init_ingester=True)
retriever = org_res["retriever"]
answer_service = org_res["answer_service"]

3. RAG Pipeline Components

a. Hybrid Retriever

  • Vector Search (Milvus): Semantic similarity using embeddings
  • BM25 Search (Elasticsearch): Keyword/lexical matching
  • Weighted Fusion: Configurable weights (WEIGHT_VECTOR, WEIGHT_BM25)
  • Reranking: Optional BGE reranker for precision

b. Query Processor

  • Query decomposition into sub-queries (optional)
  • Query expansion and reformulation
  • Temporal filtering and date extraction

c. LLM Generator

  • Supports OpenAI (gpt-4o-mini) and Ollama (local models)
  • Configurable temperature, top-p, max tokens
  • Structured output with citations

d. Memory System

  • Session Memory: Long-term user context (10 mins TTL)
  • Thread Memory: Conversation history per thread (5 mins TTL)
  • Stored in Redis with automatic expiration

API Endpoints

Base URL

http://localhost:8002

1. Health Check

GET /healthz
Response:
{
  "status": "ok"
}

2. Intent Detection

Analyze user input to detect intent before RAG pipeline execution.
POST /query
Headers:
Authorization: Bearer <JWT_TOKEN>
Content-Type: application/json
Request Body:
{
  "text": "I need to review the compliance procedures before contacting the client",
  "user_info": {
    "user_name": "[email protected]",
    "real_name": "John Doe"
  }
}
Response:
{
  "intent": "compliance_check",
  "confidence": 0.95,
  "requires_rag": true,
  "input": "I need to review...",
  "user_info": {
    "user_name": "[email protected]",
    "real_name": "John Doe",
    "org_id": "uuid-here",
    "current_roles": ["analyst"],
    "current_groups": []
  }
}
Use Case: Pre-filter queries, route to appropriate handlers, avoid unnecessary RAG calls.

3. RAG Query/Prediction

Main endpoint for question answering with document retrieval.
POST /predict
Headers:
Authorization: Bearer <JWT_TOKEN>
Content-Type: application/json
Request Body:
{
  "query": "What are the security implementation challenges in Olis?",
  "user_info": {
    "user_name": "[email protected]",
    "real_name": "Dev User"
  },
  "thread_id": "optional-existing-thread-id"
}
Response:
{
  "answer": {
    "callout_answer": "The main security challenges in Olis include...",
    "positive_list": [
      "Implementing end-to-end encryption for sensitive data",
      "Establishing secure authentication flows"
    ],
    "negative_list": [],
    "info_list": [
      "Security audit scheduled for Q2",
      "RBAC implementation in progress"
    ],
    "keywords": ["security", "encryption", "authentication", "RBAC"]
  },
  "sources": {
    "doc_1": {
      "title": "Security Implementation Plan",
      "source": "internal-wiki",
      "page_content": "...",
      "metadata": { "owner_id": "[email protected]", "acl": "|user:_all_|" }
    }
  },
  "query": "What are the security implementation challenges in Olis?",
  "original_query": "What are the security implementation challenges in Olis?",
  "input": "What are the security implementation challenges in Olis?",
  "thread_id": "uuid-thread-id"
}
Flow:
  1. Extract JWT, validate user
  2. Create or retrieve thread ID
  3. Normalize query text
  4. Run RAG pipeline (see RAG Pipeline Deep Dive)
  5. Filter results by RBAC
  6. Update session memory (background task)
  7. Return structured answer with citations

4. Document Upsert/Ingestion

Insert or update documents in the knowledge base.
POST /upsert
Headers:
Authorization: Bearer <JWT_TOKEN>
Content-Type: application/json
Request Body:
{
  "user_email": "[email protected]",
  "documents": {
    "doc_id_1": {
      "title": "Product Requirements Document",
      "content": "Full document text here...",
      "metadata": {
        "source": "confluence",
        "created_at": "2026-01-15",
        "owner_id": "[email protected]",
        "acl": "|user:_all_|role:pm|role:engineering|"
      }
    },
    "doc_id_2": {
      "title": "API Design Spec",
      "content": "...",
      "metadata": { "source": "github", "acl": "|role:engineering|" }
    }
  }
}
Response:
{
  "status": "success",
  "user": "[email protected]",
  "num_chunks": "42"
}
Processing:
  1. Validate JWT and extract org_id
  2. Get org-specific ingester (creates if needed)
  3. Chunk documents (default: 1000 chars, 150 overlap)
  4. Generate embeddings (HuggingFace model)
  5. Store in Milvus (vectors) and Elasticsearch (full-text)
  6. Apply org_id stamp for hard isolation
Supported Document Fields:
  • title: Document title
  • content: Full text content
  • metadata.source: Source system (confluence, slack, github, etc.)
  • metadata.owner_id: Document owner email
  • metadata.acl: Access control list (pipe-delimited)
  • metadata.created_at: ISO date string
  • metadata.*: Any additional metadata

5. Test Prediction (No Auth Required)

Simplified prediction endpoint for testing (auth bypass).
POST /testpredict
Request: Same as /predict Response: Same as /predict + prediction_time field ⚠️ WARNING: This endpoint bypasses JWT auth. Only enable in development!

Authentication & Authorization

JWT Authentication

Token Format: Bearer token in Authorization header
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
Required JWT Claims:
{
  "email": "[email protected]",
  "role": "engineer",
  "org_id": "uuid-org-id",
  "type": "access",
  "exp": 1738425600,
  "iat": 1738422000,
  "iss": "oauth-app",
  "jti": "unique-token-id"
}
Configuration:
JWT_SECRET=your-secret-key-here
JWT_ISSUER=oauth-app
JWT_ALGORITHM=HS256
Token Validation:
  • Signature verification with JWT_SECRET
  • Expiration check (exp)
  • Issuer validation (iss)
  • Token type must be access
Error Responses:
  • 401: Missing/invalid/expired token
  • 503: JWT_SECRET not configured

Role-Based Access Control (RBAC)

Documents include ACL metadata for fine-grained access control. ACL Format (pipe-delimited keyset):
|user:[email protected]|role:engineer|role:pm|group:product|deny:[email protected]|
Access Rules:
  1. Deny always wins: If |deny:[email protected]| present, access blocked
  2. Owner access: If owner_id matches user email, granted
  3. Public access: If |user:_all_| present, granted to all
  4. User-specific: If |user:[email protected]| matches, granted
  5. Role-based: If |role:engineer| and user has role, granted
  6. Group-based: If |group:product| and user in group, granted
  7. Default deny: If no ACL or no match, only owner can access
Implementation:
def _doc_allowed_for_user(doc: Any, user_info: Dict[str, Any]) -> bool:
    meta = doc.metadata
    user = user_info.get("user_name").lower()
    owner = meta.get("owner_id", "").lower()
    acl = meta.get("acl", "").lower()

    # Deny check
    if f"|deny:{user}|" in acl:
        return False

    # Owner check
    if owner == user:
        return True

    # Public check
    if "|user:_all_|" in acl:
        return True

    # User/role/group checks...
    # (See main.py:_doc_allowed_for_user for full implementation)
Filtering:
  • Post-retrieval filtering: _filter_docs_by_rbac(docs, user_info)
  • Applied after retrieval, before LLM generation

RAG Pipeline Deep Dive

Pipeline Execution Flow

Step-by-Step Breakdown

1. Query Normalization

query_text = _normalize_query_value(input.query)
# Handles string or dict input, extracts query value

2. Retrieval Phase

Vector Retrieval (Milvus):
  • Embed query using same model as documents
  • Cosine similarity search in vector space
  • Top-K results: VECTOR_RETRIEVER_TOP_K (default: 8)
BM25 Retrieval (Elasticsearch):
  • Lexical keyword matching
  • TF-IDF scoring
  • Top-K results: BM25_RETRIEVER_TOP_K (default: 8)
Hybrid Fusion:
# Weighted combination
score = (vector_score * WEIGHT_VECTOR) + (bm25_score * WEIGHT_BM25)
# Default weights: 0.5 / 0.5
Performance Timing:
retrieval_ms = int((time.monotonic() - t0) * 1000)
_perf_emit("retrieval_done", {"retrieval_ms": retrieval_ms, "docs_count": len(docs)})

3. Reranking (Optional)

If USE_RERANKER=true:
  • Model: BAAI/bge-reranker-v2-m3
  • Re-score top results for precision
  • Select top RERANK_TOP_K (default: 5)

4. RBAC Filtering

docs = _filter_docs_by_rbac(docs, user_info)
# Remove docs user doesn't have permission to access

5. Context Preparation

context = answer_service.prepare_context(
    inputs={
        "retrieved_docs": docs,
        "query": query_text,
        "original_query": query_text,
        "subqueries": [query_text],
    },
    user_info=user_info,
    chat_history=None,  # Added from thread memory if available
)
Context Structure:
  • context: Concatenated document chunks with source markers
  • sources: Dict of source documents with metadata
  • query: Current query
  • original_query: Original user query (before decomposition)
  • subqueries: List of decomposed queries (if enabled)

6. LLM Generation

raw_answer = answer_service.generate_answer_with_prepared_context(
    context=context,
    answer_prompt=AnswerWithKeywordsPrompt,
)
Prompt Structure:
You are Olis, an intelligent assistant that answers questions using provided documents.

Context:
[Document chunks with [Source: doc_1] markers]

Question: {query}

Instructions:
- Provide a comprehensive answer using the context
- Include specific examples and details
- Cite sources using [Source: doc_id] format
- Structure your answer with:
  - callout_answer: Main answer summary
  - positive_list: Specific points/facts supporting the answer
  - negative_list: Caveats/limitations
  - info_list: Additional relevant information
  - keywords: Key terms from the answer

Format your response as JSON.
LLM Configuration:
  • Model: RAG_GENERATOR_MODEL (default: gpt-4o-mini)
  • Temperature: RAG_LLM_TEMP (default: 0.0)
  • Top-P: RAG_LLM_TOP_P (default: 0.2)
  • Max tokens: Configured via OLLAMA_NUM_PREDICT or model default

7. Post-Processing

# Flatten answer + keywords structure
payload = answer_service.flatten_ak(context)

# Coalesce citation keys ([Source: doc_1] -> sources dict)
payload = answer_service.coalesce_citation_keys(payload)

# Remove unused sources not cited in answer
payload = answer_service.prune_sources_by_citations(payload)

# Parse JSON string answers to objects
payload = _maybe_parse_answer(payload)

8. Fast Path (Empty Context)

If SKIP_LLM_ON_EMPTY_CONTEXT=true and no docs retrieved:
{
  "answer": {
    "callout_answer": "No information could be found for the question based on the provided sources.",
    "positive_list": [],
    "negative_list": [],
    "info_list": [],
    "keywords": []
  },
  "sources": {},
  "query": "..."
}
Saves LLM cost and latency when no relevant docs found.

Query Decomposition (Advanced)

If USE_QUERY_DECOMPOSITION=true:
  1. Decompose complex query into QUERY_DECOMPOSITION_N sub-queries (default: 5)
  2. Retrieve docs for each sub-query independently
  3. Select top SUBQUERY_TOP_K docs per sub-query (default: 5)
  4. Merge and deduplicate results
  5. Reason over combined results (if SUBQUERY_REASONING_ENABLED=true)
Example:
Query: "What security challenges does Olis face and who is addressing them?"

Sub-queries:
1. "What are the security challenges in Olis?"
2. "Who is responsible for security in Olis?"
3. "What is the current security implementation status?"
4. "What security features are planned?"
5. "What security risks have been identified?"

Configuration Reference

Environment Variables

Database Configuration

VariableDefaultDescription
MILVUS_URIhttp://milvus:19530Milvus connection URI
MILVUS_COLLECTIONdocs_collectionBase collection name
ES_URIhttp://elasticsearch:9200Elasticsearch URI
ES_INDEXdocs_bm25Base index name

LLM Configuration

VariableDefaultDescription
RAG_LLM_PROVIDERopenaiLLM provider (openai or ollama)
OLLAMA_BASE_URLhttp://localhost:11434Ollama server URL
OLLAMA_MODELllama3.1Ollama model name
OLLAMA_NUM_PREDICTnullMax tokens to generate
OLLAMA_NUM_CTXnullContext window size
RAG_GENERATOR_MODELgpt-4o-miniOpenAI model name
RAG_LLM_TEMP0.0LLM temperature
RAG_LLM_TOP_P0.2Top-p sampling
OPENAI_API_KEY-OpenAI API key

Retrieval Configuration

VariableDefaultDescription
VECTOR_RETRIEVER_TOP_K8Vector search top-K
BM25_RETRIEVER_TOP_K8BM25 search top-K
RETRIEVER_TOP_K5Final top-K after fusion
WEIGHT_VECTOR0.5Vector search weight
WEIGHT_BM250.5BM25 search weight
USE_RERANKERfalseEnable reranking
RERANK_TOP_K5Reranker top-K
RERANKER_SCORE_THRESHOLD0.5Min reranker score
SKIP_LLM_ON_EMPTY_CONTEXTtrueFast path when no docs

Query Processing

VariableDefaultDescription
USE_QUERY_DECOMPOSITIONtrueEnable query decomposition
QUERY_DECOMPOSITION_N5Number of sub-queries
SUBQUERY_TOP_K5Docs per sub-query
SUBQUERY_REASONING_ENABLEDtrueMulti-hop reasoning

Memory Configuration

VariableDefaultDescription
REDIS_URL-Redis connection URL
REDIS_HOSTredisRedis host
REDIS_PORT6379Redis port
REDIS_DB0Redis database number
REDIS_PASSWORD-Redis password
REDIS_SSLfalseUse SSL for Redis
SESSION_TIMEOUT600Session TTL (seconds)
MAX_SESSION_TOKENS500Max tokens in session
THREAD_TIMEOUT300Thread TTL (seconds)

Document Ingestion

VariableDefaultDescription
CHUNK_SIZE1000Chunk size (characters)
CHUNK_OVERLAP150Overlap between chunks
USE_SEMANTIC_SPLITfalseSemantic chunking
USE_RAPTORtrueEnable RAPTOR hierarchical indexing
EMBED_MODEL_NAMEsentence-transformers/msmarco-distilbert-base-v4Embedding model
EMBED_DIMENSION768Embedding dimension

Authentication

VariableDefaultDescription
JWT_SECRET-JWT signing secret
JWT_ISSUERoauth-appJWT issuer claim
JWT_ALGORITHMHS256JWT signing algorithm

Performance

VariableDefaultDescription
PERF_LOGfalseEnable performance logging
MAX_RETRIEVAL_THREADS10Concurrent retrieval threads
MAX_INGESTION_THREADS10Concurrent ingestion threads

Development Guide

Local Setup

Prerequisites

  • Python 3.10+
  • Docker & Docker Compose
  • Poetry (Python dependency manager)

1. Clone Repository

git clone <repo-url>
cd olis-monorepo/apps/api-server

2. Install Dependencies

poetry install
# or
pip install -r requirements.txt

3. Start Infrastructure Services

docker-compose up -d
# Starts: Milvus, Elasticsearch, Redis

4. Set Environment Variables

cp .env.example .env
# Edit .env with your configuration
Minimal .env:
# Database
MILVUS_URI=http://localhost:19530
ES_URI=http://localhost:9200
REDIS_HOST=localhost
REDIS_PORT=6379

# LLM
RAG_LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
RAG_GENERATOR_MODEL=gpt-4o-mini

# Auth
JWT_SECRET=your-secret-key-change-in-production
JWT_ISSUER=olis-api

5. Run Server

# Development with hot reload
uvicorn src.api_server.main:app --reload --port 8002

# Production
gunicorn src.api_server.main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8002

6. Verify

curl http://localhost:8002/healthz
# Expected: {"status":"ok"}

Testing

Integration Tests

# Run integration tests (requires running server)
RUN_INTEGRATION_TESTS=1 pytest tests/integration_tests/test_fastapi.py -v
Example Test:
import requests

def test_predict_endpoint():
    url = "http://localhost:8002/predict"
    headers = {"Authorization": f"Bearer {get_test_token()}"}
    data = {
        "query": "What are the Olis features?",
        "user_info": {"user_name": "[email protected]"}
    }
    response = requests.post(url, json=data, headers=headers)
    assert response.status_code == 200
    assert "answer" in response.json()

Unit Tests

pytest tests/unit_tests/ -v

Load Testing

# Using locust
locust -f tests/load_tests/locustfile.py --host=http://localhost:8002

Debugging

Enable Debug Logging

# In main.py
logging.basicConfig(level=logging.DEBUG)

Performance Logging

# Enable performance metrics
export PERF_LOG=true

# Server will emit:
# PERF {"event":"rag_start","ts":1738425600000,"request_id":"..."}
# PERF {"event":"retrieval_done","ts":1738425601000,"retrieval_ms":342,"docs_count":5}
# PERF {"event":"rag_end","ts":1738425603000,"total_ms":3124}

Agent Debugging

Custom debug logging is available via _debug_log():
_debug_log(
    "main.py:lifespan",
    "retrieval_config",
    {"top_k": config.RETRIEVER_TOP_K, "use_reranker": config.USE_RERANKER},
    "H1"  # Hypothesis ID
)
Logs written to: DEBUG_LOG_PATH (default: .cursor/debug.log)

Performance & Monitoring

Performance Metrics

The RAG pipeline emits detailed performance metrics:
{
  "event": "rag_end",
  "request_id": "uuid",
  "query_hash": "abc123def4",
  "retrieval_ms": 342,
  "context_ms": 12,
  "llm_ms": 2456,
  "post_ms": 23,
  "total_ms": 2833,
  "docs_count": 5,
  "context_chars": 4532,
  "model": "gpt-4o-mini",
  "num_predict": 500,
  "num_ctx": 4096
}
Key Metrics:
  • retrieval_ms: Time for hybrid retrieval
  • llm_ms: Time for LLM generation
  • total_ms: End-to-end latency
  • docs_count: Number of retrieved documents
  • context_chars: Context size sent to LLM

Optimization Tips

1. Tune Retrieval Parameters

# Reduce top-K for faster retrieval
VECTOR_RETRIEVER_TOP_K=5
BM25_RETRIEVER_TOP_K=5
RETRIEVER_TOP_K=3

# Disable reranker if not needed
USE_RERANKER=false

2. Use Fast Path

# Skip LLM when no docs found
SKIP_LLM_ON_EMPTY_CONTEXT=true

3. Optimize Context Size

# Limit doc/context characters
MAX_DOC_CHARS=1000
MAX_CONTEXT_CHARS=4000

4. Disable Query Decomposition

# Faster but less comprehensive
USE_QUERY_DECOMPOSITION=false

5. Use Ollama for Local LLM

# Avoid OpenAI API latency
RAG_LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3.1
OLLAMA_NUM_PREDICT=300  # Faster generation

Caching Strategy

  • Session Memory: 10-minute TTL, stores user context
  • Thread Memory: 5-minute TTL, stores conversation history
  • Org Resources: Cached indefinitely per org_id
Redis Memory Usage:
  • Session: ~10KB per user
  • Thread: ~5KB per thread
  • Estimate: 100 concurrent users = ~1.5MB

Deployment

Docker Deployment

1. Build Image

cd apps/api-server
docker build -t olis-api:latest .

2. Run Container

docker run -d \
  -p 8002:8002 \
  -e MILVUS_URI=http://milvus:19530 \
  -e ES_URI=http://elasticsearch:9200 \
  -e REDIS_HOST=redis \
  -e OPENAI_API_KEY=sk-... \
  -e JWT_SECRET=production-secret \
  --name olis-api \
  olis-api:latest

3. Docker Compose

version: '3.8'
services:
  olis-api:
    build: .
    ports:
      - "8002:8002"
    environment:
      MILVUS_URI: http://milvus:19530
      ES_URI: http://elasticsearch:9200
      REDIS_HOST: redis
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      JWT_SECRET: ${JWT_SECRET}
    depends_on:
      - milvus
      - elasticsearch
      - redis

  milvus:
    image: milvusdb/milvus:latest
    ports:
      - "19530:19530"
    volumes:
      - milvus_data:/var/lib/milvus

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    ports:
      - "9200:9200"
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    volumes:
      - es_data:/usr/share/elasticsearch/data

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

volumes:
  milvus_data:
  es_data:
  redis_data:

Production Considerations

1. Security

  • Never expose /testpredict in production (bypasses auth)
  • Use strong JWT_SECRET (32+ chars, random)
  • Enable HTTPS/TLS for all connections
  • Lock down CORS: allow_origins=["https://yourdomain.com"]

2. Scaling

  • Horizontal Scaling: Run multiple API instances behind load balancer
  • Stateless Design: All state in Redis, safe to scale
  • Database Scaling:
    • Milvus: Standalone → Cluster mode
    • Elasticsearch: Single node → Cluster
    • Redis: Single instance → Redis Cluster/Sentinel

3. Monitoring

  • Health Checks: /healthz endpoint for load balancer
  • Metrics: Export performance logs to DataDog/Prometheus
  • Alerting: Monitor total_ms > 5000ms, error rates

4. Backup

  • Milvus: Regular snapshots of /var/lib/milvus
  • Elasticsearch: Snapshot and restore API
  • Redis: RDB/AOF persistence enabled

Troubleshooting

Common Issues

1. “JWT_SECRET not configured”

Symptom: All API calls return 503 Solution:
export JWT_SECRET=your-secret-key
# Restart server

2. Milvus Connection Failed

Symptom: Server fails to start with Milvus error Check:
# Verify Milvus is running
curl http://localhost:19530/healthz

# Check logs
docker logs milvus-standalone
Solution:
# Restart Milvus
docker-compose restart milvus

# Verify connection
export MILVUS_URI=http://localhost:19530

3. Empty RAG Results

Symptom: All queries return empty answers Debug:
# Check if documents exist
from pymilvus import connections, utility
connections.connect(uri="http://localhost:19530")
stats = utility.get_query_segment_info("docs_collection")
print(f"Docs in collection: {stats}")
Solution:
  • Ingest documents via /upsert
  • Check ACLs match user permissions
  • Verify embeddings are generated correctly

4. Slow Query Performance

Symptom: total_ms > 5000ms Profile:
export PERF_LOG=true
# Check which phase is slow:
# - retrieval_ms > 1000? Optimize Milvus/ES
# - llm_ms > 3000? Use smaller model or reduce context
# - context_ms > 500? Too many docs
Optimize:
# Reduce retrieval size
RETRIEVER_TOP_K=3

# Disable query decomposition
USE_QUERY_DECOMPOSITION=false

# Use faster LLM
RAG_GENERATOR_MODEL=gpt-4o-mini
OLLAMA_NUM_PREDICT=300

5. Redis Connection Errors

Symptom: “Error connecting to Redis” Check:
redis-cli -h localhost -p 6379 ping
# Expected: PONG
Solution:
# Check Redis is running
docker ps | grep redis

# Verify connection params
echo $REDIS_HOST $REDIS_PORT

# Test connection
python -c "import redis; r=redis.Redis(host='localhost', port=6379); print(r.ping())"

API Design Patterns

Dependency Injection

FastAPI’s dependency injection provides clean resource access:
@app.post("/predict")
async def predict(
    input: QueryInput,
    resources: AppResources = Depends(get_resources),
    current_user: Dict[str, Any] = Depends(get_current_user),
):
    # resources and current_user automatically injected
    pass

Background Tasks

Expensive operations run asynchronously:
background_tasks.add_task(
    _update_memory_background,
    resources.session_memory,
    user_id,
    memory
)
# Response returns immediately, memory updated in background

Lifespan Context Manager

Resources initialized once at startup, cleaned up at shutdown:
@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: Initialize databases, LLM, etc.
    app.state.resources = AppResources(...)

    yield  # Server runs here

    # Shutdown: Close connections
    connections.disconnect("default")

Advanced Topics

Custom Prompts

Modify prompts in backend/variables/prompts.py:
AnswerWithKeywordsPrompt = """
You are Olis, an AI assistant that provides accurate answers based on provided context.

Context:
{context}

Question: {query}

Provide a JSON response with:
- callout_answer: Main answer (2-3 sentences)
- positive_list: Key points (list)
- negative_list: Caveats (list)
- info_list: Additional info (list)
- keywords: Relevant keywords (list)
"""

Multi-Tenant Isolation

Each organization gets isolated resources:
# Org A: docs_collection_orgaid, docs_bm25_orgaid
# Org B: docs_collection_orgbid, docs_bm25_orgbid

# Hard isolation at database level
org_res = resources.org_manager.get(org_id)
retriever = org_res["retriever"]  # Only sees org's data

Streaming Responses

For real-time LLM streaming:
from fastapi.responses import StreamingResponse

@app.post("/predict-stream")
async def predict_stream(input: QueryInput):
    async def generate():
        async for chunk in llm.astream(prompt):
            yield f"data: {chunk}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Additional Resources


Support & Contact

For questions or issues:
Last Updated: 2026-02-10 Version: 1.0.0 Author: Olis Backend Team