Skip to content

Retrieval & RAG

Ekklesia uses a hybrid retrieval strategy that combines dense (vector) search and sparse (full-text) search across three source types, fused with Reciprocal Rank Fusion. The corpus lives entirely in PostgreSQL with pgvector.

Hybrid search pipeline

flowchart TD
    Q[query string] --> EMBED["embed_query (gemini-embedding-001, 768-dim)"]
    Q --> FTS["websearch_to_tsquery (PostgreSQL FTS)"]
    EMBED --> DENSE1[dense: bible_passages]
    EMBED --> DENSE2[dense: commentary_chunks]
    EMBED --> DENSE3[dense: cultural_entries]
    FTS --> SPARSE1[sparse: bible_passages]
    FTS --> SPARSE2[sparse: commentary_chunks]
    FTS --> SPARSE3[sparse: cultural_entries]
    DENSE1 --> RRF
    DENSE2 --> RRF
    DENSE3 --> RRF
    SPARSE1 --> RRF
    SPARSE2 --> RRF
    SPARSE3 --> RRF
    RRF["Reciprocal Rank Fusion (rrf_k=60)"] --> RESULTS[top_k RetrievalResult]

All queries within a single hybrid_search() call run concurrently via asyncio.gather. A shared asyncio.Lock serialises writes to the SQLAlchemy session (which is not thread-safe for concurrent use).

Source types

Source type Table Text column Reference column
bible bible_passages text reference
commentary commentary_chunks chunk_text reference
cultural cultural_entries text title

Each source type can be included or excluded per call. The Research agent queries all three; the Exegesis agent queries bible and commentary only.

Embeddings are stored as Vector(768) columns. Similarity is measured by cosine distance using the <=> operator:

SELECT id, reference, text, 1 - (embedding <=> CAST(:qvec AS vector)) AS score
FROM bible_passages
WHERE embedding IS NOT NULL
ORDER BY embedding <=> CAST(:qvec AS vector)
LIMIT :limit

HNSW indexes are configured for cosine similarity with m=16, ef_construction=64:

Index(
    "ix_bible_passages_embedding",
    bible_passages.c.embedding,
    postgresql_using="hnsw",
    postgresql_with={"m": 16, "ef_construction": 64},
    postgresql_ops={"embedding": "vector_cosine_ops"},
)

Each table has a computed TSVECTOR column and a GIN index:

-- computed column (persisted)
to_tsvector('english', pericope_title || ' ' || text)

-- query
SELECT ..., ts_rank(search_vector, websearch_to_tsquery('english', :q)) AS score
FROM bible_passages
WHERE search_vector @@ websearch_to_tsquery('english', :q)
ORDER BY score DESC
LIMIT :limit

websearch_to_tsquery accepts natural-language search syntax (AND, OR, negation with -, phrase with "...") so agent queries can use ordinary prose without SQL injection risk.

Reciprocal Rank Fusion

RRF (retrieval/hybrid.py: _rrf_fuse) combines ranked result lists without requiring comparable scores:

score(doc) = Σ  1 / (rrf_k + rank_i)
             i

Default rrf_k=60. Documents that appear at high rank across multiple lists accumulate the highest fused scores. The final list is deduplicated by (source_type, id).

Embedding model

File: src/ekklesia/retrieval/embedding.py

async def embed_query(text: str) -> list[float]:
    with logfire.span("embed_query", model="gemini-embedding-001"):
        # POST to Gemini embedContent API
        # Returns list[float] of length 768

The same gemini-embedding-001 model is used at both ingestion time and query time, ensuring the vector space is consistent.

Embedding dimension: 768
Model: gemini-embedding-001
Fallback: If the Gemini API is unavailable during tests, conftest.py provides an autouse mock returning [0.1] * 768.

Database schema

erDiagram
    bible_passages {
        string id PK
        string translation
        string book
        smallint book_id
        smallint start_chapter
        smallint start_verse
        smallint end_chapter
        smallint end_verse
        string reference
        string pericope_title
        text text
        smallint verse_count
        string testament
        vector_768 embedding
        tsvector search_vector
    }
    bible_verses {
        int id PK
        smallint book_id
        string book
        smallint chapter
        smallint verse
        text text
        string translation
    }
    commentary_chunks {
        string id PK
        string source
        string book
        smallint start_chapter
        smallint start_verse
        string reference
        text chunk_text
        string parent_id FK
        smallint position_in_parent
        vector_768 embedding
        tsvector search_vector
    }
    commentary_parents {
        string id PK
        string source
        string book
        string reference
        text full_text
    }
    cultural_entries {
        string id PK
        string source
        string title
        text text
        array related_references
        string category
        vector_768 embedding
        tsvector search_vector
    }
    lexicon_entries {
        string strongs_number PK
        string language
        string lemma
        string transliteration
        text definition
        text kjv_usage
        int occurrences
    }
    cross_references {
        int id PK
        string from_verse_id
        string to_verse_id
        smallint to_verse_end
        int votes
    }
    commentary_chunks }o--|| commentary_parents : parent_id

Ingestion

Ingestion populates the corpus from data files in data/. Each script runs independently and is idempotent (uses upsert semantics).

Order of operations

1. ingest_lexicon.py            — no dependencies
2. ingest_cross_references.py   — no dependencies
3. ingest_bible.py              — calls Gemini embedding API (batch)
4. ingest_commentary.py         — calls Gemini embedding API (batch)

Lexicon and cross-reference data do not require embeddings and can be ingested without a Gemini API key.

Running ingestion (Docker)

# Start the stack first
docker compose up -d db api

# Wait for db to be healthy, then run each script
docker compose exec api python scripts/ingest_lexicon.py
docker compose exec api python scripts/ingest_cross_references.py
docker compose exec api python scripts/ingest_bible.py
docker compose exec api python scripts/ingest_commentary.py

Re-ingestion after embedding migration

After the 384→768 migration (b2f4c8e1d937), all existing embeddings are set to NULL. The Alembic migration uses:

ALTER COLUMN embedding TYPE vector(768) USING NULL::vector(768)

All three embedding scripts must be re-run after upgrading to restore semantic search capability.