Retrieval & RAG¶
Ekklesia uses a hybrid retrieval strategy that combines dense (vector) search and sparse (full-text) search across three source types, fused with Reciprocal Rank Fusion. The corpus lives entirely in PostgreSQL with pgvector.
Hybrid search pipeline¶
flowchart TD
Q[query string] --> EMBED["embed_query (gemini-embedding-001, 768-dim)"]
Q --> FTS["websearch_to_tsquery (PostgreSQL FTS)"]
EMBED --> DENSE1[dense: bible_passages]
EMBED --> DENSE2[dense: commentary_chunks]
EMBED --> DENSE3[dense: cultural_entries]
FTS --> SPARSE1[sparse: bible_passages]
FTS --> SPARSE2[sparse: commentary_chunks]
FTS --> SPARSE3[sparse: cultural_entries]
DENSE1 --> RRF
DENSE2 --> RRF
DENSE3 --> RRF
SPARSE1 --> RRF
SPARSE2 --> RRF
SPARSE3 --> RRF
RRF["Reciprocal Rank Fusion (rrf_k=60)"] --> RESULTS[top_k RetrievalResult]
All queries within a single hybrid_search() call run concurrently via asyncio.gather. A shared asyncio.Lock serialises writes to the SQLAlchemy session (which is not thread-safe for concurrent use).
Source types¶
| Source type | Table | Text column | Reference column |
|---|---|---|---|
bible |
bible_passages |
text |
reference |
commentary |
commentary_chunks |
chunk_text |
reference |
cultural |
cultural_entries |
text |
title |
Each source type can be included or excluded per call. The Research agent queries all three; the Exegesis agent queries bible and commentary only.
Dense retrieval (vector search)¶
Embeddings are stored as Vector(768) columns. Similarity is measured by cosine distance using the <=> operator:
SELECT id, reference, text, 1 - (embedding <=> CAST(:qvec AS vector)) AS score
FROM bible_passages
WHERE embedding IS NOT NULL
ORDER BY embedding <=> CAST(:qvec AS vector)
LIMIT :limit
HNSW indexes are configured for cosine similarity with m=16, ef_construction=64:
Index(
"ix_bible_passages_embedding",
bible_passages.c.embedding,
postgresql_using="hnsw",
postgresql_with={"m": 16, "ef_construction": 64},
postgresql_ops={"embedding": "vector_cosine_ops"},
)
Sparse retrieval (full-text search)¶
Each table has a computed TSVECTOR column and a GIN index:
-- computed column (persisted)
to_tsvector('english', pericope_title || ' ' || text)
-- query
SELECT ..., ts_rank(search_vector, websearch_to_tsquery('english', :q)) AS score
FROM bible_passages
WHERE search_vector @@ websearch_to_tsquery('english', :q)
ORDER BY score DESC
LIMIT :limit
websearch_to_tsquery accepts natural-language search syntax (AND, OR, negation with -, phrase with "...") so agent queries can use ordinary prose without SQL injection risk.
Reciprocal Rank Fusion¶
RRF (retrieval/hybrid.py: _rrf_fuse) combines ranked result lists without requiring comparable scores:
Default rrf_k=60. Documents that appear at high rank across multiple lists accumulate the highest fused scores. The final list is deduplicated by (source_type, id).
Embedding model¶
File: src/ekklesia/retrieval/embedding.py
async def embed_query(text: str) -> list[float]:
with logfire.span("embed_query", model="gemini-embedding-001"):
# POST to Gemini embedContent API
# Returns list[float] of length 768
The same gemini-embedding-001 model is used at both ingestion time and query time, ensuring the vector space is consistent.
Embedding dimension: 768
Model: gemini-embedding-001
Fallback: If the Gemini API is unavailable during tests, conftest.py provides an autouse mock returning [0.1] * 768.
Database schema¶
erDiagram
bible_passages {
string id PK
string translation
string book
smallint book_id
smallint start_chapter
smallint start_verse
smallint end_chapter
smallint end_verse
string reference
string pericope_title
text text
smallint verse_count
string testament
vector_768 embedding
tsvector search_vector
}
bible_verses {
int id PK
smallint book_id
string book
smallint chapter
smallint verse
text text
string translation
}
commentary_chunks {
string id PK
string source
string book
smallint start_chapter
smallint start_verse
string reference
text chunk_text
string parent_id FK
smallint position_in_parent
vector_768 embedding
tsvector search_vector
}
commentary_parents {
string id PK
string source
string book
string reference
text full_text
}
cultural_entries {
string id PK
string source
string title
text text
array related_references
string category
vector_768 embedding
tsvector search_vector
}
lexicon_entries {
string strongs_number PK
string language
string lemma
string transliteration
text definition
text kjv_usage
int occurrences
}
cross_references {
int id PK
string from_verse_id
string to_verse_id
smallint to_verse_end
int votes
}
commentary_chunks }o--|| commentary_parents : parent_id
Ingestion¶
Ingestion populates the corpus from data files in data/. Each script runs independently and is idempotent (uses upsert semantics).
Order of operations¶
1. ingest_lexicon.py — no dependencies
2. ingest_cross_references.py — no dependencies
3. ingest_bible.py — calls Gemini embedding API (batch)
4. ingest_commentary.py — calls Gemini embedding API (batch)
Lexicon and cross-reference data do not require embeddings and can be ingested without a Gemini API key.
Running ingestion (Docker)¶
# Start the stack first
docker compose up -d db api
# Wait for db to be healthy, then run each script
docker compose exec api python scripts/ingest_lexicon.py
docker compose exec api python scripts/ingest_cross_references.py
docker compose exec api python scripts/ingest_bible.py
docker compose exec api python scripts/ingest_commentary.py
Re-ingestion after embedding migration¶
After the 384→768 migration (b2f4c8e1d937), all existing embeddings are set to NULL. The Alembic migration uses:
All three embedding scripts must be re-run after upgrading to restore semantic search capability.