Skip to content

toranb/encoder-search

Repository files navigation

Data Engineering

Get JSON for each translation

  curl https://bolls.life/static/translations/NASB.json > nasb.json
  curl https://bolls.life/static/translations/NIV.json > niv.json
  curl https://bolls.life/static/translations/ESV.json > esv.json

Generate each csv dataset from JSON

  Example.Dataset.gen_bible("nasb")
  Example.Dataset.gen_bible("niv")
  Example.Dataset.gen_bible("esv")

clean up the nasb csv and then niv & esv

  %s/\[\([^]]*\)\]/\1/g
  %s/’/'/g
  %s/‘/'/g

generate the context len 100 chunks for all 3

  Example.Prep.generate("nasb", encoder: true)
  Example.Prep.generate("niv", encoder: true)
  Example.Prep.generate("esv", encoder: true)

combine them all into a single csv

  mkdir combined
  cd combined
  cp ../nasbtraining.csv .
  cp ../nivtraining.csv .
  cp ../esvtraining.csv .
  cat *.csv > pretraining.csv

shuffle and uniq the pretraining data

  Example.Prep.shuffle()

Training

  Example.Encoder.scheduled(280, 0.00017)

Evaluation

  Example.Scoring.evaluate()
  Example.Scoring.comprehensive_eval()

Prepare the NLT dataset

  curl https://bolls.life/static/translations/NLT.json > nlt.json
  Example.Dataset.gen_bible("nlt")
  ## vim find/replace
  %s/<br>/ /g
  %s/<i>\(.\{-}\)<\/i>/\1/g
  %s/’/'/g
  %s/‘/'/g

Seed the database from a given text

  Example.Utils.seed("nlt")
  Example.Utils.add_verse_embeddings()

Index verses for BM25 search

  Example.Verse.index_verses()

Run the app and search

  iex -S mix phx.server

Run the BEIR Benchmark

  # uncomment last index for ilike perf
  vim priv/repo/migrations/20250718010521_add_bm25_stats.exs
  mix ecto.reset
  # setup python venv
  uv python install 3.13
  uv venv
  source .venv/bin/activate
  uv sync
  python beir.py

Benchmark results

================================================================================
BENCHMARK RESULTS COMPARISON
================================================================================
Dataset: nfcorpus | Total Queries: 324
--------------------------------------------------------------------------------
Metric                    | BM25            | ILIKE           | Improvement
--------------------------------------------------------------------------------
NDCG@10 (relevance)       | 0.2940          | 0.1105          |         +166.1%
QPS (throughput)          | 4.86            | 225.36          |          -97.8%
Avg Latency (ms)          | 205.94          | 4.44            |        -4541.1%
Precision@10              | 0.2806          | 0.1747          |          +60.6%
Recall@10                 | 0.1302          | 0.0454          |        +0.0849pp
Recall % (found relevant) | 66.7            | 28.1            |          +38.6pp
--------------------------------------------------------------------------------
Total Time                | 66.73           | 1.44            | (seconds)
================================================================================

KEY TAKEAWAYS:
  ✓ BM25 provides 166.1% better relevance ranking
  ✓ BM25 retrieves more relevant documents within the top 10 (Precision@10)
  ✓ BM25 finds a larger proportion of all relevant documents (Recall@10)
  ⚠ BM25 is 97.8% slower, but much more relevant
  ✓ BM25 finds relevant results for 38.6% more queries

About

Nx based encoder + search with PGVector

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published