Data Engineering

Get JSON for each translation

  curl https://bolls.life/static/translations/NASB.json > nasb.json
  curl https://bolls.life/static/translations/NIV.json > niv.json
  curl https://bolls.life/static/translations/ESV.json > esv.json

Generate each csv dataset from JSON

  Example.Dataset.gen_bible("nasb")
  Example.Dataset.gen_bible("niv")
  Example.Dataset.gen_bible("esv")

clean up the nasb csv and then niv & esv

  %s/\[\([^]]*\)\]/\1/g
  %s/’/'/g
  %s/‘/'/g

generate the context len 100 chunks for all 3

  Example.Prep.generate("nasb", encoder: true)
  Example.Prep.generate("niv", encoder: true)
  Example.Prep.generate("esv", encoder: true)

combine them all into a single csv

  mkdir combined
  cd combined
  cp ../nasbtraining.csv .
  cp ../nivtraining.csv .
  cp ../esvtraining.csv .
  cat *.csv > pretraining.csv

shuffle and uniq the pretraining data

  Example.Prep.shuffle()

Training

  Example.Encoder.scheduled(280, 0.00017)

Evaluation

  Example.Scoring.evaluate()
  Example.Scoring.comprehensive_eval()

Prepare the NLT dataset

  curl https://bolls.life/static/translations/NLT.json > nlt.json
  Example.Dataset.gen_bible("nlt")
  ## vim find/replace
  %s/<br>/ /g
  %s/<i>\(.\{-}\)<\/i>/\1/g
  %s/’/'/g
  %s/‘/'/g

Seed the database from a given text

  Example.Utils.seed("nlt")
  Example.Utils.add_verse_embeddings()

Index verses for BM25 search

  Example.Verse.index_verses()

Run the app and search

  iex -S mix phx.server

Run the BEIR Benchmark

  # uncomment last index for ilike perf
  vim priv/repo/migrations/20250718010521_add_bm25_stats.exs
  mix ecto.reset
  # setup python venv
  uv python install 3.13
  uv venv
  source .venv/bin/activate
  uv sync
  python beir.py

Benchmark results

================================================================================
BENCHMARK RESULTS COMPARISON
================================================================================
Dataset: nfcorpus | Total Queries: 324
--------------------------------------------------------------------------------
Metric                    | BM25            | ILIKE           | Improvement
--------------------------------------------------------------------------------
NDCG@10 (relevance)       | 0.2940          | 0.1105          |         +166.1%
QPS (throughput)          | 4.86            | 225.36          |          -97.8%
Avg Latency (ms)          | 205.94          | 4.44            |        -4541.1%
Precision@10              | 0.2806          | 0.1747          |          +60.6%
Recall@10                 | 0.1302          | 0.0454          |        +0.0849pp
Recall % (found relevant) | 66.7            | 28.1            |          +38.6pp
--------------------------------------------------------------------------------
Total Time                | 66.73           | 1.44            | (seconds)
================================================================================

KEY TAKEAWAYS:
  ✓ BM25 provides 166.1% better relevance ranking
  ✓ BM25 retrieves more relevant documents within the top 10 (Precision@10)
  ✓ BM25 finds a larger proportion of all relevant documents (Recall@10)
  ⚠ BM25 is 97.8% slower, but much more relevant
  ✓ BM25 finds relevant results for 38.6% more queries

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
config		config
lib		lib
priv		priv
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
beir.py		beir.py
mix.exs		mix.exs
mix.lock		mix.lock
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineering

Get JSON for each translation

Generate each csv dataset from JSON

clean up the nasb csv and then niv & esv

generate the context len 100 chunks for all 3

combine them all into a single csv

shuffle and uniq the pretraining data

Training

Evaluation

Prepare the NLT dataset

Seed the database from a given text

Index verses for BM25 search

Run the app and search

Run the BEIR Benchmark

Benchmark results

About

Uh oh!

Releases

Packages

Languages

toranb/encoder-search

Folders and files

Latest commit

History

Repository files navigation

Data Engineering

Get JSON for each translation

Generate each csv dataset from JSON

clean up the nasb csv and then niv & esv

generate the context len 100 chunks for all 3

combine them all into a single csv

shuffle and uniq the pretraining data

Training

Evaluation

Prepare the NLT dataset

Seed the database from a given text

Index verses for BM25 search

Run the app and search

Run the BEIR Benchmark

Benchmark results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages