Skip to content
Open
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 46 additions & 50 deletions src/python/knnPerfTest.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import re
import subprocess
import sys
import argparse

import benchUtil
import constants
Expand All @@ -33,60 +34,47 @@
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this still run as is, or do we need to update the gradle task as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the task so that it accepts arguments by passing -Pargs="(script args go here)" to gradlew.

# you may want to modify the following settings:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also update the doc string above with instructions on how to run this script with args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

DO_PROFILING = False

# e.g. to compile KnnIndexer:
#
# javac -d build -cp /l/trunk/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/trunk/lucene/join/build/libs/lucene-join-10.0.0-SNAPSHOT.jar src/main/knn/*.java src/main/WikiVectors.java src/main/perf/VectorDictionary.java
#

NOISY = True

# TODO
# - can we expose greediness (global vs local queue exploration in KNN search) here?

# test parameters. This script will run KnnGraphTester on every combination of these parameters
PARAMS = {
# "ndoc": (10_000_000,),
#'ndoc': (10000, 100000, 200000, 500000),
#'ndoc': (10000, 100000, 200000, 500000),
#'ndoc': (2_000_000,),
#'ndoc': (1_000_000,),
"ndoc": (500_000,),
#'ndoc': (50_000,),
#'maxConn': (32, 64, 96),
"maxConn": (64,),
#'maxConn': (32,),
#'beamWidthIndex': (250, 500),
"beamWidthIndex": (250,),
#'beamWidthIndex': (50,),
#'fanout': (20, 100, 250)
"fanout": (50,),
#'quantize': None,
#'quantizeBits': (32, 7, 4),
"numMergeWorker": (12,),
"numMergeThread": (4,),
"numSearchThread": (0,),
#'numMergeWorker': (1,),
#'numMergeThread': (1,),
"encoding": ("float32",),
# 'metric': ('angular',), # default is angular (dot_product)
# 'metric': ('mip',),
#'quantize': (True,),
"quantizeBits": (32,),
# "quantizeBits": (1,),
# "overSample": (5,), # extra ratio of vectors to retrieve, for testing approximate scoring, e.g. quantized indices
#'fanout': (0,),
"topK": (100,),
# "bp": ("false", "true"),
#'quantizeCompress': (True, False),
"quantizeCompress": (True,),
# "indexType": ("flat", "hnsw"), # index type, only works with singlt bit
"queryStartIndex": (0,), # seek to this start vector before searching, to sample different vectors
# "forceMerge": (True, False),
#'niter': (10,),
}

def str2bool(v):
if v.lower() == 'true':
return True
elif v.lower() == 'false':
return False
else:
raise argparse.ArgumentTypeError("Expected boolean value(s).")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include the value v that was erroneous in the exception message?


def parse_args():
parser = argparse.ArgumentParser(description="Run KNN benchmark with configurable parameters.")

parser.add_argument("--ndoc", type=int, nargs="+", default=[500_000], help="Number of documents")
parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, today we can set topK to a list of multiple values and the benchy will iterate. With this switch to argparse, does that still work? We would just pass multiple things for each e.g. --topK 10 50 100?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right! Combination values are passed separated by spaces and are then converted to a list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I suppose nargs="+" does that? Would it only accept space as the multiple arg separator, or can you also use something else like a comma?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since we provide default values, would nargs="*" make more sense here?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure to document somewhere clearly (maybe in example runs?) to pass multiple values with spaces.

parser.add_argument("--maxConn", type=int, nargs="+", default=[64], help="Max connections in the graph")
parser.add_argument("--beamWidthIndex", type=int, nargs="+", default=[250], help="Beam width at index time")
parser.add_argument("--fanout", type=int, nargs="+", default=[50], help="Fanout parameter")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For params that take multiple values (nargs="+"), are we validating that the same number of values were passed everywhere? Each value corresponds to a specific run. For for e.g., if I pass --fanout 50 100 150, it will fire three runs with fanout = 50, 100 and 150 respectively. To do that, we need other arguments to also have three values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without that, you could fall back to using the default, but would you assume that provided values are for the first set of runs? I think that's confusing, and a mismatch in no. of values is likely just a user error.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the tool currently takes all combinations of multiple inputs (I think/thought)? So three values for one param and two for another would make six runs. Let's leave that be for now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tool currently takes all combinations of multiple inputs (I think/thought)? So three values for one param and two for another would make six runs. Let's leave that be for now?

Right, I didn't realize that! Guess I never used multiple args in the benchmark setup before. Yes, in that case, different no. of args is fine.

parser.add_argument("--quantizeBits", type=int, nargs="+", default=[32], help="Quantization bits")
parser.add_argument("--quantizeCompress", type=str2bool, nargs="+", default=[True], help="Enable quantize compression")
parser.add_argument("--numMergeWorker", type=int, nargs="+", default=[12], help="Number of merge workers")
parser.add_argument("--numMergeThread", type=int, nargs="+", default=[4], help="Number of merge threads")
parser.add_argument("--encoding", type=str, nargs="+", default=["float32"], help="Encoding type")
parser.add_argument("--queryStartIndex", type=int, nargs="+", default=[0], help="Query start index")
parser.add_argument("--numSearchThread", type=int, nargs="+", default=[0], help="Number of search threads")
parser.add_argument("--dim", type=int, default=768, help="Vector dimensionality")
parser.add_argument("--docVectors", type=str, default=f"{constants.BASE_DIR}/data/cohere-wikipedia-docs-768d.vec", help="Path to document vectors")
parser.add_argument("--queryVectors", type=str, default=f"{constants.BASE_DIR}/data/cohere-wikipedia-queries-768d.vec", help="Path to query vectors")
parser.add_argument("--parentJoin", type=str, nargs="+", default=[], help="Path to parent join metadata file")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need multiple arguments, does it need to be nargs+?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's processed under the same iterator that goes over values. I agree that it shouldn't imply that it can be a list. We can add it to values as a list later or process it separately.

parser.add_argument("--profile", action="store_true", help="Enable Java profiling")
parser.add_argument("--quiet", action="store_true", help="Suppress benchmark output")

return parser.parse_args()

def advance(ix, values):
for i in reversed(range(len(ix))):
Expand All @@ -102,9 +90,11 @@ def advance(ix, values):


def run_knn_benchmark(checkout, values):
indexes = [0] * len(values.keys())
indexes[-1] = -1
args = []
DO_PROFILING = values.pop("profile")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this used to be global scope, does it matter that it's now local to this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it's supposed to be used elsewhere outside of this method. It was just like PARAMS except that it was handled a bit differently.

I just realized that I forgot to change the naming given that these arguments are no longer constants too.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that's great -- moving a variable from global to local scope when it's already only used in that one local scope is rote refactoring.

I just realized that I forgot to change the naming given that these arguments are no longer constants too.

Ahh you mean the ALL_CAPS styling? Yeah let's fix that in another rev?

NOISY = not values.pop("quiet")
if not values["parentJoin"]:
del values["parentJoin"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this by making the argument optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is there just to avoid passing empty parentJoin argument (which is the default value too) to knn.KnnGraphTester. If the user doesn't specify a metadata file, the argument will be omitted.


# dim = 100
# doc_vectors = constants.GLOVE_VECTOR_DOCS_FILE
# query_vectors = '%s/luceneutil/tasks/vector-task-100d.vec' % constants.BASE_DIR
Expand All @@ -123,13 +113,17 @@ def run_knn_benchmark(checkout, values):
# query_vectors = '/d/electronics_query_vectors.bin'

# Cohere dataset
dim = 768
doc_vectors = f"{constants.BASE_DIR}/data/cohere-wikipedia-docs-{dim}d.vec"
query_vectors = f"{constants.BASE_DIR}/data/cohere-wikipedia-queries-{dim}d.vec"
dim = values.pop("dim")
doc_vectors = values.pop("docVectors")
query_vectors = values.pop("queryVectors")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YAY! I'm so tired of editing this source for our runs...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 100%

# doc_vectors = f"/lucenedata/enwiki/{'cohere-wikipedia'}-docs-{dim}d.vec"
# query_vectors = f"/lucenedata/enwiki/{'cohere-wikipedia'}-queries-{dim}d.vec"
# parentJoin_meta_file = f"{constants.BASE_DIR}/data/{'cohere-wikipedia'}-metadata.csv"

indexes = [0] * len(values.keys())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is sort of confusing -- maybe add a comment what this indexes array is all about? It's sort of the iterator state when iterating through all combinations of the incoming args? Kinda like a car odometer...

indexes[-1] = -1
args = []

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we somewhere print Usage: python src/python/knnPerfTest.py ... line, if the user invokes w/o necessary args like docs/query source files?

If not, can we add that, and could you also give some juicy examples showing off the odometer iterator aspect, like mixing multiple topK with multiple other things (force merge or not, maxConn, etc.)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the script adopts default values for all parameters. For the arguments doc/query vector files, it defaults to the downloaded sources from running src/python/initial_setup.py -download. We can of course let the doc/query vector files arguments be positional arguments and require them with no set defaults.

Let me know if you think that that would be a good design. I also just added a usage example that shows how multiple arguments are passed. Thanks for the feedback!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah actually I love that it runs all defaults -- it's a great out of the box experience. I guess if the user messes something else up (not sure what?) we'd print a Usage line...? Can add this later...

jfr_output = f"{constants.LOGS_DIR}/knn-perf-test.jfr"

cp = benchUtil.classPathToString(benchUtil.getClassPath(checkout) + (f"{constants.BENCH_BASE_DIR}/build",))
Expand Down Expand Up @@ -284,6 +278,8 @@ def print_fixed_width(all_results, columns_to_skip):


if __name__ == "__main__":
args = parse_args()
PARAMS = vars(args)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nightly KNN benchmarks (src/python/runNightlyKnn.py) I think invokes functions in this source -- can you take a quick peek and see whether these changes might break the nightly benchy? It's hard to test nightly benchy (I run on nightly benchy box and iterate to fix issues), but maybe we could take a first cut here to try not to break it ... thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's only using the print_fixed_width(...) method, which is not changed at all in this PR. I hope I'm not missing anything 😄

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super! I think you are right! It has more dependency on KnnGraphTester.java (a recent awesome change that added better CPU accounting had broken the nightly build... hopefully soon fixed).

# Where the version of Lucene is that will be tested. Now this will be sourced from gradle.properties
LUCENE_CHECKOUT = getLuceneDirFromGradleProperties()
run_knn_benchmark(LUCENE_CHECKOUT, PARAMS)
Loading