Add argparse support for knnPerfTest.py #413

yaser-aj · 2025-06-19T07:01:01Z

This resolves #387. Let me know if anything needs adjustments.

mikemccand

Looks great! Thank you @yaser-aj!

mikemccand · 2025-06-20T12:06:29Z

src/python/knnPerfTest.py

+    elif v.lower() == 'false':
+        return False
+    else:
+        raise argparse.ArgumentTypeError("Expected boolean value(s).")


Can you include the value v that was erroneous in the exception message?

mikemccand · 2025-06-20T12:07:15Z

src/python/knnPerfTest.py

-  indexes = [0] * len(values.keys())
-  indexes[-1] = -1
-  args = []
+  DO_PROFILING = values.pop("profile")


Hmm this used to be global scope, does it matter that it's now local to this method?

I don't think that it's supposed to be used elsewhere outside of this method. It was just like PARAMS except that it was handled a bit differently.

I just realized that I forgot to change the naming given that these arguments are no longer constants too.

OK that's great -- moving a variable from global to local scope when it's already only used in that one local scope is rote refactoring.

I just realized that I forgot to change the naming given that these arguments are no longer constants too.

Ahh you mean the ALL_CAPS styling? Yeah let's fix that in another rev?

mikemccand · 2025-06-20T12:07:44Z

src/python/knnPerfTest.py

-  query_vectors = f"{constants.BASE_DIR}/data/cohere-wikipedia-queries-{dim}d.vec"
+  dim = values.pop("dim")
+  doc_vectors = values.pop("docVectors")
+  query_vectors = values.pop("queryVectors")


YAY! I'm so tired of editing this source for our runs...

mikemccand · 2025-06-20T12:08:48Z

src/python/knnPerfTest.py

  # query_vectors = f"/lucenedata/enwiki/{'cohere-wikipedia'}-queries-{dim}d.vec"
  # parentJoin_meta_file = f"{constants.BASE_DIR}/data/{'cohere-wikipedia'}-metadata.csv"

+  indexes = [0] * len(values.keys())


Hmm this is sort of confusing -- maybe add a comment what this indexes array is all about? It's sort of the iterator state when iterating through all combinations of the incoming args? Kinda like a car odometer...

mikemccand · 2025-06-20T12:11:45Z

src/python/knnPerfTest.py

+    parser = argparse.ArgumentParser(description="Run KNN benchmark with configurable parameters.")
+
+    parser.add_argument("--ndoc", type=int, nargs="+", default=[500_000], help="Number of documents")
+    parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve")


Hmm, today we can set topK to a list of multiple values and the benchy will iterate. With this switch to argparse, does that still work? We would just pass multiple things for each e.g. --topK 10 50 100?

That's right! Combination values are passed separated by spaces and are then converted to a list.

Nice, I suppose nargs="+" does that? Would it only accept space as the multiple arg separator, or can you also use something else like a comma?

Also, since we provide default values, would nargs="*" make more sense here?

Let's make sure to document somewhere clearly (maybe in example runs?) to pass multiple values with spaces.

mikemccand · 2025-06-20T12:13:31Z

src/python/knnPerfTest.py

+  indexes = [0] * len(values.keys())
+  indexes[-1] = -1
+  args = []
+


Do we somewhere print Usage: python src/python/knnPerfTest.py ... line, if the user invokes w/o necessary args like docs/query source files?

If not, can we add that, and could you also give some juicy examples showing off the odometer iterator aspect, like mixing multiple topK with multiple other things (force merge or not, maxConn, etc.)?

So far the script adopts default values for all parameters. For the arguments doc/query vector files, it defaults to the downloaded sources from running src/python/initial_setup.py -download. We can of course let the doc/query vector files arguments be positional arguments and require them with no set defaults.

Let me know if you think that that would be a good design. I also just added a usage example that shows how multiple arguments are passed. Thanks for the feedback!

Yeah actually I love that it runs all defaults -- it's a great out of the box experience. I guess if the user messes something else up (not sure what?) we'd print a Usage line...? Can add this later...

mikemccand · 2025-06-20T12:15:18Z

src/python/knnPerfTest.py


 if __name__ == "__main__":
+  args = parse_args()
+  PARAMS = vars(args)


The nightly KNN benchmarks (src/python/runNightlyKnn.py) I think invokes functions in this source -- can you take a quick peek and see whether these changes might break the nightly benchy? It's hard to test nightly benchy (I run on nightly benchy box and iterate to fix issues), but maybe we could take a first cut here to try not to break it ... thanks!

Looks like it's only using the print_fixed_width(...) method, which is not changed at all in this PR. I hope I'm not missing anything 😄

Super! I think you are right! It has more dependency on KnnGraphTester.java (a recent awesome change that added better CPU accounting had broken the nightly build... hopefully soon fixed).

yaser-aj · 2025-06-20T18:23:00Z

Done with the tweaks! Let me know if anything else needs adjustment.

vigyasharma · 2025-06-20T23:58:46Z

src/python/knnPerfTest.py

+    parser = argparse.ArgumentParser(description="Run KNN benchmark with configurable parameters.")
+
+    parser.add_argument("--ndoc", type=int, nargs="+", default=[500_000], help="Number of documents")
+    parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve")


Nice, I suppose nargs="+" does that? Would it only accept space as the multiple arg separator, or can you also use something else like a comma?

vigyasharma · 2025-06-21T00:00:15Z

src/python/knnPerfTest.py

+    parser.add_argument("--dim", type=int, default=768, help="Vector dimensionality")
+    parser.add_argument("--docVectors", type=str, default=f"{constants.BASE_DIR}/data/cohere-wikipedia-docs-768d.vec", help="Path to document vectors")
+    parser.add_argument("--queryVectors", type=str, default=f"{constants.BASE_DIR}/data/cohere-wikipedia-queries-768d.vec", help="Path to query vectors")
+    parser.add_argument("--parentJoin", type=str, nargs="+", default=[], help="Path to parent join metadata file")


This doesn't need multiple arguments, does it need to be nargs+?

It's processed under the same iterator that goes over values. I agree that it shouldn't imply that it can be a list. We can add it to values as a list later or process it separately.

vigyasharma · 2025-06-21T00:00:32Z

src/python/knnPerfTest.py

+  if not values["parentJoin"]:
+    del values["parentJoin"]


Can we avoid this by making the argument optional?

This is there just to avoid passing empty parentJoin argument (which is the default value too) to knn.KnnGraphTester. If the user doesn't specify a metadata file, the argument will be omitted.

vigyasharma · 2025-06-21T00:02:49Z

src/python/knnPerfTest.py

+    parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve")
+    parser.add_argument("--maxConn", type=int, nargs="+", default=[64], help="Max connections in the graph")
+    parser.add_argument("--beamWidthIndex", type=int, nargs="+", default=[250], help="Beam width at index time")
+    parser.add_argument("--fanout", type=int, nargs="+", default=[50], help="Fanout parameter")


For params that take multiple values (nargs="+"), are we validating that the same number of values were passed everywhere? Each value corresponds to a specific run. For for e.g., if I pass --fanout 50 100 150, it will fire three runs with fanout = 50, 100 and 150 respectively. To do that, we need other arguments to also have three values.

Without that, you could fall back to using the default, but would you assume that provided values are for the first set of runs? I think that's confusing, and a mismatch in no. of values is likely just a user error.

Hmm, the tool currently takes all combinations of multiple inputs (I think/thought)? So three values for one param and two for another would make six runs. Let's leave that be for now?

the tool currently takes all combinations of multiple inputs (I think/thought)? So three values for one param and two for another would make six runs. Let's leave that be for now?

Right, I didn't realize that! Guess I never used multiple args in the benchmark setup before. Yes, in that case, different no. of args is fine.

vigyasharma · 2025-06-21T00:06:39Z

src/python/knnPerfTest.py

+    parser = argparse.ArgumentParser(description="Run KNN benchmark with configurable parameters.")
+
+    parser.add_argument("--ndoc", type=int, nargs="+", default=[500_000], help="Number of documents")
+    parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve")


Also, since we provide default values, would nargs="*" make more sense here?

vigyasharma · 2025-06-21T00:07:30Z

src/python/knnPerfTest.py

-  query_vectors = f"{constants.BASE_DIR}/data/cohere-wikipedia-queries-{dim}d.vec"
+  dim = values.pop("dim")
+  doc_vectors = values.pop("docVectors")
+  query_vectors = values.pop("queryVectors")


github-actions · 2025-07-08T00:10:15Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

vigyasharma

Thanks for making these changes @yaser-aj ! I'm looking forward to running these benchmarks without having to modify the source script. Perhaps once you've finalized these changes, we can also update the instructions in README?

Also, it would be helpful to see what the actual command we need to run and its benchmarking output on the PR (maybe also include a basic version of the command in the doc string)? You have the right defaults in place, let's confirm that the script picks them up and we don't have to pass args for values we don't want to change.

vigyasharma · 2025-07-08T05:59:32Z

src/python/knnPerfTest.py

 # change the parameters below and then run (you can still manually run this file, but using gradle command
 # below will auto recompile if you made any changes to java files in luceneutils)
 # ./gradlew runKnnPerfTest
 #


Will this still run as is, or do we need to update the gradle task as well?

I updated the task so that it accepts arguments by passing -Pargs="(script args go here)" to gradlew.

vigyasharma · 2025-07-08T06:00:57Z

src/python/knnPerfTest.py

 # ./gradlew runKnnPerfTest
 #
 # you may want to modify the following settings:



Let's also update the doc string above with instructions on how to run this script with args?

yaser-aj · 2025-07-11T18:28:47Z

I've updated the instructions and added support for passing arguments through gradlew. Let me know if anything else needs tweaking!

github-actions · 2025-07-26T00:10:29Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

(none) added 6 commits June 19, 2025 07:28

added argparse support

d79897e

added numSearchThread argument

06ff824

added parentJoin type support

76fccac

fixed empty parentJoin argument

726c43b

added arguments for dim, doc_vectors, and query_vectors

ed4fb74

fixed issues and refactored code

18b513c

mikemccand reviewed Jun 20, 2025

View reviewed changes

(none) added 4 commits June 20, 2025 18:14

include erroneous argument value v in str2bool exception

a32fc08

changed argument variable names to avoid confusion with constants

4e7c6c3

explain indexes iterator in comments

5339bf1

added usage example

8a3da61

vigyasharma reviewed Jun 21, 2025

View reviewed changes

github-actions bot added the Stale label Jul 8, 2025

vigyasharma requested changes Jul 8, 2025

View reviewed changes

github-actions bot removed the Stale label Jul 9, 2025

yaser-aj added 7 commits July 11, 2025 17:23

added argparse support

90c4789

replaced "+" with "*"

78d15cb

pass arguments for knnPerfTest through gradle

30311ec

changed doc string instructions

0337f5d

updated instructions in README.md

c18706e

removed old commented variables

a362acf

changed parentJoin to a single string value

4029ab8

github-actions bot added the Stale label Jul 26, 2025

Add argparse support for knnPerfTest.py #413

Are you sure you want to change the base?

Add argparse support for knnPerfTest.py #413

Uh oh!

Conversation

yaser-aj commented Jun 19, 2025

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaser-aj commented Jun 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 8, 2025

Uh oh!

vigyasharma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment