Skip to content

Conversation

yaser-aj
Copy link
Contributor

This resolves #387. Let me know if anything needs adjustments.

Copy link
Owner

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you @yaser-aj!

elif v.lower() == 'false':
return False
else:
raise argparse.ArgumentTypeError("Expected boolean value(s).")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include the value v that was erroneous in the exception message?

indexes = [0] * len(values.keys())
indexes[-1] = -1
args = []
DO_PROFILING = values.pop("profile")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this used to be global scope, does it matter that it's now local to this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it's supposed to be used elsewhere outside of this method. It was just like PARAMS except that it was handled a bit differently.

I just realized that I forgot to change the naming given that these arguments are no longer constants too.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that's great -- moving a variable from global to local scope when it's already only used in that one local scope is rote refactoring.

I just realized that I forgot to change the naming given that these arguments are no longer constants too.

Ahh you mean the ALL_CAPS styling? Yeah let's fix that in another rev?

query_vectors = f"{constants.BASE_DIR}/data/cohere-wikipedia-queries-{dim}d.vec"
dim = values.pop("dim")
doc_vectors = values.pop("docVectors")
query_vectors = values.pop("queryVectors")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YAY! I'm so tired of editing this source for our runs...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 100%

# query_vectors = f"/lucenedata/enwiki/{'cohere-wikipedia'}-queries-{dim}d.vec"
# parentJoin_meta_file = f"{constants.BASE_DIR}/data/{'cohere-wikipedia'}-metadata.csv"

indexes = [0] * len(values.keys())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is sort of confusing -- maybe add a comment what this indexes array is all about? It's sort of the iterator state when iterating through all combinations of the incoming args? Kinda like a car odometer...

parser = argparse.ArgumentParser(description="Run KNN benchmark with configurable parameters.")

parser.add_argument("--ndoc", type=int, nargs="+", default=[500_000], help="Number of documents")
parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, today we can set topK to a list of multiple values and the benchy will iterate. With this switch to argparse, does that still work? We would just pass multiple things for each e.g. --topK 10 50 100?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right! Combination values are passed separated by spaces and are then converted to a list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I suppose nargs="+" does that? Would it only accept space as the multiple arg separator, or can you also use something else like a comma?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since we provide default values, would nargs="*" make more sense here?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure to document somewhere clearly (maybe in example runs?) to pass multiple values with spaces.

indexes = [0] * len(values.keys())
indexes[-1] = -1
args = []

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we somewhere print Usage: python src/python/knnPerfTest.py ... line, if the user invokes w/o necessary args like docs/query source files?

If not, can we add that, and could you also give some juicy examples showing off the odometer iterator aspect, like mixing multiple topK with multiple other things (force merge or not, maxConn, etc.)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the script adopts default values for all parameters. For the arguments doc/query vector files, it defaults to the downloaded sources from running src/python/initial_setup.py -download. We can of course let the doc/query vector files arguments be positional arguments and require them with no set defaults.

Let me know if you think that that would be a good design. I also just added a usage example that shows how multiple arguments are passed. Thanks for the feedback!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah actually I love that it runs all defaults -- it's a great out of the box experience. I guess if the user messes something else up (not sure what?) we'd print a Usage line...? Can add this later...


if __name__ == "__main__":
args = parse_args()
PARAMS = vars(args)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nightly KNN benchmarks (src/python/runNightlyKnn.py) I think invokes functions in this source -- can you take a quick peek and see whether these changes might break the nightly benchy? It's hard to test nightly benchy (I run on nightly benchy box and iterate to fix issues), but maybe we could take a first cut here to try not to break it ... thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's only using the print_fixed_width(...) method, which is not changed at all in this PR. I hope I'm not missing anything 😄

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super! I think you are right! It has more dependency on KnnGraphTester.java (a recent awesome change that added better CPU accounting had broken the nightly build... hopefully soon fixed).

@yaser-aj
Copy link
Contributor Author

Done with the tweaks! Let me know if anything else needs adjustment.

parser = argparse.ArgumentParser(description="Run KNN benchmark with configurable parameters.")

parser.add_argument("--ndoc", type=int, nargs="+", default=[500_000], help="Number of documents")
parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I suppose nargs="+" does that? Would it only accept space as the multiple arg separator, or can you also use something else like a comma?

parser.add_argument("--dim", type=int, default=768, help="Vector dimensionality")
parser.add_argument("--docVectors", type=str, default=f"{constants.BASE_DIR}/data/cohere-wikipedia-docs-768d.vec", help="Path to document vectors")
parser.add_argument("--queryVectors", type=str, default=f"{constants.BASE_DIR}/data/cohere-wikipedia-queries-768d.vec", help="Path to query vectors")
parser.add_argument("--parentJoin", type=str, nargs="+", default=[], help="Path to parent join metadata file")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need multiple arguments, does it need to be nargs+?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's processed under the same iterator that goes over values. I agree that it shouldn't imply that it can be a list. We can add it to values as a list later or process it separately.

Comment on lines 98 to 99
if not values["parentJoin"]:
del values["parentJoin"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this by making the argument optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is there just to avoid passing empty parentJoin argument (which is the default value too) to knn.KnnGraphTester. If the user doesn't specify a metadata file, the argument will be omitted.

parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve")
parser.add_argument("--maxConn", type=int, nargs="+", default=[64], help="Max connections in the graph")
parser.add_argument("--beamWidthIndex", type=int, nargs="+", default=[250], help="Beam width at index time")
parser.add_argument("--fanout", type=int, nargs="+", default=[50], help="Fanout parameter")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For params that take multiple values (nargs="+"), are we validating that the same number of values were passed everywhere? Each value corresponds to a specific run. For for e.g., if I pass --fanout 50 100 150, it will fire three runs with fanout = 50, 100 and 150 respectively. To do that, we need other arguments to also have three values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without that, you could fall back to using the default, but would you assume that provided values are for the first set of runs? I think that's confusing, and a mismatch in no. of values is likely just a user error.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the tool currently takes all combinations of multiple inputs (I think/thought)? So three values for one param and two for another would make six runs. Let's leave that be for now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tool currently takes all combinations of multiple inputs (I think/thought)? So three values for one param and two for another would make six runs. Let's leave that be for now?

Right, I didn't realize that! Guess I never used multiple args in the benchmark setup before. Yes, in that case, different no. of args is fine.

parser = argparse.ArgumentParser(description="Run KNN benchmark with configurable parameters.")

parser.add_argument("--ndoc", type=int, nargs="+", default=[500_000], help="Number of documents")
parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since we provide default values, would nargs="*" make more sense here?

query_vectors = f"{constants.BASE_DIR}/data/cohere-wikipedia-queries-{dim}d.vec"
dim = values.pop("dim")
doc_vectors = values.pop("docVectors")
query_vectors = values.pop("queryVectors")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 100%

Copy link

github-actions bot commented Jul 8, 2025

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jul 8, 2025
Copy link
Collaborator

@vigyasharma vigyasharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes @yaser-aj ! I'm looking forward to running these benchmarks without having to modify the source script. Perhaps once you've finalized these changes, we can also update the instructions in README?

Also, it would be helpful to see what the actual command we need to run and its benchmarking output on the PR (maybe also include a basic version of the command in the doc string)? You have the right defaults in place, let's confirm that the script picks them up and we don't have to pass args for values we don't want to change.

# change the parameters below and then run (you can still manually run this file, but using gradle command
# below will auto recompile if you made any changes to java files in luceneutils)
# ./gradlew runKnnPerfTest
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this still run as is, or do we need to update the gradle task as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the task so that it accepts arguments by passing -Pargs="(script args go here)" to gradlew.

# ./gradlew runKnnPerfTest
#
# you may want to modify the following settings:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also update the doc string above with instructions on how to run this script with args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@github-actions github-actions bot removed the Stale label Jul 9, 2025
@yaser-aj
Copy link
Contributor Author

I've updated the instructions and added support for passing arguments through gradlew. Let me know if anything else needs tweaking!

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jul 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add argparse support for knnPerfTest.py
3 participants