-
Notifications
You must be signed in to change notification settings - Fork 136
Add argparse support for knnPerfTest.py #413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add argparse support for knnPerfTest.py #413
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you @yaser-aj!
src/python/knnPerfTest.py
Outdated
elif v.lower() == 'false': | ||
return False | ||
else: | ||
raise argparse.ArgumentTypeError("Expected boolean value(s).") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include the value v
that was erroneous in the exception message?
src/python/knnPerfTest.py
Outdated
indexes = [0] * len(values.keys()) | ||
indexes[-1] = -1 | ||
args = [] | ||
DO_PROFILING = values.pop("profile") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this used to be global scope, does it matter that it's now local to this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it's supposed to be used elsewhere outside of this method. It was just like PARAMS
except that it was handled a bit differently.
I just realized that I forgot to change the naming given that these arguments are no longer constants too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that's great -- moving a variable from global to local scope when it's already only used in that one local scope is rote refactoring.
I just realized that I forgot to change the naming given that these arguments are no longer constants too.
Ahh you mean the ALL_CAPS styling? Yeah let's fix that in another rev?
query_vectors = f"{constants.BASE_DIR}/data/cohere-wikipedia-queries-{dim}d.vec" | ||
dim = values.pop("dim") | ||
doc_vectors = values.pop("docVectors") | ||
query_vectors = values.pop("queryVectors") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YAY! I'm so tired of editing this source for our runs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 100%
# query_vectors = f"/lucenedata/enwiki/{'cohere-wikipedia'}-queries-{dim}d.vec" | ||
# parentJoin_meta_file = f"{constants.BASE_DIR}/data/{'cohere-wikipedia'}-metadata.csv" | ||
|
||
indexes = [0] * len(values.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is sort of confusing -- maybe add a comment what this indexes
array is all about? It's sort of the iterator state when iterating through all combinations of the incoming args? Kinda like a car odometer...
src/python/knnPerfTest.py
Outdated
parser = argparse.ArgumentParser(description="Run KNN benchmark with configurable parameters.") | ||
|
||
parser.add_argument("--ndoc", type=int, nargs="+", default=[500_000], help="Number of documents") | ||
parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, today we can set topK
to a list of multiple values and the benchy will iterate. With this switch to argparse
, does that still work? We would just pass multiple things for each e.g. --topK 10 50 100
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right! Combination values are passed separated by spaces and are then converted to a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I suppose nargs="+"
does that? Would it only accept space as the multiple arg separator, or can you also use something else like a comma?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, since we provide default values, would nargs="*"
make more sense here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure to document somewhere clearly (maybe in example runs?) to pass multiple values with spaces.
indexes = [0] * len(values.keys()) | ||
indexes[-1] = -1 | ||
args = [] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we somewhere print Usage: python src/python/knnPerfTest.py ...
line, if the user invokes w/o necessary args like docs/query source files?
If not, can we add that, and could you also give some juicy examples showing off the odometer iterator aspect, like mixing multiple topK
with multiple other things (force merge or not, maxConn
, etc.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far the script adopts default values for all parameters. For the arguments doc/query vector files, it defaults to the downloaded sources from running src/python/initial_setup.py -download
. We can of course let the doc/query vector files arguments be positional arguments and require them with no set defaults.
Let me know if you think that that would be a good design. I also just added a usage example that shows how multiple arguments are passed. Thanks for the feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah actually I love that it runs all defaults -- it's a great out of the box experience. I guess if the user messes something else up (not sure what?) we'd print a Usage line...? Can add this later...
src/python/knnPerfTest.py
Outdated
|
||
if __name__ == "__main__": | ||
args = parse_args() | ||
PARAMS = vars(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nightly KNN benchmarks (src/python/runNightlyKnn.py
) I think invokes functions in this source -- can you take a quick peek and see whether these changes might break the nightly benchy? It's hard to test nightly benchy (I run on nightly benchy box and iterate to fix issues), but maybe we could take a first cut here to try not to break it ... thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's only using the print_fixed_width(...) method, which is not changed at all in this PR. I hope I'm not missing anything 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super! I think you are right! It has more dependency on KnnGraphTester.java
(a recent awesome change that added better CPU accounting had broken the nightly build... hopefully soon fixed).
Done with the tweaks! Let me know if anything else needs adjustment. |
src/python/knnPerfTest.py
Outdated
parser = argparse.ArgumentParser(description="Run KNN benchmark with configurable parameters.") | ||
|
||
parser.add_argument("--ndoc", type=int, nargs="+", default=[500_000], help="Number of documents") | ||
parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I suppose nargs="+"
does that? Would it only accept space as the multiple arg separator, or can you also use something else like a comma?
src/python/knnPerfTest.py
Outdated
parser.add_argument("--dim", type=int, default=768, help="Vector dimensionality") | ||
parser.add_argument("--docVectors", type=str, default=f"{constants.BASE_DIR}/data/cohere-wikipedia-docs-768d.vec", help="Path to document vectors") | ||
parser.add_argument("--queryVectors", type=str, default=f"{constants.BASE_DIR}/data/cohere-wikipedia-queries-768d.vec", help="Path to query vectors") | ||
parser.add_argument("--parentJoin", type=str, nargs="+", default=[], help="Path to parent join metadata file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need multiple arguments, does it need to be nargs+
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's processed under the same iterator that goes over values
. I agree that it shouldn't imply that it can be a list. We can add it to values as a list later or process it separately.
src/python/knnPerfTest.py
Outdated
if not values["parentJoin"]: | ||
del values["parentJoin"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid this by making the argument optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is there just to avoid passing empty parentJoin argument (which is the default value too) to knn.KnnGraphTester. If the user doesn't specify a metadata file, the argument will be omitted.
src/python/knnPerfTest.py
Outdated
parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve") | ||
parser.add_argument("--maxConn", type=int, nargs="+", default=[64], help="Max connections in the graph") | ||
parser.add_argument("--beamWidthIndex", type=int, nargs="+", default=[250], help="Beam width at index time") | ||
parser.add_argument("--fanout", type=int, nargs="+", default=[50], help="Fanout parameter") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For params that take multiple values (nargs="+"), are we validating that the same number of values were passed everywhere? Each value corresponds to a specific run. For for e.g., if I pass --fanout 50 100 150
, it will fire three runs with fanout = 50, 100 and 150 respectively. To do that, we need other arguments to also have three values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without that, you could fall back to using the default, but would you assume that provided values are for the first set of runs? I think that's confusing, and a mismatch in no. of values is likely just a user error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the tool currently takes all combinations of multiple inputs (I think/thought)? So three values for one param and two for another would make six runs. Let's leave that be for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tool currently takes all combinations of multiple inputs (I think/thought)? So three values for one param and two for another would make six runs. Let's leave that be for now?
Right, I didn't realize that! Guess I never used multiple args in the benchmark setup before. Yes, in that case, different no. of args is fine.
src/python/knnPerfTest.py
Outdated
parser = argparse.ArgumentParser(description="Run KNN benchmark with configurable parameters.") | ||
|
||
parser.add_argument("--ndoc", type=int, nargs="+", default=[500_000], help="Number of documents") | ||
parser.add_argument("--topK", type=int, nargs="+", default=[100], help="Top K results to retrieve") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, since we provide default values, would nargs="*"
make more sense here?
query_vectors = f"{constants.BASE_DIR}/data/cohere-wikipedia-queries-{dim}d.vec" | ||
dim = values.pop("dim") | ||
doc_vectors = values.pop("docVectors") | ||
query_vectors = values.pop("queryVectors") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 100%
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making these changes @yaser-aj ! I'm looking forward to running these benchmarks without having to modify the source script. Perhaps once you've finalized these changes, we can also update the instructions in README
?
Also, it would be helpful to see what the actual command we need to run and its benchmarking output on the PR (maybe also include a basic version of the command in the doc string)? You have the right defaults in place, let's confirm that the script picks them up and we don't have to pass args for values we don't want to change.
# change the parameters below and then run (you can still manually run this file, but using gradle command | ||
# below will auto recompile if you made any changes to java files in luceneutils) | ||
# ./gradlew runKnnPerfTest | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this still run as is, or do we need to update the gradle task as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the task so that it accepts arguments by passing -Pargs="(script args go here)"
to gradlew
.
# ./gradlew runKnnPerfTest | ||
# | ||
# you may want to modify the following settings: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also update the doc string above with instructions on how to run this script with args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
I've updated the instructions and added support for passing arguments through |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
This resolves #387. Let me know if anything needs adjustments.