-
Notifications
You must be signed in to change notification settings - Fork 586
[VLM] Offline scenario, performance-only mode of the reference implementation #2381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[VLM] Offline scenario, performance-only mode of the reference implementation #2381
Conversation
…perf-inference into wangshangsam/vlm-sut-prototype
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
…perf-inference into wangshangsam/vlm-sut-prototype
|
|
||
| min_duration: Annotated[ | ||
| timedelta, | ||
| Field( | ||
| description="The minimum testing duration.", | ||
| ), | ||
| ] = timedelta(seconds=5) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this to float? Currently is timedelta. If I try to enter
--settings.min_duration 60 I get:
│ Invalid value for _pydantic_settings_min_duration: Input should be a valid timedelta, "day" identifier in duration not correctly formatted │```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm this is not supposed to happen. The string format that this flag can take is defined in https://docs.pydantic.dev/2.0/usage/types/datetime/
timedelta fields will accept values of type:
- str; the following formats are accepted:
[-][DD ][HH:MM]SS[.ffffff]
[±]P[DD]DT[HH]H[MM]M[SS]S (ISO 8601 format for timedelta)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so the input will be something like: --settings.min_duration 00:01:00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
00:01:00 can work but what I was trying to say is that it's supposed to be able to take in an single integer as well (in which case it would mean "seconds")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when, I entered 60 , it threw me the error
| settings.use_token_latencies = True | ||
| return settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, when we use settings.use_token_latencies = True, the constrains become:
/// Token latency parameters
uint64_t server_ttft_latency = 100000000;
uint64_t server_tpot_latency = 100000000;
If we use false the constrain is:
/// \brief The latency constraint for the Server scenario.
uint64_t server_target_latency_ns = 100000000;
We may need to add more flags to let the user choose the constrains and values
…the client, event loop and the event loop thread
…perf-inference into wangshangsam/vlm-sut-prototype
…perf-inference into wangshangsam/vlm-sut-prototype
This is the first PR towards the VLM reference implementation for the v6.0 round.
This PR currenlty supports the Offline scenario + performance-only mode. Server scenario and accuracy mode will be introduced through subsequent PRs.
The
issue_queryimplemenation adopted the purely asyncio-based design from the DSR1 reference implementation, but the code here is simpler mostly because we only access the inference endpoint through OpenAI APIs.