-
Notifications
You must be signed in to change notification settings - Fork 1.6k
POC: datafusion-cli instrumented object store #17266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- A WIP/POC of instrumenting the object store backing datafusion-cli operations
Thank you @BlakeOrth -- I tried this out and it looks very nice 👌
I have some ideas on how to make it configurable and easy to see. BRB |
@alamb Awesome, I'm looking forward to hearing your feedback. Just so you can understand where my head was at for this (currently very rough) implementation in terms of configuration:
Any feedback on the above, or general code structure/implementation is obviously welcome as well, so let me know your thoughts. This code is obviously a minimally functional example, but I'd rather incorporate feedback early than need to re-work a bunch of stuff! |
Yes, I agree with this
I think it would be ok to always use the "Instrumented Object REgistry" and then only pass back an instrumented object store if the profiling was enabled
That is perfect!
how about something like this (maybe you can implement one of these, perhaps summary, as the intial PR and we can add the others afterwards) # Prints an object store summary at the end of all results, after the query executed
\profile objectstore summary # prints one line per object store request at the end of all results, after the query executed
\profile objectstore trace # prints one line per object store request when it happens (potentially in the middle of results)
# this can be useful to understand how the requests happen over time
\profile objectstore inline
Indeed -- makes total sense I |
In my mind I was going to have time to work on this myself, but I fear that is not likely to be the case for a while (and I will be out for the next week or so on vacation, though I will be reviewing PRs as much as possible). If you are willing to help push this forward that would be most appreciated |
I'm happy to keep driving this effort as long as it sounds like it's moving in the right direction. My interpretation of your above comments is that we currently are moving along at least mostly the right path, so I should be able to generally make progress. (And do your best to enjoy your vacation! The code will be here when you return) I just pushed some changes the implement the summary output. I get the feeling we're about to learn quite a lot... DataFusion CLI v49.0.1
> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
0 row(s) fetched.
Elapsed 2.587 seconds.
List Summary:
count: 1
Get Summary:
count: 288
duration min: 0.058361s
duration max: 0.374491s
duration avg: 0.122724s
size min: 8
size max: 44247
size avg: 18870
size sum: 5434702
List Summary:
count: 1
> select count(*) from 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/' where vendor_id='CMT';
+-----------+
| count(*) |
+-----------+
| 505603754 |
+-----------+
1 row(s) fetched.
Elapsed 56.057 seconds.
Get Summary:
count: 1126
duration min: 0.062342s
duration max: 1.831455s
duration avg: 0.397414s
size min: 47
size max: 112
size avg: 69
size sum: 78422
List Summary:
count: 4 (caveat: this is a debug build, not
I think this is an interesting thought and I'll take some time to explore this as a solution. Are you thinking the |
Yes indeed -- I think this will be super valuable
Yes that is what I was thinking I agree the implementation would be more complicated, but it would be easier to use (people wouldn't have to run datafusion-cli with a flag and then also have to enable/disable profiling) |
Which issue does this PR close?
N/A - POC to recieve initial implementation feedback related to
Rationale for this change
What changes are included in this PR?
Are these changes tested?
No - This is a POC
cc @alamb