-
Notifications
You must be signed in to change notification settings - Fork 62
[RFC] Sampling APIs #206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Sampling APIs #206
Conversation
rfc_samplers.md
Outdated
understood. | ||
|
||
|
||
Option 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer Option 1 over Option 2. My reasons:
- To me, good OO-design is when an object has some internal state, and its methods are for either performing actions on that state, or retrieving some part of that state. The
ClipSampler
in Option 2 is not that. - I think it's unlikely that users will create a
ClipSampler
and call more than one sampling strategy on it in the same Python process. That is, I think users will kick off distinct training runs, where each training run changes the sampling strategy. - I think this problem is most straight-forwardly modeled as functions.
rfc_samplers.md
Outdated
num_frames_per_clip=4, | ||
step_between_frames=2, # often called "dilation" | ||
# sampler-specific params: | ||
prioritize_keyframes=False, # to implement "fast" sampling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see this as a parameter or a different top-level function. As to name, I think we would want something more about the perceivable qualities to users: accuracy versus speed. We could implement that by prioritizing keyframes or not.
rfc_samplers.md
Outdated
# sampler-specific params: | ||
clip_range_start_seconds=3, | ||
clip_range_end_seconds=4, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think @ahmadsharif1's point about variable FPS is relevant here. Clearly, get_event_timed_clips()
is all about PTS as "time" is in the name. If we respect PTS for it (and I think we should), that takes care of variable FPS videos.
However, get_uniformly_random_clips()
and get_evenly_spaced_clips()
is less clear. Do we want to be uniformly random or evenly spaced across all of the frames, or across the time duration of the video? I think both behaviors can be reasonable for a user to want. That may need to be a parameter option, or different functions entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time-based equivalent of get_evenly_spaced_clips()
is get_evenly_timed_clips()
.
For the random one, I agree we need to think more. There is also the "step_betwee_frames" parameter which right now is only index-based. We may want it to be time-based as well, and some users may want it to be time-based on an index-based random sampler (and vice-versa).
rfc_samplers.md
Outdated
|
||
- should the returned `clips` be...? | ||
- a List[FrameBatch] where the FrameBatch.data is 4D | ||
- a FrameBatch where the FrameBatch.data is 5D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try to survey use-cases here? My default is that we should favor returning tensors over lists. But if we discover that users may want different dimensions, we can't return tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing out some comments
|
||
|
||
def samplers.clips_at_random_timestamps( | ||
decoder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a pure performance POV, passing a SimpleVideoDecoder could potentially be slow because it scans the entire file.
Also, do we plan to handle multiple streams or audio here? SimpleVideoDecoder doesn't handle those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we plan to handle multiple streams or audio here? SimpleVideoDecoder doesn't handle those
This is out of scope for this design as we don't have a multi-stream decoder yet.
*, | ||
num_clips, | ||
num_frames_per_clip=1, | ||
seconds_between_frames=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should this be seconds_between_frames or seconds_between_timestamps?
A frame is an interval, not a single point in time. I think for time-based sampling it makes sense to talk about single timestamps, not intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output of this function is a sequence of clips where a clip is a batch of frames. A frame is a well-defined concept within the scope of these samplers, so I don't think we need to talk about timestamps here. One may ask: the timestamps of what? The clips? The frames? The ...?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more (refer to chat we had):
For time-based samplers the user has to give us the clip duration AND frames per clip both.
Your implementation will first determine the clip_starts in terms of seconds. To get the clip_ends you simply add the clip_duration (passed by the user) to the clip_starts.
Then to get the frames, you can subdivide the [clip_start, clip_end) interval into frames_per_clip
segments and ask for the frames at the beginning of each segment using get_frame_displayed_at
.
I think that's how this and the at_regular_timestamps
below should work.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think users don't think in terms of clip duration. To them, a "clip" is something they pass to their model for training, which means that all clips must have a constant number of frames. So the time duration of a clip is not important to them, but the number of frames is.
What I think users think about when sampling from a video:
- How many frames are in each clip?
- How do I determine where each clips starts?
- Inside of each clip, how far apart should each frame be?
In this RFC, that last question is answered by seconds_between_frames
and num_indices_between_frames
. Specifying the distance between frames (in indices or time) means we don't need to specify the clip duration.
In terms of usability, I think specifying how far apart individual frames should be scales easier than clip duration. That is, when video length can vary by order of magnitudes, I think it's harder to specify a clip duration that makes sense across that spectrum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @scotts
However to make the naming consistent I think it should be called seconds_between_samples or something like that. A frame is an interval not a snapshot in time. What the user wants is seconds between samples chosen and number of samples.
Note that if you specify two of {seconds_between_samples, num_samples_in_clip, clip_duration} the third is constrained to a single value. So specifying any two is enough. I suggested the user give us clip_duration and num_samples_in_clip, but the more intuitive may be num_samples_in_clip and seconds_between_samples
The last step is to get the frames that are displayed at those samples.
(I used samples instead of frames because two samples could land in the same frame, in which case we will be returning two copies of the same frame)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give an extreme example -- if the whole video is a single frame of 10 seconds, and if the user specifies seconds_between_frames=1, you would end up returning copies of the same frame in the clip. And there would not be 1 second between frames.
There would be 1 second between the samples (which are instants in time) the sampler chose -- and those samples just happen to land on the same frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss more today about the right parametrization (clip_duration vs seconds_between_frames).
Regarding using the term frame vs sample vs timestamp: let's try to leave that for a separate, higher-level discussion.
*, | ||
num_clips_per_second, | ||
num_frames_per_clip=1, | ||
seconds_between_frames=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as nit above
) -> FrameBatch | ||
``` | ||
|
||
Discussion / Notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you talk about error scenarios here?
What if there aren't enough frames in the video? Do we return partial results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted this RFC to focus on the interface more than on lower-level details, as there are enough discussion to have on the interface already. I'd suggest to leave these discussions for the implementation PRs.
When there are not enough frames some libs "wrap around", and some return partial results. We'll need to decide which one[s] we want to support.
num_clips_per_second, | ||
num_frames_per_clip=1, | ||
seconds_between_frames=None, | ||
clip_range_start_seconds: float = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why isn't this -inf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want or need to expose the fact that some streams may have negative start times. This is pretty unintuitive. If 0 isn't correct enough, I'd suggest to just use None
of the default of both clip_range_start_seconds
and clip_range_end_seconds
seconds_between_frames=None, | ||
) -> FrameBatch | ||
|
||
def samplers.clips_at_regular_indices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this and clips_at_regular_timestamps(), both clips and frames within the cips are at regular indices/timestamps respectively, right? Is there a name that would convey to the user that information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name already convey the fact that clips are at regular indices/timestamps.
For frames within a clip, I don't believe this is necessary to convey. This is not a key aspect of these samplers, and it's only captured in a single, optional parameter. We don't convey that in the random sampler either.
def samplers.clips_at_regular_timestamps( | ||
decoder, | ||
*, | ||
num_clips_per_second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find num_clips_per_second
confusing to reason about because we need to think about fractions of a clip. That is, because we've fixed the unit to be 1 second ("per second"), if we want a clip to span multiple seconds, we'll need to specify what fraction of the clip is taken up by 1 second. For example, if a user wants a clip to start every 3 seconds, I believe they would need to say num_clips_per_second=1/3
. That is, each second has 1/3 of a clip, so we need 3 seconds before we get a full clip.
I have been thinking of how to specify the start in time as a period instead of a rate. So more intuitive to me is clips_start_period_seconds
. In order to achieve the above, I would say clips_start_period_seconds=3
. A clip starts every 3 seconds. Granted, if I want multiple clips in one second, we have to provide values less than 1 again: clips_start_period_seconds=1/3
would be starting a clip every 1/3 of a second.
Personally, I find it more intuitive to think of fractions of a second than fractions of a clip. And I find it more intuitive to think in terms of a period instead of a rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more consistent if the user specifies both values as periods -- seconds_between_clip_starts and seconds_between_samples (or seconds_between_frames)
OR
the user specifies both values as a frequency -- samples_per_second and clip_starts_per_second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_clips_per_seconds
has 2 clear advantages over clips_start_period_seconds
IMHO:
- It is what existing samplers use. It reduces migration friction.
- It is immediately obvious what it means, while that's not the case for
clips_start_period_seconds
Personally, I find it more intuitive to think of fractions of a second than fractions of a clip
I share the same view, and that's precisely why I tend to think num_clips_per_seconds
is easier to grasp: to me, a fraction of a second implies a "per_second" parameter.
Note that I don't think this premise is quite exact:
if we want a clip to span multiple seconds, we'll need to specify what fraction of the clip is taken up by 1 second
The time spanned by a clip isn't just determined by this parameter, it also depend on the number of frames within that clip (and the framerate).
Happy to chat more!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is immediately obvious what it means, while that's not the case for
clips_start_period_seconds
The reason I made this comment is that num_clips_per_second
was not immediately obvious to me. :) I had to start writing the comment, and doing unit analysis, to piece it together. If we can come up with a name better than clip_start_period_seconds
that allows users to specify a clip period, then that's great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I like seconds_between_clip_starts
and seconds_between_samples_within_clip
(both are periods, not frequencies).
No description provided.