[RFC] Sampling APIs #206

NicolasHug · 2024-09-10T18:09:18Z

No description provided.

scotts · 2024-09-14T03:50:27Z

rfc_samplers.md

+  understood.
+
+
+Option 1


I prefer Option 1 over Option 2. My reasons:

To me, good OO-design is when an object has some internal state, and its methods are for either performing actions on that state, or retrieving some part of that state. The ClipSampler in Option 2 is not that.

I think it's unlikely that users will create a ClipSampler and call more than one sampling strategy on it in the same Python process. That is, I think users will kick off distinct training runs, where each training run changes the sampling strategy.

I think this problem is most straight-forwardly modeled as functions.

scotts · 2024-09-14T03:52:27Z

rfc_samplers.md

+    num_frames_per_clip=4,
+    step_between_frames=2,  # often called "dilation"
+    # sampler-specific params:
+    prioritize_keyframes=False,  # to implement "fast" sampling.


I can see this as a parameter or a different top-level function. As to name, I think we would want something more about the perceivable qualities to users: accuracy versus speed. We could implement that by prioritizing keyframes or not.

scotts · 2024-09-14T04:05:10Z

rfc_samplers.md

+    # sampler-specific params:
+    clip_range_start_seconds=3,
+    clip_range_end_seconds=4,
+)


I do think @ahmadsharif1's point about variable FPS is relevant here. Clearly, get_event_timed_clips() is all about PTS as "time" is in the name. If we respect PTS for it (and I think we should), that takes care of variable FPS videos.

However, get_uniformly_random_clips() and get_evenly_spaced_clips() is less clear. Do we want to be uniformly random or evenly spaced across all of the frames, or across the time duration of the video? I think both behaviors can be reasonable for a user to want. That may need to be a parameter option, or different functions entirely.

The time-based equivalent of get_evenly_spaced_clips() is get_evenly_timed_clips().

For the random one, I agree we need to think more. There is also the "step_betwee_frames" parameter which right now is only index-based. We may want it to be time-based as well, and some users may want it to be time-based on an index-based random sampler (and vice-versa).

scotts · 2024-09-14T04:07:30Z

rfc_samplers.md

+
+- should the returned `clips` be...?
+  - a List[FrameBatch] where the FrameBatch.data is 4D
+  - a FrameBatch where the FrameBatch.data is 5D


Can we try to survey use-cases here? My default is that we should favor returning tensors over lists. But if we discover that users may want different dimensions, we can't return tensors.

ahmadsharif1

flushing out some comments

ahmadsharif1 · 2024-09-17T17:55:08Z

rfc_samplers.md

+
+
+def samplers.clips_at_random_timestamps(
+    decoder,


From a pure performance POV, passing a SimpleVideoDecoder could potentially be slow because it scans the entire file.

Also, do we plan to handle multiple streams or audio here? SimpleVideoDecoder doesn't handle those

Also, do we plan to handle multiple streams or audio here? SimpleVideoDecoder doesn't handle those

This is out of scope for this design as we don't have a multi-stream decoder yet.

ahmadsharif1 · 2024-09-17T17:56:21Z

rfc_samplers.md

+    *,
+    num_clips,
+    num_frames_per_clip=1,
+    seconds_between_frames=None,


Nit: should this be seconds_between_frames or seconds_between_timestamps?

A frame is an interval, not a single point in time. I think for time-based sampling it makes sense to talk about single timestamps, not intervals.

The output of this function is a sequence of clips where a clip is a batch of frames. A frame is a well-defined concept within the scope of these samplers, so I don't think we need to talk about timestamps here. One may ask: the timestamps of what? The clips? The frames? The ...?.

Thinking about this more (refer to chat we had):

For time-based samplers the user has to give us the clip duration AND frames per clip both.

Your implementation will first determine the clip_starts in terms of seconds. To get the clip_ends you simply add the clip_duration (passed by the user) to the clip_starts.

Then to get the frames, you can subdivide the [clip_start, clip_end) interval into frames_per_clip segments and ask for the frames at the beginning of each segment using get_frame_displayed_at.

I think that's how this and the at_regular_timestamps below should work.

WDYT?

I think users don't think in terms of clip duration. To them, a "clip" is something they pass to their model for training, which means that all clips must have a constant number of frames. So the time duration of a clip is not important to them, but the number of frames is.

What I think users think about when sampling from a video:

How many frames are in each clip?

How do I determine where each clips starts?

Inside of each clip, how far apart should each frame be?

In this RFC, that last question is answered by seconds_between_frames and num_indices_between_frames. Specifying the distance between frames (in indices or time) means we don't need to specify the clip duration.

In terms of usability, I think specifying how far apart individual frames should be scales easier than clip duration. That is, when video length can vary by order of magnitudes, I think it's harder to specify a clip duration that makes sense across that spectrum.

I agree with @scotts

However to make the naming consistent I think it should be called seconds_between_samples or something like that. A frame is an interval not a snapshot in time. What the user wants is seconds between samples chosen and number of samples.

Note that if you specify two of {seconds_between_samples, num_samples_in_clip, clip_duration} the third is constrained to a single value. So specifying any two is enough. I suggested the user give us clip_duration and num_samples_in_clip, but the more intuitive may be num_samples_in_clip and seconds_between_samples

The last step is to get the frames that are displayed at those samples.

(I used samples instead of frames because two samples could land in the same frame, in which case we will be returning two copies of the same frame)

To give an extreme example -- if the whole video is a single frame of 10 seconds, and if the user specifies seconds_between_frames=1, you would end up returning copies of the same frame in the clip. And there would not be 1 second between frames.

There would be 1 second between the samples (which are instants in time) the sampler chose -- and those samples just happen to land on the same frame.

Let's discuss more today about the right parametrization (clip_duration vs seconds_between_frames).

Regarding using the term frame vs sample vs timestamp: let's try to leave that for a separate, higher-level discussion.

ahmadsharif1 · 2024-09-17T17:57:06Z

rfc_samplers.md

+    *,
+    num_clips_per_second,
+    num_frames_per_clip=1,
+    seconds_between_frames=None,


same as nit above

ahmadsharif1 · 2024-09-17T18:03:56Z

rfc_samplers.md

+) -> FrameBatch
+```
+
+Discussion / Notes


Can you talk about error scenarios here?

What if there aren't enough frames in the video? Do we return partial results?

I wanted this RFC to focus on the interface more than on lower-level details, as there are enough discussion to have on the interface already. I'd suggest to leave these discussions for the implementation PRs.
When there are not enough frames some libs "wrap around", and some return partial results. We'll need to decide which one[s] we want to support.

ahmadsharif1 · 2024-09-17T18:04:17Z

rfc_samplers.md

+    num_clips_per_second,
+    num_frames_per_clip=1,
+    seconds_between_frames=None,
+    clip_range_start_seconds: float = 0,


why isn't this -inf?

I'm not sure we want or need to expose the fact that some streams may have negative start times. This is pretty unintuitive. If 0 isn't correct enough, I'd suggest to just use None of the default of both clip_range_start_seconds and clip_range_end_seconds

ahmadsharif1 · 2024-09-17T18:05:56Z

rfc_samplers.md

+    seconds_between_frames=None,
+) -> FrameBatch
+
+def samplers.clips_at_regular_indices(


For this and clips_at_regular_timestamps(), both clips and frames within the cips are at regular indices/timestamps respectively, right? Is there a name that would convey to the user that information?

The name already convey the fact that clips are at regular indices/timestamps.
For frames within a clip, I don't believe this is necessary to convey. This is not a key aspect of these samplers, and it's only captured in a single, optional parameter. We don't convey that in the random sampler either.

scotts · 2024-09-20T21:03:04Z

rfc_samplers.md

+def samplers.clips_at_regular_timestamps(
+    decoder,
+    *,
+    num_clips_per_second,


I find num_clips_per_second confusing to reason about because we need to think about fractions of a clip. That is, because we've fixed the unit to be 1 second ("per second"), if we want a clip to span multiple seconds, we'll need to specify what fraction of the clip is taken up by 1 second. For example, if a user wants a clip to start every 3 seconds, I believe they would need to say num_clips_per_second=1/3. That is, each second has 1/3 of a clip, so we need 3 seconds before we get a full clip.

I have been thinking of how to specify the start in time as a period instead of a rate. So more intuitive to me is clips_start_period_seconds. In order to achieve the above, I would say clips_start_period_seconds=3. A clip starts every 3 seconds. Granted, if I want multiple clips in one second, we have to provide values less than 1 again: clips_start_period_seconds=1/3 would be starting a clip every 1/3 of a second.

Personally, I find it more intuitive to think of fractions of a second than fractions of a clip. And I find it more intuitive to think in terms of a period instead of a rate.

I think it would be more consistent if the user specifies both values as periods -- seconds_between_clip_starts and seconds_between_samples (or seconds_between_frames)

OR

the user specifies both values as a frequency -- samples_per_second and clip_starts_per_second

num_clips_per_seconds has 2 clear advantages over clips_start_period_seconds IMHO:

It is what existing samplers use. It reduces migration friction.

It is immediately obvious what it means, while that's not the case for clips_start_period_seconds

Personally, I find it more intuitive to think of fractions of a second than fractions of a clip

I share the same view, and that's precisely why I tend to think num_clips_per_seconds is easier to grasp: to me, a fraction of a second implies a "per_second" parameter.

Note that I don't think this premise is quite exact:

if we want a clip to span multiple seconds, we'll need to specify what fraction of the clip is taken up by 1 second

The time spanned by a clip isn't just determined by this parameter, it also depend on the number of frames within that clip (and the framerate).

Happy to chat more!

It is immediately obvious what it means, while that's not the case for clips_start_period_seconds

The reason I made this comment is that num_clips_per_second was not immediately obvious to me. :) I had to start writing the comment, and doing unit analysis, to piece it together. If we can come up with a name better than clip_start_period_seconds that allows users to specify a clip period, then that's great.

FWIW, I like seconds_between_clip_starts and seconds_between_samples_within_clip (both are periods, not frequencies).

RFC Sampling APIs

d7a515e

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 10, 2024

typo

2708d3e

scotts reviewed Sep 14, 2024

View reviewed changes

Update

93bb693

ahmadsharif1 reviewed Sep 17, 2024

View reviewed changes

scotts reviewed Sep 20, 2024

View reviewed changes

NicolasHug closed this Oct 24, 2024

[RFC] Sampling APIs #206

[RFC] Sampling APIs #206

Uh oh!

Conversation

NicolasHug commented Sep 10, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahmadsharif1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahmadsharif1 Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahmadsharif1 Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahmadsharif1 Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NicolasHug Sep 17, 2024 •

edited

Loading

ahmadsharif1 Sep 20, 2024 •

edited

Loading

ahmadsharif1 Sep 20, 2024 •

edited

Loading

ahmadsharif1 Sep 20, 2024 •

edited

Loading

NicolasHug Sep 23, 2024 •

edited

Loading