Skip to content

Conversation

NicolasHug
Copy link
Contributor

No description provided.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 10, 2024
rfc_samplers.md Outdated
understood.


Option 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer Option 1 over Option 2. My reasons:

  1. To me, good OO-design is when an object has some internal state, and its methods are for either performing actions on that state, or retrieving some part of that state. The ClipSampler in Option 2 is not that.
  2. I think it's unlikely that users will create a ClipSampler and call more than one sampling strategy on it in the same Python process. That is, I think users will kick off distinct training runs, where each training run changes the sampling strategy.
  3. I think this problem is most straight-forwardly modeled as functions.

rfc_samplers.md Outdated
num_frames_per_clip=4,
step_between_frames=2, # often called "dilation"
# sampler-specific params:
prioritize_keyframes=False, # to implement "fast" sampling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this as a parameter or a different top-level function. As to name, I think we would want something more about the perceivable qualities to users: accuracy versus speed. We could implement that by prioritizing keyframes or not.

rfc_samplers.md Outdated
# sampler-specific params:
clip_range_start_seconds=3,
clip_range_end_seconds=4,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think @ahmadsharif1's point about variable FPS is relevant here. Clearly, get_event_timed_clips() is all about PTS as "time" is in the name. If we respect PTS for it (and I think we should), that takes care of variable FPS videos.

However, get_uniformly_random_clips() and get_evenly_spaced_clips() is less clear. Do we want to be uniformly random or evenly spaced across all of the frames, or across the time duration of the video? I think both behaviors can be reasonable for a user to want. That may need to be a parameter option, or different functions entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time-based equivalent of get_evenly_spaced_clips() is get_evenly_timed_clips().

For the random one, I agree we need to think more. There is also the "step_betwee_frames" parameter which right now is only index-based. We may want it to be time-based as well, and some users may want it to be time-based on an index-based random sampler (and vice-versa).

rfc_samplers.md Outdated

- should the returned `clips` be...?
- a List[FrameBatch] where the FrameBatch.data is 4D
- a FrameBatch where the FrameBatch.data is 5D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to survey use-cases here? My default is that we should favor returning tensors over lists. But if we discover that users may want different dimensions, we can't return tensors.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing out some comments



def samplers.clips_at_random_timestamps(
decoder,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a pure performance POV, passing a SimpleVideoDecoder could potentially be slow because it scans the entire file.

Also, do we plan to handle multiple streams or audio here? SimpleVideoDecoder doesn't handle those

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we plan to handle multiple streams or audio here? SimpleVideoDecoder doesn't handle those

This is out of scope for this design as we don't have a multi-stream decoder yet.

*,
num_clips,
num_frames_per_clip=1,
seconds_between_frames=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should this be seconds_between_frames or seconds_between_timestamps?

A frame is an interval, not a single point in time. I think for time-based sampling it makes sense to talk about single timestamps, not intervals.

Copy link
Contributor Author

@NicolasHug NicolasHug Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of this function is a sequence of clips where a clip is a batch of frames. A frame is a well-defined concept within the scope of these samplers, so I don't think we need to talk about timestamps here. One may ask: the timestamps of what? The clips? The frames? The ...?.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more (refer to chat we had):

For time-based samplers the user has to give us the clip duration AND frames per clip both.

Your implementation will first determine the clip_starts in terms of seconds. To get the clip_ends you simply add the clip_duration (passed by the user) to the clip_starts.

Then to get the frames, you can subdivide the [clip_start, clip_end) interval into frames_per_clip segments and ask for the frames at the beginning of each segment using get_frame_displayed_at.

I think that's how this and the at_regular_timestamps below should work.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think users don't think in terms of clip duration. To them, a "clip" is something they pass to their model for training, which means that all clips must have a constant number of frames. So the time duration of a clip is not important to them, but the number of frames is.

What I think users think about when sampling from a video:

  • How many frames are in each clip?
  • How do I determine where each clips starts?
  • Inside of each clip, how far apart should each frame be?

In this RFC, that last question is answered by seconds_between_frames and num_indices_between_frames. Specifying the distance between frames (in indices or time) means we don't need to specify the clip duration.

In terms of usability, I think specifying how far apart individual frames should be scales easier than clip duration. That is, when video length can vary by order of magnitudes, I think it's harder to specify a clip duration that makes sense across that spectrum.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @scotts

However to make the naming consistent I think it should be called seconds_between_samples or something like that. A frame is an interval not a snapshot in time. What the user wants is seconds between samples chosen and number of samples.

Note that if you specify two of {seconds_between_samples, num_samples_in_clip, clip_duration} the third is constrained to a single value. So specifying any two is enough. I suggested the user give us clip_duration and num_samples_in_clip, but the more intuitive may be num_samples_in_clip and seconds_between_samples

The last step is to get the frames that are displayed at those samples.

(I used samples instead of frames because two samples could land in the same frame, in which case we will be returning two copies of the same frame)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give an extreme example -- if the whole video is a single frame of 10 seconds, and if the user specifies seconds_between_frames=1, you would end up returning copies of the same frame in the clip. And there would not be 1 second between frames.

There would be 1 second between the samples (which are instants in time) the sampler chose -- and those samples just happen to land on the same frame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss more today about the right parametrization (clip_duration vs seconds_between_frames).

Regarding using the term frame vs sample vs timestamp: let's try to leave that for a separate, higher-level discussion.

*,
num_clips_per_second,
num_frames_per_clip=1,
seconds_between_frames=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as nit above

) -> FrameBatch
```

Discussion / Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you talk about error scenarios here?

What if there aren't enough frames in the video? Do we return partial results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted this RFC to focus on the interface more than on lower-level details, as there are enough discussion to have on the interface already. I'd suggest to leave these discussions for the implementation PRs.
When there are not enough frames some libs "wrap around", and some return partial results. We'll need to decide which one[s] we want to support.

num_clips_per_second,
num_frames_per_clip=1,
seconds_between_frames=None,
clip_range_start_seconds: float = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't this -inf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want or need to expose the fact that some streams may have negative start times. This is pretty unintuitive. If 0 isn't correct enough, I'd suggest to just use None of the default of both clip_range_start_seconds and clip_range_end_seconds

seconds_between_frames=None,
) -> FrameBatch

def samplers.clips_at_regular_indices(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and clips_at_regular_timestamps(), both clips and frames within the cips are at regular indices/timestamps respectively, right? Is there a name that would convey to the user that information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name already convey the fact that clips are at regular indices/timestamps.
For frames within a clip, I don't believe this is necessary to convey. This is not a key aspect of these samplers, and it's only captured in a single, optional parameter. We don't convey that in the random sampler either.

def samplers.clips_at_regular_timestamps(
decoder,
*,
num_clips_per_second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find num_clips_per_second confusing to reason about because we need to think about fractions of a clip. That is, because we've fixed the unit to be 1 second ("per second"), if we want a clip to span multiple seconds, we'll need to specify what fraction of the clip is taken up by 1 second. For example, if a user wants a clip to start every 3 seconds, I believe they would need to say num_clips_per_second=1/3. That is, each second has 1/3 of a clip, so we need 3 seconds before we get a full clip.

I have been thinking of how to specify the start in time as a period instead of a rate. So more intuitive to me is clips_start_period_seconds. In order to achieve the above, I would say clips_start_period_seconds=3. A clip starts every 3 seconds. Granted, if I want multiple clips in one second, we have to provide values less than 1 again: clips_start_period_seconds=1/3 would be starting a clip every 1/3 of a second.

Personally, I find it more intuitive to think of fractions of a second than fractions of a clip. And I find it more intuitive to think in terms of a period instead of a rate.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more consistent if the user specifies both values as periods -- seconds_between_clip_starts and seconds_between_samples (or seconds_between_frames)

OR

the user specifies both values as a frequency -- samples_per_second and clip_starts_per_second

Copy link
Contributor Author

@NicolasHug NicolasHug Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_clips_per_seconds has 2 clear advantages over clips_start_period_seconds IMHO:

  • It is what existing samplers use. It reduces migration friction.
  • It is immediately obvious what it means, while that's not the case for clips_start_period_seconds

Personally, I find it more intuitive to think of fractions of a second than fractions of a clip

I share the same view, and that's precisely why I tend to think num_clips_per_seconds is easier to grasp: to me, a fraction of a second implies a "per_second" parameter.

Note that I don't think this premise is quite exact:

if we want a clip to span multiple seconds, we'll need to specify what fraction of the clip is taken up by 1 second

The time spanned by a clip isn't just determined by this parameter, it also depend on the number of frames within that clip (and the framerate).

Happy to chat more!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is immediately obvious what it means, while that's not the case for clips_start_period_seconds

The reason I made this comment is that num_clips_per_second was not immediately obvious to me. :) I had to start writing the comment, and doing unit analysis, to piece it together. If we can come up with a name better than clip_start_period_seconds that allows users to specify a clip period, then that's great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I like seconds_between_clip_starts and seconds_between_samples_within_clip (both are periods, not frequencies).

@NicolasHug NicolasHug closed this Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants