Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions rfc_samplers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@

Feature space
-------------

At a high level, a sampler returns a sequence of "clips" where a clip is a batch
of (non-necessarily contiguous) frames. Based on an analysis of the sampling
options of other libraries (torchvision, torchmultimodal, and various internal
libraries), we want to support the following use-cases:


**Clip location** (and number thereof)

- Get `num_clips_per_video` random clips: typically called "Random" sampler
- Get `num_clips_per_video` equally-spaced clips: "Uniform" sampler. Space is
index-based.
- Get `num_clips_per_second` clips per seconds [within the [`start_seconds`,
`stop_seconds`] sampling range]: "Periodic" sampler.

**Clip content**

- Clip must be `num_frames_per_clip` frames spaced by `dilation` frames.

Notes:

- There are 3 sampling strategies defining the clip locations: Random, Uniform
and Periodic. The names are prone to confusion. "Random" is in fact uniformly
random while "Uniform" isn't random. "Uniform" is periodic and "Periodic" is
uniformly spaced too.
- The parametrization of the *Clip Content* dimension is limited and
reduces to just `num_frames_per_clip` and the index-based `dilation`.
- The *Clip Content* dimension is never random. `num_frames_per_clip` is always
passed by the user. This makes sense: training typically requires fixed-sized
inputs so `num_frames_per_clip` is expected to be constant. The content of a
clip (space between frames) is always regular, never random.


Proposal
--------

```py
from torchcodec.decoders import SimpleVideoDecoder
from torchcodec import samplers

decoder = SimpleVideoDecoder("./cool_vid")


def samplers.clips_at_random_timestamps(
decoder,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a pure performance POV, passing a SimpleVideoDecoder could potentially be slow because it scans the entire file.

Also, do we plan to handle multiple streams or audio here? SimpleVideoDecoder doesn't handle those

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we plan to handle multiple streams or audio here? SimpleVideoDecoder doesn't handle those

This is out of scope for this design as we don't have a multi-stream decoder yet.

*,
num_clips,
num_frames_per_clip=1,
seconds_between_frames=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should this be seconds_between_frames or seconds_between_timestamps?

A frame is an interval, not a single point in time. I think for time-based sampling it makes sense to talk about single timestamps, not intervals.

Copy link
Contributor Author

@NicolasHug NicolasHug Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of this function is a sequence of clips where a clip is a batch of frames. A frame is a well-defined concept within the scope of these samplers, so I don't think we need to talk about timestamps here. One may ask: the timestamps of what? The clips? The frames? The ...?.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more (refer to chat we had):

For time-based samplers the user has to give us the clip duration AND frames per clip both.

Your implementation will first determine the clip_starts in terms of seconds. To get the clip_ends you simply add the clip_duration (passed by the user) to the clip_starts.

Then to get the frames, you can subdivide the [clip_start, clip_end) interval into frames_per_clip segments and ask for the frames at the beginning of each segment using get_frame_displayed_at.

I think that's how this and the at_regular_timestamps below should work.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think users don't think in terms of clip duration. To them, a "clip" is something they pass to their model for training, which means that all clips must have a constant number of frames. So the time duration of a clip is not important to them, but the number of frames is.

What I think users think about when sampling from a video:

  • How many frames are in each clip?
  • How do I determine where each clips starts?
  • Inside of each clip, how far apart should each frame be?

In this RFC, that last question is answered by seconds_between_frames and num_indices_between_frames. Specifying the distance between frames (in indices or time) means we don't need to specify the clip duration.

In terms of usability, I think specifying how far apart individual frames should be scales easier than clip duration. That is, when video length can vary by order of magnitudes, I think it's harder to specify a clip duration that makes sense across that spectrum.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @scotts

However to make the naming consistent I think it should be called seconds_between_samples or something like that. A frame is an interval not a snapshot in time. What the user wants is seconds between samples chosen and number of samples.

Note that if you specify two of {seconds_between_samples, num_samples_in_clip, clip_duration} the third is constrained to a single value. So specifying any two is enough. I suggested the user give us clip_duration and num_samples_in_clip, but the more intuitive may be num_samples_in_clip and seconds_between_samples

The last step is to get the frames that are displayed at those samples.

(I used samples instead of frames because two samples could land in the same frame, in which case we will be returning two copies of the same frame)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give an extreme example -- if the whole video is a single frame of 10 seconds, and if the user specifies seconds_between_frames=1, you would end up returning copies of the same frame in the clip. And there would not be 1 second between frames.

There would be 1 second between the samples (which are instants in time) the sampler chose -- and those samples just happen to land on the same frame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss more today about the right parametrization (clip_duration vs seconds_between_frames).

Regarding using the term frame vs sample vs timestamp: let's try to leave that for a separate, higher-level discussion.

) -> FrameBatch

def samplers.clips_at_regular_indices(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and clips_at_regular_timestamps(), both clips and frames within the cips are at regular indices/timestamps respectively, right? Is there a name that would convey to the user that information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name already convey the fact that clips are at regular indices/timestamps.
For frames within a clip, I don't believe this is necessary to convey. This is not a key aspect of these samplers, and it's only captured in a single, optional parameter. We don't convey that in the random sampler either.

decoder,
*,
num_clips,
num_frames_per_clip=1,
num_indices_between_frames=1, # in indices (dilation)
) -> FrameBatch

def samplers.clips_at_regular_timestamps(
decoder,
*,
num_clips_per_second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find num_clips_per_second confusing to reason about because we need to think about fractions of a clip. That is, because we've fixed the unit to be 1 second ("per second"), if we want a clip to span multiple seconds, we'll need to specify what fraction of the clip is taken up by 1 second. For example, if a user wants a clip to start every 3 seconds, I believe they would need to say num_clips_per_second=1/3. That is, each second has 1/3 of a clip, so we need 3 seconds before we get a full clip.

I have been thinking of how to specify the start in time as a period instead of a rate. So more intuitive to me is clips_start_period_seconds. In order to achieve the above, I would say clips_start_period_seconds=3. A clip starts every 3 seconds. Granted, if I want multiple clips in one second, we have to provide values less than 1 again: clips_start_period_seconds=1/3 would be starting a clip every 1/3 of a second.

Personally, I find it more intuitive to think of fractions of a second than fractions of a clip. And I find it more intuitive to think in terms of a period instead of a rate.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more consistent if the user specifies both values as periods -- seconds_between_clip_starts and seconds_between_samples (or seconds_between_frames)

OR

the user specifies both values as a frequency -- samples_per_second and clip_starts_per_second

Copy link
Contributor Author

@NicolasHug NicolasHug Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_clips_per_seconds has 2 clear advantages over clips_start_period_seconds IMHO:

  • It is what existing samplers use. It reduces migration friction.
  • It is immediately obvious what it means, while that's not the case for clips_start_period_seconds

Personally, I find it more intuitive to think of fractions of a second than fractions of a clip

I share the same view, and that's precisely why I tend to think num_clips_per_seconds is easier to grasp: to me, a fraction of a second implies a "per_second" parameter.

Note that I don't think this premise is quite exact:

if we want a clip to span multiple seconds, we'll need to specify what fraction of the clip is taken up by 1 second

The time spanned by a clip isn't just determined by this parameter, it also depend on the number of frames within that clip (and the framerate).

Happy to chat more!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is immediately obvious what it means, while that's not the case for clips_start_period_seconds

The reason I made this comment is that num_clips_per_second was not immediately obvious to me. :) I had to start writing the comment, and doing unit analysis, to piece it together. If we can come up with a name better than clip_start_period_seconds that allows users to specify a clip period, then that's great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I like seconds_between_clip_starts and seconds_between_samples_within_clip (both are periods, not frequencies).

num_frames_per_clip=1,
seconds_between_frames=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as nit above

clip_range_start_seconds: float = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't this -inf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want or need to expose the fact that some streams may have negative start times. This is pretty unintuitive. If 0 isn't correct enough, I'd suggest to just use None of the default of both clip_range_start_seconds and clip_range_end_seconds

clip_range_end_seconds: float = float("inf"),
) -> FrameBatch
```

Discussion / Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you talk about error scenarios here?

What if there aren't enough frames in the video? Do we return partial results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted this RFC to focus on the interface more than on lower-level details, as there are enough discussion to have on the interface already. I'd suggest to leave these discussions for the implementation PRs.
When there are not enough frames some libs "wrap around", and some return partial results. We'll need to decide which one[s] we want to support.

------------------

The sampler is agnostic to decoding options: the decoder object is passed to the
sampler by the user. This is because we don't want to duplicate the decoder
parametrization in the samplers, and also because it's not up to the sampler to
decide on some frame properties like their shape.

`clips_at_regular_indices` and `clips_at_regular_timestamps` are dual of
one-another, and have a slightly different parametrization: `num_clips` vs
`num_clips_per_second`. This is to respect the existing parametrization of the
surveyed libraries.

The samplers are explicitly time-based or index-based, from their function name.

We believe that in general, a time-based sampler is closer to the user's mental
model. It is also potentially more "correct" in the sense that it accounts for
variable framerates. This is why the "random" sampler is explicitly implemented
as a time-based sampler. Note that existing "random" samplers are implemented
as index-based, which is likely a result of the existing decoders not having
fine-grained pts-based APIs, rather than a deliberate design choice. We may
implement a "random" index-based sampler in the future if there are feature
requests.

The "dilation" parameter has different semantic depending on whether the sampler
is time-based or index-based. It is named either `seconds_between_frames` or
`num_indices_between_frames`. We don't allow `num_indices_between_frames` to be
used within a time-based sampler. The rationale is that if a user uses a
time-based API, it means they care about variable framerates for the clip start
locations, and we infer they also care about variable framerate for the clip
content. We may allow mix-matching in the future depending on user feedback,
either via additional functions, or via mutually exclusive parameters.

The term "clip" is being used to denote a sequence of frames in presentation
order. The frames aren't necessarily consecutive. We should add this term to the
glossary. I don't feel too strongly about using the name "clip", but it's used
in all existing libraries, and I don't know of an alternative that would be
universally understood.

Eventually we'll want to implement a "fast and approximate" version of the
random sampler as some users are happy to sacrifice entropy for speed.
Basically, prioritize key-frames in a smart way. This isn't part of this
proposal to limit the scope of discussions. Eventually, this sampler will likely
be exposed as a parameter to the exi sting random sampler(s), or as a new
function with the same parametriation as the random sampler.

The output of the samplers is a sequence of clips where a clips is a batch of
frames. Such output can be represented as:
- a `List[FrameBatch]` where the `FrameBatch.data` tensor is 4D
- a `FrameBatch` where the `FrameBatch.data` tensor is 5D. This is the current
proposal. This can be done as long as `num_frames_per_clip` is not random
(it is currently never random).
Loading