-
Notifications
You must be signed in to change notification settings - Fork 64
[RFC] Sampling APIs #206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Sampling APIs #206
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
|
||
Feature space | ||
------------- | ||
|
||
At a high level, a sampler returns a sequence of "clips" where a clip is a batch | ||
of (non-necessarily contiguous) frames. Based on an analysis of the sampling | ||
options of other libraries (torchvision, torchmultimodal, and various internal | ||
libraries), we want to support the following use-cases: | ||
|
||
|
||
**Clip location** (and number thereof) | ||
|
||
- Get `num_clips_per_video` random clips: typically called "Random" sampler | ||
- Get `num_clips_per_video` equally-spaced clips: "Uniform" sampler. Space is | ||
index-based. | ||
- Get `num_clips_per_second` clips per seconds [within the [`start_seconds`, | ||
`stop_seconds`] sampling range]: "Periodic" sampler. | ||
|
||
**Clip content** | ||
|
||
- Clip must be `num_frames_per_clip` frames spaced by `dilation` frames. | ||
|
||
Notes: | ||
|
||
- There are 3 sampling strategies defining the clip locations: Random, Uniform | ||
and Periodic. The names are prone to confusion. "Random" is in fact uniformly | ||
random while "Uniform" isn't random. "Uniform" is periodic and "Periodic" is | ||
uniformly spaced too. | ||
- The parametrization of the *Clip Content* dimension is limited and | ||
reduces to just `num_frames_per_clip` and the index-based `dilation`. | ||
- The *Clip Content* dimension is never random. `num_frames_per_clip` is always | ||
passed by the user. This makes sense: training typically requires fixed-sized | ||
inputs so `num_frames_per_clip` is expected to be constant. The content of a | ||
clip (space between frames) is always regular, never random. | ||
|
||
|
||
Proposal | ||
-------- | ||
|
||
```py | ||
from torchcodec.decoders import SimpleVideoDecoder | ||
from torchcodec import samplers | ||
|
||
decoder = SimpleVideoDecoder("./cool_vid") | ||
|
||
|
||
def samplers.clips_at_random_timestamps( | ||
decoder, | ||
*, | ||
num_clips, | ||
num_frames_per_clip=1, | ||
seconds_between_frames=None, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: should this be seconds_between_frames or seconds_between_timestamps? A frame is an interval, not a single point in time. I think for time-based sampling it makes sense to talk about single timestamps, not intervals. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The output of this function is a sequence of clips where a clip is a batch of frames. A frame is a well-defined concept within the scope of these samplers, so I don't think we need to talk about timestamps here. One may ask: the timestamps of what? The clips? The frames? The ...?. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about this more (refer to chat we had): For time-based samplers the user has to give us the clip duration AND frames per clip both. Your implementation will first determine the clip_starts in terms of seconds. To get the clip_ends you simply add the clip_duration (passed by the user) to the clip_starts. Then to get the frames, you can subdivide the [clip_start, clip_end) interval into I think that's how this and the WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think users don't think in terms of clip duration. To them, a "clip" is something they pass to their model for training, which means that all clips must have a constant number of frames. So the time duration of a clip is not important to them, but the number of frames is. What I think users think about when sampling from a video:
In this RFC, that last question is answered by In terms of usability, I think specifying how far apart individual frames should be scales easier than clip duration. That is, when video length can vary by order of magnitudes, I think it's harder to specify a clip duration that makes sense across that spectrum. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @scotts However to make the naming consistent I think it should be called seconds_between_samples or something like that. A frame is an interval not a snapshot in time. What the user wants is seconds between samples chosen and number of samples. Note that if you specify two of {seconds_between_samples, num_samples_in_clip, clip_duration} the third is constrained to a single value. So specifying any two is enough. I suggested the user give us clip_duration and num_samples_in_clip, but the more intuitive may be num_samples_in_clip and seconds_between_samples The last step is to get the frames that are displayed at those samples. (I used samples instead of frames because two samples could land in the same frame, in which case we will be returning two copies of the same frame) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To give an extreme example -- if the whole video is a single frame of 10 seconds, and if the user specifies seconds_between_frames=1, you would end up returning copies of the same frame in the clip. And there would not be 1 second between frames. There would be 1 second between the samples (which are instants in time) the sampler chose -- and those samples just happen to land on the same frame. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's discuss more today about the right parametrization (clip_duration vs seconds_between_frames). Regarding using the term frame vs sample vs timestamp: let's try to leave that for a separate, higher-level discussion. |
||
) -> FrameBatch | ||
|
||
def samplers.clips_at_regular_indices( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For this and clips_at_regular_timestamps(), both clips and frames within the cips are at regular indices/timestamps respectively, right? Is there a name that would convey to the user that information? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The name already convey the fact that clips are at regular indices/timestamps. |
||
decoder, | ||
*, | ||
num_clips, | ||
num_frames_per_clip=1, | ||
num_indices_between_frames=1, # in indices (dilation) | ||
) -> FrameBatch | ||
|
||
def samplers.clips_at_regular_timestamps( | ||
decoder, | ||
*, | ||
num_clips_per_second, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find I have been thinking of how to specify the start in time as a period instead of a rate. So more intuitive to me is Personally, I find it more intuitive to think of fractions of a second than fractions of a clip. And I find it more intuitive to think in terms of a period instead of a rate. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be more consistent if the user specifies both values as periods -- seconds_between_clip_starts and seconds_between_samples (or seconds_between_frames) OR the user specifies both values as a frequency -- samples_per_second and clip_starts_per_second There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I share the same view, and that's precisely why I tend to think Note that I don't think this premise is quite exact:
The time spanned by a clip isn't just determined by this parameter, it also depend on the number of frames within that clip (and the framerate). Happy to chat more! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The reason I made this comment is that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, I like |
||
num_frames_per_clip=1, | ||
seconds_between_frames=None, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same as nit above |
||
clip_range_start_seconds: float = 0, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why isn't this -inf? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure we want or need to expose the fact that some streams may have negative start times. This is pretty unintuitive. If 0 isn't correct enough, I'd suggest to just use |
||
clip_range_end_seconds: float = float("inf"), | ||
) -> FrameBatch | ||
``` | ||
|
||
Discussion / Notes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you talk about error scenarios here? What if there aren't enough frames in the video? Do we return partial results? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted this RFC to focus on the interface more than on lower-level details, as there are enough discussion to have on the interface already. I'd suggest to leave these discussions for the implementation PRs. |
||
------------------ | ||
|
||
The sampler is agnostic to decoding options: the decoder object is passed to the | ||
sampler by the user. This is because we don't want to duplicate the decoder | ||
parametrization in the samplers, and also because it's not up to the sampler to | ||
decide on some frame properties like their shape. | ||
|
||
`clips_at_regular_indices` and `clips_at_regular_timestamps` are dual of | ||
one-another, and have a slightly different parametrization: `num_clips` vs | ||
`num_clips_per_second`. This is to respect the existing parametrization of the | ||
surveyed libraries. | ||
|
||
The samplers are explicitly time-based or index-based, from their function name. | ||
|
||
We believe that in general, a time-based sampler is closer to the user's mental | ||
model. It is also potentially more "correct" in the sense that it accounts for | ||
variable framerates. This is why the "random" sampler is explicitly implemented | ||
as a time-based sampler. Note that existing "random" samplers are implemented | ||
as index-based, which is likely a result of the existing decoders not having | ||
fine-grained pts-based APIs, rather than a deliberate design choice. We may | ||
implement a "random" index-based sampler in the future if there are feature | ||
requests. | ||
|
||
The "dilation" parameter has different semantic depending on whether the sampler | ||
is time-based or index-based. It is named either `seconds_between_frames` or | ||
`num_indices_between_frames`. We don't allow `num_indices_between_frames` to be | ||
used within a time-based sampler. The rationale is that if a user uses a | ||
time-based API, it means they care about variable framerates for the clip start | ||
locations, and we infer they also care about variable framerate for the clip | ||
content. We may allow mix-matching in the future depending on user feedback, | ||
either via additional functions, or via mutually exclusive parameters. | ||
|
||
The term "clip" is being used to denote a sequence of frames in presentation | ||
order. The frames aren't necessarily consecutive. We should add this term to the | ||
glossary. I don't feel too strongly about using the name "clip", but it's used | ||
in all existing libraries, and I don't know of an alternative that would be | ||
universally understood. | ||
|
||
Eventually we'll want to implement a "fast and approximate" version of the | ||
random sampler as some users are happy to sacrifice entropy for speed. | ||
Basically, prioritize key-frames in a smart way. This isn't part of this | ||
proposal to limit the scope of discussions. Eventually, this sampler will likely | ||
be exposed as a parameter to the exi sting random sampler(s), or as a new | ||
function with the same parametriation as the random sampler. | ||
|
||
The output of the samplers is a sequence of clips where a clips is a batch of | ||
frames. Such output can be represented as: | ||
- a `List[FrameBatch]` where the `FrameBatch.data` tensor is 4D | ||
- a `FrameBatch` where the `FrameBatch.data` tensor is 5D. This is the current | ||
proposal. This can be done as long as `num_frames_per_clip` is not random | ||
(it is currently never random). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a pure performance POV, passing a SimpleVideoDecoder could potentially be slow because it scans the entire file.
Also, do we plan to handle multiple streams or audio here? SimpleVideoDecoder doesn't handle those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is out of scope for this design as we don't have a multi-stream decoder yet.