Skip to content
Closed
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions rfc_samplers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
**DISCLAIMER** This RFC is currently lacking (at least):

- a "time-based" equivalent of dilation
- a "time-based" equivalent of random uniform sampling.

Main design principles
----------------------

- The existing feature space of different libraries (torchvision, torchmm,
internal stuff) is supported. In particular "Random", "Uniform" and
"Periodic" samplers are supported, with hopefully more descriptive names.
- Sampler is agnostic to decoding options: the decoder object is passed to the
sampler by the user.
- Explicit **non goal**: to support arbitrary strategy that we haven't
observed or heard usage of. For example, in all existing libraries the clip
content is never random and just determined by 2 parameters:
`num_frames_per_clip` and `step_between_frames` (dilation). We allow for
future extensibility, but we're not explicitly enabling additional
clip-content strategies.
- The term "clip" is being used to denote a sequence of frames in presentation
order. The frames aren't necessarily consecutive (when
`step_between_frames > 1`). We should add this term to the glossary. I don't
feel too strongly about using the name "clip", but it's used in all existing
libraries, and I don't know of an alternative that would be universally
understood.


Option 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer Option 1 over Option 2. My reasons:

  1. To me, good OO-design is when an object has some internal state, and its methods are for either performing actions on that state, or retrieving some part of that state. The ClipSampler in Option 2 is not that.
  2. I think it's unlikely that users will create a ClipSampler and call more than one sampling strategy on it in the same Python process. That is, I think users will kick off distinct training runs, where each training run changes the sampling strategy.
  3. I think this problem is most straight-forwardly modeled as functions.

--------

- One function per sampling strategy
- Note: this is not 100% stateless, the decoder object is seeked so there are
side effects.

```py
from torchcodec.decoders import SimpleVideoDecoder
from torchcodec import samplers

video = "cool_vid"
decoder = SimpleVideoDecoder(video)

clips = samplers.get_uniformly_random_clips(
decoder,
num_clips=12,
# clip content params:
num_frames_per_clip=4,
step_between_frames=2, # often called "dilation"
# sampler-specific params:
prioritize_keyframes=False, # to implement "fast" sampling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this as a parameter or a different top-level function. As to name, I think we would want something more about the perceivable qualities to users: accuracy versus speed. We could implement that by prioritizing keyframes or not.

# Might need to be a separate function.
)

clips = samplers.get_evenly_spaced_clips( # often called "Uniform"
decoder,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a pure performance POV, passing a SimpleVideoDecoder could potentially be slow because it scans the entire file.

Also, do we plan to handle multiple streams or audio here? SimpleVideoDecoder doesn't handle those

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we plan to handle multiple streams or audio here? SimpleVideoDecoder doesn't handle those

This is out of scope for this design as we don't have a multi-stream decoder yet.

num_clips=12,
# clip content params:
num_frames_per_clip=4,
step_between_frames=2,
)

clips = samplers.get_evenly_timed_clips( # often called "Periodic"
decoder,
num_clips_per_second=3,
# clip content params:
num_frames_per_clip=4,
step_between_frames=2,
# sampler-specific params:
clip_range_start_seconds=3,
clip_range_end_seconds=4,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think @ahmadsharif1's point about variable FPS is relevant here. Clearly, get_event_timed_clips() is all about PTS as "time" is in the name. If we respect PTS for it (and I think we should), that takes care of variable FPS videos.

However, get_uniformly_random_clips() and get_evenly_spaced_clips() is less clear. Do we want to be uniformly random or evenly spaced across all of the frames, or across the time duration of the video? I think both behaviors can be reasonable for a user to want. That may need to be a parameter option, or different functions entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time-based equivalent of get_evenly_spaced_clips() is get_evenly_timed_clips().

For the random one, I agree we need to think more. There is also the "step_betwee_frames" parameter which right now is only index-based. We may want it to be time-based as well, and some users may want it to be time-based on an index-based random sampler (and vice-versa).

```

Option 2
--------

- One ClipSampler object where each sampling strategy is a method.
- Bonus: the underlying `decoder` object can be re-initilized by the sampler
in-between calls, possibly improving seek time??

```py
from torchcodec.sampler import ClipSampler

sampler = ClipSampler(
# Here: parameters that apply to all sampling strategies
decoder,
num_frames_per_clip=4,
step_between_frames=2,
)
# One method per sampling strat
sampler.get_uniformly_random_clips(num_clips=12, prioritize_keyframes=True)
sampler.get_evenly_spaced_clips(num_clips=12)
sampler.get_evenly_timed_clips(
num_clips_per_second=3,
clip_range_start_seconds=3,
cip_range_end_seconds=4
)
```


Questions
---------

- should the returned `clips` be...?
- a List[FrameBatch] where the FrameBatch.data is 4D
- a FrameBatch where the FrameBatch.data is 5D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to survey use-cases here? My default is that we should favor returning tensors over lists. But if we discover that users may want different dimensions, we can't return tensors.

- This is OK as long as num_frames_per_clip is not random (it is currently
never random)
Loading