@@ -58,49 +58,12 @@ struct AudioFramesOutput {
58
58
// FRAME TENSOR ALLOCATION APIs
59
59
// --------------------------------------------------------------------------
60
60
61
- // Note [Frame Tensor allocation and height and width ]
61
+ // Note [Frame Tensor allocation]
62
62
//
63
63
// We always allocate [N]HWC tensors. The low-level decoding functions all
64
64
// assume HWC tensors, since this is what FFmpeg natively handles. It's up to
65
65
// the high-level decoding entry-points to permute that back to CHW, by calling
66
66
// maybePermuteHWC2CHW().
67
- //
68
- // TODO: Rationalize the comment below with refactoring.
69
- //
70
- // Also, importantly, the way we figure out the the height and width of the
71
- // output frame tensor varies, and depends on the decoding entry-point. In
72
- // *decreasing order of accuracy*, we use the following sources for determining
73
- // height and width:
74
- // - getHeightAndWidthFromResizedAVFrame(). This is the height and width of the
75
- // AVframe, *post*-resizing. This is only used for single-frame decoding APIs,
76
- // on CPU, with filtergraph.
77
- // - getHeightAndWidthFromOptionsOrAVFrame(). This is the height and width from
78
- // the user-specified options if they exist, or the height and width of the
79
- // AVFrame *before* it is resized. In theory, i.e. if there are no bugs within
80
- // our code or within FFmpeg code, this should be exactly the same as
81
- // getHeightAndWidthFromResizedAVFrame(). This is used by single-frame
82
- // decoding APIs, on CPU with swscale, and on GPU.
83
- // - getHeightAndWidthFromOptionsOrMetadata(). This is the height and width from
84
- // the user-specified options if they exist, or the height and width form the
85
- // stream metadata, which itself got its value from the CodecContext, when the
86
- // stream was added. This is used by batch decoding APIs, for both GPU and
87
- // CPU.
88
- //
89
- // The source of truth for height and width really is the (resized) AVFrame: it
90
- // comes from the decoded ouptut of FFmpeg. The info from the metadata (i.e.
91
- // from the CodecContext) may not be as accurate. However, the AVFrame is only
92
- // available late in the call stack, when the frame is decoded, while the
93
- // CodecContext is available early when a stream is added. This is why we use
94
- // the CodecContext for pre-allocating batched output tensors (we could
95
- // pre-allocate those only once we decode the first frame to get the info frame
96
- // the AVFrame, but that's a more complex logic).
97
- //
98
- // Because the sources for height and width may disagree, we may end up with
99
- // conflicts: e.g. if we pre-allocate a batch output tensor based on the
100
- // metadata info, but the decoded AVFrame has a different height and width.
101
- // it is very important to check the height and width assumptions where the
102
- // tensors memory is used/filled in order to avoid segfaults.
103
-
104
67
torch::Tensor allocateEmptyHWCTensor (
105
68
const FrameDims& frameDims,
106
69
const torch::Device& device,
0 commit comments