You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**(internal)** Enables [OffHeapColumnVector](OffHeapColumnVector.md) (`true`) or [OnHeapColumnVector](OnHeapColumnVector.md) (`false`) in [ColumnarBatch](vectorized-query-execution/ColumnarBatch.md)
124
+
125
+
Default: `false`
126
+
127
+
Use [SQLConf.offHeapColumnVectorEnabled](SQLConf.md#offHeapColumnVectorEnabled) for the current value
128
+
129
+
Used when:
130
+
131
+
*`RowToColumnarExec` is requested to `doExecuteColumnar`
132
+
*`DefaultCachedBatchSerializer` is requested to `vectorTypes` and `convertCachedBatchToColumnarBatch`
133
+
*`ParquetFileFormat` is requested to [vectorTypes](datasources/parquet/ParquetFileFormat.md#vectorTypes) and [buildReaderWithPartitionValues](datasources/parquet/ParquetFileFormat.md#buildReaderWithPartitionValues)
134
+
*`ParquetPartitionReaderFactory` is [created](datasources/parquet/ParquetPartitionReaderFactory.md#enableOffHeapColumnVector)
**(internal)** Enables [OffHeapColumnVector](OffHeapColumnVector.md) in [ColumnarBatch](vectorized-query-execution/ColumnarBatch.md) (`true`) or not (`false`). When `false`, [OnHeapColumnVector](OnHeapColumnVector.md) is used instead.
1193
-
1194
-
Default: `false`
1195
-
1196
-
Use [SQLConf.offHeapColumnVectorEnabled](SQLConf.md#offHeapColumnVectorEnabled) method to access the current value.
Copy file name to clipboardExpand all lines: docs/datasources/parquet/ParquetPartitionReaderFactory.md
+42-4Lines changed: 42 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# ParquetPartitionReaderFactory
2
2
3
-
`ParquetPartitionReaderFactory` is a [FilePartitionReaderFactory](../FilePartitionReaderFactory.md).
3
+
`ParquetPartitionReaderFactory` is a [FilePartitionReaderFactory](../FilePartitionReaderFactory.md) for [ParquetScan](ParquetScan.md#createReaderFactory) for batch queries.
4
4
5
5
## Creating Instance
6
6
@@ -18,6 +18,13 @@
18
18
19
19
*`ParquetScan` is requested to [create a PartitionReaderFactory](ParquetScan.md#createReaderFactory)
`ParquetPartitionReaderFactory` uses [spark.sql.columnVector.offheap.enabled](../../configuration-properties.md#spark.sql.columnVector.offheap.enabled) configuration property when requested for the following:
24
+
25
+
*[Create a Vectorized Reader](#createParquetVectorizedReader) (and create a [VectorizedParquetRecordReader](VectorizedParquetRecordReader.md#useOffHeap))
26
+
*[Build a Columnar Reader](#buildColumnarReader) (and `convertAggregatesRowToBatch`)
In the end, `buildColumnarReader` returns a [PartitionReader](../../connector/PartitionReader.md) that returns [ColumnarBatch](../../vectorized-query-execution/ColumnarBatch.md)es (when [requested for records](../../connector/PartitionReader.md#get)).
53
60
54
-
## <spanid="buildReader"> Building PartitionReader
61
+
## <spanid="buildReader"> Building Partition Reader
55
62
56
63
```scala
57
64
buildReader(
@@ -95,9 +102,38 @@ createVectorizedReader(
95
102
96
103
In the end, `createVectorizedReader` requests the [VectorizedParquetRecordReader](VectorizedParquetRecordReader.md) to [initBatch](VectorizedParquetRecordReader.md#initBatch) (with the [partitionSchema](#partitionSchema) and the [partitionValues](../PartitionedFile.md#partitionValues) of the given [PartitionedFile](../PartitionedFile.md)) and returns it.
97
104
98
-
`createVectorizedReader` is used when:
105
+
---
106
+
107
+
`createVectorizedReader` is used when `ParquetPartitionReaderFactory` is requested for the following:
108
+
109
+
*[Build a partition reader (for a file)](#buildReader) (with [enableVectorizedReader](#enableVectorizedReader) enabled)
110
+
*[Build a columnar partition reader (for a file)](#buildColumnarReader)
`createParquetVectorizedReader` creates a [VectorizedParquetRecordReader](VectorizedParquetRecordReader.md) (with [capacity](#capacity)).
124
+
125
+
`createParquetVectorizedReader` creates a [RecordReaderIterator](../RecordReaderIterator.md) (for the `VectorizedParquetRecordReader`).
99
126
100
-
*`ParquetPartitionReaderFactory` is requested to [buildReader](#buildReader) and [buildColumnarReader](#buildColumnarReader)
127
+
`createParquetVectorizedReader` prints out the following DEBUG message to the logs (with the [partitionSchema](#partitionSchema) and the given `partitionValues`):
128
+
129
+
```text
130
+
Appending [partitionSchema] [partitionValues]
131
+
```
132
+
133
+
In the end, `createParquetVectorizedReader` returns the `VectorizedParquetRecordReader`.
134
+
135
+
??? note "Unused RecordReaderIterator?"
136
+
It appears that the `RecordReaderIterator` is created but not used. _Feeling confused_.
101
137
102
138
## <spanid="buildReaderBase"> buildReaderBase
103
139
@@ -116,6 +152,8 @@ buildReaderBase[T](
116
152
117
153
`buildReaderBase`...FIXME
118
154
155
+
---
156
+
119
157
`buildReaderBase` is used when:
120
158
121
159
*`ParquetPartitionReaderFactory` is requested to [createRowBaseReader](#createRowBaseReader) and [createVectorizedReader](#createVectorizedReader)
Copy file name to clipboardExpand all lines: docs/datasources/parquet/VectorizedParquetRecordReader.md
+8-4Lines changed: 8 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,21 +62,24 @@ void initBatch(
62
62
63
63
`initBatch` creates a [batch schema](../../types/index.md) that is [sparkSchema](SpecificParquetRecordReaderBase.md#sparkSchema) and the input `partitionColumns` schema (if available).
64
64
65
-
`initBatch` requests [OffHeapColumnVector](../../OffHeapColumnVector.md#allocateColumns) or [OnHeapColumnVector](../../OnHeapColumnVector.md#allocateColumns) to allocate column vectors per the input `memMode`, i.e. [OFF_HEAP](#OFF_HEAP) or [ON_HEAP](#ON_HEAP) memory modes, respectively. `initBatch` records the allocated column vectors as the internal [WritableColumnVectors](#columnVectors).
65
+
`initBatch` requests [OffHeapColumnVector](../../OffHeapColumnVector.md#allocateColumns) or [OnHeapColumnVector](../../OnHeapColumnVector.md#allocateColumns) to allocate column vectors per the input `memMode` (i.e.,[OFF_HEAP](#OFF_HEAP) or [ON_HEAP](#ON_HEAP) memory modes, respectively). `initBatch` records the allocated column vectors as the internal [WritableColumnVectors](#columnVectors).
66
66
67
-
!!! note
67
+
!!! note "spark.sql.columnVector.offheap.enabled"
68
68
[OnHeapColumnVector](../../OnHeapColumnVector.md) is used based on [spark.sql.columnVector.offheap.enabled](../../configuration-properties.md#spark.sql.columnVector.offheap.enabled) configuration property.
69
69
70
-
`initBatch` creates a [ColumnarBatch](../../vectorized-query-execution/ColumnarBatch.md) (with the [allocated WritableColumnVectors](#columnVectors)) and records it as the internal [ColumnarBatch](#columnarBatch).
70
+
`initBatch` creates a [ColumnarBatch](#columnarBatch) (with the [allocated WritableColumnVectors](#columnVectors)).
71
71
72
-
`initBatch` does some additional maintenance to the [columnVectors](#columnVectors).
72
+
`initBatch` does some additional maintenance to the [WritableColumnVectors](#columnVectors).
73
+
74
+
---
73
75
74
76
`initBatch` is used when:
75
77
76
78
*`VectorizedParquetRecordReader` is requested to [resultBatch](#resultBatch)
77
79
*`ParquetFileFormat` is requested to [build a data reader (with partition column values appended)](ParquetFileFormat.md#buildReaderWithPartitionValues)
78
80
*`ParquetPartitionReaderFactory` is requested to [createVectorizedReader](ParquetPartitionReaderFactory.md#createVectorizedReader)
79
81
82
+
<!---
80
83
## Review Me
81
84
82
85
`VectorizedParquetRecordReader` uses <<OFF_HEAP, OFF_HEAP>> memory mode when [spark.sql.columnVector.offheap.enabled](../../configuration-properties.md#spark.sql.columnVector.offheap.enabled) internal configuration property is enabled (`true`).
@@ -212,3 +215,4 @@ NOTE: `getCurrentValue` is part of the Hadoop https://hadoop.apache.org/docs/r2.
212
215
* `NewHadoopRDD` is requested to compute a partition (`compute`)
213
216
214
217
* `RecordReaderIterator` is requested for the [next internal row](../RecordReaderIterator.md#next)
Copy file name to clipboardExpand all lines: docs/vectorized-query-execution/ColumnarBatch.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,12 +20,13 @@ tags:
20
20
21
21
`ColumnarBatch` is created when:
22
22
23
+
*`ArrowConverters` utility is requested to `fromBatchIterator`
23
24
*`RowToColumnarExec` unary physical operator is requested to `doExecuteColumnar`
24
25
*[InMemoryTableScanExec](../physical-operators/InMemoryTableScanExec.md) leaf physical operator is requested for a [RDD[ColumnarBatch]](../physical-operators/InMemoryTableScanExec.md#columnarInputRDD)
25
26
*`MapInPandasExec` unary physical operator is requested to `doExecute`
26
-
*`OrcColumnarBatchReader`and `VectorizedParquetRecordReader` are requested to `initBatch`
27
+
*`OrcColumnarBatchReader`is requested to `initBatch`
27
28
*`PandasGroupUtils` utility is requested to `executePython`
28
-
*`ArrowConverters` utility is requested to `fromBatchIterator`
29
+
*`VectorizedParquetRecordReader`is requested to [init a batch](../datasources/parquet/VectorizedParquetRecordReader.md#initBatch)
0 commit comments