Skip to content

Commit 3c2f1dd

Browse files
ParquetReadSupport, ParquetPartitionReaderFactory and Vectorized Parquet Decoding
1 parent 11b2cc2 commit 3c2f1dd

File tree

5 files changed

+74
-38
lines changed

5 files changed

+74
-38
lines changed

docs/datasources/parquet/ParquetPartitionReaderFactory.md

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,30 @@ createRowBaseReader(
8989
file: PartitionedFile): RecordReader[Void, InternalRow]
9090
```
9191

92-
`createRowBaseReader` [buildReaderBase](#buildReaderBase) (for the given [PartitionedFile](../PartitionedFile.md) and [createRowBaseParquetReader](#createRowBaseParquetReader)).
92+
`createRowBaseReader` [buildReaderBase](#buildReaderBase) for the given [PartitionedFile](../PartitionedFile.md) and with [createRowBaseParquetReader](#createRowBaseParquetReader) factory.
93+
94+
### <span id="createRowBaseParquetReader"> createRowBaseParquetReader
95+
96+
```scala
97+
createRowBaseParquetReader(
98+
partitionValues: InternalRow,
99+
pushed: Option[FilterPredicate],
100+
convertTz: Option[ZoneId],
101+
datetimeRebaseSpec: RebaseSpec,
102+
int96RebaseSpec: RebaseSpec): RecordReader[Void, InternalRow]
103+
```
104+
105+
`createRowBaseParquetReader` prints out the following DEBUG message to the logs:
106+
107+
```text
108+
Falling back to parquet-mr
109+
```
110+
111+
`createRowBaseParquetReader` creates a [ParquetReadSupport](ParquetReadSupport.md) (with [enableVectorizedReader](ParquetReadSupport.md#enableVectorizedReader) flag disabled).
112+
113+
`createRowBaseParquetReader` creates a [RecordReaderIterator](../RecordReaderIterator.md) with a new `ParquetRecordReader`.
114+
115+
In the end, `createRowBaseParquetReader` returns the `ParquetRecordReader`.
93116

94117
## <span id="createVectorizedReader"> Creating Vectorized Parquet RecordReader
95118

@@ -157,3 +180,16 @@ buildReaderBase[T](
157180
`buildReaderBase` is used when:
158181

159182
* `ParquetPartitionReaderFactory` is requested to [createRowBaseReader](#createRowBaseReader) and [createVectorizedReader](#createVectorizedReader)
183+
184+
## Logging
185+
186+
Enable `ALL` logging level for `org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory` logger to see what happens inside.
187+
188+
Add the following line to `conf/log4j2.properties`:
189+
190+
```text
191+
logger.ParquetPartitionReaderFactory.name = org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory
192+
logger.ParquetPartitionReaderFactory.level = all
193+
```
194+
195+
Refer to [Logging](../../spark-logging.md).
Lines changed: 31 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,48 @@
11
# ParquetReadSupport
22

3-
`ParquetReadSupport` is a concrete `ReadSupport` (from Apache Parquet) of [UnsafeRows](../../UnsafeRow.md).
3+
`ParquetReadSupport` is a `ReadSupport` (Apache Parquet) of [UnsafeRows](../../UnsafeRow.md) for non-[Vectorized Parquet Decoding](../../vectorized-decoding/index.md).
44

5-
`ParquetReadSupport` is <<creating-instance, created>> exclusively when `ParquetFileFormat` is requested for a [data reader](ParquetFileFormat.md#buildReaderWithPartitionValues) (with no support for [Vectorized Parquet Decoding](../../vectorized-decoding/index.md) and so falling back to parquet-mr).
5+
`ParquetReadSupport` is the value of `parquet.read.support.class` Hadoop configuration property for the following:
66

7-
[[parquet.read.support.class]]
8-
`ParquetReadSupport` is registered as the fully-qualified class name for [parquet.read.support.class](ParquetFileFormat.md#parquet.read.support.class) Hadoop configuration when `ParquetFileFormat` is requested for a [data reader](ParquetFileFormat.md#buildReaderWithPartitionValues).
7+
* [ParquetFileFormat](ParquetFileFormat.md#buildReaderWithPartitionValues)
8+
* [ParquetScan](ParquetScan.md#createReaderFactory)
99

10-
[[creating-instance]]
11-
[[convertTz]]
12-
`ParquetReadSupport` takes an optional Java `TimeZone` to be created.
10+
## Creating Instance
1311

14-
[[logging]]
15-
[TIP]
16-
====
17-
Enable `ALL` logging level for `org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport` logger to see what happens inside.
12+
`ParquetReadSupport` takes the following to be created:
1813

19-
Add the following line to `conf/log4j2.properties`:
14+
* <span id="convertTz"> `ZoneId` (optional)
15+
* <span id="enableVectorizedReader"> `enableVectorizedReader`
16+
* <span id="datetimeRebaseSpec"> DateTime RebaseSpec
17+
* <span id="int96RebaseSpec"> int96 RebaseSpec
2018

21-
```
22-
log4j.logger.org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport=ALL
23-
```
19+
`ParquetReadSupport` is created when:
2420

25-
Refer to <<spark-logging.md#, Logging>>.
26-
====
21+
* `ParquetFileFormat` is requested to [buildReaderWithPartitionValues](ParquetFileFormat.md#buildReaderWithPartitionValues) (with [enableVectorizedReader](ParquetFileFormat.md#enableVectorizedReader) disabled)
22+
* `ParquetPartitionReaderFactory` is requested to [createRowBaseParquetReader](ParquetPartitionReaderFactory.md#createRowBaseParquetReader)
2723

28-
=== [[init]] Initializing ReadSupport -- `init` Method
24+
## Logging
2925

30-
[source, scala]
31-
----
32-
init(context: InitContext): ReadContext
33-
----
26+
Enable `ALL` logging level for `org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport` logger to see what happens inside.
3427

35-
NOTE: `init` is part of the `ReadSupport` Contract to...FIXME.
28+
Add the following line to `conf/log4j2.properties`:
3629

37-
`init`...FIXME
30+
```text
31+
logger.ParquetReadSupport.name = org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport
32+
logger.ParquetReadSupport.level = all
33+
```
3834

39-
=== [[prepareForRead]] `prepareForRead` Method
35+
Refer to [Logging](../../spark-logging.md).
4036

41-
[source, scala]
42-
----
43-
prepareForRead(
44-
conf: Configuration,
45-
keyValueMetaData: JMap[String, String],
46-
fileSchema: MessageType,
47-
readContext: ReadContext): RecordMaterializer[UnsafeRow]
48-
----
37+
<!---
38+
## Review Me
39+
40+
`ParquetReadSupport` is <<creating-instance, created>> exclusively when `ParquetFileFormat` is requested for a [data reader](ParquetFileFormat.md#buildReaderWithPartitionValues) (with no support for [Vectorized Parquet Decoding](../../vectorized-decoding/index.md) and so falling back to parquet-mr).
4941
50-
NOTE: `prepareForRead` is part of the `ReadSupport` Contract to...FIXME.
42+
[[parquet.read.support.class]]
43+
`ParquetReadSupport` is registered as the fully-qualified class name for [parquet.read.support.class](ParquetFileFormat.md#parquet.read.support.class) Hadoop configuration when `ParquetFileFormat` is requested for a [data reader](ParquetFileFormat.md#buildReaderWithPartitionValues).
5144
52-
`prepareForRead`...FIXME
45+
[[creating-instance]]
46+
[[convertTz]]
47+
`ParquetReadSupport` takes an optional Java `TimeZone` to be created.
48+
-->

docs/datasources/parquet/VectorizedParquetRecordReader.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# VectorizedParquetRecordReader
22

3-
`VectorizedParquetRecordReader` is a [SpecificParquetRecordReaderBase](SpecificParquetRecordReaderBase.md) for [parquet](index.md) data source for [Vectorized Parquet Decoding](../../vectorized-decoding/index.md).
3+
`VectorizedParquetRecordReader` is a [SpecificParquetRecordReaderBase](SpecificParquetRecordReaderBase.md) for [Parquet Data Source](index.md) for [Vectorized Parquet Decoding](../../vectorized-decoding/index.md).
44

55
## Creating Instance
66

docs/datasources/parquet/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,10 @@ Parquet is the default data source format based on the [spark.sql.sources.defaul
1111

1212
Parquet data source uses `spark.sql.parquet` prefix for [parquet-specific configuration properties](../../configuration-properties.md).
1313

14+
## Vectorized Parquet Decoding
15+
16+
Parquet Data Source uses [VectorizedParquetRecordReader](VectorizedParquetRecordReader.md) for [Vectorized Parquet Decoding](../../vectorized-decoding/index.md) (and [ParquetReadSupport](ParquetReadSupport.md) otherwise).
17+
1418
## Demo
1519

1620
```scala

docs/vectorized-decoding/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,4 @@ Quoting [SPARK-12854 Vectorize Parquet reader](https://issues.apache.org/jira/br
88
99
Vectorized Parquet Decoding is used exclusively when `ParquetFileFormat` is requested for a [data reader](../datasources/parquet/ParquetFileFormat.md#buildReaderWithPartitionValues) when [spark.sql.parquet.enableVectorizedReader](../configuration-properties.md#spark.sql.parquet.enableVectorizedReader) property is enabled (`true`) and the read schema uses [AtomicTypes](../types/AtomicType.md) data types only.
1010

11-
Vectorized Parquet Decoding uses [VectorizedParquetRecordReader](../datasources/parquet/VectorizedParquetRecordReader.md) for vectorized decoding.
11+
Vectorized Parquet Decoding uses [VectorizedParquetRecordReader](../datasources/parquet/VectorizedParquetRecordReader.md) for vectorized decoding (and [ParquetReadSupport](../datasources/parquet/ParquetReadSupport.md) otherwise).

0 commit comments

Comments
 (0)