You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/SparkSession.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -419,10 +419,12 @@ leafNodeDefaultParallelism: Int
419
419
420
420
`leafNodeDefaultParallelism` is the value of [spark.sql.leafNodeDefaultParallelism](configuration-properties.md#spark.sql.leafNodeDefaultParallelism) if defined or `SparkContext.defaultParallelism` ([Spark Core]({{ book.spark_core }}/SparkContext#defaultParallelism)).
421
421
422
+
---
423
+
422
424
`leafNodeDefaultParallelism` is used when:
423
425
424
-
*`SparkSession` is requested to [range](SparkSession.md#range)
426
+
*[SparkSession.range](SparkSession.md#range) operator is used
425
427
*`RangeExec` leaf physical operator is [created](physical-operators/RangeExec.md#numSlices)
426
428
*`CommandResultExec` physical operator is requested for the `RDD[InternalRow]`
427
429
*`LocalTableScanExec` physical operator is requested for the [RDD](physical-operators/LocalTableScanExec.md#rdd)
428
-
*`FilePartition`utility is used to `maxSplitBytes`
430
+
*`FilePartition` is requested for [maxSplitBytes](datasources/FilePartition.md#maxSplitBytes)
*[spark.sql.files.minPartitionNum](../configuration-properties.md#spark.sql.files.minPartitionNum) (default: [Default Parallelism of Leaf Nodes](../SparkSession.md#leafNodeDefaultParallelism))
16
18
17
-
`maxSplitBytes` uses the given `selectedPartitions` to calculate `totalBytes` based on the size of the files with [spark.sql.files.openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes) added (for each file).
19
+
---
20
+
21
+
`maxSplitBytes` calculates the total size of all the files (in the given `PartitionDirectory`ies) with [spark.sql.files.openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes) overhead added (to the size of every file).
22
+
23
+
??? note "PartitionDirectory"
24
+
`PartitionDirectory` is a collection of `FileStatus`es ([Apache Hadoop]({{ hadoop.api }}/org/apache/hadoop/fs/FileStatus.html)) along with partition values (if there are any).
18
25
19
-
`maxSplitBytes` calculates `bytesPerCore`to be `totalBytes`divided by [filesMinPartitionNum](../SQLConf.md#filesMinPartitionNum).
26
+
`maxSplitBytes` calculates how many bytes to allow per partition (`bytesPerCore`) that is the total size of all the files divided by [spark.sql.files.minPartitionNum](../configuration-properties.md#spark.sql.files.minPartitionNum) configuration property.
20
27
21
28
In the end, `maxSplitBytes` is [spark.sql.files.maxPartitionBytes](../configuration-properties.md#spark.sql.files.maxPartitionBytes) unless
22
29
the maximum of [spark.sql.files.openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes) and `bytesPerCore` is even smaller.
@@ -25,5 +32,5 @@ the maximum of [spark.sql.files.openCostInBytes](../configuration-properties.md#
25
32
26
33
`maxSplitBytes` is used when:
27
34
28
-
*`FileSourceScanExec` physical operator is requested to [createReadRDD](../physical-operators/FileSourceScanExec.md#createReadRDD) (and creates a [FileScanRDD](../rdds/FileScanRDD.md))
35
+
*`FileSourceScanExec` physical operator is requested to [create an RDD for scanning](../physical-operators/FileSourceScanExec.md#createReadRDD) (and creates a [FileScanRDD](../rdds/FileScanRDD.md))
29
36
*`FileScan` is requested for [partitions](FileScan.md#partitions)
Copy file name to clipboardExpand all lines: docs/physical-operators/FileSourceScanExec.md
+20-14Lines changed: 20 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -121,29 +121,35 @@ createReadRDD(
121
121
fsRelation: HadoopFsRelation):RDD[InternalRow]
122
122
```
123
123
124
-
!!! note "FIXME: Review Me"
124
+
`createReadRDD` prints out the following INFO message to the logs (with [maxSplitBytes](../datasources/FilePartition.md#maxSplitBytes) hint and [openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes)):
125
125
126
-
`createReadRDD` calculates the maximum size of partitions (`maxSplitBytes`) based on the following properties:
`createReadRDD` determines whether [Bucketing](../bucketing.md) is enabled or not (based on [spark.sql.sources.bucketing.enabled](../configuration-properties.md#spark.sql.sources.bucketing.enabled)) for bucket pruning.
131
132
132
-
`createReadRDD` sums up the size of all the files (with the extra [spark.sql.files.openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes)) for the given `selectedPartitions` and divides the sum by the "default parallelism" (i.e. number of CPU cores assigned to a Spark application) that gives `bytesPerCore`.
133
+
??? note "Bucket Pruning"
134
+
**Bucket Pruning** is an optimization to filter out data files from scanning (based on [optionalBucketSet](#optionalBucketSet)).
133
135
134
-
The maximum size of partitions is then the minimum of [spark.sql.files.maxPartitionBytes](../configuration-properties.md#spark.sql.files.maxPartitionBytes) and the bigger of [spark.sql.files.openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes) and the `bytesPerCore`.
136
+
With [Bucketing](../bucketing.md) disabled or [optionalBucketSet](#optionalBucketSet) undefined, all files are included in scanning.
135
137
136
-
`createReadRDD`prints out the following INFO message to the logs:
138
+
`createReadRDD`[splits files](../datasources/PartitionedFileUtil.md#splitFiles) to be scanned (in the given `selectedPartitions`), possibly applying bucket pruning (with [Bucketing](../bucketing.md) enabled). `createReadRDD` uses the following:
137
139
138
-
```text
139
-
Planning scan with bin packing, max size: [maxSplitBytes] bytes, open cost is considered as scanning [openCostInBytes] bytes.
140
-
```
140
+
*[isSplitable](../datasources/FileFormat.md#isSplitable) property of the [FileFormat](../datasources/FileFormat.md) of the [HadoopFsRelation](#relation)
141
+
*[maxSplitBytes](../datasources/FilePartition.md#maxSplitBytes) hint
141
142
142
-
For every file (as Hadoop's `FileStatus`) in every partition (as `PartitionDirectory` in the given `selectedPartitions`), `createReadRDD`[gets the HDFS block locations](#getBlockLocations) to create [PartitionedFiles](../datasources/PartitionedFile.md) (possibly split per the maximum size of partitions if the [FileFormat](../datasources/HadoopFsRelation.md#fileFormat) of the [HadoopFsRelation](#fsRelation) is [splittable](../datasources/FileFormat.md#isSplitable)). The partitioned files are then sorted by number of bytes to read (aka _split size_) in decreasing order (from the largest to the smallest).
143
+
`createReadRDD`sorts the split files (by length in reverse order).
143
144
144
-
`createReadRDD` "compresses" multiple splits per partition if together they are smaller than the `maxSplitBytes` ("Next Fit Decreasing") that gives the necessary partitions (file blocks as [FilePartitions](../rdds/FileScanRDD.md#FilePartition)).
145
+
In the end, creates a [FileScanRDD](../rdds/FileScanRDD.md) with the following:
145
146
146
-
In the end, `createReadRDD` creates a [FileScanRDD](../rdds/FileScanRDD.md) (with the given `(PartitionedFile) => Iterator[InternalRow]` read function and the partitions).
147
+
Property | Value
148
+
---------|------
149
+
[readFunction](../rdds/FileScanRDD.md#readFunction) | Input `readFile` function
[readSchema](../rdds/FileScanRDD.md#readSchema) | [requiredSchema](#requiredSchema) with [partitionSchema](../datasources/HadoopFsRelation.md#partitionSchema) of the input [HadoopFsRelation](../datasources/HadoopFsRelation.md)
Copy file name to clipboardExpand all lines: docs/rdds/FileScanRDD.md
+9-5Lines changed: 9 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,18 +2,22 @@
2
2
3
3
`FileScanRDD` is the [input RDD](../physical-operators/FileSourceScanExec.md#inputRDD) of [FileSourceScanExec](../physical-operators/FileSourceScanExec.md) leaf physical operator (for [Whole-Stage Java Code Generation](../whole-stage-code-generation/index.md)).
4
4
5
-
!!! note "The Internals of Apache Spark"
6
-
Find out more on `RDD` abstraction in [The Internals of Apache Spark]({{ book.spark_core }}/rdd/RDD.html).
5
+
??? note "RDD"
6
+
Find out more on `RDD` abstraction in [The Internals of Apache Spark]({{ book.spark_core }}/rdd/RDD).
* <spanid="readFunction"> Read Function that takes a [PartitionedFile](../datasources/PartitionedFile.md) and gives [internal binary rows](../InternalRow.md) back (`(PartitionedFile) => Iterator[InternalRow]`)
14
-
* <spanid="filePartitions"> File Blocks as `FilePartition`s (`Seq[FilePartition]`)
13
+
* <spanid="readFunction"> Read Function of [PartitionedFile](../datasources/PartitionedFile.md)s to [InternalRow](../InternalRow.md)s (`(PartitionedFile) => Iterator[InternalRow]`)
`FileScanRDD` is created when [FileSourceScanExec](../physical-operators/FileSourceScanExec.md) physical operator is requested to [createBucketedReadRDD](../physical-operators/FileSourceScanExec.md#createBucketedReadRDD) and [createNonBucketedReadRDD](../physical-operators/FileSourceScanExec.md#createNonBucketedReadRDD) (when `FileSourceScanExec` operator is requested for the [input RDD](../physical-operators/FileSourceScanExec.md#inputRDD) when [WholeStageCodegenExec](../physical-operators/WholeStageCodegenExec.md) physical operator is executed).
18
+
`FileScanRDD` is created when:
19
+
20
+
*[FileSourceScanExec](../physical-operators/FileSourceScanExec.md) physical operator is requested to [createBucketedReadRDD](../physical-operators/FileSourceScanExec.md#createBucketedReadRDD) and [createNonBucketedReadRDD](../physical-operators/FileSourceScanExec.md#createNonBucketedReadRDD) (when `FileSourceScanExec` operator is requested for the [input RDD](../physical-operators/FileSourceScanExec.md#inputRDD) when [WholeStageCodegenExec](../physical-operators/WholeStageCodegenExec.md) physical operator is executed)
0 commit comments