spark.sql.files.minPartitionNum, maxSplitBytes hint and File-Based Data Scanning

jaceklaskowski · jaceklaskowski · commit 8e7f8acd353c · 2023-02-21T18:26:37.000+01:00
diff --git a/docs/SQLConf.md b/docs/SQLConf.md
@@ -462,10 +462,6 @@ Used when:
 
 [spark.sql.files.minPartitionNum](configuration-properties.md#spark.sql.files.minPartitionNum)
 
-Used when:
-
-* `FilePartition` utility is requested for [maxSplitBytes](datasources/FilePartition.md#maxSplitBytes)
-
 ## <span id="filesOpenCostInBytes"> filesOpenCostInBytes
 
 [spark.sql.files.openCostInBytes](configuration-properties.md#spark.sql.files.openCostInBytes)
diff --git a/docs/SparkSession.md b/docs/SparkSession.md
@@ -419,10 +419,12 @@ leafNodeDefaultParallelism: Int
 
 `leafNodeDefaultParallelism` is the value of [spark.sql.leafNodeDefaultParallelism](configuration-properties.md#spark.sql.leafNodeDefaultParallelism) if defined or `SparkContext.defaultParallelism` ([Spark Core]({{ book.spark_core }}/SparkContext#defaultParallelism)).
 
+---
+
 `leafNodeDefaultParallelism` is used when:
 
-* `SparkSession` is requested to [range](SparkSession.md#range)
+* [SparkSession.range](SparkSession.md#range) operator is used
 * `RangeExec` leaf physical operator is [created](physical-operators/RangeExec.md#numSlices)
 * `CommandResultExec` physical operator is requested for the `RDD[InternalRow]`
 * `LocalTableScanExec` physical operator is requested for the [RDD](physical-operators/LocalTableScanExec.md#rdd)
-* `FilePartition` utility is used to `maxSplitBytes`
+* `FilePartition` is requested for [maxSplitBytes](datasources/FilePartition.md#maxSplitBytes)
diff --git a/docs/configuration-properties.md b/docs/configuration-properties.md
@@ -137,7 +137,7 @@ Used when:
 
 **spark.sql.files.maxPartitionBytes**
 
-Maximum number of bytes to pack into a single partition when reading files. Effective only for file-based sources (e.g., Parquet, JSON, ORC)
+Maximum number of bytes to pack into a single partition when reading files for file-based data sources (e.g., [Parquet](datasources/parquet/index.md))
 
 Default: `128MB` (like `parquet.block.size`)
 
@@ -147,6 +147,20 @@ Used when:
 
 * `FilePartition` is requested for [maxSplitBytes](datasources/FilePartition.md#maxSplitBytes)
 
+## <span id="spark.sql.files.minPartitionNum"><span id="FILES_MIN_PARTITION_NUM"> files.minPartitionNum
+
+**spark.sql.files.minPartitionNum**
+
+Hint about the minimum number of partitions for file-based data sources (e.g., [Parquet](datasources/parquet/index.md))
+
+Default: [spark.sql.leafNodeDefaultParallelism](SparkSession.md#leafNodeDefaultParallelism)
+
+Use [SQLConf.filesMinPartitionNum](SQLConf.md#filesMinPartitionNum) for the current value
+
+Used when:
+
+* `FilePartition` is requested for [maxSplitBytes](datasources/FilePartition.md#maxSplitBytes)
+
 ## <span id="spark.sql.files.openCostInBytes"><span id="FILES_OPEN_COST_IN_BYTES"> files.openCostInBytes
 
 **spark.sql.files.openCostInBytes**
@@ -1278,14 +1292,6 @@ Default: `0`
 
 Use [SQLConf.maxRecordsPerFile](SQLConf.md#maxRecordsPerFile) method to access the current value.
 
-## <span id="spark.sql.files.minPartitionNum"> spark.sql.files.minPartitionNum
-
-The suggested (not guaranteed) minimum number of split file partitions for file-based data sources such as Parquet, JSON and ORC.
-
-Default: (undefined)
-
-Use [SQLConf.filesMinPartitionNum](SQLConf.md#filesMinPartitionNum) method to access the current value.
-
 ## <span id="spark.sql.inMemoryColumnarStorage.compressed"> spark.sql.inMemoryColumnarStorage.compressed
 
 When enabled, Spark SQL will automatically select a compression codec for each column based on statistics of the data.
diff --git a/docs/connector/Batch.md b/docs/connector/Batch.md
@@ -20,13 +20,13 @@ Used when:
 
 * `BatchScanExec` is requested for a [PartitionReaderFactory](../physical-operators/BatchScanExec.md#readerFactory)
 
-### <span id="planInputPartitions"> planInputPartitions
+### <span id="planInputPartitions"> Planning Input Partitions
 
 ```java
 InputPartition[] planInputPartitions()
 ```
 
-[InputPartition](InputPartition.md)s to scan this data source
+[InputPartition](InputPartition.md)s to scan this data source with
 
 See:
 
diff --git a/docs/datasources/FilePartition.md b/docs/datasources/FilePartition.md
@@ -8,15 +8,22 @@ maxSplitBytes(
   selectedPartitions: Seq[PartitionDirectory]): Long
 ```
 
-`maxSplitBytes` reads the following properties:
+---
+
+`maxSplitBytes` can be adjusted based on the following configuration properties:
 
 * [spark.sql.files.maxPartitionBytes](../configuration-properties.md#spark.sql.files.maxPartitionBytes)
 * [spark.sql.files.openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes)
 * [spark.sql.files.minPartitionNum](../configuration-properties.md#spark.sql.files.minPartitionNum) (default: [Default Parallelism of Leaf Nodes](../SparkSession.md#leafNodeDefaultParallelism))
 
-`maxSplitBytes` uses the given `selectedPartitions` to calculate `totalBytes` based on the size of the files with [spark.sql.files.openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes) added (for each file).
+---
+
+`maxSplitBytes` calculates the total size of all the files (in the given `PartitionDirectory`ies) with [spark.sql.files.openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes) overhead added (to the size of every file).
+
+??? note "PartitionDirectory"
+    `PartitionDirectory` is a collection of `FileStatus`es ([Apache Hadoop]({{ hadoop.api }}/org/apache/hadoop/fs/FileStatus.html)) along with partition values (if there are any).
 
-`maxSplitBytes` calculates `bytesPerCore` to be `totalBytes` divided by [filesMinPartitionNum](../SQLConf.md#filesMinPartitionNum).
+`maxSplitBytes` calculates how many bytes to allow per partition (`bytesPerCore`) that is the total size of all the files divided by [spark.sql.files.minPartitionNum](../configuration-properties.md#spark.sql.files.minPartitionNum) configuration property.
 
 In the end, `maxSplitBytes` is [spark.sql.files.maxPartitionBytes](../configuration-properties.md#spark.sql.files.maxPartitionBytes) unless
 the maximum of [spark.sql.files.openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes) and `bytesPerCore` is even smaller.
@@ -25,5 +32,5 @@ the maximum of [spark.sql.files.openCostInBytes](../configuration-properties.md#
 
 `maxSplitBytes` is used when:
 
-* `FileSourceScanExec` physical operator is requested to [createReadRDD](../physical-operators/FileSourceScanExec.md#createReadRDD) (and creates a [FileScanRDD](../rdds/FileScanRDD.md))
+* `FileSourceScanExec` physical operator is requested to [create an RDD for scanning](../physical-operators/FileSourceScanExec.md#createReadRDD) (and creates a [FileScanRDD](../rdds/FileScanRDD.md))
 * `FileScan` is requested for [partitions](FileScan.md#partitions)
diff --git a/docs/features/index.md b/docs/features/index.md
@@ -1,11 +1,12 @@
 # Features
 
-The following are the features of Spark SQL that help place it in the top of the modern SQL execution engines:
+The following are the features of Spark SQL that help place it in the top of the modern distributed SQL query processing engines:
 
 * [Adaptive Query Execution](../adaptive-query-execution/index.md)
 * [Catalog Plugin API](../connector/catalog/index.md)
 * [Columnar Execution](../columnar-execution/index.md)
 * [Dynamic Partition Pruning](../dynamic-partition-pruning/index.md)
+* [File-Based Data Scanning](../file-based-data-scanning/index.md)
 * [Variable Substitution](../variable-substitution.md)
 * [Whole-Stage Code Generation](../whole-stage-code-generation/index.md)
-* _many others_ (listed in the menu on the left)
+* _others_ (listed in the menu on the left)
diff --git a/docs/file-based-data-scanning/index.md b/docs/file-based-data-scanning/index.md
@@ -0,0 +1,12 @@
+# File-Based Data Scanning
+
+Spark SQL uses [FileScanRDD](../rdds/FileScanRDD.md) for table scans of File-Based Data Sources (e.g., [parquet](../datasources/parquet/index.md)).
+
+The number of partitions in data scanning is based on the following:
+
+* [maxSplitBytes hint](../datasources/FilePartition.md#maxSplitBytes)
+* [Whether FileFormat is splitable or not](../datasources/FileFormat.md#isSplitable)
+* [Number of split files](../datasources/PartitionedFileUtil.md#splitFiles)
+* Bucket Pruning
+
+File-Based Data Scanning can be [bucketed or not](../physical-operators/FileSourceScanExec.md#bucketedScan).
diff --git a/docs/physical-operators/FileSourceScanExec.md b/docs/physical-operators/FileSourceScanExec.md
@@ -121,29 +121,35 @@ createReadRDD(
   fsRelation: HadoopFsRelation): RDD[InternalRow]
 ```
 
-!!! note "FIXME: Review Me"
+`createReadRDD` prints out the following INFO message to the logs (with [maxSplitBytes](../datasources/FilePartition.md#maxSplitBytes) hint and [openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes)):
 
-`createReadRDD` calculates the maximum size of partitions (`maxSplitBytes`) based on the following properties:
-
-* [spark.sql.files.maxPartitionBytes](../configuration-properties.md#spark.sql.files.maxPartitionBytes)
+```text
+Planning scan with bin packing, max size: [maxSplitBytes] bytes,
+open cost is considered as scanning [openCostInBytes] bytes.
+```
 
-* [spark.sql.files.openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes)
+`createReadRDD` determines whether [Bucketing](../bucketing.md) is enabled or not (based on [spark.sql.sources.bucketing.enabled](../configuration-properties.md#spark.sql.sources.bucketing.enabled)) for bucket pruning.
 
-`createReadRDD` sums up the size of all the files (with the extra [spark.sql.files.openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes)) for the given `selectedPartitions` and divides the sum by the "default parallelism" (i.e. number of CPU cores assigned to a Spark application) that gives `bytesPerCore`.
+??? note "Bucket Pruning"
+    **Bucket Pruning** is an optimization to filter out data files from scanning (based on [optionalBucketSet](#optionalBucketSet)).
 
-The maximum size of partitions is then the minimum of [spark.sql.files.maxPartitionBytes](../configuration-properties.md#spark.sql.files.maxPartitionBytes) and the bigger of [spark.sql.files.openCostInBytes](../configuration-properties.md#spark.sql.files.openCostInBytes) and the `bytesPerCore`.
+    With [Bucketing](../bucketing.md) disabled or [optionalBucketSet](#optionalBucketSet) undefined, all files are included in scanning.
 
-`createReadRDD` prints out the following INFO message to the logs:
+`createReadRDD` [splits files](../datasources/PartitionedFileUtil.md#splitFiles) to be scanned (in the given `selectedPartitions`), possibly applying bucket pruning (with [Bucketing](../bucketing.md) enabled). `createReadRDD` uses the following:
 
-```text
-Planning scan with bin packing, max size: [maxSplitBytes] bytes, open cost is considered as scanning [openCostInBytes] bytes.
-```
+* [isSplitable](../datasources/FileFormat.md#isSplitable) property of the [FileFormat](../datasources/FileFormat.md) of the [HadoopFsRelation](#relation)
+* [maxSplitBytes](../datasources/FilePartition.md#maxSplitBytes) hint
 
-For every file (as Hadoop's `FileStatus`) in every partition (as `PartitionDirectory` in the given `selectedPartitions`), `createReadRDD` [gets the HDFS block locations](#getBlockLocations) to create [PartitionedFiles](../datasources/PartitionedFile.md) (possibly split per the maximum size of partitions if the [FileFormat](../datasources/HadoopFsRelation.md#fileFormat) of the [HadoopFsRelation](#fsRelation) is [splittable](../datasources/FileFormat.md#isSplitable)). The partitioned files are then sorted by number of bytes to read (aka _split size_) in decreasing order (from the largest to the smallest).
+`createReadRDD` sorts the split files (by length in reverse order).
 
-`createReadRDD` "compresses" multiple splits per partition if together they are smaller than the `maxSplitBytes` ("Next Fit Decreasing") that gives the necessary partitions (file blocks as [FilePartitions](../rdds/FileScanRDD.md#FilePartition)).
+In the end, creates a [FileScanRDD](../rdds/FileScanRDD.md) with the following:
 
-In the end, `createReadRDD` creates a [FileScanRDD](../rdds/FileScanRDD.md) (with the given `(PartitionedFile) => Iterator[InternalRow]` read function and the partitions).
+Property | Value
+---------|------
+[readFunction](../rdds/FileScanRDD.md#readFunction) | Input `readFile` function
+[filePartitions](../rdds/FileScanRDD.md#filePartitions) | [Partitions](../datasources/FilePartition.md#getFilePartitions)
+[readSchema](../rdds/FileScanRDD.md#readSchema) | [requiredSchema](#requiredSchema) with [partitionSchema](../datasources/HadoopFsRelation.md#partitionSchema) of the input [HadoopFsRelation](../datasources/HadoopFsRelation.md)
+[metadataColumns](../rdds/FileScanRDD.md#metadataColumns) | [metadataColumns](#metadataColumns)
 
 ### <span id="dynamicallySelectedPartitions"> Dynamically Selected Partitions
 
diff --git a/docs/rdds/FileScanRDD.md b/docs/rdds/FileScanRDD.md
@@ -2,18 +2,22 @@
 
 `FileScanRDD` is the [input RDD](../physical-operators/FileSourceScanExec.md#inputRDD) of [FileSourceScanExec](../physical-operators/FileSourceScanExec.md) leaf physical operator (for [Whole-Stage Java Code Generation](../whole-stage-code-generation/index.md)).
 
-!!! note "The Internals of Apache Spark"
-    Find out more on `RDD` abstraction in [The Internals of Apache Spark]({{ book.spark_core }}/rdd/RDD.html).
+??? note "RDD"
+    Find out more on `RDD` abstraction in [The Internals of Apache Spark]({{ book.spark_core }}/rdd/RDD).
 
 ## Creating Instance
 
 `FileScanRDD` takes the following to be created:
 
 * <span id="sparkSession"> [SparkSession](../SparkSession.md)
-* <span id="readFunction"> Read Function that takes a [PartitionedFile](../datasources/PartitionedFile.md) and gives [internal binary rows](../InternalRow.md) back (`(PartitionedFile) => Iterator[InternalRow]`)
-* <span id="filePartitions"> File Blocks as `FilePartition`s (`Seq[FilePartition]`)
+* <span id="readFunction"> Read Function of [PartitionedFile](../datasources/PartitionedFile.md)s to [InternalRow](../InternalRow.md)s (`(PartitionedFile) => Iterator[InternalRow]`)
+* <span id="filePartitions"> [FilePartition](../datasources/FilePartition.md)s
+* <span id="readSchema"> Read [Schema](../types/StructType.md)
+* <span id="metadataColumns"> Metadata Columns
 
-`FileScanRDD` is created when [FileSourceScanExec](../physical-operators/FileSourceScanExec.md) physical operator is requested to [createBucketedReadRDD](../physical-operators/FileSourceScanExec.md#createBucketedReadRDD) and [createNonBucketedReadRDD](../physical-operators/FileSourceScanExec.md#createNonBucketedReadRDD) (when `FileSourceScanExec` operator is requested for the [input RDD](../physical-operators/FileSourceScanExec.md#inputRDD) when [WholeStageCodegenExec](../physical-operators/WholeStageCodegenExec.md) physical operator is executed).
+`FileScanRDD` is created when:
+
+* [FileSourceScanExec](../physical-operators/FileSourceScanExec.md) physical operator is requested to [createBucketedReadRDD](../physical-operators/FileSourceScanExec.md#createBucketedReadRDD) and [createNonBucketedReadRDD](../physical-operators/FileSourceScanExec.md#createNonBucketedReadRDD) (when `FileSourceScanExec` operator is requested for the [input RDD](../physical-operators/FileSourceScanExec.md#inputRDD) when [WholeStageCodegenExec](../physical-operators/WholeStageCodegenExec.md) physical operator is executed)
 
 ## Configuration Properties
 
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -174,6 +174,8 @@ nav:
       - EstimationUtils: cost-based-optimization/EstimationUtils.md
     - Dynamic Partition Pruning:
       - dynamic-partition-pruning/index.md
+    - File-Based Data Scanning:
+      - file-based-data-scanning/index.md
     - Join Queries:
       - Joins: joins.md
       - Broadcast Joins: spark-sql-joins-broadcast.md
@@ -811,6 +813,8 @@ nav:
       - UnsafeHashedRelation: UnsafeHashedRelation.md
       - UnsafeRow: UnsafeRow.md
       - UnsafeRowSerializerInstance: tungsten/UnsafeRowSerializerInstance.md
+    - RDDs:
+      - FileScanRDD: rdds/FileScanRDD.md
   - SQL:
     - sql/index.md
     - AbstractSqlParser: sql/AbstractSqlParser.md
@@ -1059,7 +1063,6 @@ nav:
       - Caching and Persistence: caching-and-persistence.md
       - User-Friendly Names of Cached Queries in web UI: caching-webui-storage.md
     - Checkpointing: checkpointing.md
-    - FileScanRDD: rdds/FileScanRDD.md
     - Logging: spark-logging.md
     - Performance Tuning and Debugging:
       - Debugging Query Execution: debugging-query-execution.md