[SPARK-32284][SQL] Avoid expanding too many CNF predicates in partition pruning #29075

gengliangwang · 2020-07-12T13:28:38Z

What changes were proposed in this pull request?

After #28805, predicates are converted into CNF for partition pruning. However, the CNF result can be very long and the Hive metastore will fail to execute it.
For example, the following partition filter:

(p0 = '1' AND p1 = '1') OR (p0 = '2' AND p1 = '2') OR (p0 = '3' AND p1 = '3') OR (p0 = '4' AND p1 = '4') OR (p0 = '5' AND p1 = '5') OR (p0 = '6' AND p1 = '6') OR (p0 = '7' AND p1 = '7') OR (p0 = '8' AND p1 = '8') OR (p0 = '9' AND p1 = '9') OR (p0 = '10' AND p1 = '10') OR (p0 = '11' AND p1 = '11') OR (p0 = '12' AND p1 = '12') OR (p0 = '13' AND p1 = '13') OR (p0 = '14' AND p1 = '14') OR (p0 = '15' AND p1 = '15') OR (p0 = '16' AND p1 = '16') OR (p0 = '17' AND p1 = '17') OR (p0 = '18' AND p1 = '18') OR (p0 = '19' AND p1 = '19') OR (p0 = '20' AND p1 = '20')

will be converted into a long query(130K characters) in Hive metastore, and there will be error:

javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME,A0.PART_ID,A0.PART_NAME AS NUCORDER0 FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE B0.TBL_NAME = ? AND C0."NAME" = ? AND ((((((A0.PART_NAME LIKE '%/p1=1' ESCAPE '\' ) OR (A0.PART_NAME LIKE '%/p1=2' ESCAPE '\' )) OR (A0.PART_NAME LIKE '%/p1=3' ESCAPE '\' )) OR ((A0.PART_NAME LIKE '%/p1=4' ESCAPE '\' ) O ...

To mitigating the regression due to the previous improvement #28805:

We should push down the convertible original queries as they are, instead of converting all predicates into CNF
We can skip grouping expressions so that we can stop the CNF conversion when the predicates becoming too long.

Why are the changes needed?

Mitigating potential regressions in partiton pruning from #28805

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

gengliangwang · 2020-07-12T13:29:29Z

cc @AngersZhuuuu @cloud-fan

dongjoon-hyun

Although this aims to fix too many predicated issues in HMS, Avoid pushing down too many predicated in partition pruning sounds ambiguous as a PR title. Can we have a more specific title describing what the PR code does?

gengliangwang · 2020-07-12T17:07:27Z

@dongjoon-hyun Thanks for the suggestion. I have updated the title.

SparkQA · 2020-07-12T18:28:41Z

Test build #125715 has finished for PR 29075 at commit ccba836.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-12T21:09:16Z

Thank you for updating, @gengliangwang . Shall we adjust this test case name accordingly together?

test("SPARK-32284: Avoid pushing down too many predicates in partition pruning") {

BTW, in the test case, since 20 looks like reasonably a small number in the Spark world. Could you use more functional word to describe the change? For example, this PR is not limiting based on the number of predicate like 10 is possible, but 20 is not allowed. Apache Spark still will hit the HMS issue when we have a long long SQL query with too many predicates after this PR. So, this PR doesn't fix Avoid pushing down too many predicates in partition pruning. Instead, this PR looks like mitigating the regression due to the previous improvement PR.

We should push down the convertible original queries as they are, instead of converting all predicates into CNF
We can skip grouping expressions so that we can stop the CNF conversion when the predicates becoming too long.

maropu · 2020-07-13T01:09:01Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala

    case op @ PhysicalOperation(projections, filters, relation: HiveTableRelation)
      if filters.nonEmpty && relation.isPartitioned && relation.prunedPartitions.isEmpty =>
-      val predicates = CNFWithGroupExpressionsByReference(filters.reduceLeft(And))
+      val predicates = CNFConversion(filters.reduceLeft(And))


nit: conjunctiveNormalForm(filters.reduceLeft(And), identity)?

maropu · 2020-07-13T01:24:39Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

+    val extraPartitionFilters =
+      remainingFilterInCnf.filter(f => f.references.subsetOf(partitionSet))
+
+    (ExpressionSet(partitionFilters ++ extraPartitionFilters), remainingFilters)


val (extraPartitionFilters, otherFilters) = remainingFilterInCnf.partition(f => f.references.subsetOf(partitionSet) ) (ExpressionSet(partitionFilters ++ extraPartitionFilters), otherFilters)

?

In that way, otherFilters can be very long, which leads to a longer codegen... I am avoiding that on purpose. Let me add comment here.

AngersZhuuuu · 2020-07-13T01:34:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala

 import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.execution.datasources.DataSourceStrategy
+import org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions.CNFConversion
 import org.apache.spark.sql.internal.SQLConf


This import is not necessary.

AngersZhuuuu · 2020-07-13T01:42:55Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala

+    val remainingFilterInCnf = remainingFilters.flatMap(CNFConversion)
+    val extraPartitionFilters = remainingFilterInCnf.filter(f =>
+      !f.references.isEmpty && f.references.subsetOf(partitionColumnSet))
+    ExpressionSet(partitionFilters ++ extraPartitionFilters)


I am confused that seems CNFConversion won't change references, You don't need to call a splitConjunctivePredicates to each expr in remainingFilterInCnf to extract more predicate?

The filters here is already processed with splitConjunctivePredicates in PhysicalOperation.unapply. That's why the original code before #28805 doesn't call splitConjunctivePredicates either.

SparkQA · 2020-07-13T07:05:02Z

Test build #125746 has finished for PR 29075 at commit df08390.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-13T14:03:37Z

Test build #125758 has finished for PR 29075 at commit 6fe106c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-13T23:22:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

-   * @param condition to be converted into CNF.
+   * @param condition Condition to be converted into CNF.
+   * @param groupExpsFunc A method for grouping intermediate results so that the final result can be
+   *                      shorter.


nit: A method to group expressions for reducing the size of pushed down predicates and corresponding codegen?

gengliangwang · 2020-07-15T08:39:25Z

#29101 is a better solution to me. Close this one now.

avoid pushing down too many predicated in partition pruning

ccba836

probot-autolabeler bot added the SQL label Jul 12, 2020

dongjoon-hyun reviewed Jul 12, 2020

View reviewed changes

gengliangwang changed the title ~~[SPARK-32284][SQL] Avoid pushing down too many predicated in partition pruning~~ [SPARK-32284][SQL] Avoid expanding too many CNF predicates in partition pruning Jul 12, 2020

maropu reviewed Jul 13, 2020

View reviewed changes

AngersZhuuuu reviewed Jul 13, 2020

View reviewed changes

gengliangwang added 2 commits July 13, 2020 11:17

address comments

df08390

revise method names and comments

6fe106c

maropu reviewed Jul 13, 2020

View reviewed changes

maropu mentioned this pull request Jul 15, 2020

[SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions #29101

Closed

gengliangwang closed this Jul 15, 2020

[SPARK-32284][SQL] Avoid expanding too many CNF predicates in partition pruning #29075

[SPARK-32284][SQL] Avoid expanding too many CNF predicates in partition pruning #29075

Uh oh!

Conversation

gengliangwang commented Jul 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang commented Jul 12, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Jul 12, 2020

Uh oh!

SparkQA commented Jul 12, 2020

Uh oh!

dongjoon-hyun commented Jul 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 13, 2020

Uh oh!

SparkQA commented Jul 13, 2020

Uh oh!

maropu Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Jul 15, 2020

Uh oh!

Uh oh!

gengliangwang commented Jul 12, 2020 •

edited

Loading

dongjoon-hyun commented Jul 12, 2020 •

edited

Loading

maropu Jul 13, 2020 •

edited

Loading