-
Notifications
You must be signed in to change notification settings - Fork 451
PARQUET-2249: Introduce IEEE 754 total order #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JFinis! First read looks pretty good to me. I just have a few small suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, thank you for writing this up
|
Are there any plans to get this or #196 merged? Either of these solutions would still be useful today. |
|
Yes, I would like to merge this, as the problem still persists. I guess what's missing now is mostly 2 open source implementations. I currently do not have the time allocated for that much open source work, sadly. |
|
I can implement support for this in a branch in DuckDB if that helps get this in. |
@Mytherin I can help with the PoC implementation with parquet-java and/or parquet-cpp but I cannot commit the time. |
|
I can take a stab at a rust implementation, but it would help if this PR could merge in recent changes from master. |
I'll update in the next days |
|
parquet-rs PoC is largely ready to go, at least as far as producing files that use the new Then all that would be required is implementation in parquet-java and demonstration of interoperability. |
|
I'm glad to provide the POC for C++ version in these days |
|
@mapleFU Thanks! Then I will focus on parquet-java. |
…of floating point numbers)
|
I've implemented this for DuckDB in my branch here. Attached is a Parquet file that has the orders defined. |
|
PoC from parquet-java is now alive: apache/parquet-java#3191 |
a7efb32 to
cb7c38e
Compare
|
I have rebased my changes onto latest master and applied all suggestions. I have also split this PR into a styling change and the actual change, as suggested by @wgtmac. I don't know how to make this PR depend on the other PR (I can't put branches from my fork as base branches to this PR), so as long as the other PR isn't shipped, this PR still contains both commits. So, how do we continue from here? Do we have enough implementations now to start a vote? Or what is still missing? |
Thanks @JFinis! I think what's needed now is to have the PoC engines each write a file with the new ordering (perhaps After that, I'd probably kick off a thread on the dev list to see if there are any last concerns before holding a vote. |
|
@orlp I actually had
So in the end, this boils down to: Are we willing to special case float to make filtering in the presence of NaNs more efficient, or do we go with a more streamlined implementation without special fields that only make sense for a single data type, at the cost of worse filtering performance when NaNs are present. The consensus was somehow the latter, and over time I have come to agree with this. It's just a way more minimalistic and clean design. To your second comment about making the sign bit significant. I am working on an engine myself that does not care about sign bits in NaNs, but I still consider having them in Parquet okay. Our engine, when reading will simply normalize the NaN bit out (or just keep it as is). When writing, we will simply never write negative NaNs (or just write whatever bit pattern we have). However we do it, we of course have to make sure that the statistics we write are in line with the bit patterns we write and when reading, we have to make sure the same is true. Therefore, I don't see the danger you mention IMHO. Of course, each writer who wants to conform to the Parquet spec would have to handle this correctly, so either normalize the sign bit out, or consider it when computing min/max; otherwise it's an incorrect writer. The writing engine cannot just do whatever it does internally, it has to do whatever the Parquet spec mandates. Again, the engine I am working on handles NaNs internally differently as well, but of course our Parquet read/write code does whatever Parquet mandates, and not what we otherwise do internally. So basically your danger boils down to: "There is a chance that a sloppily programmed writer gets this wrong". That is indeed a risk. But IMHO, there are way more complex things in Parquet to get wrong, so this shouldn't be our bar. On the upside, what we gain by considering the sign bit for NaNs is:
Your own proposal suggests an ordering that would be different for min and max. While this might help with squeezing the most filtering possibilities out of NaNs, it adds even more special handling logic, and I guess a sloppily written writer implementation could get this wrong as well. I guess, in the end, whatever we do will be light years better than the current state. We just have to agree on doing something. So far, reaching that agreement was hard, but it looks like this version of the proposal had gotten most agreement so far. I do agree though that the solution you propose is absolutely also a possibility with its own merits. |
I consider IEEE 754 total order as arbitrary and special-cased as the NaN-ignoring order. Either way we have to do something we're not doing right now for floats.
Computing the min/max while ignoring NaNs is also efficient and branch-free if done correctly. And I hope we agree the computation of the statistic is far more important than the singular comparison to decide whether or not to skip an entire page.
My proposal is also fully defined without room for interpretation (apart from the exact NaN pattern used for the full-NaN page, we can figure that out).
Yes, absolutely. NaNs aren't as rare as you might think depending on the use-case, and need special handling regardless, because they aren't ordered naturally. The 'large value' conundrum you mention is a red herring, those are naturally handled the same by every data system. But not every data system makes the same choices when it comes to handling NaNs. By taking them out of the ordering and presented separately through a
Let's not forget that my proposal very closely matches what Parquet already requires today. |
|
I agree @orlp, all your arguments are valid and it's good to have the discussion. However, they all basically already came up in the initial discussion, so they are merely another educated opinion rather than new points we didn't consider before and therefore aren't likely to sway existing opinions on the topic; or maybe they do. As mentioned, I'm happy with both approaches and willing to continue with any of them; I just want the main problem to be fixed, somehow. I mostly challenged your proposal to give you and others an overview of where the discussion went so far. While I like the conceptual simplicity of total ordering, I'm not a firm believer that it is the only possible solution to the problem. And I do recognize that your proposal of sorting NaN differently for min and max wasn't in my initial proposal and is probably a good idea, so if we went with NaN-counts, there are things that could be improved upon in the initial proposal before continuing. So from my side, I'm happy to continue either way. But I guess we now should at least get an informal vote going to see where the majority opinion resides on this. |
|
From my perspective both solutions are significantly better than the status quo, and solve the core problem which is that floating point statistics cannot be used fully even when there are no NaN values present due to the reader always having to assume NaN values might be present in the data. Which of these solutions is adapted is not important to me - but given that changes to the standard take a long time to propagate, it would be better if they are adapted sooner rather than later. At this point it is over 2 years since the original proposal and pages of discussions have already been had over this. It would be great to make some progress here. |
|
Personally I can see pros and cons for each approach and agree either is better than the status quo. I'll just point out that if speed is of the essence then the current proposal already has 3 open source implementations and AFAICT is in a state it can be voted on and adopted quickly. Kicking off another round of discussions will likely delay roll out by at least a few months.
I like this idea. If there are PMC members who would veto this PR, it's better to know now rather than after a few more months of discussion. If vetoed we can regroup and try again with |
I don't believe either proposal requires a lot of implementation effort.
I don't think this is a fair voting procedure if the options are stated as "approve this or veto". Since everyone agrees that the status quo is bad, almost everyone will prefer this over vetoing. At the very least the voting options must include
|
|
There is actually a problem with the singular NaN count for data systems which use IEEE 754 total ordering (such as datafusion), they would need two counts for efficient page filtering in the face of NaNs: one for positive NaNs and one for negative NaNs. |
|
To echo what has been said before, and reiterated by @JFinis the total order mechanism is a relatively straightforward fix that is not only easy to implement but also easy to explain to users, if it even needs explaining. A NaN just becomes a very big value, and a negative NaN a very small value. Yes, this means a NaN may limit the ability to prune statistics, but this is no different from an abnormally large or small value. IMO this is pretty intuitive. That's not to say the nan counts proposal is without merit, I could definitely see it being useful for engines that order NaNs differently, however, it is more complex both for users and parquet readers to reason about. IMO we have a proposal with relatively broad consensus, and multiple implementations, I think it is worthwhile bring it to a broader audience in the form of a vote. I don't see adopting total ordering as a one way door, we can always add a nan count mechanism later. |
I do see it as a one-way door because no vendor is going to adopt a new ordering later if it means all readers which do not (yet) support the new ordering will downgrade from "suboptimal pruning" (the new status quo if this is adopted) to "no pruning" (the current status quo). We realistically speaking only have one chance to go from the status quo to a good solution, adopting a suboptimal solution now will likely leave us in a local maximum for the remainder of Parquet's lifetime. |
| * x_int ^= (((x_int >> 63) as u64) >> 1) as i64; | ||
| * y_int ^= (((y_int >> 63) as u64) >> 1) as i64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment explaining those lines?
| * x_int ^= (((x_int >> 63) as u64) >> 1) as i64; | |
| * y_int ^= (((y_int >> 63) as u64) >> 1) as i64; | |
| * // Turn sign+mantissa into 2's complement representation | |
| * x_int ^= (((x_int >> 63) as u64) >> 1) as i64; | |
| * y_int ^= (((y_int >> 63) as u64) >> 1) as i64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's assuming the comment is right. Please double-check :)
I don't think that's a big problem. It just means that if the system needs to include either -NaN or +NaN in a query, any page that has a non-zero |
|
I am only a guest here and no PMC member, so I don't know how to continue the process. Could someone pick this up and do a vote? I agree with @orlp that since there are two possible solutions for the problem on the table, we could also do a vote taking both approaches into account. Or two separate votes for each of them? Or just a single vote? Or something completely different? I am happy with any of those. As mentioned, I don't know how things are done here. Please, someone more official than I, pick this up and do whatever is prudent here. I am happy to work in the latest suggestions and rebase; or switch back and improve the other PR, but before doing so, I would like to get a better understanding of whether the PR at hand has a chance to make it in the first place. |
I wasn't suggesting a formal vote on total order. Rather, I was thinking an informal poll as @JFinis had suggested to fail fast on this PR if it lacked PMC support. But I see your point. What I want to avoid is a protracted discussion that after several months ends with the majority opinion of those involved to proceed with total order, only to then be shot down in an actual vote. The most vocal opinions recently have been from non-PMC. Also, any vote involving
I believe it's worse than that. Consider a page |
I can try to scan through the two proposals again, but I think maybe we should send a note about progress made with the current proposal and the alternative that has been brought up. If the current proposal. We generally haven't had too many complaints about NaNs in general in parquet so I'm inclined to the simplest thing possible here from a specification perspective (but if there are open datasets that we can evaluate impact on prunability, setting the argument with data would be useful). |
|
I've posted a brief update on the Parquet dev ML: https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s |
I don't think the two proposal are mutually exclusive. Here is a existence proof of a way they could both co-exist
Of course, the spec would be less complicated if we included So my suggestion is let's get this proposal up for a vote, and then write up a follow on proposal for being more efficient in the presence of Nans |
It would be incompatible with all existing query engines with Parquet readers which implemented this PR and use it to prune. Suppose I write a query engine with a Parquet reader, and want to do pruning. My query is "find all rows where column
Now if we later change the semantics like you propose, this query engine would incorrectly miss rows if it doesn't understand So the only option would be to completely replace the ordering introduced in this PR with another ordering which includes this
|
In my opinion this situation would be better than the current state of affairs. I still don't see the proposals as mutually exclusive (though the current proposal could be improved) Is there a formal proposal (aka a PR with proposed wording) for the alternative NAN handling? @orlp are you willing to to make such a proposal if it doesn't exist? |
I don't have the time or energy to make a fully formal proposal, no. That said, if I may informally propose something, I think the current order we have in I would honestly only clarify what should be done in the event a full-NaN page is encountered. I think in that case it makes the most sense to write Then if this ordering (which we already have! - can't get easier than that) is simply combined with the addition of two new statistics ( I assume the addition of a completely new statistic is backwards compatible, if we don't change any semantics of the existing ordering (only clarifying what should be done for the full-NaN page)? Then the statistics can be implemented in an incredibly simple branchless (with min = inf
max = -inf
pos_nan_count = 0
neg_nan_count = 0
for x in data {
min = if x < min { x } else { min }
max = if x > max { x } else { max }
pos_nan_count += x.is_nan() & x.is_sign_positive()
neg_nan_count += x.is_nan() & x.is_sign_negative()
}
// Post-processing normalization.
if min == 0 { min = -0.0 }
if max == 0 { max = 0.0 }
if pos_nan_count + neg_nan_count == data.len() {
min = -inf
max = inf
} |
I'm wondering if counts are actually necessary. I see a need for null counts, but with NaN I think all we really care about is the lack or presence of them, rather than the number. Perhaps two boolean fields
I think #196 could be revived if need be. |
|
My personal opinion would be not to revive nan counts. Can we maybe first establish whether we have a majority for anything? I feel like we're being startled by a singular opinion here, while we already had broad consensus of switching from nan_counts to total order; that's why I created this second PR in the first place. It feels like going in circles without adding new insights. One technical comment:
I don't think it's fully fine. It's not a total order, so it makes |
|
@emkornfield @Fokko @gszadovszky @julienledem @rdblue |
This commit adds a new column order `IEEE754TotalOrder`, which can be used for floating point types (`FLOAT`, `DOUBLE`, `FLOAT16`). The advantage of the new order is a well-defined ordering between -0,+0 and the various possible bit patterns of NaNs. Thus, every single possible bit pattern of a floating point value has a well-defined order now, so there are no possibilities where two implementations might apply different orders when the new column order is used. With the default column order, there were many problems w.r.t. NaN values which lead to reading engines not being able to use statistics of floating point columns for scan pruning even in the case where no NaNs were in the data set. The problems are discussed in detail in the next section. This solution to the problem is the result of the extended discussion in PR apache#196 [1], which ended with the consensus that IEEE 754 total ordering is the best approach to solve the problem in a simple manner without introducing special fields for floating point columns (such as `nan_counts`, which was proposed in that PR). Please refer to the discussion in that PR for all the details why this solution was chosen over various design alternatives. Note that this solution is fully backward compatible and should not break neither old readers nor writers, as a new column order is added. Legacy writers can continue not writing this new order and instead writing the default type defined order. Legacy readers should avoid using any statistics on columns that have a column order they do not understand and therefore should just not use the statistics for columns ordered using the new order. The remainder of this message explains in detail what the problems are and how the proposed solution fixes them. Problem Description =================== Currently, the way NaN values are to be handled in statistics inhibits most scan pruning once NaN values are present in `DOUBLE` or `FLOAT` columns. Concretely the following problems exist: Statistics don't tell whether NaNs are present ---------------------------------------------- As NaN values are not to be incorporated in min/max bounds, a reader cannot know whether NaN values are present. This might seem to be not too problematic, as most queries will not filter for NaNs. However, NaN is ordered in most database systems. For example, Postgres, DB2, and Oracle treat NaN as greater than any other value, while MSSQL and MySQL treat it as less than any other value. An overview over what different systems are doing can be found here [2]. The gist of it is that different systems with different semantics exist w.r.t. NaNs and most of the systems do order NaNs; either less than or greater than all other values. For example, if the semantics of the reading query engine mandate that NaN is to be treated greater than all other values, the predicate `x > 1.0` should include NaN values. If a page has max = 0.0 now, the engine would not be able to skip the page, as the page might contain NaNs which would need to be included in the query result. Likewise, the predicate `x < 1.0` should include NaN if NaN is treated to be less than all other values by the reading engine. Again, a page with min = 2.0 couldn't be skipped in this case by the reader. Thus, even if a user doesn't query for NaN explicitly, they might use other predictes that need to filter or retain NaNs in the semantics of the reading engine, so the fact that we currently can't know whether a page or row group contains NaN is a bigger problem than it might seem on first sight. Currently, any predicate that needs to retain NaNs cannot use min and max bounds in Parquet and therefore cannot be used for scan pruning at all. And as state, that can be many seemingly innocuous greater than or less than predicates in most databases systems. Conversely, it would be nice if Parquet would enable scan pruning in these cases, regardless of whether the reader and writer agree upon whether NaN is smaller, greater, or incomparable to all other values. Note that the problem exists especially if the Parquet file doesn't include any NaNs, so this is not only a problem in the edge case where NaNs are present; it is a problem in the way more common case of NaNs not being present. Handling NaNs in a ColumnIndex ------------------------------ There is currently no well-defined way to write a spec-conforming `ColumnIndex` once a page has only NaN (and possibly null) values. NaN values should not be included in min/max bounds, but if a page contains only NaN values, then there is no other value to put into the min/max bounds. However, bounds in a ColumnIndex are non-optional, so we have to put something in here. The spec does not describe what engines should do in this case. Parquet-mr takes the safe route and does not write a column index once NaNs are present. But this is a huge pessimization, as a single page containing NaNs will prevent writing a column index for the column chunk containing that page, so even pages in that chunk that don't contain NaNs will not be indexed. It would be nice if there was a defined way of writing the ColumnIndex when NaNs (and especially only-NaN pages) are present. Handling only-NaN pages & column chunks --------------------------------------- Note: Hereinafter, whenever the term only-NaN is used, it refers to a page or column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN non-null values. The `Statistics` objects stored in page headers and in the file footer have a similar, albeit smaller problem: `min_value` and `max_value` are optional here, so it is easier to not include NaNs in the min/max in case of an only-NaN page or column chunk: Simply omit these optional fields. However, this brings a semantic ambiguity with it, as it is now unclear whether the min/max value wasn't written because there were only NaNs, or simply because the writing engine did decide to omit them for whatever other reason, which is allowed by the spec as the field is optional. Consequently, a reader cannot know whether missing `min_value` and `max_value` means "only NaNs, I can skip this page if I am looking for only non-NaN values" or "no stats written, you have to read this page as it is undefined what values it contains". It would be nice if we could handle NaNs in a way that would allow scan pruning for these only-NaN pages. Solution ======== IEEE 754 total order solves all the mentioned problems. As NaNs now have a defined place in the ordering, they can be incorporated into min and max bounds. In fact, in contrast to the default ordering, they do not need any special casing anymore, so all the remarks how readers and writers should special-handle NaNs and -0/+0 no longer apply to the new ordering. As NaNs are incorporated into min and max, a reader can now see whether NaNs are contained through the statistics. Thus, a reading engine just has to map its NaN semantics to the NaN semantics of total ordering. For example, if the semantics of the reading engine treat all NaNs (also -NaNs) as greater than all other values, a reading engine having a predicate `x > 5.0` (which should include NaNs) may not filter any pages / row groups if either min or max are (+/-)NaN. Only-NaN pages can now also be included in the column index, as they are no longer a special case. [1] apache#196 [2] apache/arrow-rs#264 (comment)
This commit is a combination of the following PRs: * Introduce IEEE 754 total order apache#221 * Add nan_count to handle NaNs in statistics apache#196 Both these PRs try to solve the same problems; read the description of the respective PRs for explanation. This PR is the result of an extended discussion in which it was repeatedly brought up that another possible solution to the problem could be the combination of the two approaches. Please refer to this discussion on the mailing list and in the two PRs for details. the mailing list discussion can be found here: https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s The contents of this PR are basically a straightforward combination of the two approaches: * IEEE total order is introduced as a new order for floating point types * nan_count and nan_counts fields are added Legacy writers may not write nan_count(s) fields, so readers have to handle them being absent. Also, legacy writers may have included NaNs into min/max bounds, so readers also have to handle that. As there are no legacy writers writing IEEE total order, nan_count(s) are defined to be mandatory if this order is used, so readers can assume their presense when this order is used. This commit removes `nan_pages` from the ColumnIndex, which the nan_counts PR mandated. We don't need them anymore, as IEEE total order solves this: As this is a new order and there are thus no legacy writers and readers, we have the freedom to define that for only-NaN pages using this order, we can actually write NaNs into min & max bounds in this case, and readers can assume that isNaN(min) signals an only-NaN page.
|
@JFinis has created a combined, updated proposal that adds total order and nan_count here. Please take a look if you are interested |
This commit adds a new column order
IEEE754TotalOrder, which can be used for floating point types (FLOAT, DOUBLE, FLOAT16).The advantage of the new order is a well-defined ordering between -0,+0 and the various possible bit patterns of NaNs. Thus, every single possible bit pattern of a floating point value has a well-defined order now, so there are no possibilities where two implementations might apply different orders when the new column order is used.
With the default column order, there were many problems w.r.t. NaN values which lead to reading engines not being able to use statistics of floating point columns for scan pruning even in the case where no NaNs were in the data set. The problems are discussed in detail in the next section.
This solution to the problem is the result of the extended discussion in #196, which ended with the consensus that IEEE 754 total ordering is the best approach to solve the problem in a simple manner without introducing special fields for floating point columns (such as
nan_counts, which was proposed in that PR). Please refer to the discussion in that PR for all the details why this solution was chosen over various design alternatives.Note that this solution is fully backward compatible and should not break neither old readers nor writers, as a new column order is added. Legacy writers can continue not writing this new order and instead writing the default type defined order. Legacy readers should avoid using any statistics on columns that have a column order they do not understand and therefore should just not use the statistics for columns ordered using the new order.
The remainder of this message explains in detail what the problems are and how the proposed solution fixes them.
Problem Description
Currently, the way NaN values are to be handled in statistics inhibits most scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely the following problems exist:
Statistics don't tell whether NaNs are present
As NaN values are not to be incorporated in min/max bounds, a reader cannot know whether NaN values are present. This might seem to be not too problematic, as most queries will not filter for NaNs. However, NaN is ordered in most database systems. For example, Postgres, DB2, and Oracle treat NaN as greater than any other value, while MSSQL and MySQL treat it as less than any other value. An overview over what different systems are doing can be found here. The gist of it is that different systems with different semantics exist w.r.t. NaNs and most of the systems do order NaNs; either less than or greater than all other values.
For example, if the semantics of the reading query engine mandate that NaN is to be treated greater than all other values, the predicate x > 1.0 should include NaN values. If a page has max = 0.0 now, the engine would not be able to skip the page, as the page might contain NaNs which would need to be included in the query result.
Likewise, the predicate x < 1.0 should include NaN if NaN is treated to be less than all other values by the reading engine. Again, a page with min = 2.0 couldn't be skipped in this case by the reader.
Thus, even if a user doesn't query for NaN explicitly, they might use other predictes that need to filter or retain NaNs in the semantics of the reading engine, so the fact that we currently can't know whether a page or row group contains NaN is a bigger problem than it might seem on first sight.
Currently, any predicate that needs to retain NaNs cannot use min and max bounds in Parquet and therefore cannot be used for scan pruning at all. And as stated, that can be many seemingly innocuous greater than or less than predicates in most databases systems. Conversely, it would be nice if Parquet would enable scan pruning in these cases, regardless of whether the reader and writer agree upon whether NaN is smaller, greater, or incomparable to all other values.
Note that the problem exists especially if the Parquet file doesn't include any NaNs, so this is not only a problem in the edge case where NaNs are present; it is a problem in the way more common case of NaNs not being present.
Handling NaNs in a ColumnIndex
There is currently no well-defined way to write a spec-conforming ColumnIndex once a page has only NaN (and possibly null) values. NaN values should not be included in min/max bounds, but if a page contains only NaN values, then there is no other value to put into the min/max bounds. However, bounds in a ColumnIndex are non-optional, so we have to put something in here. The spec does not describe what engines should do in this case. Parquet-mr takes the safe route and does not write a column index once NaNs are present. But this is a huge pessimization, as a single page containing NaNs will prevent writing a column index for the column chunk containing that page, so even pages in that chunk that don't contain NaNs will not be indexed.
It would be nice if there was a defined way of writing the ColumnIndex when NaNs (and especially only-NaN pages) are present.
Handling only-NaN pages & column chunks
Note: Hereinafter, whenever the term only-NaN is used, it refers to a page or column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN non-null values.
The Statistics objects stored in page headers and in the file footer have a similar, albeit smaller problem: min_value and max_value are optional here, so it is easier to not include NaNs in the min/max in case of an only-NaN page or column chunk: Simply omit these optional fields. However, this brings a semantic ambiguity with it, as it is now unclear whether the min/max value wasn't written because there were only NaNs, or simply because the writing engine did decide to omit them for whatever other reason, which is allowed by the spec as the field is optional.
Consequently, a reader cannot know whether missing min_value and max_value means "only NaNs, you can skip this page if you are looking for only non-NaN values" or "no stats written, you have to read this page as it is undefined what values it contains".
It would be nice if we could handle NaNs in a way that would allow scan pruning for these only-NaN pages.
Solution
IEEE 754 total order solves all the mentioned problems. As NaNs now have a defined place in the ordering, they can be incorporated into min and max bounds. In fact, in contrast to the default ordering, they do not need any special casing anymore, so all the remarks how readers and writers should special-handle NaNs and -0/+0 no longer apply to the new ordering.
As NaNs are incorporated into min and max, a reader can now see whether NaNs are contained through the statistics. Thus, a reading engine just has to map its NaN semantics to the NaN semantics of total ordering. For example, if the semantics of the reading engine treat all NaNs (also -NaNs) as greater than all other values, a reading engine having a predicate
x > 5.0(which should include NaNs) may not filter any pages / row groups if either min or max are (+/-)NaN.Only-NaN pages can now also be included in the column index, as they are no longer a special case.
In conclusion, all mentioned problems are solved by using IEEE 754 total ordering.
Make sure you have checked all steps below.
Jira
Commits
Documentation