DRAFT: Extension types #451

pitrou · 2024-09-19T10:56:43Z

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

emkornfield · 2024-09-24T06:57:57Z

LogicalTypes.md

+When a reader encounters an extension type in a Parquet schema, it should try
+to match it by name to its known extension types. If it does not recognize
+the extension type, then it should read it as the underlying physical type
+and should not try to interpret the column's statistics. It may however


min/max statistics, others should be valid?

Oops, yes, you're right.

perhaps including column index?

emkornfield · 2024-09-24T07:00:41Z

Generally seems reasonable to me.

wgtmac · 2024-09-24T13:50:37Z

LogicalTypes.md

+When a reader encounters an extension type in a Parquet schema, it should try
+to match it by name to its known extension types. If it does not recognize
+the extension type, then it should read it as the underlying physical type
+and should not try to interpret the column's statistics. It may however


perhaps including column index?

wgtmac · 2024-09-24T14:09:46Z

src/main/thrift/parquet.thrift

+ *
+ * If the extension type is not parametric, then `serialization` is empty.
+ */
+struct ExtensionTypeDescription {


Why choosing a dedicated ExtensionTypeDescription struct over list<KeyValue>? I'm afraid that a binary typed field may incur misuse from the users.

What would the list<KeyValue> contain and where would it reside? I'm not following you.

struct ExtensionTypeDescription { 1: optional list<KeyValue> metadata }

And specify the required keys for each extension type, pretty much like what Arrow does.

This does not make sense, does it? The keys will always be the same, so why not reify them in the Thrift spec as the PR currently does?

Or are you thinking about extension-specific parameter keys as in https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html ?

Note we would still need the extension name, so this would be:

struct ExtensionTypeDescription { 1: required string name 2: optional list<KeyValue> parameters }

Or are you thinking about extension-specific parameter keys as in https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html ?

Yes, I mean something like this.

abellgithub · 2025-10-16T12:39:49Z

What makes something appropriate as an extension type rather than a basic supported type? GEOMETRY is currently supported natively, which seems to be simply a binary blob with a metadata CRS string. Is the problem that adding such types is cumbersome?

emkornfield · 2025-10-16T14:15:05Z

What makes something appropriate as an extension type rather than a basic supported type? GEOMETRY is currently supported natively, which seems to be simply a binary blob with a metadata CRS string. Is the problem that adding such types is cumbersome?

Yes, mainly the fact that adding new types is cumbersome, so there is a trade-off between how useful we expect the type to be to the broader community. The other questions which comes into play are things like if stats are important, how well an extension type would work in this context (I forget if this design addresses that issue). IMO, GEOMETRY might have been considered for an extension type if we had this facility.

As a datapiont, in the Arrow project most new types have been added as extension types I believe.

abellgithub · 2025-10-16T18:18:06Z

Are you just providing a name for an introduced type? The examples don't show using any special handling -- IP address as FIXED_LEN_BYTE_ARRAY(16) and f64tensor as JSON -- and there is no example to help explain leaf vs. non-leaf handling. Is there some more complex vision? If so an example would be helpful.

emkornfield · 2025-10-23T02:50:03Z

Are you just providing a name for an introduced type? The examples don't show using any special handling -- IP address as FIXED_LEN_BYTE_ARRAY(16) and f64tensor as JSON -- and there is no example to help explain leaf vs. non-leaf handling. Is there some more complex vision? If so an example would be helpful.

I think this needs to be fleshed out some more for leaf/non-leaf and possible custom statistics for aggregates types (e.g. point cloud data). I don't think f64tensor is supposed to be JSON, just its metadata.

DRAFT: Extension types

042a207

pitrou force-pushed the extension_types branch from 8b0da72 to 042a207 Compare September 19, 2024 11:01

emkornfield reviewed Sep 24, 2024

View reviewed changes

wgtmac reviewed Sep 24, 2024

View reviewed changes

DRAFT: Extension types #451

Are you sure you want to change the base?

DRAFT: Extension types #451

Conversation

pitrou commented Sep 19, 2024

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Sep 24, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abellgithub commented Oct 16, 2025

Uh oh!

emkornfield commented Oct 16, 2025

Uh oh!

abellgithub commented Oct 16, 2025

Uh oh!

emkornfield commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants