Skip to content

Add array_transform function #17289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Aug 22, 2025

Which issue does this PR close?

Rationale for this change

There are many use cases where you have a column of data that contains an array and you want to transform every element in that array. The current work around is to do something like unnest and then aggregate. This is bad from both ergonomics and performance. With this work we add a function array_transform that will take a scalar function and apply it to every element in an array.

This PR is narrowly scoped as a first proof of concept. It does not address aggregation as #15882 requests and it is limited in scope to cases where all other variables passed to the inner function must be scalar values.

What changes are included in this PR?

Adds array_transform and unit tests.

Are these changes tested?

Unit test provided that demonstrates both low level testing of the invocation and also a full test demonstrating it in operation with a dataframe.

Here is an example taken from the test that is included in the PR:

let udf = array_transform_udf(datafusion_functions::math::abs(), 0);
let df = df.select([col("a"), udf.call(vec![col("a")]).alias("abs(a[])")])?;

Will produce this dataframe, which shows the original data and transformed:

+-------------+-----------+
| a           | abs(a[])  |
+-------------+-----------+
| [1, -2, 3]  | [1, 2, 3] |
| [-4, 5]     | [4, 5]    |
| [-6, 7, -8] | [6, 7, 8] |
+-------------+-----------+

Are there any user-facing changes?

No

Still to do before ready to merge

  • Add additional documentation describing how all the pieces of this work
  • Create a plan for how to expand beyond other variables requiring to be scalar values
  • Create a plan for addressing the aggregation case or open an issue for something like array_aggregate
  • Address how it can be used with SQL commands instead of only dataframe operations
  • Potentially move the integrated test to a different location - dataframe may not be the right place to test a function

@github-actions github-actions bot added the core Core DataFusion crate label Aug 22, 2025
@timsaucer timsaucer self-assigned this Aug 22, 2025
@timsaucer
Copy link
Contributor Author

Note to self: To support extending beyond just scalar values for anything other than the array to apply to, see if we can use run end encoded array. That would prevent a need to do any kind of data duplication. We would only need to make one new array that was built up from the lengths of the lists of the array we are applying to.

Also consider having a .new() function that only takes the scalar function and assumes we apply to the 0 index. Then add a .with_argument_index() to change it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement method to apply scalar or aggregate function to Array elements
1 participant