GH-32609: [Python] Add type annotations to PyArrow #47609

rok · 2025-09-20T19:53:21Z

This proposes adding type annotation to pyarrow by adopting pyarrow-stubs into pyarrow. To do so we copy pyarrow-stubs's stubfiles into arrow/python/pyarrow-stubs/, restructure them somewhat and add more annotations. We remove docstrings from annotations and provide a script to include docstrings into stubfiles at wheel-build-time. We also remove overloads from annotations to simplify this PR. We then add annotation checks for all project files. We introduce a CI check to make sure all mypy, pyright and ty annotation checks pass (see python/pyproject.toml for any exceptions).

PR introduces:

adds pyarrow-stubs into arrow/python/pyarrow-stubs/
fixes pyarrow-stubs to pass ty, mypy and pyright check
adds ty, mypy and pyright check to CI (crudely)
adds a tool (update_stub_docstrings.py) to insert annotation docstrings into stubfiles at wheel-build-time

GitHub discussion: A new home for pyarrow-stubs? #45919
GitHub Issue: [Python] Type checking support #32609

dangotbanned

Hey @rok, I come bearing unsolicited suggestions 😉

A lot of this is from 2 recent PRs that have had me battling the current stubs more

python/pyarrow-stubs/_compute.pyi

dangotbanned · 2025-09-30T18:38:21Z

python/pyarrow-stubs/compute.pyi

+def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ...
+
+
+def scalar(value: bool | float | str) -> Expression: ...


Based on

arrow/python/pyarrow/_compute.pyx

Lines 2859 to 2869 in 13c2615

@staticmethod

def _scalar(value):

cdef:

Scalar scalar

if isinstance(value, Scalar):

scalar = value

else:

scalar = lib.scalar(value)

return Expression.wrap(CMakeScalarExpression(scalar.unwrap()))

The Expression version (pc.scalar) should accept the same types as pa.scalar right?

https://github.com/rok/arrow/blob/24ec3c3b66b84d677caef02075e56703a7ad9d39/python/pyarrow-stubs/scalar.pyi#L400-L406

Ran into it the other day here where I needed to add a cast

https://github.com/narwhals-dev/narwhals/blob/cef3e0670ef2e208b3bfb071487c78de83b25e1f/narwhals/_plan/arrow/acero.py#L64-L65

I'm not sure what are you suggesting. Do you mean:

diff --git i/python/pyarrow-stubs/compute.pyi w/python/pyarrow-stubs/compute.pyi index df660e0c0c..f005c5f552 100644 --- i/python/pyarrow-stubs/compute.pyi +++ w/python/pyarrow-stubs/compute.pyi @@ -84,7 +84,7 @@ _R = TypeVar("_R") def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ... -def scalar(value: bool | float | str) -> Expression: ... +def scalar(value: Any) -> Expression: ...

Hmm, yeah I guess Any is what you have there so that could work.

But I think it would be more helpful to use something like this to start:
https://github.com/rok/arrow/blob/6a310149ed305d7e2606066f5d0915e9c23310f4/python/pyarrow-stubs/_stubs_typing.pyi#L50

PyScalar: TypeAlias = (bool | int | float | Decimal | str | bytes | dt.date | dt.datetime | dt.time | dt.timedelta)

Then the snippet from (#47609 (comment)) seems to imply pa.Scalar is valid as well.
So maybe this would document it more clearly?

def scalar(value: PyScalar | lib.Scalar[Any] | None) -> Expression: ...

python/pyarrow-stubs/acero.pyi

python/pyarrow-stubs/compute.pyi

python/pyarrow-stubs/_compute.pyi

dangotbanned · 2025-09-30T20:47:53Z

python/pyarrow-stubs/_compute.pyi

+    def name(self) -> str: ...
+    @property
+    def num_kernels(self) -> int: ...
+


#45919 (reply in thread)

I wonder if the overloads can be generated instead of written out and maintained manually.

Took me a while to discover this without it being in the stubs 😅

Suggested change

@property

def kernels(self) -> list[ScalarKernel | VectorKernel | ScalarAggregateKernel | HashAggregateKernel]:

I know this isn't accurate for Function itself, but it's the type returned by FunctionRegistry.get_function

If you wanted to be a bit fancier, maybe add some Generics into the mix?

@rok

look at extracting compute kernel signatures from C++ (valid input types are explicitly stated at registration time).

That would probably be more useful than the route I was going for here.

In python there's only the repr to work with, but there is quite a lot of information encoded in it

import pyarrow.compute as pc >>> pc.get_function("array_take").kernels[:10] [VectorKernel<(primitive, integer) -> computed>, VectorKernel<(binary-like, integer) -> computed>, VectorKernel<(large-binary-like, integer) -> computed>, VectorKernel<(fixed-size-binary-like, integer) -> computed>, VectorKernel<(null, integer) -> computed>, VectorKernel<(Type::DICTIONARY, integer) -> computed>, VectorKernel<(Type::EXTENSION, integer) -> computed>, VectorKernel<(Type::LIST, integer) -> computed>, VectorKernel<(Type::LARGE_LIST, integer) -> computed>, VectorKernel<(Type::LIST_VIEW, integer) -> computed>]

>>> pc.get_function("min_element_wise").kernels[:10] [ScalarKernel<varargs[uint8*] -> uint8>, ScalarKernel<varargs[uint16*] -> uint16>, ScalarKernel<varargs[uint32*] -> uint32>, ScalarKernel<varargs[uint64*] -> uint64>, ScalarKernel<varargs[int8*] -> int8>, ScalarKernel<varargs[int16*] -> int16>, ScalarKernel<varargs[int32*] -> int32>, ScalarKernel<varargs[int64*] -> int64>, ScalarKernel<varargs[float*] -> float>, ScalarKernel<varargs[double*] -> double>]

>>> pc.get_function("approximate_median").kernels [ScalarAggregateKernel<(any) -> double>]

rok · 2025-09-30T21:52:11Z

Oh awesome! Thank you @dangotbanned I love unsolicited suggestions like these! I am at pydata Paris right now so I probably can't reply properly until Monday, but given your experience I'm sure these will be very useful!

rok · 2025-10-02T14:16:05Z

Just a mental note: @pitrou suggested to look at extracting compute kernel signatures from C++ (valid input types are explicitly stated at registration time).

github-actions bot added the awaiting committer review Awaiting committer review label Sep 20, 2025

rok mentioned this pull request Sep 20, 2025

[Python] Gradually add type checks to Arrow, initial step rok/arrow#45

Closed

rok changed the title ~~[Python] Add type annotations to PyArrow~~ GH-32609: [Python] Add type annotations to PyArrow Sep 20, 2025

apache deleted a comment from github-actions bot Sep 20, 2025

rok force-pushed the pyarrow-stubs-2 branch from a0ce53c to 9c881b4 Compare September 20, 2025 20:09

github-actions bot added the Component: Python label Sep 20, 2025

apache deleted a comment from github-actions bot Sep 20, 2025

rok mentioned this pull request Sep 21, 2025

[Python] Setup type checking with mypy #24376

Open

rok requested review from pitrou and raulcd September 22, 2025 10:30

rok force-pushed the pyarrow-stubs-2 branch 5 times, most recently from b564265 to 127e741 Compare September 22, 2025 23:56

dangotbanned reviewed Sep 30, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 6, 2025

rok force-pushed the pyarrow-stubs-2 branch from 596fd29 to 6a31014 Compare October 6, 2025 17:09

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 6, 2025

rok added 19 commits October 25, 2025 23:31

WIP

cc134aa

fix ty checks

004452c

minor fixes

39578ba

fix mypy

c48eed2

minor fix

2dbd8c7

minor fix

36a82f1

minor fix

487b582

minor fix

5c8a7d3

fix CI

d717792

minor fix

26b3b4e

test

83ae7b6

fixes

1747bd3

cleanup

85ab2e1

lint

c072b76

fix

77a5b2b

fix

b90c075

more fix

2a31ec3

fixes

a192385

remove some newlines

5e27207

rok force-pushed the pyarrow-stubs-2 branch 3 times, most recently from 4aae1c8 to 80ea044 Compare October 25, 2025 21:57

some fixes

8e50a64

rok force-pushed the pyarrow-stubs-2 branch from 80ea044 to 8e50a64 Compare October 25, 2025 22:08

rok marked this pull request as ready for review October 26, 2025 17:51

rok requested review from AlenkaF, assignUser, jonkeane, kou and wjones127 as code owners October 26, 2025 17:51

		def field(*name_or_index: str \| tuple[str, ...] \| int) -> Expression: ...


		def scalar(value: bool \| float \| str) -> Expression: ...

	@staticmethod
	def _scalar(value):
	cdef:
	Scalar scalar

	if isinstance(value, Scalar):
	scalar = value
	else:
	scalar = lib.scalar(value)

	return Expression.wrap(CMakeScalarExpression(scalar.unwrap()))


	@property
	def kernels(self) -> list[ScalarKernel \| VectorKernel \| ScalarAggregateKernel \| HashAggregateKernel]:

Uh oh!

GH-32609: [Python] Add type annotations to PyArrow #47609

Are you sure you want to change the base?

GH-32609: [Python] Add type annotations to PyArrow #47609

Uh oh!

Conversation

rok commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dangotbanned left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dangotbanned Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

rok Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dangotbanned Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

rok commented Sep 30, 2025

Uh oh!

rok commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rok commented Sep 20, 2025 •

edited

Loading