-
Couldn't load subscription status.
- Fork 3.9k
GH-32609: [Python] Add type annotations to PyArrow #47609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a0ce53c to
9c881b4
Compare
b564265 to
127e741
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rok, I come bearing unsolicited suggestions 😉
A lot of this is from 2 recent PRs that have had me battling the current stubs more
python/pyarrow-stubs/compute.pyi
Outdated
| def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ... | ||
|
|
||
|
|
||
| def scalar(value: bool | float | str) -> Expression: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on
arrow/python/pyarrow/_compute.pyx
Lines 2859 to 2869 in 13c2615
| @staticmethod | |
| def _scalar(value): | |
| cdef: | |
| Scalar scalar | |
| if isinstance(value, Scalar): | |
| scalar = value | |
| else: | |
| scalar = lib.scalar(value) | |
| return Expression.wrap(CMakeScalarExpression(scalar.unwrap())) |
The Expression version (pc.scalar) should accept the same types as pa.scalar right?
Ran into it the other day here where I needed to add a cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what are you suggesting. Do you mean:
diff --git i/python/pyarrow-stubs/compute.pyi w/python/pyarrow-stubs/compute.pyi
index df660e0c0c..f005c5f552 100644
--- i/python/pyarrow-stubs/compute.pyi
+++ w/python/pyarrow-stubs/compute.pyi
@@ -84,7 +84,7 @@ _R = TypeVar("_R")
def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ...
-def scalar(value: bool | float | str) -> Expression: ...
+def scalar(value: Any) -> Expression: ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah I guess Any is what you have there so that could work.
But I think it would be more helpful to use something like this to start:
https://github.com/rok/arrow/blob/6a310149ed305d7e2606066f5d0915e9c23310f4/python/pyarrow-stubs/_stubs_typing.pyi#L50
PyScalar: TypeAlias = (bool | int | float | Decimal | str | bytes |
dt.date | dt.datetime | dt.time | dt.timedelta)Then the snippet from (#47609 (comment)) seems to imply pa.Scalar is valid as well.
So maybe this would document it more clearly?
def scalar(value: PyScalar | lib.Scalar[Any] | None) -> Expression: ...| def name(self) -> str: ... | ||
| @property | ||
| def num_kernels(self) -> int: ... | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the overloads can be generated instead of written out and maintained manually.
Took me a while to discover this without it being in the stubs 😅
| @property | |
| def kernels(self) -> list[ScalarKernel | VectorKernel | ScalarAggregateKernel | HashAggregateKernel]: |
I know this isn't accurate for Function itself, but it's the type returned by FunctionRegistry.get_function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted to be a bit fancier, maybe add some Generics into the mix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at extracting compute kernel signatures from C++ (valid input types are explicitly stated at registration time).
That would probably be more useful than the route I was going for here.
In python there's only the repr to work with, but there is quite a lot of information encoded in it
import pyarrow.compute as pc
>>> pc.get_function("array_take").kernels[:10]
[VectorKernel<(primitive, integer) -> computed>,
VectorKernel<(binary-like, integer) -> computed>,
VectorKernel<(large-binary-like, integer) -> computed>,
VectorKernel<(fixed-size-binary-like, integer) -> computed>,
VectorKernel<(null, integer) -> computed>,
VectorKernel<(Type::DICTIONARY, integer) -> computed>,
VectorKernel<(Type::EXTENSION, integer) -> computed>,
VectorKernel<(Type::LIST, integer) -> computed>,
VectorKernel<(Type::LARGE_LIST, integer) -> computed>,
VectorKernel<(Type::LIST_VIEW, integer) -> computed>]>>> pc.get_function("min_element_wise").kernels[:10]
[ScalarKernel<varargs[uint8*] -> uint8>,
ScalarKernel<varargs[uint16*] -> uint16>,
ScalarKernel<varargs[uint32*] -> uint32>,
ScalarKernel<varargs[uint64*] -> uint64>,
ScalarKernel<varargs[int8*] -> int8>,
ScalarKernel<varargs[int16*] -> int16>,
ScalarKernel<varargs[int32*] -> int32>,
ScalarKernel<varargs[int64*] -> int64>,
ScalarKernel<varargs[float*] -> float>,
ScalarKernel<varargs[double*] -> double>]>>> pc.get_function("approximate_median").kernels
[ScalarAggregateKernel<(any) -> double>]|
Oh awesome! Thank you @dangotbanned I love unsolicited suggestions like these! I am at pydata Paris right now so I probably can't reply properly until Monday, but given your experience I'm sure these will be very useful! |
|
Just a mental note: @pitrou suggested to look at extracting compute kernel signatures from C++ (valid input types are explicitly stated at registration time). |
4aae1c8 to
80ea044
Compare
This proposes adding type annotation to pyarrow by adopting pyarrow-stubs into pyarrow. To do so we copy pyarrow-stubs's stubfiles into
arrow/python/pyarrow-stubs/, restructure them somewhat and add more annotations. We remove docstrings from annotations and provide a script to include docstrings into stubfiles at wheel-build-time. We also remove overloads from annotations to simplify this PR. We then add annotation checks for all project files. We introduce a CI check to make sure allmypy,pyrightandtyannotation checks pass (seepython/pyproject.tomlfor any exceptions).PR introduces:
arrow/python/pyarrow-stubs/