-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
Description
What is your issue?
I recently tried to resample a large geotif file using majority vote for each block, following this stackoverflow article: https://stackoverflow.com/questions/75041095/how-to-apply-a-custom-function-to-xarray-dataarray-coarsen-reduce
I wanted to use dask for larger-than-memory computation, but while some functions like np.sum
execute lazily, using scipy.stats.mode
seems to instantly call compute()
somehow. Unfortunately, this exceeds my computers capacity and also renders the larger_than_memory capabilities of dask useless.
here is an example to easily reproduce the issue:
import xarray as xr
import dask.array as da
import numpy as np
from scipy import stats
# Define dimensions
nx, ny, nt = 3000, 300, 100 # size of each dimension
chunks = (300, 30, 10) # chunk sizes for Dask
# Create Dask arrays
data1 = da.random.random((nx, ny, nt), chunks=chunks)
data2 = da.random.random((nx, ny, nt), chunks=chunks)
# Create coordinates
x = np.linspace(0, 10, nx)
y = np.linspace(0, 5, ny)
time = np.arange(nt)
# Build the xarray Dataset
ds = xr.Dataset(
{
"temperature": (("x", "y", "time"), data1),
"precipitation": (("x", "y", "time"), data2),
},
coords={
"x": x,
"y": y,
"time": time,
}
)
# custom function for accessing the mode
def find_mode(arr, axis):
m, _ = stats.mode(arr, axis=axis)
return m
# this is lazy!
coearse_mean_ds = ds.coarsen(x=3, y=3, boundary='pad').reduce(np.sum)
# this computes on spot!
maj_vote_coarse = ds.coarsen(x=3, y=3, boundary='pad').reduce(find_mode)
Can you please guide me how I would apply a function like that that computes lazily with dask?
PS: THis is the xarray version of my repo
INSTALLED VERSIONS
------------------
commit: None
python: 3.12.11 (main, Jul 23 2025, 00:34:44) [Clang 20.1.4 ]
python-bits: 64
OS: Linux
OS-release: 6.12.41-1-MANJARO
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.4-development
xarray: 2025.7.1
pandas: 2.3.1
numpy: 2.2.6
scipy: 1.16.0
netCDF4: 1.7.2
pydap: None
h5netcdf: None
h5py: None
zarr: 3.1.0
cftime: 1.6.4.post1
nc_time_axis: None
iris: 3.12.2
bottleneck: None
dask: 2025.7.0
distributed: 2025.7.0
matplotlib: 3.10.3
cartopy: 0.24.1
seaborn: 0.13.2
numbagg: None
fsspec: 2025.5.1
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: None
pip: None
conda: None
pytest: None
mypy: None
IPython: 9.4.0
sphinx: None