cspan: Suggesting an ergonomic C99 API with multidimensional matrices for BLIS

This is more a question whether there is interest for a small, fast and generic C99 implementation of a numpy-alike multidimensional array-view API (currently header file ~200 LOC)? I am not currently a BLIS user, but I gathered it may have some interest for C users of BLIS. It does not currently compile with C++ (can be done), but C is the primary target. It supports
- row- and column-major layouts, transpose
- type safety, bounds checking by default
- full numpy slicing/subview capablilities (currently not negative increment step)
- fast linear iteration of multidimensional (and sliced) arrays.

Note that C++23 std::mdspan will not support the two last bullets at all (maybe in C++26).
The API could possibly be adapted at some level as a "contribution API", for making it more convenient to use BLIS directly from C, by extending it to wrap BLIS-function calls, as shown in the example below. I am not aware of any similar C library with these specific features, ergonomics, and small package. I made a recent [reddit post here](https://www.reddit.com/r/C_Programming/comments/16e6iln/python_numpy_multidimensional_arrays_with_slicing). The example below is a rewrite of the example in the [c++ std::mdspan proposal](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2642r2.html)
```c
// C99:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include "cspan.h"

using_cspan3(Mat, float); // shorthand for defining Mat, Mat2, and Mat3

typedef Mat2 OutMat;
typedef struct { Mat2 m00, m01, m10, m11; } Partition;

Partition partition(Mat2 A)
{
  int32_t M = A.shape[0];
  int32_t N = A.shape[1];
  return (Partition){
    .m00 = cspan_slice(Mat2, &A, {0, M/2}, {0, N/2}),
    .m01 = cspan_slice(Mat2, &A, {0, M/2}, {N/2, N}),
    .m10 = cspan_slice(Mat2, &A, {M/2, M}, {0, N/2}),
    .m11 = cspan_slice(Mat2, &A, {M/2, M}, {N/2, N}),
  };
}

#if 0
void base_case_matrix_product(Mat2 A, Mat2 B, OutMat C)
{
  cblas_sgemm(
    CblasColMajor, CblasNoTrans, CblasNoTrans,
    C.shape[0], C.shape[1], A.shape[1], 1.0f,
    A.data, A.stride.d[1], B.data, B.stride.d[1], 1.0f,
    C.data, C.stride.d[1]);
}
#else
// Slow generic implementation
void base_case_matrix_product(Mat2 A, Mat2 B, OutMat C)
{
  for (int j = 0; j < C.shape[1]; ++j) {
    for (int i = 0; i < C.shape[0]; ++i) {
      Mat2_value C_ij = 0;
      for (int k = 0; k < A.shape[1]; ++k) {
        C_ij += *cspan_at(&A, i,k) * *cspan_at(&B, k,j);
      }
      *cspan_at(&C, i,j) += C_ij;
    }
  }
}
#endif

void recursive_matrix_product(Mat2 A, Mat2 B, OutMat C)
{
  // Some hardware-dependent constant
  enum {recursion_threshold = 32};
  if (C.shape[0] <= recursion_threshold || C.shape[1] <= recursion_threshold) {
    base_case_matrix_product(A, B, C);
  } else {
    Partition c = partition(C),
              a = partition(A),
              b = partition(B);
    recursive_matrix_product(a.m00, b.m00, c.m00);
    recursive_matrix_product(a.m01, b.m10, c.m00);
    recursive_matrix_product(a.m10, b.m00, c.m10);
    recursive_matrix_product(a.m11, b.m10, c.m10);
    recursive_matrix_product(a.m00, b.m01, c.m01);
    recursive_matrix_product(a.m01, b.m11, c.m01);
    recursive_matrix_product(a.m10, b.m01, c.m11);
    recursive_matrix_product(a.m11, b.m11, c.m11);
  }
}


int main(void)
{
  enum {N = 10, D = 256};
  float* values = malloc(N * D * D * sizeof values[0]);
  float out[D * D];

  Mat3 data = cspan_md(values, N, D, D); // default row-major
  c_foreach (i, Mat3, data)
    *i.ref = ((double)rand()/RAND_MAX - 0.5)*4.0;

  OutMat c = cspan_md_layout(c_COLMAJOR, out, D, D);
  Mat2 a = cspan_submd3(&data, 0);
  cspan_transpose(&a);

  clock_t t = clock();
  for (int i=1; i<N; ++i) {
    Mat2 b = cspan_submd3(&data, i);
    cspan_transpose(&b);
    memset(out, 0, sizeof out);
    //base_case_matrix_product(a, b, c);
    recursive_matrix_product(a, b, c);  // 2x faster -O3
  }
  t = clock() - t;

  double sum = 0.0;
  c_foreach (i, Mat2, c) sum += *i.ref;  // linear iterate md span.
  printf("sum=%.16g, %f ms\n", sum, (double)t*1000.0/CLOCKS_PER_SEC);
  free(values);
}
```





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cspan: Suggesting an ergonomic C99 API with multidimensional matrices for BLIS #772

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cspan: Suggesting an ergonomic C99 API with multidimensional matrices for BLIS #772

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions