adding F-contig layout support + other cleanups#108
Conversation
|
Ahh the fix to run older-cpu workflows is done in #104 , once that will merge we can update here. |
There was a problem hiding this comment.
Pull request overview
This PR fixes incorrect matmul results for non-row-major (notably F-contiguous and transposed-view) QuadPrecision inputs by making the QBLAS dispatch stride-aware (mirroring NumPy’s layout/BLAS-ability checks) and expanding test coverage to exercise the full layout matrix and copy-fallback paths.
Changes:
- Update the
matmulstrided loop to detect BLAS-able 2D operands in either orientation and pass the appropriatetransa/transb+lda/ldb/ldc, copying only when required. - Remove the prior
xfails for Fortran-order/transposed batched cases and add new tests covering layout combinations, output layout, empty-core behavior, and copy-fallback scenarios. - Adjust QBLAS interface edge handling for empty GEMV/GEMM calls.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
tests/test_dot.py |
Removes xfails for issue #89 and adds comprehensive layout/copy-path tests for matmul. |
src/csrc/umath/matmul.cpp |
Implements stride-derived BLAS dispatch (incl. transpose flags) plus gather/scatter copy fallback and size/empty guards. |
src/csrc/quadblas_interface.cpp |
Tweaks empty-dimension behavior for qblas_gemv and qblas_gemm. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
cc: @ngoldbaum this one too :) |
ngoldbaum
left a comment
There was a problem hiding this comment.
Overall looks good, just a few minor suggestions inline.
| } | ||
| return 0; | ||
| } | ||
| if (!alpha || !A || !x || !beta || !y) { |
There was a problem hiding this comment.
You already checked !beta above so might as well delete that here
| if kind == "F": | ||
| return np.asfortranarray(a) | ||
| if kind == "T": | ||
| return np.ascontiguousarray(a.T).T |
There was a problem hiding this comment.
Isn't this the same as F-order? Maybe makes more sense to just return a.T, which is different from a fortran-order copy.
| Sleef_quad *beta, Sleef_quad *C, size_t ldc) | ||
| { | ||
| if (m == 0 || n == 0 || k == 0) { | ||
| if (m == 0 || n == 0) { |
There was a problem hiding this comment.
Claude says that no tests hit this code change. You could add a test that does np.matmul(a, b) where the contracted dimension is zero to hit this.
| rng = np.random.default_rng(seed=201) | ||
| A_f = rng.standard_normal((6, 5)) | ||
| B_f = rng.standard_normal((5, 7)) | ||
| _assert_matmul_matches_float64(_qnd(A_f)[::-1], _qnd(B_f), A_f[::-1], B_f) |
There was a problem hiding this comment.
Maybe add a test for negative column strides too? e.g. A[:, ::-1].
| rng = np.random.default_rng(seed=202) | ||
| A_f = rng.standard_normal((20, 7)) | ||
| B_f = rng.standard_normal((7, 11)) | ||
| out = np.asfortranarray(np.zeros((10, 11), dtype=QuadPrecDType())) |
There was a problem hiding this comment.
Maybe also add a test for strided out? Something like out=np.empty(14, dtype=QuadPrecDType())[::2]; np.matmul(A, B, out=out).
closes #89
This PR improves the BLAS integration very parallel to the NumPy's own logic.
@TYPE@_matmuldispatch. NumPy hands the inner loop raw strides and expects the loop itself to dispatch, so we do the same instead of assuming row-major.is_blasable2dcheck: an operand is BLAS-able iff its fast axis is contiguous and the slow axis is a valid leading dimension; each operand is tested in both orientations (C- and F-blasable).'N'/'T') andldaderived from strides, so F-contiguous/transposed inputs need no copy (QBLAS already supportstransa/transbcorrectly).dim > INT_MAX) or empty cores fall back to the naive strided loop, matching NumPy'stoo_big_for_blasguard.I think this is a good base to integrate dot operations efficiently in follow-up PRs