Skip to content

DaCe backend: set unstructured_horizontal_has_unit_stride=True#1130

Open
havogt wants to merge 12 commits into
mainfrom
havogt-patch-2
Open

DaCe backend: set unstructured_horizontal_has_unit_stride=True#1130
havogt wants to merge 12 commits into
mainfrom
havogt-patch-2

Conversation

@havogt
Copy link
Copy Markdown
Contributor

@havogt havogt commented Mar 25, 2026

No description provided.

@havogt havogt requested a review from edopao March 25, 2026 19:14
@havogt
Copy link
Copy Markdown
Contributor Author

havogt commented Mar 25, 2026

cscs-ci run default

@havogt
Copy link
Copy Markdown
Contributor Author

havogt commented Mar 25, 2026

cscs-ci run distributed

edopao
edopao previously approved these changes Mar 25, 2026
@havogt
Copy link
Copy Markdown
Contributor Author

havogt commented Mar 26, 2026

cscs-ci run distributed

@edopao
Copy link
Copy Markdown
Contributor

edopao commented Mar 31, 2026

cscs-ci run distributed

@edopao
Copy link
Copy Markdown
Contributor

edopao commented Mar 31, 2026

cscs-ci run default

@edopao
Copy link
Copy Markdown
Contributor

edopao commented Apr 1, 2026

cscs-ci run dace

@edopao
Copy link
Copy Markdown
Contributor

edopao commented Apr 1, 2026

@msimberg Do you have any idea why the distributed CI pipeline is failing? It seems that the test passes.

@havogt
Copy link
Copy Markdown
Contributor Author

havogt commented Apr 1, 2026

@edopao
from Mikael (a few days ago I had the same question):

hmm, on the job you linked it's useful to know the following:

  1. it passed on rank 0 but failed overall -> likely another rank failed
  2. srun: error: nid005301: task 1: Exited with exit code 1 -> rank 1 failed
  3. go to the right panel and browse or download the job artifacts
  4. open the log for rank 1
  5. ???
  6. profit

in this case the output is

FAILED model/atmosphere/dycore/tests/dycore/mpi_tests/test_parallel_solve_nonhydro.py::test_run_solve_nonhydro_single_step[experiment0-1-2021-06-20T12:00:10.000-1-2-2021-06-20T12:00:10.000-1-True] - AssertionError: assert False
 +  where False = <function dallclose at 0x4003eee55870>(array([[-4.49902001, -2.47195228, -2.43418221, ..., -1.1147065 ,\n        -0.97308378, -0.78161297],\n       [-4.74286997, -2.51603478, -2.50672366, ..., -1.60057587,\n        -1.36821611, -1.03553511],\n       [-3.69400648, -1.96190308, -3.1953062 , ..., -1.31898163,\n        -1.17395118, -0.9497928 ],\n       ...,\n       [ 9.05451828,  1.8218479 ,  3.39239301, ...,  0.90259989

...directly from that I can't tell if it's completely off or just something in the numerics that has changed the error slightly such that you just need to change tolerances
if you want to see the output more easily in ci you can also remove the ci-mpi-wrapper around the nox/pytest call, but all the output will be interleaved (it's a bit more obvious that something really went wrong, but it's a pain to read the logs)

@edopao
Copy link
Copy Markdown
Contributor

edopao commented Apr 1, 2026

cscs-ci run distributed

@edopao
Copy link
Copy Markdown
Contributor

edopao commented Apr 1, 2026

cscs-ci run dace

@msimberg
Copy link
Copy Markdown
Contributor

I think once #1102 is merged we should try this again. We should get a max diff on the one rank that failed in the distributed tests. I suppose this option can change code generation and thus slightly change how errors accumulate? We might just be slightly changing the error on one test with this option.

@msimberg
Copy link
Copy Markdown
Contributor

I've bumped gt4py to main (which as far as I can tell includes the patch).

@msimberg
Copy link
Copy Markdown
Contributor

cscs-ci run distributed

@msimberg
Copy link
Copy Markdown
Contributor

Ok, it wasn't quite that easy... something in the version bump broke other things.

@havogt
Copy link
Copy Markdown
Contributor Author

havogt commented Apr 29, 2026

cscs-ci run distributed

@havogt
Copy link
Copy Markdown
Contributor Author

havogt commented Apr 29, 2026

cscs-ci run default

@msimberg
Copy link
Copy Markdown
Contributor

cscs-ci run distributed

@msimberg
Copy link
Copy Markdown
Contributor

Using assert_dallclose one of the ranks on the parallel test_run_solve_nonhydro_single_step test reports (on this job: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5125340235196978/2255149825504675/-/jobs/14142397511):

FAILED model/atmosphere/dycore/tests/dycore/mpi_tests/test_parallel_solve_nonhydro.py::test_run_solve_nonhydro_single_step[experiment0-1-2021-06-20T12:00:10.000-1-2-2021-06-20T12:00:10.000-1-True] - AssertionError: 
Not equal to tolerance rtol=1e-10, atol=0

Mismatched elements: 1 / 536705 (0.000186%)
Max absolute difference among violations: 1.18404195e-16
Max relative difference among violations: 1.18822024e-10
 ACTUAL: array([[-4.49902 , -2.471952, -2.434182, ..., -1.114707, -0.973084,
        -0.781613],
       [-4.74287 , -2.516035, -2.506724, ..., -1.600576, -1.368216,...
 DESIRED: array([[-4.49902 , -2.471952, -2.434182, ..., -1.114707, -0.973084,
        -0.781613],
       [-4.74287 , -2.516035, -2.506724, ..., -1.600576, -1.368216,...

This happens on one test with dace_cpu (dace_gpu is still not tested in CI, so unsure if it's affected).

This looks... ok to me? It's not great if we increase the difference to serialized data, but this is exactly one element going above the tolerance. @edopao @havogt @philip-paul-mueller do you have any reason to suspect that this change would be due to a bug or should we go ahead and slightly change the tolerance on that test?

@philip-paul-mueller
Copy link
Copy Markdown
Collaborator

I think there is no reason against increasing the tolerance.
However, I would run the thing again to see if the error stays the same.

If the error moves you could also try to run it with the same cache and disabled OpenMP.
This should eliminate almost everything except the network.

@edopao edopao dismissed their stale review May 6, 2026 15:25

I was working on my project and I found some bugs in the lowering to SDFG. I have tested the change in this PR together with my bug fixes (see #1247, one test fails but it's unrelated) and the CI error we saw on this PR goes away.

@edopao
Copy link
Copy Markdown
Contributor

edopao commented May 6, 2026

cscs-ci run distributed

@msimberg
Copy link
Copy Markdown
Contributor

msimberg commented May 7, 2026

Just adding this as a reminder before merging: we added GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE=1 to the readme in #1221. I guess this would allow removing that?

@edopao
Copy link
Copy Markdown
Contributor

edopao commented May 7, 2026

Just adding this as a reminder before merging: we added GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE=1 to the readme in #1221. I guess this would allow removing that?

Good point. Yes, we should test this as well.

@edopao
Copy link
Copy Markdown
Contributor

edopao commented May 7, 2026

cscs-ci run distributed

@edopao
Copy link
Copy Markdown
Contributor

edopao commented May 7, 2026

cscs-ci run dace

@edopao
Copy link
Copy Markdown
Contributor

edopao commented May 7, 2026

cscs-ci run distributed

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default
  • cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

  • cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants