Add persistent gt4py cache for CSCS CI by msimberg · Pull Request #1218 · C2SM/icon4py · GitHub

msimberg · 2026-04-24T11:52:26Z

Sets up a persistent cache for cscs ci pipelines. There's a weekly cache shared across all jobs. This means that each week a new cache directory is populated based on the year and week number. The idea of this is that we sometimes trigger compilation for a new cache from scratch, but not for every job. The cache is shared to avoid having too many files on scratch from every branch writing a separate cache such as in #1220.

I don't love this setup since it's slightly unconventional, but it's currently the best idea I have if we want to both 1. limit the number of files on scratch and 2. sometimes start with a fresh cache. I'd love to hear ideas for alternatives to this setup.

msimberg · 2026-04-24T15:16:35Z

The github actions cache for gt4py seems to work nicely. I had test-model (3.11, gtfn_cpu, dycore) do a first run without a cache in 27 minutes. The second run with a cache ran in 10 minutes.

I've yet to try the cscs ci cache.

msimberg · 2026-04-24T15:51:51Z

cscs-ci run distributed

msimberg · 2026-04-24T18:51:24Z

cscs-ci run distributed

msimberg · 2026-04-24T19:04:10Z

cscs-ci run distributed

msimberg · 2026-04-27T09:37:21Z

+# TODO(msimberg): enable
+# test_model_stencils_aarch64:
+#   extends: [.test_model_stencils, .test_template_aarch64]


msimberg · 2026-04-27T09:37:33Z

-      - COMPONENT: [advection, diffusion, dycore, microphysics, muphys, common, driver, standalone_driver]
-        BACKEND: [embedded, dace_gpu, gtfn_cpu, gtfn_gpu]
+      # - COMPONENT: [advection, diffusion, dycore, microphysics, muphys, common, driver, standalone_driver]
+      - COMPONENT: [dycore, advection]
+        # BACKEND: [embedded, dace_gpu, gtfn_cpu, gtfn_gpu]
+        BACKEND: [dace_gpu, gtfn_cpu]


To do: revert.

Also add dace_cpu, please. And in other places, like stencil tests, run also the dace backends. The dace backends were skipped in several places because too slow. We have instead a nightly dace pipeline that runs with the dace backends. That we can remove after this PR is merged.

I'll try it out, but note that the worst case timings for the pipelines will still be as bad as before, when the cache is cleared (currently weekly or with the uv.lock hash changing, perhaps the uv.lock hash could be enough, but then we risk scratch cleanup policies messing up the cache; unsure what the best setup is). I think @havogt's changes to better parallelize gt4py compilation would also be very useful here.

@havogt do you think me might be able to get GridTools/gt4py#2565 and your parallization changes in a 1.1.10 sometime soon or does that need more time?

Edit: this is the parallelization PR I'm talking about: GridTools/gt4py#2587.

msimberg · 2026-04-27T09:38:44Z

-      - COMPONENT: [atmosphere/diffusion, atmosphere/dycore, common]
-        # TODO(msimberg): Enable dace_gpu when compilation doesn't take as long
-        # or when we can cache across CI jobs.
-        BACKEND: [embedded, gtfn_cpu, dace_cpu, gtfn_gpu]
+      # - COMPONENT: [atmosphere/diffusion, atmosphere/dycore, common]
+      - COMPONENT: [common]
+        # BACKEND: [embedded, gtfn_cpu, dace_cpu, gtfn_gpu]
+        BACKEND: [gtfn_cpu]


Note, we might still need a long timelimit for dace jobs when the cache is populated. Hopefully the average time will be a lot better than the maximum timelimit though.

msimberg · 2026-04-29T06:45:34Z

This should benefit significantly from GridTools/gt4py#2565. The concern about number of files on scratch will be a much smaller concern (and possibly it allows using per-branch caches? still needs measurement).

philip-paul-mueller

What I am a bit concerned with is when you work on GT4Py, i.e. if your PR does not touch the ICON4Py stencil code but affects it.
I know the solution is simply to disable it, but there should be somewhere an explanation on how to do it.

havogt · 2026-04-30T11:28:52Z

What I am a bit concerned with is when you work on GT4Py, i.e. if your PR does not touch the ICON4Py stencil code but affects it. I know the solution is simply to disable it, but there should be somewhere an explanation on how to do it.

I agree. A simple solution might be to add the gt4py version to the key. PRs not touching gt4py will use the same cache as main, PRs touching gt4py will get their own directory. But need to double-check that the gt4py version is correctly resolved (i.e. with the dev suffix if it's not a release version).

msimberg · 2026-04-30T11:30:35Z

What I am a bit concerned with is when you work on GT4Py, i.e. if your PR does not touch the ICON4Py stencil code but affects it. I know the solution is simply to disable it, but there should be somewhere an explanation on how to do it.

Yeah, that's a valid concern. How well do you think gt4py itself will invalidate caches when it itself is changed? The "weekly fresh cache" slightly helps protect against this. A per-branch cache might also help a bit more? But yeah, if one really wants a fresh cache then it would (at the moment) mean either disabling the scratch cache for your PR, or we can think about also allowing setting a specific variable to disable the cache (cscs-ci run allows passing extra env vars that will be forwarded to the job). Ideally this would look as close as possible to the github actions caching does (those caches can also be explicitly deleted, manually). File count was my main concern, but it might not even be a real concern.

msimberg · 2026-05-04T12:53:18Z

cscs-ci run default

msimberg · 2026-05-04T13:07:29Z

cscs-ci run default

edopao · 2026-05-04T13:27:21Z

+
+if [[ "${ICON4PY_CI_DISABLE_PERSISTENT_GT4PY_CACHE:-}" == "true" ]]; then
+    echo "Using non-persistent gt4py cache because ICON4PY_CI_DISABLE_PERSISTENT_GT4PY_CACHE is set"
+    export GT4PY_BUILD_CACHE_LIFETIME="session"


I had forgotten that it is just a matter of triggering the pipeline with:
cscs-ci run default GT4PY_BUILD_CACHE_LIFETIME="session"

The trigger variables should have precedence, according to the documentation, so we do not need to introduce ICON4PY_CI_DISABLE_PERSISTENT_GT4PY_CACHE .

Excellent idea, that would make it much more straightforward.

edopao · 2026-05-04T13:29:19Z

  parallel:
    matrix:
      - COMPONENT: [diffusion, dycore, microphysics, muphys, common, driver]
        BACKEND: [gtfn_gpu]


Suggested change

BACKEND: [dace_gpu, gtfn_gpu]

edopao · 2026-05-04T13:31:19Z

    GT4PY_BUILD_CACHE_LIFETIME: "persistent"
    PYTEST_ADDOPTS: "--durations=0"
-    CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH"]'
+    CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH","/iopsstor/scratch/cscs/svc_cwci02_cicd_ext:$ICON4PY_CI_GT4PY_BUILD_CACHE_BASE_DIR"]'


I don't know whether it could work:

Suggested change

CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH","/iopsstor/scratch/cscs/svc_cwci02_cicd_ext:$ICON4PY_CI_GT4PY_BUILD_CACHE_BASE_DIR"]'

CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH","$SCRATCH:$ICON4PY_CI_GT4PY_BUILD_CACHE_BASE_DIR"]'

I tried this naively and it didn't work. I think the problem is that this gets interpreted on many levels through the cicd ext system/gitlab/etc. and I think putting the explicit directory is simpler, at least for now.

msimberg · 2026-05-04T14:00:42Z

The latest runs are definitely not looking very good. There's a lot of https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5125340235196978/2255149825504669/-/jobs/14200851472#L655:

E           ../compute_inverse_on_edges.hpp:1:10: fatal error: gridtools/fn/backend/naive.hpp: No such file or directory
E               1 | #include <gridtools/fn/backend/naive.hpp>
E                 |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

which seems strange to me...

Most likely cause is that I've messed something up in the config, but @edopao does that look familiar to you for any reason? Worst case is that something in the concurrent cache writing still goes wrong.

edopao · 2026-05-04T14:13:58Z

The latest runs are definitely not looking very good. There's a lot of https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5125340235196978/2255149825504669/-/jobs/14200851472#L655:
E           ../compute_inverse_on_edges.hpp:1:10: fatal error: gridtools/fn/backend/naive.hpp: No such file or directory
E               1 | #include <gridtools/fn/backend/naive.hpp>
E                 |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
which seems strange to me...

Most likely cause is that I've messed something up in the config, but @edopao does that look familiar to you for any reason? Worst case is that something in the concurrent cache writing still goes wrong.

I have no idea. This file should come from the gridtools python package.

msimberg · 2026-05-06T07:46:23Z

The latest runs are definitely not looking very good. There's a lot of https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5125340235196978/2255149825504669/-/jobs/14200851472#L655:
E           ../compute_inverse_on_edges.hpp:1:10: fatal error: gridtools/fn/backend/naive.hpp: No such file or directory
E               1 | #include <gridtools/fn/backend/naive.hpp>
E                 |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
which seems strange to me...
Most likely cause is that I've messed something up in the config, but @edopao does that look familiar to you for any reason? Worst case is that something in the concurrent cache writing still goes wrong.
I have no idea. This file should come from the gridtools python package.

Indeed, that's what I'd expect as well (and it's found on most pipelines...). I'll investigate.

msimberg · 2026-05-07T09:16:36Z

cscs-ci run default

…t the variable instead)

msimberg · 2026-05-07T10:15:10Z

cscs-ci run default;GT4PY_BUILD_CACHE_LIFETIME=session;ICON4PY_CI_WIPE_GT4PY_CACHE=true

msimberg · 2026-05-07T12:27:38Z

cscs-ci run default

msimberg · 2026-05-07T12:39:15Z

cscs-ci run default

msimberg · 2026-05-07T13:35:08Z

cscs-ci run default;GT4PY_BUILD_CACHE_LIFETIME=session;ICON4PY_CI_WIPE_GT4PY_CACHE=true

msimberg · 2026-05-07T14:56:06Z

cscs-ci run default

msimberg · 2026-05-08T13:57:45Z

cscs-ci run default

msimberg · 2026-05-08T14:05:35Z

cscs-ci run default

github-actions · 2026-05-08T14:06:11Z

Mandatory Tests

Please make sure you run these tests via comment before you merge!

cscs-ci run default
cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

edopao · 2026-05-11T06:40:51Z

+
+echo "Using GT4PY_BUILD_CACHE_DIR=${GT4PY_BUILD_CACHE_DIR}"
+
+# TODO: This is here just for debugging, probably remove?


I don't know.. On the one hand, I find it useful to easily wipe the build cache (which I guess needs to be done by the CI user) without relying on the scratch cleanup policy. On the other hand, we would expose a variable that deletes stuff from the filesystem, so we introduce the risk for CI users of deleting folders by mistake.

msimberg added 4 commits April 24, 2026 13:50

Set up a semi-persistent cache for CI jobs

520f065

Only run a few jobs in ci

3bd409f

Use persistent cache also for distributed CI pipeline

fef0223

Set up gt4py cache in github workflows

fb5fabd

msimberg changed the title ~~Ci persistent cache~~ Reuse gt4py cache in CI Apr 24, 2026

msimberg added 2 commits April 24, 2026 15:34

Key github gt4py cache by components

62b1fea

Change quotes back to what it was

04a8ac0

msimberg added 2 commits April 24, 2026 17:51

Fix scratch path for gt4py cache

0470b71

Fewer tests

49ddfda

msimberg added 2 commits April 24, 2026 20:49

Different cache directory

1a5db5f

Print GT4PY_BUILD_CACHE_DIR

96870ab

Change cache dir

6457e39

msimberg changed the title ~~Reuse gt4py cache in CI~~ Add persistent gt4py cache for CSCS CI Apr 27, 2026

Revert GitHub Actions GT4Py cache changes

74326b5

msimberg commented Apr 27, 2026

View reviewed changes

msimberg requested review from edopao, jcanton and nfarabullini April 27, 2026 09:41

msimberg added 2 commits April 29, 2026 14:58

Merge remote-tracking branch 'origin/main' into ci-persistent-cache

12943a5

Merge remote-tracking branch 'origin/main' into ci-persistent-cache

57baf74

philip-paul-mueller reviewed Apr 30, 2026

View reviewed changes

Comment thread ci/scripts/gt4py-cache.sh Outdated

msimberg added 5 commits May 4, 2026 14:09

Allow disabling gt4py cache in CSCS CI

00bc225

Enable all tests again in cscs ci, including dace_cpu

1ec24d1

Add uv.lock hash to gt4py cache directory in cscs ci

92f4cec

Rename GT4PY_BUILD_CACHE_BASE_DIR

927ffb3

Run all tests on distributed CI pipeline

856098a

Fix cache directory in cscs ci

b9a813b

edopao reviewed May 4, 2026

View reviewed changes

msimberg added 3 commits May 7, 2026 12:12

Remove explicit option to use gt4py session cache in cscs ci (just se…

e9bb0fe

…t the variable instead)

Add option to wipe gt4py cache in cscs ci

a297609

Merge remote-tracking branch 'origin/main' into ci-persistent-cache

4e84f60

Try adding CI job name hash to gt4py build cache name

07f6121

Skip settin gt4py cache dir for jobs that don't have BACKEND set

9525d77

msimberg force-pushed the ci-persistent-cache branch from 375ebec to 247f818 Compare May 8, 2026 13:57

Use readable job name slug

6cd8b9f

msimberg force-pushed the ci-persistent-cache branch from 247f818 to 6cd8b9f Compare May 8, 2026 14:05

edopao reviewed May 11, 2026

View reviewed changes

	CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH","/iopsstor/scratch/cscs/svc_cwci02_cicd_ext:$ICON4PY_CI_GT4PY_BUILD_CACHE_BASE_DIR"]'
	CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH","$SCRATCH:$ICON4PY_CI_GT4PY_BUILD_CACHE_BASE_DIR"]'


		echo "Using GT4PY_BUILD_CACHE_DIR=${GT4PY_BUILD_CACHE_DIR}"

		# TODO: This is here just for debugging, probably remove?

Conversation

msimberg commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msimberg commented Apr 24, 2026

Uh oh!

msimberg commented Apr 24, 2026

Uh oh!

msimberg commented Apr 24, 2026

Uh oh!

msimberg commented Apr 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msimberg May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msimberg commented Apr 29, 2026

Uh oh!

philip-paul-mueller left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

havogt commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msimberg commented Apr 30, 2026

Uh oh!

msimberg commented May 4, 2026

Uh oh!

msimberg commented May 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

msimberg commented May 4, 2026

Uh oh!

edopao commented May 4, 2026

Uh oh!

msimberg commented May 6, 2026

Uh oh!

msimberg commented May 7, 2026

Uh oh!

msimberg commented May 7, 2026

Uh oh!

msimberg commented May 7, 2026

Uh oh!

msimberg commented May 7, 2026

Uh oh!

msimberg commented May 7, 2026

Uh oh!

msimberg commented May 7, 2026

Uh oh!

msimberg commented May 8, 2026

Uh oh!

msimberg commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

msimberg commented Apr 24, 2026 •

edited

Loading

msimberg May 4, 2026 •

edited

Loading

havogt commented Apr 30, 2026 •

edited

Loading