Skip to content

Add persistent gt4py cache for CSCS CI#1218

Draft
msimberg wants to merge 27 commits into
C2SM:mainfrom
msimberg:ci-persistent-cache
Draft

Add persistent gt4py cache for CSCS CI#1218
msimberg wants to merge 27 commits into
C2SM:mainfrom
msimberg:ci-persistent-cache

Conversation

@msimberg
Copy link
Copy Markdown
Contributor

@msimberg msimberg commented Apr 24, 2026

Fixes #1133.

Sets up a persistent cache for cscs ci pipelines. There's a weekly cache shared across all jobs. This means that each week a new cache directory is populated based on the year and week number. The idea of this is that we sometimes trigger compilation for a new cache from scratch, but not for every job. The cache is shared to avoid having too many files on scratch from every branch writing a separate cache such as in #1220.

I don't love this setup since it's slightly unconventional, but it's currently the best idea I have if we want to both 1. limit the number of files on scratch and 2. sometimes start with a fresh cache. I'd love to hear ideas for alternatives to this setup.

@msimberg msimberg changed the title Ci persistent cache Reuse gt4py cache in CI Apr 24, 2026
@msimberg
Copy link
Copy Markdown
Contributor Author

The github actions cache for gt4py seems to work nicely. I had test-model (3.11, gtfn_cpu, dycore) do a first run without a cache in 27 minutes. The second run with a cache ran in 10 minutes.

I've yet to try the cscs ci cache.

@msimberg
Copy link
Copy Markdown
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Copy Markdown
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Copy Markdown
Contributor Author

cscs-ci run distributed

@msimberg msimberg changed the title Reuse gt4py cache in CI Add persistent gt4py cache for CSCS CI Apr 27, 2026
Comment thread ci/default.yml
Comment on lines +26 to +28
# TODO(msimberg): enable
# test_model_stencils_aarch64:
# extends: [.test_model_stencils, .test_template_aarch64]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do.

Comment thread ci/default.yml Outdated
Comment on lines +63 to +67
- COMPONENT: [advection, diffusion, dycore, microphysics, muphys, common, driver, standalone_driver]
BACKEND: [embedded, dace_gpu, gtfn_cpu, gtfn_gpu]
# - COMPONENT: [advection, diffusion, dycore, microphysics, muphys, common, driver, standalone_driver]
- COMPONENT: [dycore, advection]
# BACKEND: [embedded, dace_gpu, gtfn_cpu, gtfn_gpu]
BACKEND: [dace_gpu, gtfn_cpu]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do: revert.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add dace_cpu, please. And in other places, like stencil tests, run also the dace backends. The dace backends were skipped in several places because too slow. We have instead a nightly dace pipeline that runs with the dace backends. That we can remove after this PR is merged.

Copy link
Copy Markdown
Contributor Author

@msimberg msimberg May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try it out, but note that the worst case timings for the pipelines will still be as bad as before, when the cache is cleared (currently weekly or with the uv.lock hash changing, perhaps the uv.lock hash could be enough, but then we risk scratch cleanup policies messing up the cache; unsure what the best setup is). I think @havogt's changes to better parallelize gt4py compilation would also be very useful here.

@havogt do you think me might be able to get GridTools/gt4py#2565 and your parallization changes in a 1.1.10 sometime soon or does that need more time?

Edit: this is the parallelization PR I'm talking about: GridTools/gt4py#2587.

Comment thread ci/distributed.yml Outdated
Comment on lines +94 to +101
- COMPONENT: [atmosphere/diffusion, atmosphere/dycore, common]
# TODO(msimberg): Enable dace_gpu when compilation doesn't take as long
# or when we can cache across CI jobs.
BACKEND: [embedded, gtfn_cpu, dace_cpu, gtfn_gpu]
# - COMPONENT: [atmosphere/diffusion, atmosphere/dycore, common]
- COMPONENT: [common]
# BACKEND: [embedded, gtfn_cpu, dace_cpu, gtfn_gpu]
BACKEND: [gtfn_cpu]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, we might still need a long timelimit for dace jobs when the cache is populated. Hopefully the average time will be a lot better than the maximum timelimit though.

@msimberg
Copy link
Copy Markdown
Contributor Author

This should benefit significantly from GridTools/gt4py#2565. The concern about number of files on scratch will be a much smaller concern (and possibly it allows using per-branch caches? still needs measurement).

Copy link
Copy Markdown
Collaborator

@philip-paul-mueller philip-paul-mueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am a bit concerned with is when you work on GT4Py, i.e. if your PR does not touch the ICON4Py stencil code but affects it.
I know the solution is simply to disable it, but there should be somewhere an explanation on how to do it.

Comment thread ci/scripts/gt4py-cache.sh Outdated
@havogt
Copy link
Copy Markdown
Contributor

havogt commented Apr 30, 2026

What I am a bit concerned with is when you work on GT4Py, i.e. if your PR does not touch the ICON4Py stencil code but affects it. I know the solution is simply to disable it, but there should be somewhere an explanation on how to do it.

I agree. A simple solution might be to add the gt4py version to the key. PRs not touching gt4py will use the same cache as main, PRs touching gt4py will get their own directory. But need to double-check that the gt4py version is correctly resolved (i.e. with the dev suffix if it's not a release version).

@msimberg
Copy link
Copy Markdown
Contributor Author

What I am a bit concerned with is when you work on GT4Py, i.e. if your PR does not touch the ICON4Py stencil code but affects it. I know the solution is simply to disable it, but there should be somewhere an explanation on how to do it.

Yeah, that's a valid concern. How well do you think gt4py itself will invalidate caches when it itself is changed? The "weekly fresh cache" slightly helps protect against this. A per-branch cache might also help a bit more? But yeah, if one really wants a fresh cache then it would (at the moment) mean either disabling the scratch cache for your PR, or we can think about also allowing setting a specific variable to disable the cache (cscs-ci run allows passing extra env vars that will be forwarded to the job). Ideally this would look as close as possible to the github actions caching does (those caches can also be explicitly deleted, manually). File count was my main concern, but it might not even be a real concern.

@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 4, 2026

cscs-ci run default

@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 4, 2026

cscs-ci run default

Comment thread ci/scripts/gt4py-cache.sh Outdated

if [[ "${ICON4PY_CI_DISABLE_PERSISTENT_GT4PY_CACHE:-}" == "true" ]]; then
echo "Using non-persistent gt4py cache because ICON4PY_CI_DISABLE_PERSISTENT_GT4PY_CACHE is set"
export GT4PY_BUILD_CACHE_LIFETIME="session"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had forgotten that it is just a matter of triggering the pipeline with:
cscs-ci run default GT4PY_BUILD_CACHE_LIFETIME="session"

The trigger variables should have precedence, according to the documentation, so we do not need to introduce ICON4PY_CI_DISABLE_PERSISTENT_GT4PY_CACHE .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent idea, that would make it much more straightforward.

Comment thread ci/default.yml
parallel:
matrix:
- COMPONENT: [diffusion, dycore, microphysics, muphys, common, driver]
BACKEND: [gtfn_gpu]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
BACKEND: [dace_gpu, gtfn_gpu]

Comment thread ci/distributed.yml
GT4PY_BUILD_CACHE_LIFETIME: "persistent"
PYTEST_ADDOPTS: "--durations=0"
CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH"]'
CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH","/iopsstor/scratch/cscs/svc_cwci02_cicd_ext:$ICON4PY_CI_GT4PY_BUILD_CACHE_BASE_DIR"]'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether it could work:

Suggested change
CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH","/iopsstor/scratch/cscs/svc_cwci02_cicd_ext:$ICON4PY_CI_GT4PY_BUILD_CACHE_BASE_DIR"]'
CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH","$SCRATCH:$ICON4PY_CI_GT4PY_BUILD_CACHE_BASE_DIR"]'

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this naively and it didn't work. I think the problem is that this gets interpreted on many levels through the cicd ext system/gitlab/etc. and I think putting the explicit directory is simpler, at least for now.

Comment thread ci/base.yml
@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 4, 2026

The latest runs are definitely not looking very good. There's a lot of https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5125340235196978/2255149825504669/-/jobs/14200851472#L655:

E           ../compute_inverse_on_edges.hpp:1:10: fatal error: gridtools/fn/backend/naive.hpp: No such file or directory
E               1 | #include <gridtools/fn/backend/naive.hpp>
E                 |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

which seems strange to me...

Most likely cause is that I've messed something up in the config, but @edopao does that look familiar to you for any reason? Worst case is that something in the concurrent cache writing still goes wrong.

@edopao
Copy link
Copy Markdown
Contributor

edopao commented May 4, 2026

The latest runs are definitely not looking very good. There's a lot of https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5125340235196978/2255149825504669/-/jobs/14200851472#L655:

E           ../compute_inverse_on_edges.hpp:1:10: fatal error: gridtools/fn/backend/naive.hpp: No such file or directory
E               1 | #include <gridtools/fn/backend/naive.hpp>
E                 |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

which seems strange to me...

Most likely cause is that I've messed something up in the config, but @edopao does that look familiar to you for any reason? Worst case is that something in the concurrent cache writing still goes wrong.

I have no idea. This file should come from the gridtools python package.

@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 6, 2026

The latest runs are definitely not looking very good. There's a lot of https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5125340235196978/2255149825504669/-/jobs/14200851472#L655:

E           ../compute_inverse_on_edges.hpp:1:10: fatal error: gridtools/fn/backend/naive.hpp: No such file or directory
E               1 | #include <gridtools/fn/backend/naive.hpp>
E                 |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

which seems strange to me...
Most likely cause is that I've messed something up in the config, but @edopao does that look familiar to you for any reason? Worst case is that something in the concurrent cache writing still goes wrong.

I have no idea. This file should come from the gridtools python package.

Indeed, that's what I'd expect as well (and it's found on most pipelines...). I'll investigate.

@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 7, 2026

cscs-ci run default

@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 7, 2026

cscs-ci run default;GT4PY_BUILD_CACHE_LIFETIME=session;ICON4PY_CI_WIPE_GT4PY_CACHE=true

@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 7, 2026

cscs-ci run default

@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 7, 2026

cscs-ci run default

@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 7, 2026

cscs-ci run default;GT4PY_BUILD_CACHE_LIFETIME=session;ICON4PY_CI_WIPE_GT4PY_CACHE=true

@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 7, 2026

cscs-ci run default

@msimberg msimberg force-pushed the ci-persistent-cache branch from 375ebec to 247f818 Compare May 8, 2026 13:57
@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 8, 2026

cscs-ci run default

@msimberg msimberg force-pushed the ci-persistent-cache branch from 247f818 to 6cd8b9f Compare May 8, 2026 14:05
@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 8, 2026

cscs-ci run default

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default
  • cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

  • cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

Comment thread ci/scripts/gt4py-cache.sh

echo "Using GT4PY_BUILD_CACHE_DIR=${GT4PY_BUILD_CACHE_DIR}"

# TODO: This is here just for debugging, probably remove?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know.. On the one hand, I find it useful to easily wipe the build cache (which I guess needs to be done by the CI user) without relying on the scratch cleanup policy. On the other hand, we would expose a variable that deletes stuff from the filesystem, so we introduce the risk for CI users of deleting folders by mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Set up a persistent cache across jobs for CI

4 participants