Add persistent gt4py cache for CSCS CI#1218
Conversation
|
The github actions cache for gt4py seems to work nicely. I had I've yet to try the cscs ci cache. |
|
cscs-ci run distributed |
|
cscs-ci run distributed |
|
cscs-ci run distributed |
| # TODO(msimberg): enable | ||
| # test_model_stencils_aarch64: | ||
| # extends: [.test_model_stencils, .test_template_aarch64] |
| - COMPONENT: [advection, diffusion, dycore, microphysics, muphys, common, driver, standalone_driver] | ||
| BACKEND: [embedded, dace_gpu, gtfn_cpu, gtfn_gpu] | ||
| # - COMPONENT: [advection, diffusion, dycore, microphysics, muphys, common, driver, standalone_driver] | ||
| - COMPONENT: [dycore, advection] | ||
| # BACKEND: [embedded, dace_gpu, gtfn_cpu, gtfn_gpu] | ||
| BACKEND: [dace_gpu, gtfn_cpu] |
There was a problem hiding this comment.
Also add dace_cpu, please. And in other places, like stencil tests, run also the dace backends. The dace backends were skipped in several places because too slow. We have instead a nightly dace pipeline that runs with the dace backends. That we can remove after this PR is merged.
There was a problem hiding this comment.
I'll try it out, but note that the worst case timings for the pipelines will still be as bad as before, when the cache is cleared (currently weekly or with the uv.lock hash changing, perhaps the uv.lock hash could be enough, but then we risk scratch cleanup policies messing up the cache; unsure what the best setup is). I think @havogt's changes to better parallelize gt4py compilation would also be very useful here.
@havogt do you think me might be able to get GridTools/gt4py#2565 and your parallization changes in a 1.1.10 sometime soon or does that need more time?
Edit: this is the parallelization PR I'm talking about: GridTools/gt4py#2587.
| - COMPONENT: [atmosphere/diffusion, atmosphere/dycore, common] | ||
| # TODO(msimberg): Enable dace_gpu when compilation doesn't take as long | ||
| # or when we can cache across CI jobs. | ||
| BACKEND: [embedded, gtfn_cpu, dace_cpu, gtfn_gpu] | ||
| # - COMPONENT: [atmosphere/diffusion, atmosphere/dycore, common] | ||
| - COMPONENT: [common] | ||
| # BACKEND: [embedded, gtfn_cpu, dace_cpu, gtfn_gpu] | ||
| BACKEND: [gtfn_cpu] |
There was a problem hiding this comment.
Note, we might still need a long timelimit for dace jobs when the cache is populated. Hopefully the average time will be a lot better than the maximum timelimit though.
|
This should benefit significantly from GridTools/gt4py#2565. The concern about number of files on scratch will be a much smaller concern (and possibly it allows using per-branch caches? still needs measurement). |
philip-paul-mueller
left a comment
There was a problem hiding this comment.
What I am a bit concerned with is when you work on GT4Py, i.e. if your PR does not touch the ICON4Py stencil code but affects it.
I know the solution is simply to disable it, but there should be somewhere an explanation on how to do it.
I agree. A simple solution might be to add the gt4py version to the key. PRs not touching gt4py will use the same cache as main, PRs touching gt4py will get their own directory. But need to double-check that the gt4py version is correctly resolved (i.e. with the dev suffix if it's not a release version). |
Yeah, that's a valid concern. How well do you think gt4py itself will invalidate caches when it itself is changed? The "weekly fresh cache" slightly helps protect against this. A per-branch cache might also help a bit more? But yeah, if one really wants a fresh cache then it would (at the moment) mean either disabling the scratch cache for your PR, or we can think about also allowing setting a specific variable to disable the cache (cscs-ci run allows passing extra env vars that will be forwarded to the job). Ideally this would look as close as possible to the github actions caching does (those caches can also be explicitly deleted, manually). File count was my main concern, but it might not even be a real concern. |
|
cscs-ci run default |
|
cscs-ci run default |
|
|
||
| if [[ "${ICON4PY_CI_DISABLE_PERSISTENT_GT4PY_CACHE:-}" == "true" ]]; then | ||
| echo "Using non-persistent gt4py cache because ICON4PY_CI_DISABLE_PERSISTENT_GT4PY_CACHE is set" | ||
| export GT4PY_BUILD_CACHE_LIFETIME="session" |
There was a problem hiding this comment.
I had forgotten that it is just a matter of triggering the pipeline with:
cscs-ci run default GT4PY_BUILD_CACHE_LIFETIME="session"
The trigger variables should have precedence, according to the documentation, so we do not need to introduce ICON4PY_CI_DISABLE_PERSISTENT_GT4PY_CACHE .
There was a problem hiding this comment.
Excellent idea, that would make it much more straightforward.
| parallel: | ||
| matrix: | ||
| - COMPONENT: [diffusion, dycore, microphysics, muphys, common, driver] | ||
| BACKEND: [gtfn_gpu] |
There was a problem hiding this comment.
| BACKEND: [dace_gpu, gtfn_gpu] |
| GT4PY_BUILD_CACHE_LIFETIME: "persistent" | ||
| PYTEST_ADDOPTS: "--durations=0" | ||
| CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH"]' | ||
| CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH","/iopsstor/scratch/cscs/svc_cwci02_cicd_ext:$ICON4PY_CI_GT4PY_BUILD_CACHE_BASE_DIR"]' |
There was a problem hiding this comment.
I don't know whether it could work:
| CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH","/iopsstor/scratch/cscs/svc_cwci02_cicd_ext:$ICON4PY_CI_GT4PY_BUILD_CACHE_BASE_DIR"]' | |
| CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/cwci02/icon4py/ci/testdata:$ICON4PY_TEST_DATA_PATH","$SCRATCH:$ICON4PY_CI_GT4PY_BUILD_CACHE_BASE_DIR"]' |
There was a problem hiding this comment.
I tried this naively and it didn't work. I think the problem is that this gets interpreted on many levels through the cicd ext system/gitlab/etc. and I think putting the explicit directory is simpler, at least for now.
|
The latest runs are definitely not looking very good. There's a lot of https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5125340235196978/2255149825504669/-/jobs/14200851472#L655: which seems strange to me... Most likely cause is that I've messed something up in the config, but @edopao does that look familiar to you for any reason? Worst case is that something in the concurrent cache writing still goes wrong. |
I have no idea. This file should come from the |
Indeed, that's what I'd expect as well (and it's found on most pipelines...). I'll investigate. |
|
cscs-ci run default |
|
cscs-ci run default;GT4PY_BUILD_CACHE_LIFETIME=session;ICON4PY_CI_WIPE_GT4PY_CACHE=true |
|
cscs-ci run default |
|
cscs-ci run default |
|
cscs-ci run default;GT4PY_BUILD_CACHE_LIFETIME=session;ICON4PY_CI_WIPE_GT4PY_CACHE=true |
|
cscs-ci run default |
375ebec to
247f818
Compare
|
cscs-ci run default |
247f818 to
6cd8b9f
Compare
|
cscs-ci run default |
|
Mandatory Tests Please make sure you run these tests via comment before you merge!
Optional Tests To run benchmarks you can use:
To run tests and benchmarks with the DaCe backend you can use:
To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:
For more detailed information please look at CI in the EXCLAIM universe. |
|
|
||
| echo "Using GT4PY_BUILD_CACHE_DIR=${GT4PY_BUILD_CACHE_DIR}" | ||
|
|
||
| # TODO: This is here just for debugging, probably remove? |
There was a problem hiding this comment.
I don't know.. On the one hand, I find it useful to easily wipe the build cache (which I guess needs to be done by the CI user) without relying on the scratch cleanup policy. On the other hand, we would expose a variable that deletes stuff from the filesystem, so we introduce the risk for CI users of deleting folders by mistake.
Fixes #1133.
Sets up a persistent cache for cscs ci pipelines. There's a weekly cache shared across all jobs. This means that each week a new cache directory is populated based on the year and week number. The idea of this is that we sometimes trigger compilation for a new cache from scratch, but not for every job. The cache is shared to avoid having too many files on scratch from every branch writing a separate cache such as in #1220.
I don't love this setup since it's slightly unconventional, but it's currently the best idea I have if we want to both 1. limit the number of files on scratch and 2. sometimes start with a fresh cache. I'd love to hear ideas for alternatives to this setup.