Skip to content

fix: increase supervisor max_restarts across all RabbitMQ-consuming services#982

Open
cchristous wants to merge 3 commits into
semaphoreio:mainfrom
cchristous:fix/increase-supervisor-max-restarts-all-services
Open

fix: increase supervisor max_restarts across all RabbitMQ-consuming services#982
cchristous wants to merge 3 commits into
semaphoreio:mainfrom
cchristous:fix/increase-supervisor-max-restarts-all-services

Conversation

@cchristous
Copy link
Copy Markdown
Contributor

Summary

Context

During Amazon MQ weekly maintenance, RabbitMQ nodes enter maintenance mode and force-close all AMQP connections with CONNECTION_FORCED. With the OTP default max_restarts of 3 (in 5 seconds), the supervisor interprets the burst of child process restarts as a fatal condition and terminates the entire application — killing all children including healthy ones (gRPC servers, HTTP endpoints, etc.).

This was already fixed for SecretHub in #972 after it caused OIDC outages. However, the same vulnerability exists in all other RabbitMQ-consuming services. We observed this on 2026-04-15 when the stage cluster's pod restart alarm fired — github-notifier-consumer, rbac-workers, ui-cache-reactor, and secrethub all crashed simultaneously at 23:12 UTC during the Wednesday 22:00 UTC maintenance window.

Services fixed

Service File
audit ee/audit/lib/audit/application.ex
gofer ee/gofer/lib/gofer/application.ex
pre_flight_checks ee/pre_flight_checks/lib/pre_flight_checks/application.ex
rbac (EE) ee/rbac/lib/rbac/application.ex
front front/lib/front/application.ex
github_notifier github_notifier/lib/github_notifier/application.ex
guard guard/lib/guard/application.ex
hooks_processor hooks_processor/lib/hooks_processor/application.ex
notifications notifications/lib/notifications/application.ex
periodic_scheduler periodic_scheduler/scheduler/lib/scheduler/application.ex
plumber/block plumber/block/lib/block/application.ex
plumber/ppl plumber/ppl/lib/ppl/application.ex
projecthub projecthub/lib/projecthub/application.ex
public-api/v2 public-api/v2/lib/public_api/application.ex
rbac (CE) rbac/ce/lib/rbac/application.ex
repository_hub repository_hub/lib/repository_hub/application.ex

Why this is safe

  • max_restarts: 1000 with the default max_seconds: 5 means the supervisor tolerates up to 1000 child restarts in a 5-second window before terminating. A RabbitMQ maintenance event causes a brief burst that's well under this, while a genuinely broken child would still be caught by Kubernetes liveness probes and Sentry error reporting.
  • This is the same value already proven in production on secrethub, badge, branch_hub, and zebra.
  • The one_for_one strategy means individual child crashes don't affect siblings — max_restarts only controls when the supervisor itself gives up.

Test plan

  • Verify CI passes for all affected services
  • Deploy to stage environment
  • Confirm pods survive the next Amazon MQ maintenance window (Wednesday 22:00 UTC) without crashing

…suming services

Extends the fix from semaphoreio#972 (secrethub only) to all 16 remaining services
that use Tackle.Consumer or Tackle.Multiconsumer. During Amazon MQ
maintenance windows, RabbitMQ force-closes all AMQP connections with
CONNECTION_FORCED. With the default max_restarts of 3 (in 5 seconds),
the OTP supervisor terminates the entire application instead of allowing
workers to reconnect. This causes unnecessary pod crashes and restarts.

Services fixed: audit, gofer, pre_flight_checks, rbac (CE + EE), front,
github_notifier, guard, hooks_processor, notifications,
periodic_scheduler, plumber/ppl, plumber/block, projecthub, public-api,
repository_hub.
@github-project-automation github-project-automation Bot moved this to Backlog in Roadmap Apr 16, 2026
@cchristous cchristous marked this pull request as ready for review April 16, 2026 00:21
Copy link
Copy Markdown
Contributor

@loadez loadez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but please rebase because there are some cves that are probably fixed on main already

@cchristous
Copy link
Copy Markdown
Contributor Author

LGTM but please rebase because there are some cves that are probably fixed on main already

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants