1 - Introduction to Kubernetes operators

What are Kubernetes Operators?

Kubernetes operators are software extensions that manage both cluster and non-cluster resources on behalf of Kubernetes. The Java Operator SDK (JOSDK) makes it easy to implement Kubernetes operators in Java, with APIs designed to feel natural to Java developers and framework handling of common problems so you can focus on your business logic.

Why Use Java Operator SDK?

JOSDK provides several key advantages:

  • Java-native APIs that feel familiar to Java developers
  • Automatic handling of common operator challenges (caching, event handling, retries)
  • Production-ready features like observability, metrics, and error handling
  • Simplified development so you can focus on business logic instead of Kubernetes complexities

Learning Resources

Getting Started

Deep Dives

Tutorials

2 - Bootstrapping and samples

Creating a New Operator Project

Using the Maven Plugin

The simplest way to start a new operator project is using the provided Maven plugin, which generates a complete project skeleton:

mvn io.javaoperatorsdk:bootstrapper:[version]:create \
  -DprojectGroupId=org.acme \
  -DprojectArtifactId=getting-started

This command creates a new Maven project with:

Building Your Project

Build the generated project with Maven:

mvn clean install

The build process automatically generates the CustomResourceDefinition YAML file that youโ€™ll need to apply to your Kubernetes cluster.

Exploring Sample Operators

The sample-operators directory contains real-world examples demonstrating different JOSDK features and patterns:

Available Samples

webpage

  • Purpose: Creates NGINX webservers from Custom Resources containing HTML code
  • Key Features: Multiple implementation approaches using both low-level APIs and higher-level abstractions
  • Good for: Understanding basic operator concepts and API usage patterns

mysql-schema

  • Purpose: Manages database schemas in MySQL instances
  • Key Features: Demonstrates managing non-Kubernetes resources (external systems)
  • Good for: Learning how to integrate with external services and manage state outside Kubernetes

tomcat

  • Purpose: Manages Tomcat instances and web applications
  • Key Features: Multiple controllers managing related custom resources
  • Good for: Understanding complex operators with multiple resource types and relationships

Running the Samples

Prerequisites

The easiest way to try samples is using a local Kubernetes cluster:

Step-by-Step Instructions

  1. Apply the CustomResourceDefinition:

    kubectl apply -f target/classes/META-INF/fabric8/[resource-name]-v1.yml
    
  2. Run the operator:

    mvn exec:java -Dexec.mainClass="your.main.ClassName"
    

    Or run your main class directly from your IDE.

  3. Create custom resources: The operator will automatically detect and reconcile custom resources when you create them:

    kubectl apply -f examples/sample-resource.yaml
    

Detailed Examples

For comprehensive setup instructions and examples, see:

Next Steps

After exploring the samples:

  1. Review the patterns and best practices guide
  2. Learn about implementing reconcilers
  3. Explore dependent resources and workflows for advanced use cases

3 - Patterns and best practices

This document describes patterns and best practices for building and running operators, and how to implement them using the Java Operator SDK (JOSDK).

See also best practices in the Operator SDK.

Implementing a Reconciler

Always Reconcile All Resources

Reconciliation can be triggered by events from multiple sources. It might be tempting to check the events and only reconcile the related resource or subset of resources that the controller manages. However, this is considered an anti-pattern for operators.

Why this is problematic:

  • Kubernetesโ€™ distributed nature makes it difficult to ensure all events are received
  • If your operator misses some events and doesnโ€™t reconcile the complete state, it might operate with incorrect assumptions about the cluster state
  • Always reconcile all resources, regardless of the triggering event

JOSDK makes this efficient by providing smart caches to avoid unnecessary Kubernetes API server access and ensuring your reconciler is triggered only when needed.

Since thereโ€™s industry consensus on this topic, JOSDK no longer provides event access from Reconciler implementations starting with version 2.

Event Sources and Caching

During reconciliation, best practice is to reconcile all dependent resources managed by the controller. This means comparing the desired state with the actual cluster state.

The Challenge: Reading the actual state directly from the Kubernetes API Server every time would create significant load.

The Solution: Create a watch for dependent resources and cache their latest state using the Informer pattern. In JOSDK, informers are wrapped into EventSource to integrate with the frameworkโ€™s eventing system via the InformerEventSource class.

How it works:

  • New events trigger reconciliation only when the resource is already cached
  • Reconciler implementations compare desired state with cached observed state
  • If a resource isnโ€™t in cache, it needs to be created
  • If actual state doesnโ€™t match desired state, the resource needs updating

Idempotency

Since all resources should be reconciled when your Reconciler is triggered, and reconciliations can be triggered multiple times for any given resource (especially with retry policies), itโ€™s crucial that Reconciler implementations be idempotent.

Idempotency means: The same observed state should always result in exactly the same outcome.

Key implications:

  • Operators should generally operate in a stateless fashion
  • Since operators usually manage declarative resources, ensuring idempotency is typically straightforward

Synchronous vs Asynchronous Resource Handling

Sometimes your reconciliation logic needs to wait for resources to reach their desired state (e.g., waiting for a Pod to become ready). You can approach this either synchronously or asynchronously.

Exit the reconciliation logic as soon as the Reconciler determines it cannot complete at this point. This frees resources to process other events.

Requirements: Set up adequate event sources to monitor state changes of all resources the operator waits for. When state changes occur, the Reconciler is triggered again and can finish processing.

Synchronous Approach

Periodically poll resourcesโ€™ state until they reach the desired state. If done within the reconcile method, this blocks the current thread for potentially long periods.

Recommendation: Use the asynchronous approach for better resource utilization.

Why Use Automatic Retries?

Automatic retries are enabled by default and configurable. While you can deactivate this feature, we advise against it.

Why retries are important:

  • Transient network errors: Common in Kubernetesโ€™ distributed environment, easily resolved with retries
  • Resource conflicts: When multiple actors modify resources simultaneously, conflicts can be resolved by reconciling again
  • Transparency: Automatic retries make error handling completely transparent when successful

Managing State

Thanks to Kubernetes resourcesโ€™ declarative nature, operators dealing only with Kubernetes resources can operate statelessly. They donโ€™t need to maintain resource state information since it should be possible to rebuild the complete resource state from its representation.

When State Management Becomes Necessary

This stateless approach typically breaks down when dealing with external resources. You might need to track external state for future reconciliations.

Anti-pattern: Putting state in the primary resourceโ€™s status sub-resource

  • Becomes difficult to manage with large amounts of state
  • Violates best practice: status should represent actual resource state, while spec represents desired state

Recommended approach: Store state in separate resources designed for this purpose:

  • Kubernetes Secret or ConfigMap
  • Dedicated Custom Resource with validated structure

Handling Informer Errors and Cache Sync Timeouts

You can configure whether the operator should stop when informer errors occur on startup.

Default Behavior

By default, if thereโ€™s a startup error (e.g., the informer lacks permissions to list target resources for primary or secondary resources), the operator stops immediately.

Alternative Configuration

Set the flag to false to start the operator even when some informers fail to start. In this case:

  • The operator continuously retries connection with exponential backoff
  • This applies both to startup failures and runtime problems
  • The operator only stops for fatal errors (currently when a resource cannot be deserialized)

Use case: When watching multiple namespaces, itโ€™s better to start the operator so it can handle other namespaces while resolving permission issues in specific namespaces.

Cache Sync Timeout Impact

The stopOnInformerErrorDuringStartup setting affects cache sync timeout behavior:

  • If true: Operator stops on cache sync timeout
  • If false: After timeout, the controller starts reconciling resources even if some event source caches havenโ€™t synced yet

Graceful Shutdown

You can provide sufficient time for the reconciler to process and complete ongoing events before shutting down. Simply set an appropriate duration value for reconciliationTerminationTimeout using ConfigurationServiceOverrider.

final var overridden = new ConfigurationServiceOverrider(config)
    .withReconciliationTerminationTimeout(Duration.ofSeconds(5));

final var operator = new Operator(overridden);