在 Kubernetes 中,您可以在一组 Pod 中运行工作负载。Pod 是单个或多个容器的群组,具有共享存储和网络资源。Pod 由 Kubernetes 规范定义。
Job 会创建一个或多个 Pod,并不断尝试运行这些 Pod,直到指定数量的 Pod 成功终止。Pod 完成后,Job 会跟踪成功的完成。达到指定的成功完成次数后,Job 即完成。
下表列出了在设计和管理 Job 时的重要建议:
建议
资源
选择 Job 完成模式
将完成模式指定为 Indexed。在根据 Pod 的索引分配要处理的数据分区时,此配置非常有用。Job 的 Pod 会获得关联的完成索引。删除 Job 会清理它创建的 Pod。暂停 Job 会删除其活跃 Pod,直到 Job 再次恢复。
为定期计划的操作设置 CronJob
使用 CronJob for GKE 执行定期计划的操作,例如备份、报告生成或计划为机器学习模型进行训练。
管理 Job 中的失败
定义 Kubernetes Pod 失败政策和 Pod 退避失败限制,以处理 Job 中可重试和不可重试的失败。此定义可避免因 Pod 中断而导致的不必要的 Pod 重试和Job 失败,从而降低集群资源消耗。例如,您可以配置抢占、API 发起的逐出或基于污点的逐出,这样不能容忍 NoExecute 污点效果的 Pod 会被逐出。了解如何使用 Pod 失败政策处理可重试和不可重试的 pod 失败。
将多个 Job 作为一个单元管理
使用 JobSet API 将多个 Job 作为一个单元进行管理来处理工作负载模式,例如一个驱动程序(或协调器)和多个工作器(例如 MPIJob),同时根据您的应用场景设置与常见模式一致的 Job 默认值。例如,您可以默认创建编入索引的 Job,为 Pod 的可预测完全限定域名 (FQDN) 创建无头服务,并设置关联的 Pod 失败政策。
延长不容忍重启的 Pod 的运行时间
在 Pod 规范中将 Kubernetes cluster-autoscaler.kubernetes.io/safe-to-evict 注解设置为 false。集群自动扩缩器会遵循对 Pod 设置的逐出规则。如果节点包含具有 cluster-autoscaler.kubernetes.io/safe-to-evict 注解的 Pod,则这些限制可防止该节点被自动扩缩器删除。
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-03。"],[],[],null,["# Best practices for running batch workloads on GKE\n\nAutopilot Standard\n\n*** ** * ** ***\n\nThis page introduces the best practices for building and optimizing batch processing platforms with Google Kubernetes Engine (GKE), including best practices for:\n\n- Architecture\n- Job management\n- Multi-tenancy\n- Security\n- Queueing\n- Storage\n- Performance\n- Cost efficiency\n- Monitoring\n\nGKE provides a powerful framework for orchestrating batch\nworkloads such as data processing,\n[training machine learning models](/blog/products/ai-machine-learning/build-a-ml-platform-with-kubeflow-and-ray-on-gke),\n[running scientific simulations](/blog/products/containers-kubernetes/gke-gpu-sharing-helps-scientists-quest-for-neutrinos),\nand other [high performance computing workloads](https://www.pgs.com/company/newsroom/news/industry-insights--hpc-in-the-cloud/).\n\nThese best practices are intended for platform administrators, cloud architects, and\noperations professionals interested in deploying batch workloads in\nGKE. The [Reference Architecture: Batch Processing Platform on GKE](https://github.com/ai-on-gke/batch-reference-architecture) showcases many of the best practices discussed in this guide, and can be deployed in your own Google Cloud project.\n\nHow batch workloads work\n------------------------\n\nA batch workload is a group of tasks that run to completion without user\nintervention. To define tasks, you use the Kubernetes\n[Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job) resource.\nA batch platform receives the Jobs and queues them in the\norder they are received. The queue in the batch platform applies processing\nlogic such as priority, quota, and allocable resources. By queueing and\ncustomizing the batch processing parameters, Kubernetes lets you optimize the\nuse of available resources, minimize the idle time for scheduled Jobs, and\nmaximize cost savings. The following diagram shows the\nGKE components that can be part of a batch platform.\n[](/static/kubernetes-engine/images/batch-process.svg)\n\nBatch platform management\n-------------------------\n\nTraditionally, batch platforms have two main user personas, developers and\nplatform administrators:\n\n- A developer submits a Job specifying the program, the data to be processed, and requirements for the Job. Then, the developer receives confirmation of the Job submission and a unique identifier. Once the Job is complete, the developer would get a notification along with any output or results of the Job.\n- A platform administrator manages and delivers an efficient and reliable batch processing platform to the developers.\n\nA batch processing platform must meet the following requirements:\n\n- The platform resources are properly provisioned to ensure that Jobs run with little to no user intervention required.\n- The platform resources are configured according to the organization's security and observability best practices.\n- The platform resources are used as efficiently as possible. In case of resource contention, the most important work gets done first.\n\n### Prepare the batch platform architecture in GKE\n\nA GKE environment consists of nodes, which are Compute Engine\nvirtual machines (VMs), that are grouped together to form a cluster.\n\nThe following table lists the key recommendations when planning and designing\nyour batch platform architecture:\n\n### Manage the Job lifecycle\n\nIn Kubernetes, you run your workloads in a set of\n[*Pods*](https://kubernetes.io/docs/concepts/workloads/pods/). Pods\nare groups of single or multiple containers, with shared storage and network\nresources. Pods are defined by a Kubernetes specification.\n\nA Job creates one or more Pods and continually tries to run them until a\nspecified number of Pods successfully terminate. As Pods complete, the Job\ntracks the successful completions. When a specified number of successful\ncompletions is reached, the Job is complete.\n\nThe following table lists the key recommendations when designing and managing\nJobs:\n\nManage Multi-tenancy\n--------------------\n\nGKE cluster\n[multi-tenancy](/kubernetes-engine/docs/concepts/multitenancy-overview) is an\nalternative to managing\nGKE resources by different users or workloads, named as\n*tenants* , in a single\norganization. The management of GKE resources might follow\ncriteria such as tenant isolation, [quotas and limit ranges](/kubernetes-engine/quotas), or cost allocation.\n[](/static/kubernetes-engine/images/enterprise-multitenancy.svg)\n\nThe following table lists the key recommendations when managing multi-tenancy:\n\n### Control access to the batch platform\n\nGKE allows you to finely tune the access permissions of the\nworkloads running on the cluster.\n\nThe following table lists the key recommendations when managing access and security\n\n#### Queueing and fair sharing\n\nTo control resource consumption, you can assign resource quota limits\nfor each tenant, queue incoming Jobs, and process Jobs in the order they were\nreceived.\n\nThe following table lists the key recommendations when managing queueing and\nfair sharing among batch workloads:\n\n### Optimize storage, performance, and cost efficiency\n\nThe efficient use of our GKE compute and [storage](/kubernetes-engine/docs/concepts/storage-overview) resources can reduce costs.\nOne strategy is to right-size and configure your compute instances to align with your batch\nprocessing needs while not sacrificing performance.\n\nThe following table lists the key recommendations when designing and managing\nstorage and optimizing performance:\n\n### Monitor clusters\n\nGKE is integrated with observability and logging tools that help\nyou monitor the reliability and efficiency of your cluster. The following table\nlists the key recommendations when enabling and using GKE\nobservability tools:\n\nWhat's next\n-----------\n\n- Learn how to [Deploy a batch system using Kueue](/kubernetes-engine/docs/tutorials/kueue-intro)\n- See the [Best practices for running cost-optimized Kubernetes applications on GKE](/architecture/best-practices-for-running-cost-effective-kubernetes-applications-on-gke#make_sure_your_application_can_grow_vertically_and_horizontally)"]]