AI Hypercomputer

Announcing Ironwood, our most powerful, capable, and energy efficient TPU yet.

Train, tune, and serve on an AI supercomputer

AI Hypercomputer is the integrated supercomputing system underneath every AI workload on Google Cloud. It is made up of hardware, software and consumption models designed to simplify AI deployment, improve system-level efficiency, and optimize costs.

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

Release Notes

Overview

AI-optimized hardware

Choose from compute, storage, and networking options optimized for granular, workload-level objectives, whether that's higher throughput, lower latency, faster time-to-results, or lower TCO. Learn more about: Google Cloud TPU, Google Cloud GPU, plus the latest in storage and networking.

Video: The latest compute innovations

Learn how Google Cloud custom designed AI compute options in this 10-minute summary of 2025 announcements.

Watch on-demand

Leading software, open frameworks

Get more from your hardware with industry-leading software, integrated with open frameworks, libraries, and compilers to make AI development, integration, and management more efficient.

Support for PyTorch, JAX, Keras, vLLM, Megatron-LM, NeMo Megatron, MaxText, MaxDiffusion, and many more.
Deep integration with the XLA compiler allows for interoperability between different accelerators, while Pathways on Cloud allows you to use the same distributed runtime that powers Google’s internal large-scale training and inference infrastructure.
All of this is deployable in your environment of choice, whether that's Google Kubernetes Engine, Cluster Director or Google Compute Engine.

Video: Hear from Moloco, LG, and Shopify

Learn how they leverage Google Cloud’s AI solutions to drive innovation and transform their businesses

Watch on-demand

Flexible consumption models

Flexible consumption options allow customers to choose fixed costs with committed use discounts or dynamic on-demand models to meet your business needs. Dynamic Workload Scheduler and Spot VMs can help you get the capacity you need without over allocating. Plus, Google Cloud's cost optimization tools help automate resource utilization to reduce manual tasks for engineers.

Optimize resource access and economics for AI/ML workloads

Learn how the Dynamic Workload Scheduler service optimizes your AI workload execution.

Read the blog

How It Works

Google is a leader in artificial intelligence with the invention of technologies like TensorFlow. Did you know you can leverage Google’s technology for your own projects? Learn about Google's history of innovation in AI infrastructure and how you can leverage it for your workloads.

Google Cloud AI Hypercomputer architecture diagram alongside the Google Cloud product manager Chelsie's photo

Common Uses

Run large-scale AI training and pre-training

Powerful, scalable, and efficient AI training

Training workloads need to run as highly synchronized jobs across thousands of nodes in tightly coupled clusters. A single degraded node can disrupt an entire job, delaying time-to-market. You need to:

Ensure the cluster is set up quickly and tuned for the workload in question
Predict failures and troubleshoot them quickly
And continue with a workload, even when failures do happen

We want to make it extremely easy for customers to deploy and scale training workloads on Google Cloud.

How-tos

Powerful, scalable, and efficient AI training

Ensure the cluster is set up quickly and tuned for the workload in question
Predict failures and troubleshoot them quickly
And continue with a workload, even when failures do happen

We want to make it extremely easy for customers to deploy and scale training workloads on Google Cloud.

Additional resources

Powerful, scalable, and efficient AI training

To create an AI cluster, get started with one of our turorials:

Create a Slurm cluster with GPUs (A4 VMs) and Cluster Toolkit
Create a GKE cluster with Cluster Director for GKE or Cluster Toolkit

Customer examples

Character AI leverages Google Cloud to scale up

"We need GPUs to generate responses to users' messages. And as we get more users on our platform, we need more GPUs to serve them. So on Google Cloud, we can experiment to find what is the right platform for a particular workload. It's great to have that flexibility to choose which solutions are most valuable." Myle Ott, Founding Engineer, Character.AI

Myle Ott, Founding Engineer, Character.AI

1:36

Deploy and orchestrate AI applications

Leverage leading AI orchestration software and open frameworks to deliver AI powered experiences

Google Cloud provides images that contain common operating systems, frameworks, libraries, and drivers. AI Hypercomputer optimizes these pre-configured images to support your AI workloads.

AI and ML frameworks and libraries: Use Deep Learning Software Layer (DLSL) Docker images to run ML models such as NeMO and MaxText on a Google Kubernetes Engine (GKE) cluster.
Cluster deployment and AI orchestration: You can deploy your AI workloads on GKE clusters, Slurm clusters, or Compute Engine instances. For more information, see VM and cluster creation overview.