[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# Introduction to Cloud TPU\n=========================\n\n\u003cbr /\u003e\n\nTensor Processing Units (TPUs) are Google's custom-developed,\napplication-specific integrated circuits (ASICs) used to accelerate machine\nlearning workloads. For more information about TPU hardware, see [TPU architecture](/tpu/docs/system-architecture-tpu-vm).\nCloud TPU is a web service that makes TPUs available as scalable computing\nresources on Google Cloud.\n\nTPUs train your models more efficiently using hardware designed for performing\nlarge matrix operations often found in machine learning algorithms. TPUs have\non-chip high-bandwidth memory (HBM) letting you use larger models and batch\nsizes. TPUs can be connected in groups called slices that scale up your\nworkloads with little to no code changes.\n\nCode that runs on TPUs must be compiled by the accelerator linear algebra (XLA)\ncompiler. [XLA](https://www.tensorflow.org/performance/xla/) is a just-in-time\ncompiler that takes the graph emitted by an ML framework application and\ncompiles the linear algebra, loss, and gradient components of the graph into\nTPU machine code. The rest of the program runs on the TPU host machine. The XLA\ncompiler is part of the TPU VM image that runs on a TPU host machine.\n\nFor more information about Tensor Processing Units, see\n[How to think about TPUs](https://jax-ml.github.io/scaling-book/tpus/).\n\nWhen to use TPUs\n----------------\n\nCloud TPUs are optimized for specific workloads. In some situations, you might\nwant to use GPUs or CPUs on Compute Engine instances to run your\nmachine learning workloads. In general, you can decide what hardware is best for\nyour workload based on the guidelines that follow.\n\n### CPUs\n\n- Quick prototyping that requires maximum flexibility\n- Simple models that don't take long to train\n- Small models with small, effective batch sizes\n- Models that contain many [custom TensorFlow operations written in C++](https://www.tensorflow.org/guide/create_op)\n- Models that are limited by available I/O or the networking bandwidth of the host system\n\n### GPUs\n\n- Models with a significant number of custom PyTorch/JAX operations that must run at least partially on CPUs\n- Models with TensorFlow ops that are not available on Cloud TPU (see the list of [available TensorFlow ops](/tpu/docs/tensorflow-ops))\n- Medium-to-large models with larger effective batch sizes\n\n### TPUs\n\n- Models dominated by matrix computations\n- Models with no custom PyTorch/JAX operations inside the main training loop\n- Models that train for weeks or months\n- Large models with large effective batch sizes\n- Models with ultra-large embeddings common in advanced ranking and recommendation workloads\n\nCloud TPUs are *not* suited to the following workloads:\n\n- Linear algebra programs that require frequent branching or contain many element-wise algebra operations\n- Workloads that require high-precision arithmetic\n- Neural network workloads that contain custom operations in the main training loop\n\nTPUs in Google Cloud\n--------------------\n\nYou can use TPUs through Cloud TPU VMs, Google Kubernetes Engine, and\nVertex AI. The following table lists resources for each Google Cloud\nservice.\n\nBest practices for model development\n------------------------------------\n\nA program whose computation is dominated by non-matrix operations such as add,\nreshape, or concatenate, will likely not achieve high MXU utilization. The\nfollowing are some guidelines to help you choose and build models that are\nsuitable for Cloud TPU.\n\n### Layout\n\nThe XLA compiler performs code transformations, including tiling a matrix\nmultiply into smaller blocks, to efficiently execute computations on the\nmatrix unit (MXU). The structure of the MXU hardware, a 128x128 [systolic array](https://en.wikipedia.org/wiki/Systolic_array),\nand the design of TPUs memory subsystem, which prefers dimensions that are\nmultiples of 8, are used by the XLA compiler for tiling efficiency.\n\nConsequently, certain layouts are more conducive to tiling, while others require\n*reshapes* to be performed before they can be tiled. Reshape operations are\noften memory bound on the Cloud TPU.\n\n### Shapes\n\nThe XLA compiler compiles an ML graph just in time for the first batch. If any\nsubsequent batches have different shapes, the model doesn't work. (Re-compiling\nthe graph every time the shape changes is too slow.) Therefore, any model that\nhas tensors with dynamic shapes isn't well suited to TPUs.\n\n### Padding\n\nA high performing Cloud TPU program is one where the dense compute\ncan be tiled into 128x128 chunks. When a matrix computation cannot occupy\nan entire MXU, the compiler pads tensors with zeros. There are two drawbacks\nto padding:\n\n- Tensors padded with zeros underutilize the TPU core.\n- Padding increases the amount of on-chip memory storage required for a tensor and can lead to an out-of-memory error in the extreme case.\n\nWhile padding is automatically performed by the XLA compiler when necessary, one\ncan determine the amount of padding performed by means of the\n[op_profile](/tpu/docs/cloud-tpu-tools#interpreting_the_results_1) tool. You can\navoid padding by picking tensor dimensions that are well suited to TPUs.\n\n### Dimensions\n\nChoosing suitable tensor dimensions goes a long way in extracting maximum\nperformance from the TPU hardware, particularly the MXU. The XLA compiler\nattempts to use either the batch size or a feature dimension to maximally\nuse the MXU. Therefore, one of these must be a multiple of 128. Otherwise,\nthe compiler will pad one of them to 128. Ideally, batch size and feature\ndimensions should be multiples of 8, which enables extracting high performance\nfrom the memory subsystem.\n\nGetting started with Cloud TPU\n------------------------------\n\n1. [Set up a Google Cloud account](/tpu/docs/setup-gcp-account)\n2. [Activate the Cloud TPU API](/tpu/docs/activate-apis)\n3. [Grant Cloud TPU access to your Cloud Storage buckets](/tpu/docs/storage-buckets)\n4. [Run a basic calculation on a TPU](/tpu/docs/quick-starts)\n5. [Train a reference model on a TPU](/tpu/docs/tutorials)\n6. [Analyze your model](/tpu/docs/cloud-tpu-tools)\n\nRequesting help\n---------------\n\nTo get help, contact [Cloud TPU support](/tpu/docs/getting-support).\nIf you have an active Google Cloud project, be prepared to provide the\nfollowing information:\n\n- Your Google Cloud project ID\n- Your TPU name, if one exists\n- Other information you want to provide\n\nWhat's next?\n------------\n\nLooking to learn more about Cloud TPU? The following resources may help:\n\n- [Cloud TPU architecture](/tpu/docs/system-architecture-tpu-vm)\n- [Cloud TPU pricing](/tpu/docs/pricing)\n- [Contact sales](/contact)"]]