[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-04。"],[],[],null,["# Troubleshoot monitoring dashboards\n\n[Autopilot](/kubernetes-engine/docs/concepts/autopilot-overview) [Standard](/kubernetes-engine/docs/concepts/choose-cluster-mode)\n\n*** ** * ** ***\n\nThis page helps you resolve issues with Google Kubernetes Engine (GKE) monitoring\ndashboards, such as dashboards not appearing or data being unavailable. For more\ninformation about how to use these dashboards to troubleshoot your clusters and\nworkloads, see\n[Introduction to GKE troubleshooting](/kubernetes-engine/docs/troubleshooting/introduction).\n\nGKE dashboards are not listed in Cloud Monitoring\n-------------------------------------------------\n\nBy default Monitoring is enabled when you create a cluster.\nIf you don't see GKE dashboards when you are\n[viewing provided Google Cloud dashboards](/monitoring/charts/predefined-dashboards) in\nMonitoring, Monitoring is not enabled for\nclusters in the selected Google Cloud project.\n[Enable monitoring](/stackdriver/docs/solutions/gke/installing) to view these dashboards.\n| **Note:** For GKE Autopilot clusters, you cannot disable the Cloud Monitoring and Cloud Logging integration.\n\nNo Kubernetes resources are in my dashboard\n-------------------------------------------\n\nIf you don't see any Kubernetes resources in your GKE dashboard,\nthen check the following:\n\n### Selected Google Cloud project\n\nVerify that you have selected the correct Google Cloud project from the\ndrop-down list in the Google Cloud console menu bar to select a project. You must\nselect the project whose data you want to see.\n\n### Clusters activity\n\nIf you just created your cluster, wait a few minutes for it to populate with\ndata. See\n[Configuring logging and monitoring for GKE](/stackdriver/docs/solutions/gke/installing)\nfor details.\n\n### Time range\n\nThe selected time range might be too narrow. You can use the **Time** menu in\nthe dashboard toolbar to select other time ranges or define a **Custom** range.\n\n### Permissions to view the dashboard\n\nIf you see either of the following permission-denied error messages when viewing\na service's deployment details or a Google Cloud project's metrics, you\nneed to update your Identity and Access Management role to include **roles/monitoring.viewer**\nor **roles/viewer**:\n\n- `You do not have sufficient permissions to view this page`\n- `You don't have permissions to perform the action on the selected resources`\n\nFor more details, go to\n[Predefined roles](/monitoring/access-control#predefined_roles).\n\n### Cluster and node service account permissions to write data to Monitoring and Logging\n\nIf you see high error rates in the **Enabled APIs and services** page in\nthe Google Cloud console, then your service account might be missing the\nfollowing roles:\n\n- `roles/logging.logWriter`: In the Google Cloud console, this role is named\n **Logs Writer** . For more information on Logging roles, see\n the [Logging access control guide](/logging/docs/access-control).\n\n- `roles/monitoring.metricWriter`: In the Google Cloud console, this role is named\n **Monitoring Metric Writer** . For more information on\n Monitoring roles, see the [Monitoring access\n control guide](/monitoring/access-control).\n\n- `roles/stackdriver.resourceMetadata.writer`: In the Google Cloud console, this\n role is named **Stackdriver Resource Metadata Writer** . This role permits\n write-only access to resource metadata, and it provides exactly the\n permissions needed by agents to send metadata. For more information on\n Monitoring roles, see the\n [Monitoring access control guide](/monitoring/access-control).\n\nTo list your service accounts, in the Google Cloud console go to\n**IAM and Admin** , and then select **Service Accounts**.\n\nCan't view logs\n---------------\n\nIf you don't see your logs in dashboards, check the following:\n\n### Agent is running and healthy\n\nGKE version 1.17 and later use [Fluent Bit](https://fluentbit.io/)\nto capture logs. Fluent Bit is the Logging agent that runs on Kubernetes nodes.\nTo check if the agent is running correctly, perform the following steps:\n\n1. Check whether the agent is restarting by running the following command:\n\n kubectl get pods -l k8s-app=fluentbit-gke -n kube-system\n\n If there are no restarts, the output is similar to the following: \n\n NAME READY STATUS RESTARTS AGE\n fluentbit-gke-6zr6g 2/2 Running 0 44d\n fluentbit-gke-dzh9l 2/2 Running 0 44d\n\n2. Check Pod status conditions by running the following command:\n\n JSONPATH='{range .items[*]};{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status},{end}{end};' \\\n && kubectl get pods -l k8s-app=fluentbit-gke -n kube-system -o jsonpath=\"$JSONPATH\" | tr \";\" \"\\n\"\n\n If the deployment is healthy, the output is similar to the following: \n\n fluentbit-gke-nj4qs:Initialized=True,Ready=True,ContainersReady=True,PodScheduled=True,\n fluentbit-gke-xtcvt:Initialized=True,Ready=True,ContainersReady=True,PodScheduled=True,\n\n3. Check the Pod status, which can help determine if the deployment is healthy\n by running the following command:\n\n kubectl get daemonset -l k8s-app=fluentbit-gke -n kube-system\n\n If the deployment is healthy, the output is similar to the following: \n\n NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE\n fluentbit-gke 2 2 2 2 2 kubernetes.io/os=linux 5d19h\n\n In this example output, the desired state matches the current state.\n\nIf the agent is running and healthy in these scenarios, and you still don't see all\nof your logs, the agent might be overloaded and dropping logs.\n\n### Agent overloaded and dropping logs\n\nOne possible reason you're not seeing all of your logs is that the node's log\nvolume is overloading the agent. The default Logging agent\nconfiguration in GKE is tuned for the rate of 100 kiB per\nsecond for each node, and the agent might start dropping logs if the volume\nexceeds that limit.\n\nTo detect if you might be hitting this limit, look for any of the following\nindicators:\n\n- View the `kubernetes.io/container/cpu/core_usage_time` metric with the filter\n `container_name=fluentbit-gke` to see if the CPU usage of the Logging\n agent is near or at 100%.\n\n- View the `logging.googleapis.com/byte_count` metric grouped by\n `metadata.system_labels.node_name` to see if any node reaches 100 kiB per\n second.\n\nIf you see any of these conditions, you can reduce the log volume of your nodes\nby adding more nodes to the cluster. If all of the log volume comes from a\nsingle pod, then you would need to reduce the volume from that pod.\n\nFor more information on investigating and resolving GKE logging\nrelated issues, see [Troubleshooting logging in GKE](/kubernetes-engine/docs/troubleshooting/troubleshooting-gke-logging).\n\nIncident isn't matched to a GKE resource?\n-----------------------------------------\n\nIf you have an alerting policy condition that aggregates metrics across distinct\nGKE resources, you might need to edit the policy's\ncondition to include more GKE hierarchy labels to associate\nincidents with specific entities.\n\nFor example, you might have two GKE clusters, one for\nproduction and one for staging, each with their own copy of service\n`lilbuddy-2`. When the alerting policy condition aggregates a metric across\ncontainers in both clusters, the GKE\nMonitoring dashboard isn't able to associate this incident\nuniquely with the production service or the staging service.\n\nTo resolve this situation, target the alerting policy to a specific service by\nadding `namespace`, `cluster`, and `location` to the policy's **Group By**\nfield. On the event card for the alert, click the **Update alert policy** link\nto open the **Edit alerting policy** page for the relevant alert policy. From\nhere, you can update the alerting policy with the additional information so that\nthe dashboard can find the associated resource.\n\nAfter you update the alerting policy, the GKE\nMonitoring dashboard is able to associate all future incidents\nwith a unique service in a particular cluster, giving you additional information\nto diagnose the problem.\n\nDepending on your use case, you might want to filter on some of these labels in\naddition to adding them to the **Group By** field. For example, if you only want\nalerts for your production cluster, you can filter on `cluster_name`.\n\nWhat's next\n-----------\n\n- If you can't find a solution to your problem in the documentation, see\n [Get support](/kubernetes-engine/docs/getting-support) for further help,\n including advice on the following topics:\n\n - Opening a support case by contacting [Cloud Customer Care](/support-hub).\n - Getting support from the community by [asking questions on StackOverflow](http://stackoverflow.com/questions/tagged/google-kubernetes-engine) and using the `google-kubernetes-engine` tag to search for similar issues. You can also join the [`#kubernetes-engine` Slack channel](https://googlecloud-community.slack.com/messages/C0B9GKTKJ/) for more community support.\n - Opening bugs or feature requests by using the [public issue tracker](/support/docs/issue-trackers)."]]