A primary provider (e.g., AWS, Azure, or GCP) is required to connect Kubernetes costs.
Agent Functionality
The Vantage Kubernetes agent relies on native Kubernetes APIs, such askube-apiserver
for metadata and kubelet
for container data. Access to these APIs is controlled via Kubernetes RBAC using a Service Account and ClusterRole included in the Vantage Kubernetes agent Helm chart.
Data is periodically collected and stored for aggregation, then sent directly to the Vantage service through an API, with your Vantage API token for authentication. This process avoids extra storage costs incurred by the OpenCost integration. The agent’s architecture eliminates the need for deploying OpenCost-specific Prometheus pods, which makes scaling easier.

Service Compatibility
The Vantage Kubernetes agent is compatible with the following services:- Amazon Elastic Kubernetes Service (EKS)
- Azure Kubernetes Service (AKS)
- Google Kubernetes Engine (GKE)
At this time, the agent does not support custom rates for on-premises servers.
Google Kubernetes Engine (GKE) Autopilot
For GKE Autopilot users, you don’t need to install the agent. These costs will already be present under Cost By Resource for the Kubernetes Engine service in a Cost Report.
Install Vantage Kubernetes Agent
Prerequisites
The following prerequisites are required before you install the Vantage Kubernetes agent:- The Helm package manager for Kubernetes
kubectl
- A running Kubernetes cluster
- An already connected primary provider (i.e., AWS, Azure, or GCP)
- A Vantage API token with READ and WRITE scopes enabled (it’s recommended to use a service token rather than a personal access token)
- If you do not already have an integration enabled, navigate to the Kubernetes Integration page in the Vantage console, and click the Enable Kubernetes Agent button (you won’t need to do this for subsequent integrations)
- Review the section on Data Persistence before you begin
- Review the section on Naming Your Clusters
Create a Connection
The following steps are also provided in the Vantage Kubernetes agent Helm chart repository. See the Helm chart repository for all value configurations. If you would like to use a manifest-based option instead, see the section below
1
Add the repository for the Vantage Kubernetes agent Helm chart.
2
Install the
vantage-kubernetes-agent
Helm chart. Ensure you update the values for VANTAGE_API_TOKEN
(obtained in the Prerequisites above) and CLUSTER_ID
(the unique value for your cluster).Azure Kubernetes Service (AKS) Connections
If you are creating an AKS connection, you will need to configure the following parameters to avoid AKS-specific errors:- Set the
VANTAGE_KUBE_SKIP_TLS_VERIFY
environment variable totrue
. This setting is controlled byagent.disableKubeTLSverify
within the Helm chart. For details, see the TLS verify error section. - Configure the
VANTAGE_NODE_ADDRESS_TYPES
environment variable, which is controlled by theagent.nodeAddressTypes
in the Helm chart. In this case, the type to use for your cluster will most likely beInternalIP
. For configuration details, see the DNS lookup error section.
Naming Your Clusters
When you name your clusters, ensure the cluster ID adheres to Kubernetes object naming conventions. While the agent does not enforce specific formats, valid characters include:- Lowercase and uppercase letters (
a-z
,A-Z
) - Numbers (
0-9
) - Periods (
.
), underscores (_
), and hyphens (-
)
Enable Collection of Annotations and Namespace Labels
You can optionally enable the collection of annotations and namespace labels.- Annotations: The agent accepts a comma-separated list of annotation keys, called
VANTAGE_ALLOWED_ANNOTATIONS
, as an environment variable at startup. To enable the collection of annotations, configure theagent.allowedAnnotations
parameter of the Helm chart with a list of annotations to be sent to Vantage. Note there is a max of 10 annotations, and values are truncated after 100 characters. - Namespace labels: The agent accepts
VANTAGE_COLLECT_NAMESPACE_LABELS
as an environment variable at startup. To enable the collection of namespace labels, configure theagent.collectNamespaceLabels
parameter of the Helm chart.
(Optional) Enable Collection of PVC Labels
This feature is available with Vantage Kubernetes agent
v1.0.30
and later.VANTAGE_COLLECT_PVC_LABELS
as an environment variable at startup. To enable the collection of PVC Labels, set agent.collectPVCLabels
to true
in the agent’s Helm chart configuration. PVC Labels are collected from persistent volume claims associated with pods.
Manifest-Based Deployment Option
You can usehelm template
to generate a static manifest via the existing repo. This option generates files (YAML) so that you can then decide to deploy them however you want.
1
Add the repository for the Vantage Kubernetes agent Helm chart.
2
Generate the static manifest.
Resource Usage
The limits provided within the Helm chart are set low to support small clusters (approximately 10 nodes) and should be considered the minimum values for deploying an agent. Estimates for larger clusters are roughly:- ~1 CPU/1000 node
- ~5 MB/node
--set
flag. You can also include the values using one of the many options Helm supports:
Configure Polling Interval
To enable a configurable polling interval for the Vantage Kubernetes Agent, specify an
image.tag
when you upgrade. Upgrade and deploy your Helm chart using the following command:helm repo update && helm upgrade -n vantage vka vantage/vantage-kubernetes-agent --set agent.pollingInterval={interval},image.tag={special-tag} --reuse-values
agent.pollingInterval
parameter of the Helm chart with the desired polling period in seconds, such as --set agent.pollingInterval=30
for a 30-second polling interval. If you enter a polling interval that is not in the list of allowed intervals, the agent will fail to start, and an error message is returned within the response.
To see the current polling period for a cluster, use the
kubectl describe pod/<pod_name> -n vantage
command. In the Vantage Helm chart, the polling interval is found in the VANTAGE_POLLING_INTERVAL
environment variable.vantage_last_node_scrape_timestamp_seconds
metric provided by the agent.
It is recommended that you monitor system performance and adjust the interval as needed to balance granularity with resource usage.
Validate Installation
Follow the steps below to validate the agent’s installation.1
Once installed, the agent’s pod should become
READY
:2
Logs should be free of
ERROR
messages:3
Agent reporting should occur once per hour at the start of the hour and should not generate an
ERROR
log line. It should also attempt a report soon after the initial start:You can view and manage your Kubernetes integration on the Kubernetes Integration page in the console. Hover over the integration in the list, and click Manage.
Monitoring
The agent exposes a Prometheus metrics endpoint via the/metrics
endpoint, exposed by default on port 9010
. This port can be changed via the Helm chart’s service.port
value.
The metrics endpoint includes standard Golang process stats as well as agent-related metrics for node scrape results, node scrape duration, internal data persistence, and reporting.
For users who want to monitor the agent:
1
vantage_last_node_scrape_count{result="fail"}
should be low (between 0 and 1% of total nodes). Some failures may occur as nodes come and go within the cluster, but consistent failures are not expected and should be investigated.2
rate(vantage_report_count{result="fail"}[5m])
should be 0. Reporting occurs within the first 5 minutes of every hour and will retry roughly once per minute. Each failure increments this counter. If the agent is unable to report within the first 10 minutes of an hour, some data may be lost from the previous window, as only the previous ~70 data points are retained.Upgrade Agent
To see which version of the Kubernetes agent you are running:1
From the top navigation, click Settings.
2
On the side navigation, click Integrations.
3
A list of all your provider integrations is displayed. Select the Kubernetes integration.
4
On the Manage tab, click the settings button (looks like a cog wheel) next to a specific integration.
5
Scroll down to the Clusters section. Each cluster that is integrated with the agent is listed along with the current agent version and indicates if the agent is out of date.

AKS users should remember to follow the AKS-specific instructions again when updating.
Data Persistence
The agent requires a persistent store for periodic backups of time-series data as well as checkpointing for periodic reporting. By default, the Helm chart configures the agent to use a Persistent Volume (PV), which works well for clusters ranging from tens to thousands of nodes. The Helm chart sets a defaultpersist.mountPath
value of /var/lib/vantage-agent
, which enables PV-based persistence by default. To disable PV persistence, set persist: null
in your values.yaml
.
If Persistent Volumes are not supported with your cluster, or if you prefer to centralize persistence, S3 is available as an alternative for agents deployed in AWS. See the section below for details. If you require persistence to a different object store, contact support@vantage.sh.
If both a Persistent Volume and an S3 bucket are configured, the agent will prioritize S3.
Persistent Metrics Recovery
Persistent Metrics Recovery is enabled by default in Kubernetes Agent version v1.0.29. (See the Upgrade the Agent section for details on how to upgrade your agent to the latest version.)
- Compresses the reports into a
.tar
archive and moves them to a backup location, either:- A mounted volume in the container, or
- An S3 bucket you configure
- Retries the upload until successful or until the report is 96 hours old.
- Deletes old reports to manage disk space and avoid unbounded storage use.
This feature does not require additional configuration flags, but it does respect the
PERSIST_DIR
or PERSIST_S3_BUCKET
environment variables, if provided. If neither is set, the agent will not store hourly backups. For persistence configuration options, see the section above.The agent emits logs and metrics to indicate when reports are stored, retried, and successfully uploaded. These logs help monitor recovery status and confirm no data loss occurred. To view these logs, run the
kubectl logs <pod-name>
command.Configure Agent for S3 Persistence
The agent uses IAM roles for service accounts to access the configured bucket. The defaultvantage
namespace and vka-vantage-kubernetes-agent
service account names may vary based on your configuration.
Below are the expected associated permissions for the IAM role:
- Environment variable: Set
VANTAGE_PERSIST_S3_BUCKET
in the agent deployment. - Helm chart values:
$CLUSTER_ID/
prefix within the bucket. Multiple agents can use the same bucket as long as they do not have overlapping CLUSTER_ID
values. An optional prefix can be prepended with VANTAGE_PERSIST_S3_PREFIX
resulting in $VANTAGE_PERSIST_S3_PREFIX/$CLUSTER_ID/
being the prefix used by the agent for all objects uploaded.
Common Errors
Failed to Fetch Presigned URL
Afailed to fetch presigned urls
error can occur for a few reasons, as described below.
API Token Error
The below error occurs when the agent attempts to fetch presigned URLs but fails due to an invalidAuthorization
header field value. The error log typically looks like this:
VANTAGE_API_TOKEN
(obtained in the Prerequisites above) is valid and properly formatted. If necessary, generate a new token and update the configuration.
404 Not Found Error
The below error occurs when the agent attempts to fetch presigned URLs but fails due to the cluster ID potentially including invalid characters. The error log typically looks like this:Failed to Set Up Controller Store—MissingRegion
This error occurs when the agent cannot initialize the controller store due to missing or misconfigured AWS region settings. The error log will typically look like:
1
Verify the Service Account configuration:
- Check if the
eks.amazonaws.com/role-arn
annotation is correctly added to the Service Account. Run the following command to inspect the configuration: - Ensure the Service Account matches the Helm chart settings in the agent’s
serviceAccount
configuration block of the Helm chart values file. This is a name that you can also set within the file.
2
Ensure the IAM role is correctly set up:
- Review the AWS IAM Roles for Service Accounts documentation to confirm that the IAM role is configured with the necessary permissions and associated with the Service Account.
3
Configure S3 persistence:
- See the Agent S3 persistence setup section for details.
S3 bucket used for persistence must be in the same region as the Kubernetes cluster to minimize latency.
4
If necessary, recreate the pod:
- If the Service Account appears correct and there’s still an issue, delete the agent pod to force a fresh start with the correct configuration:
DNS Lookup Error
You may receive a DNS Lookup Error that indicates"level":"ERROR","msg":"failed to scrape node"
and no such host
.
The agent uses the node status addresses to determine what hostname to look up for the node’s stats, which are available via the /metrics/resource
endpoint. This can be configured with the VANTAGE_NODE_ADDRESS_TYPES
environment variable, which is controlled by the agent.nodeAddressTypes
in the Helm chart. By default, the priority order is Hostname,InternalDNS,InternalIP,ExternalDNS,ExternalIP
.
To understand which type to use for your cluster, you can look at the available addresses for one of your nodes. The type
corresponds to one of the configurable nodeAddressTypes
.
EOF Error When Starting
The agent uses local files for recovering from crashes or restarts. If this backup file becomes corrupted, most commonly due to OOMKill, the most straightforward approach to get the agent running again is to perform a fresh install or remove thePersistentVolumeClaim
, PersistentVolume
, and Pod
.
An example error log line might look like:
helm
, run:
TLS Verify Error When Scraping Nodes
The agent connects to each node to collect usage metrics from the/metrics/resources
endpoint. This access is managed via Kubernetes RBAC, but in some cases, the node’s TLS certificate may not be valid and will result in TLS errors when attempting this connection. This most often affects clusters in AKS. To skip TLS verify within the Kubernetes client, you can set the VANTAGE_KUBE_SKIP_TLS_VERIFY
environment variable to true
. This setting is controlled by agent.disableKubeTLSverify
within the Helm chart. This does not affect requests outside of the cluster itself, such as to the Vantage API or S3.
An example error log line might look like:
ERROR
level messages appear:
Pod Scheduling Errors
The most common cause for pod scheduling errors is the persistent volume not being provisioned. By default, the agent is deployed as a StatefulSet with a persistent volume for persisting internal state. The state allows the agent to recover from a restart without losing the historical data for the current reporting window. An example error for this case would be present in the events on thevka-vantage-kubernetes-agent-0
pod and include an error that contains unbound immediate PersistentVolumeClaims
.
The resolution to this error is based on the cluster’s configuration and the specific cloud provider. More information might be present on the persistent volume claim or persistent volume. For Kubernetes clusters on AWS, S3 can be used for data persistence.
Additional provider references are also listed here:
- GCP: Using the Compute Engine persistent disk CSI Driver
- Azure: Container Storage Interface (CSI) drivers on Azure Kubernetes Service (AKS)
- AWS: Amazon EBS CSI driver
Volume Support Error
If you see an error related tobinding volumes: context deadline exceeded
, this means you may not have volume support on your cluster. This error typically occurs when your cluster is unable to provision or attach persistent storage volumes required by your applications. Check your cluster’s configuration and ensure the storage provider is properly set up.
Active Resources and Rightsizing Recommendations
Rightsizing recommendations require version 1.0.24 or later of the Vantage Kubernetes agent. See the upgrading section for information on how to upgrade the agent. Once the upgrade is complete, the agent will begin uploading the data needed to generate rightsizing recommendations. After the agent is upgraded or installed, recommendations will become available within 48 hours. This step is required to ensure there is enough data to make a valid recommendation. Historical data is not available before the agent upgrade, so it is recommended that you observe cyclical resource usage patterns, such as a weekly spike when you first review recommendations.
For a full guide on understanding rightsizing and how to rightsize Kubernetes workloads, see the following article in the Cloud Cost Handbook.
Migrate Costs from OpenCost to Vantage Kubernetes Agent
If you are moving from an OpenCost integration to the agent-based integration, you can contact support@vantage.sh to have your previous integration data maintained. Any overlapping data will be removed from the agent data by the Vantage team.Maintaining OpenCost Filters
If you previously used the OpenCost integration and are transitioning to the new agent-based integration, your existing filters will be retained. It’s important to note that in situations where labels contained characters excluded from Prometheus labels, such as-
, the OpenCost integration received the normalized versions of those labels from Prometheus. The Vantage Kubernetes agent, on the other hand, directly retrieves labels from the kube-apiserver
, resulting in more precise data. However, this change may necessitate updates to filters that previously relied on the normalized values. You can contact support@vantage.sh to have these filters converted for you.