The Vantage Kubernetes agent is the default, recommended configuration to ingest cost and usage data from Kubernetes clusters to Vantage. The agent is a Docker container that runs in your Kubernetes cluster. The agent collects metrics and uploads them to Vantage.
kube-apiserver
for metadata and kubelet
for container data. Access to these APIs is controlled via Kubernetes RBAC using a Service Account and ClusterRole included in the Vantage Kubernetes agent Helm chart.
Data is periodically collected and stored for aggregation, then sent directly to the Vantage service through an API, with your Vantage API token for authentication. This process avoids extra storage costs incurred by the OpenCost integration. The agent’s architecture eliminates the need for deploying OpenCost-specific Prometheus pods, which makes scaling easier.
kubectl
vantage-kubernetes-agent
Helm chart. Ensure you update the values for VANTAGE_API_TOKEN
(obtained in the Prerequisites above) and CLUSTER_ID
(the unique value for your cluster).VANTAGE_KUBE_SKIP_TLS_VERIFY
environment variable to true
. This setting is controlled by agent.disableKubeTLSverify
within the Helm chart. For details, see the TLS verify error section.VANTAGE_NODE_ADDRESS_TYPES
environment variable, which is controlled by the agent.nodeAddressTypes
in the Helm chart. In this case, the type to use for your cluster will most likely be InternalIP
. For configuration details, see the DNS lookup error section.a-z
, A-Z
)0-9
).
), underscores (_
), and hyphens (-
)VANTAGE_ALLOWED_ANNOTATIONS
, as an environment variable at startup. To enable the collection of annotations, configure the agent.allowedAnnotations
parameter of the Helm chart with a list of annotations to be sent to Vantage. Note there is a max of 10 annotations, and values are truncated after 100 characters.VANTAGE_COLLECT_NAMESPACE_LABELS
as an environment variable at startup. To enable the collection of namespace labels, configure the agent.collectNamespaceLabels
parameter of the Helm chart.v1.0.30
and later.VANTAGE_COLLECT_PVC_LABELS
as an environment variable at startup. To enable the collection of PVC Labels, set agent.collectPVCLabels
to true
in the agent’s Helm chart configuration. PVC Labels are collected from persistent volume claims associated with pods.
helm template
to generate a static manifest via the existing repo. This option generates files (YAML) so that you can then decide to deploy them however you want.
--set
flag. You can also include the values using one of the many options Helm supports:
image.tag
when you upgrade. Upgrade and deploy your Helm chart using the following command:helm repo update && helm upgrade -n vantage vka vantage/vantage-kubernetes-agent --set agent.pollingInterval={interval},image.tag={special-tag} --reuse-values
agent.pollingInterval
parameter of the Helm chart with the desired polling period in seconds, such as --set agent.pollingInterval=30
for a 30-second polling interval. If you enter a polling interval that is not in the list of allowed intervals, the agent will fail to start, and an error message is returned within the response.
kubectl describe pod/<pod_name> -n vantage
command. In the Vantage Helm chart, the polling interval is found in the VANTAGE_POLLING_INTERVAL
environment variable.vantage_last_node_scrape_timestamp_seconds
metric provided by the agent.
It is recommended that you monitor system performance and adjust the interval as needed to balance granularity with resource usage.
READY
:ERROR
messages:ERROR
log line. It should also attempt a report soon after the initial start:/metrics
endpoint, exposed by default on port 9010
. This port can be changed via the Helm chart’s service.port
value.
The metrics endpoint includes standard Golang process stats as well as agent-related metrics for node scrape results, node scrape duration, internal data persistence, and reporting.
For users who want to monitor the agent:
vantage_last_node_scrape_count{result="fail"}
should be low (between 0 and 1% of total nodes). Some failures may occur as nodes come and go within the cluster, but consistent failures are not expected and should be investigated.rate(vantage_report_count{result="fail"}[5m])
should be 0. Reporting occurs within the first 5 minutes of every hour and will retry roughly once per minute. Each failure increments this counter. If the agent is unable to report within the first 10 minutes of an hour, some data may be lost from the previous window, as only the previous ~70 data points are retained.persist.mountPath
value of /var/lib/vantage-agent
, which enables PV-based persistence by default. To disable PV persistence, set persist: null
in your values.yaml
.
If Persistent Volumes are not supported with your cluster, or if you prefer to centralize persistence, S3 is available as an alternative for agents deployed in AWS. See the section below for details. If you require persistence to a different object store, contact support@vantage.sh.
.tar
archive and moves them to a backup location, either:
PERSIST_DIR
or PERSIST_S3_BUCKET
environment variables, if provided. If neither is set, the agent will not store hourly backups. For persistence configuration options, see the section above.kubectl logs <pod-name>
command.vantage
namespace and vka-vantage-kubernetes-agent
service account names may vary based on your configuration.
Below are the expected associated permissions for the IAM role:
VANTAGE_PERSIST_S3_BUCKET
in the agent deployment.$CLUSTER_ID/
prefix within the bucket. Multiple agents can use the same bucket as long as they do not have overlapping CLUSTER_ID
values. An optional prefix can be prepended with VANTAGE_PERSIST_S3_PREFIX
resulting in $VANTAGE_PERSIST_S3_PREFIX/$CLUSTER_ID/
being the prefix used by the agent for all objects uploaded.
failed to fetch presigned urls
error can occur for a few reasons, as described below.
Authorization
header field value. The error log typically looks like this:
VANTAGE_API_TOKEN
(obtained in the Prerequisites above) is valid and properly formatted. If necessary, generate a new token and update the configuration.
MissingRegion
eks.amazonaws.com/role-arn
annotation is correctly added to the Service Account. Run the following command to inspect the configuration:
serviceAccount
configuration block of the Helm chart values file. This is a name that you can also set within the file."level":"ERROR","msg":"failed to scrape node"
and no such host
.
The agent uses the node status addresses to determine what hostname to look up for the node’s stats, which are available via the /metrics/resource
endpoint. This can be configured with the VANTAGE_NODE_ADDRESS_TYPES
environment variable, which is controlled by the agent.nodeAddressTypes
in the Helm chart. By default, the priority order is Hostname,InternalDNS,InternalIP,ExternalDNS,ExternalIP
.
To understand which type to use for your cluster, you can look at the available addresses for one of your nodes. The type
corresponds to one of the configurable nodeAddressTypes
.
PersistentVolumeClaim
, PersistentVolume
, and Pod
.
An example error log line might look like:
helm
, run:
/metrics/resources
endpoint. This access is managed via Kubernetes RBAC, but in some cases, the node’s TLS certificate may not be valid and will result in TLS errors when attempting this connection. This most often affects clusters in AKS. To skip TLS verify within the Kubernetes client, you can set the VANTAGE_KUBE_SKIP_TLS_VERIFY
environment variable to true
. This setting is controlled by agent.disableKubeTLSverify
within the Helm chart. This does not affect requests outside of the cluster itself, such as to the Vantage API or S3.
An example error log line might look like:
ERROR
level messages appear:
vka-vantage-kubernetes-agent-0
pod and include an error that contains unbound immediate PersistentVolumeClaims
.
The resolution to this error is based on the cluster’s configuration and the specific cloud provider. More information might be present on the persistent volume claim or persistent volume. For Kubernetes clusters on AWS, S3 can be used for data persistence.
Additional provider references are also listed here:
binding volumes: context deadline exceeded
, this means you may not have volume support on your cluster. This error typically occurs when your cluster is unable to provision or attach persistent storage volumes required by your applications. Check your cluster’s configuration and ensure the storage provider is properly set up.
-
, the OpenCost integration received the normalized versions of those labels from Prometheus. The Vantage Kubernetes agent, on the other hand, directly retrieves labels from the kube-apiserver
, resulting in more precise data. However, this change may necessitate updates to filters that previously relied on the normalized values. You can contact support@vantage.sh to have these filters converted for you.