Kubernetes Monitoring Operator Installation and Configuration

12/03/2024 Contributors

Data Infrastructure Insights offers the Kubernetes Monitoring Operator for Kubernetes collection. Navigate to Kubernetes > Collectors > +Kubernetes Collector to deploy a new operator.

Before installing the Kubernetes Monitoring Operator

See the Pre-requisites documentation before installing or upgrading the Kubernetes Monitoring Operator.

Installing the Kubernetes Monitoring Operator

Monitoring Operator Instructions

Steps to install Kubernetes Monitoring Operator agent on Kubernetes:

Enter a unique cluster name and namespace. If you are upgrading from a previous Kubernetes Operator, use the same cluster name and namespace.
Once these are entered, you can copy the Download Command snippet to the clipboard.
Paste the snippet into a bash window and execute it. The Operator installation files will be downloaded. Note that the snippet has a unique key and is valid for 24 hours.
If you have a custom or private repository, copy the optional Image Pull snippet, paste it into a bash shell and execute it. Once the images have been pulled, copy them to your private repository. Be sure to maintain the same tags and folder structure. Update the paths in operator-deployment.yaml as well as the docker repository settings in operator-config.yaml.
If desired, review available configuration options such as proxy or private repository settings. You can read more about configuration options.
When you are ready, deploy the Operator by copying the kubectl Apply snippet, downloading it, and executing it.
The installation proceeds automatically. When it is complete, click the Next button.
When installation is complete, click the Next button. Be sure to also delete or securely store the operator-secrets.yaml file.

If you are using a proxy, read about configuring proxy.

If you are have a custom repository, read about using a custom/private docker repository.

Kubernetes Monitoring Components

Data Infrastructure Insights Kubernetes Monitoring is comprised of four monitoring components:

Cluster Metrics
Network Performance and Map (optional)
Event Logs (optional)
Change Analysis (optional)

The optional components above are enabled by default for each Kubernetes collector; if you decide you don't need a component for a particular collector, you can disable it by navigating to Kubernetes > Collectors and selecting Modify Deployment from the collector's "three dots" menu on the right of the screen.

Modify deployment Menu on Kubernetes Collector list page

The screen shows the current state of each component and allows you to disable or enable components for that collector as needed.

Modify deployment options

Upgrading to the latest Kubernetes Monitoring Operator

Determine whether an AgentConfiguration exists with the existing Operator (if your namespace is not the default netapp-monitoring, substitute the appropriate namespace):

kubectl -n netapp-monitoring get agentconfiguration netapp-monitoring-configuration

If an AgentConfiguration exists:

Install the latest Operator over the existing Operator.
- Ensure you are pulling the latest container images if you are using a custom repository.

If the AgentConfiguration does not exist:

Make note of your cluster name as recognized by Data Infrastructure Insights (if your namespace is not the default netapp-monitoring, substitute the appropriate namespace):
```
kubectl -n netapp-monitoring get agent -o jsonpath='{.items[0].spec.cluster-name}'
```
Create a backup of the existing Operator (if your namespace is not the default netapp-monitoring, substitute the appropriate namespace):
```
kubectl -n netapp-monitoring get agent -o yaml > agent_backup.yaml
```
Uninstall the existing Operator.
Install the latest Operator.
- Use the same cluster name.
- After downloading the latest Operator YAML files, port any customizations found in agent_backup.yaml to the downloaded operator-config.yaml before deploying.
- Ensure you are pulling the latest container images if you are using a custom repository.

Stopping and Starting the Kubernetes Monitoring Operator

To stop the Kubernetes Monitoring Operator:

kubectl -n netapp-monitoring scale deploy monitoring-operator --replicas=0

To start the Kubernetes Monitoring Operator:

kubectl -n netapp-monitoring scale deploy monitoring-operator --replicas=1

Uninstalling

To remove the Kubernetes Monitoring Operator

Note that the default namespace for the Kubernetes Monitoring Operator is "netapp-monitoring". If you have set your own namespace, substitute that namespace in these and all subsequent commands and files.

Newer versions of the monitoring operator can be uninstalled with the following commands:

kubectl -n <NAMESPACE> delete agent -l installed-by=nkmo-<NAMESPACE>
kubectl -n <NAMESPACE> delete clusterrole,clusterrolebinding,crd,svc,deploy,role,rolebinding,secret,sa -l installed-by=nkmo-<NAMESPACE>

If the monitoring operator was deployed in its own dedicated namespace, delete the namespace:

kubectl delete ns <NAMESPACE>

If the first command returns “No resources found”, use the following instructions to uninstall older versions of the monitoring operator.

Execute each of the following commands in order. Depending on your current installation, some of these commands may return ‘object not found’ messages. These messages may be safely ignored.

kubectl -n <NAMESPACE> delete agent agent-monitoring-netapp
kubectl delete crd agents.monitoring.netapp.com
kubectl -n <NAMESPACE> delete role agent-leader-election-role
kubectl delete clusterrole agent-manager-role agent-proxy-role agent-metrics-reader <NAMESPACE>-agent-manager-role <NAMESPACE>-agent-proxy-role <NAMESPACE>-cluster-role-privileged
kubectl delete clusterrolebinding agent-manager-rolebinding agent-proxy-rolebinding agent-cluster-admin-rolebinding <NAMESPACE>-agent-manager-rolebinding <NAMESPACE>-agent-proxy-rolebinding <NAMESPACE>-cluster-role-binding-privileged
kubectl delete <NAMESPACE>-psp-nkmo
kubectl delete ns <NAMESPACE>

If a Security Context Constraint was previously-created:

kubectl delete scc telegraf-hostaccess

About Kube-state-metrics

The NetApp Kubernetes Monitoring Operator installs its own kube-state-metrics to avoid conflict with any other instances.

For information about Kube-State-Metrics, see this page.

Configuring/Customizing the Operator

These sections contain information on customizing your operator configuration, working with proxy, using a custom or private docker repository, or working with OpenShift.

Configuration Options

Most commonly modified settings can be configured in the AgentConfiguration custom resource. You can edit this resource before deploying the operator by editing the operator-config.yaml file. This file includes commented-out examples of settings. See the list of available settings for the most recent version of the operator.

You can also edit this resource after the operator has been deployed by using the following command:

kubectl -n netapp-monitoring edit AgentConfiguration

To determine if your deployed version of the operator supports AgentConfiguration, run the following command:

kubectl get crd agentconfigurations.monitoring.netapp.com

If you see an “Error from server (NotFound)” message, your operator must be upgraded before you can use the AgentConfiguration.

Configuring Proxy Support

There are two places where you may use a proxy on your tenant in order to install the Kubernetes Monitoring Operator. These may be the same or separate proxy systems:

Proxy needed during execution of the installation code snippet (using "curl") to connect the system where the snippet is executed to your Data Infrastructure Insights environment
Proxy needed by the target Kubernetes cluster to communicate with your Data Infrastructure Insights environment

If you use a proxy for either or both of these, in order to install the Kubernetes Operating Monitor you must first ensure that your proxy is configured to allow good communication to your Data Infrastructure Insights environment. If you have a proxy and can access Data Infrastructure Insights from the server/VM from which you wish to install the Operator, then your proxy is likely configured properly.

For the proxy used to install the Kubernetes Operating Monitor, before installing the Operator, set the http_proxy/https_proxy environment variables. For some proxy environments, you may also need to set the no_proxy environment variable.

To set the variable(s), perform the following steps on your system before installing the Kubernetes Monitoring Operator:

Set the https_proxy and/or http_proxy environment variable(s) for the current user:
1. If the proxy being setup does not have Authentication (username/password), run the following command:
  export https_proxy=<proxy_server>:<proxy_port>
2. If the proxy being setup does have Authentication (username/password), run this command:
  export http_proxy=<proxy_username>:<proxy_password>@<proxy_server>:<proxy_port>

For the proxy used for your Kubernetes cluster to communicate with your Data Infrastructure Insights environment, install the Kubernetes Monitoring Operator after reading all of these instructions.

Configure the proxy section of AgentConfiguration in operator-config.yaml before deploying the Kubernetes Monitoring Operator.

agent:
  ...
  proxy:
    server: <server for proxy>
    port: <port for proxy>
    username: <username for proxy>
    password: <password for proxy>

    # In the noproxy section, enter a comma-separated list of
    # IP addresses and/or resolvable hostnames that should bypass
    # the proxy
    noproxy: <comma separated list>

    isTelegrafProxyEnabled: true
    isFluentbitProxyEnabled: <true or false> # true if Events Log enabled
    isCollectorsProxyEnabled: <true or false> # true if Network Performance and Map enabled
    isAuProxyEnabled: <true or false> # true if AU enabled
  ...
...

Using a custom or private docker repository

By default, the Kubernetes Monitoring Operator will pull container images from the Data Infrastructure Insights repository. If you have a Kubernetes cluster used as the target for monitoring, and that cluster is configured to only pull container images from a custom or private Docker repository or container registry, you must configure access to the containers needed by the Kubernetes Monitoring Operator.

Run the “Image Pull Snippet” from the NetApp Monitoring Operator install tile. This command will log into the Data Infrastructure Insights repository, pull all image dependencies for the operator, and log out of the Data Infrastructure Insights repository. When prompted, enter the provided repository temporary password. This command downloads all images used by the operator, including for optional features. See below for which features these images are used for.

Core Operator Functionality and Kubernetes Monitoring

netapp-monitoring
ci-kube-rbac-proxy
ci-ksm
ci-telegraf
distroless-root-user

Events Log

ci-fluent-bit
ci-kubernetes-event-exporter

Network Performance and Map

ci-net-observer

Push the operator docker image to your private/local/enterprise docker repository according to your corporate policies. Ensure that the image tags and directory paths to these images in your repository are consistent with those in the Data Infrastructure Insights repository.

Edit the monitoring-operator deployment in operator-deployment.yaml, and modify all image references to use your private Docker repository.

image: <docker repo of the enterprise/corp docker repo>/ci-kube-rbac-proxy:<ci-kube-rbac-proxy version>
image: <docker repo of the enterprise/corp docker repo>/netapp-monitoring:<version>

Edit the AgentConfiguration in operator-config.yaml to reflect the new docker repo location. Create a new imagePullSecret for your private repository, for more details see https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

agent:
  ...
  # An optional docker registry where you want docker images to be pulled from as compared to CI's docker registry
  # Please see documentation link here: link:task_config_telegraf_agent_k8s.html#using-a-custom-or-private-docker-repository
  dockerRepo: your.docker.repo/long/path/to/test
  # Optional: A docker image pull secret that maybe needed for your private docker registry
  dockerImagePullSecret: docker-secret-name

OpenShift Instructions

If you are running on OpenShift 4.6 or higher, you must edit the AgentConfiguration in operator-config.yaml to enable the runPrivileged setting:

# Set runPrivileged to true SELinux is enabled on your kubernetes nodes
runPrivileged: true

Openshift may implement an added level of security that may block access to some Kubernetes components.

Tolerations and Taints

The netapp-ci-telegraf-ds, netapp-ci-fluent-bit-ds, and netapp-ci-net-observer-l4-ds DaemonSets must schedule a pod on every node in your cluster in order to correctly collect data on all nodes. The operator has been configured to tolerate some well known taints. If you have configured any custom taints on your nodes, thus preventing pods from running on every node, you can create a toleration for those taints in the AgentConfiguration. If you have applied custom taints to all nodes in your cluster, you must also add the necessary tolerations to the operator deployment to allow the operator pod to be scheduled and executed.

Learn More about Kubernetes Taints and Tolerations.

Return to the NetApp Kubernetes Monitoring Operator Installation page

A Note About Secrets

To remove permission for the Kubernetes Monitoring Operator to view secrets cluster-wide, delete the following resources from the operator-setup.yaml file before installing:

 ClusterRole/netapp-ci-<namespace>-agent-secret-clusterrole
 ClusterRoleBinding/netapp-ci-<namespace>-agent-secret-clusterrolebinding

If this is an upgrade, also delete the resources from your cluster:

 kubectl delete ClusterRole/netapp-ci-<namespace>-agent-secret-clusterrole
 kubectl delete ClusterRoleBinding/netapp-ci-<namespace>-agent-secret-clusterrolebinding

If Change Analysis is enabled, modify the AgentConfiguration or operator-config.yaml to uncomment the change-management section and include kindsToIgnoreFromWatch: '"secrets"' under the change-management section. Note the presence and position of single and double quotes in this line.

# change-management:
  ...
  # # A comma separated list of kinds to ignore from watching from the default set of kinds watched by the collector
  # # Each kind will have to be prefixed by its apigroup
  # # Example: '"networking.k8s.io.networkpolicies,batch.jobs", "authorization.k8s.io.subjectaccessreviews"'
  kindsToIgnoreFromWatch: '"secrets"'
  ...

Verifying Kubernetes Monitoring Operator Image Signatures

The image for the operator and all related images it deploys are signed by NetApp. You can manually verify the images before installation using the cosign tool, or configure a Kubernetes admission controller. For more details please see the Kubernetes documentation.

The public key used to verify the image signatures is available in the Monitoring Operator install tile under Optional: Upload the operator images to your private repository > Image Signature Public Key

To manually verify an image signature, perform the following steps:

Copy and run the Image Pull Snippet
Copy and enter the Repository Password when prompted
Store the Image Signature Public Key (dii-image-signing.pub in the example)
Verify the images using cosign. Refer to the following example of cosign usage

$ cosign verify --key dii-image-signing.pub --insecure-ignore-sct --insecure-ignore-tlog <repository>/<image>:<tag>
Verification for <repository>/<image>:<tag> --
The following checks were performed on each of these signatures:
  - The cosign claims were validated
  - The signatures were verified against the specified public key
[{"critical":{"identity":{"docker-reference":"<repository>/<image>"},"image":{"docker-manifest-digest":"sha256:<hash>"},"type":"cosign container image signature"},"optional":null}]

Troubleshooting

Some things to try if you encounter problems setting up the Kubernetes Monitoring Operator:

Problem: Try this:

Problem:	Try this:
I do not see a hyperlink/connection between my Kubernetes Persistent Volume and the corresponding back-end storage device. My Kubernetes Persistent Volume is configured using the hostname of the storage server.	Follow the steps to uninstall the existing Telegraf agent, then re-install the latest Telegraf agent. You must be using Telegraf version 2.0 or later, and your Kubernetes cluster storage must be actively monitored by Data Infrastructure Insights.
I'm seeing messages in the logs resembling the following: E0901 15:21:39.962145 1 reflector.go:178] k8s.io/kube-state-metrics/internal/store/builder.go:352: Failed to list v1.MutatingWebhookConfiguration: the server could not find the requested resource E0901 15:21:43.168161 1 reflector.go:178] k8s.io/kube-state-metrics/internal/store/builder.go:352: Failed to list v1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io) etc.	These messages may occur if you are running kube-state-metrics version 2.0.0 or above with Kubernetes versions below 1.20. To get the Kubernetes version: kubectl version To get the kube-state-metrics version: kubectl get deploy/kube-state-metrics -o jsonpath='{..image}' To prevent these messages from happening, users can modify their kube-state-metrics deployment to disable the following Leases: mutatingwebhookconfigurations validatingwebhookconfigurations volumeattachments resources More specifically, they can use the following CLI argument: resources=certificatesigningrequests,configmaps,cronjobs,daemonsets, deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,limitranges, namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes, poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas, secrets,services,statefulsets,storageclasses The default resource list is: "certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments, endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges, mutatingwebhookconfigurations,namespaces,networkpolicies,nodes, persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets, replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses, validatingwebhookconfigurations,volumeattachments"
I see error messages from Telegraf resembling the following, but Telegraf does start up and run: Oct 11 14:23:41 ip-172-31-39-47 systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB. Oct 11 14:23:41 ip-172-31-39-47 telegraf[1827]: time="2021-10-11T14:23:41Z" level=error msg="failed to create cache directory. /etc/telegraf/.cache/snowflake, err: mkdir /etc/telegraf/.ca che: permission denied. ignored\n" func="gosnowflake.(defaultLogger).Errorf" file="log.go:120" Oct 11 14:23:41 ip-172-31-39-47 telegraf[1827]: time="2021-10-11T14:23:41Z" level=error msg="failed to open. Ignored. open /etc/telegraf/.cache/snowflake/ocsp_response_cache.json: no such file or directory\n" func="gosnowflake.(defaultLogger).Errorf" file="log.go:120" Oct 11 14:23:41 ip-172-31-39-47 telegraf[1827]: 2021-10-11T14:23:41Z I! Starting Telegraf 1.19.3	This is a known issue. Refer to This GitHub article for more details. As long as Telegraf is up and running, users can ignore these error messages.
On Kubernetes, my Telegraf pod(s) are reporting the following error: "Error in processing mountstats info: failed to open mountstats file: /hostfs/proc/1/mountstats, error: open /hostfs/proc/1/mountstats: permission denied"	If SELinux is enabled and enforcing, it is likely preventing the Telegraf pod(s) from accessing the /proc/1/mountstats file on the Kubernetes node. To overcome this restriction, edit the agentconfiguration, and enable the runPrivileged setting. For more details, refer to the OpenShift Instructions.
On Kubernetes, my Telegraf ReplicaSet pod is reporting the following error: [inputs.prometheus] Error in plugin: could not load keypair /etc/kubernetes/pki/etcd/server.crt:/etc/kubernetes/pki/etcd/server.key: open /etc/kubernetes/pki/etcd/server.crt: no such file or directory	The Telegraf ReplicaSet pod is intended to run on a node designated as a master or for etcd. If the ReplicaSet pod is not running on one of these nodes, you will get these errors. Check to see if your master/etcd nodes have taints on them. If they do, add the necessary tolerations to the Telegraf ReplicaSet, telegraf-rs. For example, edit the ReplicaSet… kubectl edit rs telegraf-rs …and add the appropriate tolerations to the spec. Then, restart the ReplicaSet pod.
I have a PSP/PSA environment. Does this affect my monitoring operator?	If your Kubernetes cluster is running with Pod Security Policy (PSP) or Pod Security Admission (PSA) in place, you must upgrade to the latest Kubernetes Monitoring Operator. Follow these steps to upgrade to the current Operator with support for PSP/PSA: 1. Uninstall the previous monitoring operator: kubectl delete agent agent-monitoring-netapp -n netapp-monitoring kubectl delete ns netapp-monitoring kubectl delete crd agents.monitoring.netapp.com kubectl delete clusterrole agent-manager-role agent-proxy-role agent-metrics-reader kubectl delete clusterrolebinding agent-manager-rolebinding agent-proxy-rolebinding agent-cluster-admin-rolebinding 2. Install the latest version of the monitoring operator.
I ran into issues trying to deploy the Operator, and I have PSP/PSA in use.	1. Edit the agent using the following command: kubectl -n <name-space> edit agent 2. Mark 'security-policy-enabled' as 'false'. This will disable Pod Security Policies and Pod Security Admission and allow the Operator to deploy. Confirm by using the following commands: kubectl get psp (should show Pod Security Policy removed) kubectl get all -n <namespace> \| grep -i psp (should show that nothing is found)
"ImagePullBackoff" errors seen	These errors may be seen if you have a custom or private docker repository and have not yet configured the Kubernetes Monitoring Operator to properly recognize it. Read more about configuring for custom/private repo.
I am having an issue with my monitoring-operator deployment, and the current documentation does not help me resolve it.	Capture or otherwise note the output from the following commands, and contact the Technical Support team. kubectl -n netapp-monitoring get all kubectl -n netapp-monitoring describe all kubectl -n netapp-monitoring logs <monitoring-operator-pod> --all-containers=true kubectl -n netapp-monitoring logs <telegraf-pod> --all-containers=true
net-observer (Workload Map) pods in Operator namespace are in CrashLoopBackOff	These pods correspond to Workload Map data collector for Network Observability. Try these: • Check the logs of one of the pods to confirm minimum kernel version. For example: ---- {"ci-tenant-id":"your-tenant-id","collector-cluster":"your-k8s-cluster-name","environment":"prod","level":"error","msg":"failed in validation. Reason: kernel version 3.10.0 is less than minimum kernel version of 4.18.0","time":"2022-11-09T08:23:08Z"} ---- • Net-observer pods requires the Linux kernel version to be at least 4.18.0. Check the kernel version using the command “uname -r” and ensure they are >= 4.18.0
Pods are running in Operator namespace (default: netapp-monitoring), but no data is shown in UI for workload map or Kubernetes metrics in Queries	Check the time setting on the nodes of the K8S cluster. For accurate audit and data reporting, it is strongly recommended to synchronize the time on the Agent machine using Network Time Protocol (NTP) or Simple Network Time Protocol (SNTP).
Some of the net-observer pods in Operator namespace are in Pending state	Net-observer is a DaemonSet and runs a pod in each Node of the k8s cluster. • Note the pod which is in Pending state, and check if it is experiencing a resource issue for CPU or memory. Ensure the required memory and CPU is available in the node.
I’m seeing the following in my logs immediately after installing the Kubernetes Monitoring Operator: [inputs.prometheus] Error in plugin: error making HTTP request to http://kube-state-metrics.<namespace>.svc.cluster.local:8080/metrics: Get http://kube-state-metrics.<namespace>.svc.cluster.local:8080/metrics: dial tcp: lookup kube-state-metrics.<namespace>.svc.cluster.local: no such host	This message is typically only seen when a new operator is installed and the telegraf-rs pod is up before the ksm pod is up. These messages should stop once all pods are running.
I do see not any metrics being collected for the Kubernetes CronJobs that exist in my cluster.	Verify your Kubernetes version (i.e. `kubectl version`). If it is v1.20.x or below, this is an expected limitation. The kube-state-metrics release deployed with the Kubernetes Monitoring Operator only supports v1.CronJob. With Kubernetes 1.20.x and below, the CronJob resource is at v1beta.CronJob. As a result, kube-state-metrics cannot find the CronJob resource.
After installing the operator, the telegraf-ds pods enter CrashLoopBackOff and the pod logs indicate "su: Authentication failure".	Edit the telegraf section in AgentConfiguration, and set dockerMetricCollectionEnabled to false. For more details, refer to the operator's configuration options. … spec: … telegraf: … - name: docker run-mode: - DaemonSet substitutions: - key: DOCKER_UNIX_SOCK_PLACEHOLDER value: unix:///run/docker.sock … …
I see repeating error messages resembling the following in my Telegraf logs: E! [agent] Error writing to outputs.http: Post "https://<tenant_url>/rest/v1/lake/ingest/influxdb": context deadline exceeded (Client.Timeout exceeded while awaiting headers)	Edit the telegraf section in AgentConfiguration, and increase outputTimeout to 10s. For more details, refer to the operator's configuration options.
I'm missing involvedobject data for some Event Logs.	Be sure you have followed the steps in the Permissions section above.
Why am I seeing two monitoring operator pods running, one named netapp-ci-monitoring-operator-<pod> and the other named monitoring-operator-<pod>?	As of October 12, 2023, Data Infrastructure Insights has refactored the operator to better serve our users; for those changes to be fully adopted, you must remove the old operator and install the new one.
My kubernetes events unexpectedly stopped reporting to Data Infrastructure Insights.	Retrieve the name of the event-exporter pod: `kubectl -n netapp-monitoring get pods \|grep event-exporter \|awk '{print $1}' \|sed 's/event-exporter./event-exporter/'` It should be either "netapp-ci-event-exporter" or "event-exporter". Next, edit the monitoring agent `kubectl -n netapp-monitoring edit agent`, and set the value for LOG_FILE to reflect the appropriate event-exporter pod name found in the previous step. More specifically, LOG_FILE should be set to either "/var/log/containers/netapp-ci-event-exporter.log" or "/var/log/containers/event-exporter.log" fluent-bit: ... - name: event-exporter-ci substitutions: - key: LOG_FILE values: - /var/log/containers/netapp-ci-event-exporter.log ... Alternatively, one can also uninstall and reinstall the agent.
I'm seeing pod(s) deployed by the Kubernetes Monitoring Operator crash because of insufficient resources.	Refer to the Kubernetes Monitoring Operator configuration options to increase the CPU and/or memory limits as needed.
A missing image or invalid configuration caused the netapp-ci-kube-state-metrics pods to fail to startup or become ready. Now the StatefulSet is stuck and configuration changes are not being applied to the netapp-ci-kube-state-metrics pods.	The StatefulSet is in a broken state. After fixing any configuration problems bounce the netapp-ci-kube-state-metrics pods.
netapp-ci-kube-state-metrics pods fail to start after running a Kubernetes Operator upgrade, throwing ErrImagePull (failing to pull the image).	Try resetting the pods manually.
"Event discarded as being older then maxEventAgeSeconds" messages are being observed for my Kubernetes cluster under Log Analysis.	Modify the Operator agentconfiguration and increase the event-exporter-maxEventAgeSeconds (i.e. to 60s), event-exporter-kubeQPS (i.e. to 100), and event-exporter-kubeBurst (i.e. to 500). For more details on these configuration options, see the configuration options page.
Telegraf warns of, or crashes because of, insufficient lockable memory.	Try increasing the limit of lockable memory for Telegraf in the underlying operating system/node. If increasing the limit is not an option, modify the NKMO agentconfiguration and set unprotected to true. This will instruct Telegraf to no attempt to reserve locked memory pages. While this can pose a security risk as decrypted secrets might be swapped out to disk, it allows for execution in environments where reserving locked memory is not possible. For more details on the unprotected configuration options, refer to the configuration options page.
I see warning messages from Telegraf resembling the following: W! [inputs.diskio] Unable to gather disk name for "vdc": error reading /dev/vdc: no such file or directory	For the Kubernetes Monitoring Operator, these warning message are benign and can be safely ignored. Alternatively, edit the telegraf section in AgentConfiguration, and set runDsPrivileged to true. For more details, refer to the operator's configuration options.
My fluent-bit pod is failing with the following errors: [2024/10/16 14:16:23] [error] [/src/fluent-bit/plugins/in_tail/tail_fs_inotify.c:360 errno=24] Too many open files [2024/10/16 14:16:23] [error] failed initialize input tail.0 [2024/10/16 14:16:23] [error] [engine] input initialization failed	Try to change your fsnotify settings in your cluster: sudo sysctl fs.inotify.max_user_instances (take note of setting) sudo sysctl fs.inotify.max_user_instances=<something larger than current setting> sudo sysctl fs.inotify.max_user_watches (take note of setting) sudo sysctl fs.inotify.max_user_watches=<something larger than current setting> Restart Fluent-bit. Note: to make these settings persistent across node restarts, you need to put the following lines in /etc/sysctl.conf fs.inotify.max_user_instances=<something larger than current setting> fs.inotify.max_user_watches=<something larger than current setting>

I do not see a hyperlink/connection between my Kubernetes Persistent Volume and the corresponding back-end storage device. My Kubernetes Persistent Volume is configured using the hostname of the storage server.

Follow the steps to uninstall the existing Telegraf agent, then re-install the latest Telegraf agent. You must be using Telegraf version 2.0 or later, and your Kubernetes cluster storage must be actively monitored by Data Infrastructure Insights.

I'm seeing messages in the logs resembling the following:

E0901 15:21:39.962145 1 reflector.go:178] k8s.io/kube-state-metrics/internal/store/builder.go:352: Failed to list *v1.MutatingWebhookConfiguration: the server could not find the requested resource
E0901 15:21:43.168161 1 reflector.go:178] k8s.io/kube-state-metrics/internal/store/builder.go:352: Failed to list *v1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
etc.

These messages may occur if you are running kube-state-metrics version 2.0.0 or above with Kubernetes versions below 1.20.

To get the Kubernetes version:

kubectl version

To get the kube-state-metrics version:

kubectl get deploy/kube-state-metrics -o jsonpath='{..image}'

To prevent these messages from happening, users can modify their kube-state-metrics deployment to disable the following Leases:

mutatingwebhookconfigurations
validatingwebhookconfigurations
volumeattachments resources

More specifically, they can use the following CLI argument:

resources=certificatesigningrequests,configmaps,cronjobs,daemonsets, deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,limitranges, namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes, poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas, secrets,services,statefulsets,storageclasses

The default resource list is:

"certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments, endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges, mutatingwebhookconfigurations,namespaces,networkpolicies,nodes, persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets, replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses, validatingwebhookconfigurations,volumeattachments"

I see error messages from Telegraf resembling the following, but Telegraf does start up and run:

Oct 11 14:23:41 ip-172-31-39-47 systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Oct 11 14:23:41 ip-172-31-39-47 telegraf[1827]: time="2021-10-11T14:23:41Z" level=error msg="failed to create cache directory. /etc/telegraf/.cache/snowflake, err: mkdir /etc/telegraf/.ca
che: permission denied. ignored\n" func="gosnowflake.(*defaultLogger).Errorf" file="log.go:120"
Oct 11 14:23:41 ip-172-31-39-47 telegraf[1827]: time="2021-10-11T14:23:41Z" level=error msg="failed to open. Ignored. open /etc/telegraf/.cache/snowflake/ocsp_response_cache.json: no such
file or directory\n" func="gosnowflake.(*defaultLogger).Errorf" file="log.go:120"
Oct 11 14:23:41 ip-172-31-39-47 telegraf[1827]: 2021-10-11T14:23:41Z I! Starting Telegraf 1.19.3

This is a known issue. Refer to This GitHub article for more details. As long as Telegraf is up and running, users can ignore these error messages.

On Kubernetes, my Telegraf pod(s) are reporting the following error:
"Error in processing mountstats info: failed to open mountstats file: /hostfs/proc/1/mountstats, error: open /hostfs/proc/1/mountstats: permission denied"

If SELinux is enabled and enforcing, it is likely preventing the Telegraf pod(s) from accessing the /proc/1/mountstats file on the Kubernetes node. To overcome this restriction, edit the agentconfiguration, and enable the runPrivileged setting. For more details, refer to the OpenShift Instructions.

On Kubernetes, my Telegraf ReplicaSet pod is reporting the following error:

[inputs.prometheus] Error in plugin: could not load keypair /etc/kubernetes/pki/etcd/server.crt:/etc/kubernetes/pki/etcd/server.key: open /etc/kubernetes/pki/etcd/server.crt: no such file or directory

The Telegraf ReplicaSet pod is intended to run on a node designated as a master or for etcd. If the ReplicaSet pod is not running on one of these nodes, you will get these errors. Check to see if your master/etcd nodes have taints on them. If they do, add the necessary tolerations to the Telegraf ReplicaSet, telegraf-rs.

For example, edit the ReplicaSet…

kubectl edit rs telegraf-rs

…and add the appropriate tolerations to the spec. Then, restart the ReplicaSet pod.

I have a PSP/PSA environment. Does this affect my monitoring operator?

If your Kubernetes cluster is running with Pod Security Policy (PSP) or Pod Security Admission (PSA) in place, you must upgrade to the latest Kubernetes Monitoring Operator. Follow these steps to upgrade to the current Operator with support for PSP/PSA:

1. Uninstall the previous monitoring operator:

kubectl delete agent agent-monitoring-netapp -n netapp-monitoring
kubectl delete ns netapp-monitoring
kubectl delete crd agents.monitoring.netapp.com
kubectl delete clusterrole agent-manager-role agent-proxy-role agent-metrics-reader
kubectl delete clusterrolebinding agent-manager-rolebinding agent-proxy-rolebinding agent-cluster-admin-rolebinding

2. Install the latest version of the monitoring operator.

I ran into issues trying to deploy the Operator, and I have PSP/PSA in use.

1. Edit the agent using the following command:

kubectl -n <name-space> edit agent

2. Mark 'security-policy-enabled' as 'false'. This will disable Pod Security Policies and Pod Security Admission and allow the Operator to deploy. Confirm by using the following commands:

kubectl get psp (should show Pod Security Policy removed)
kubectl get all -n <namespace> | grep -i psp (should show that nothing is found)

"ImagePullBackoff" errors seen

These errors may be seen if you have a custom or private docker repository and have not yet configured the Kubernetes Monitoring Operator to properly recognize it. Read more about configuring for custom/private repo.

I am having an issue with my monitoring-operator deployment, and the current documentation does not help me resolve it.

Capture or otherwise note the output from the following commands, and contact the Technical Support team.

 kubectl -n netapp-monitoring get all
 kubectl -n netapp-monitoring describe all
 kubectl -n netapp-monitoring logs <monitoring-operator-pod> --all-containers=true
 kubectl -n netapp-monitoring logs <telegraf-pod> --all-containers=true

net-observer (Workload Map) pods in Operator namespace are in CrashLoopBackOff

These pods correspond to Workload Map data collector for Network Observability. Try these:
• Check the logs of one of the pods to confirm minimum kernel version. For example:

----
{"ci-tenant-id":"your-tenant-id","collector-cluster":"your-k8s-cluster-name","environment":"prod","level":"error","msg":"failed in validation. Reason: kernel version 3.10.0 is less than minimum kernel version of 4.18.0","time":"2022-11-09T08:23:08Z"}
----

• Net-observer pods requires the Linux kernel version to be at least 4.18.0. Check the kernel version using the command “uname -r” and ensure they are >= 4.18.0

Pods are running in Operator namespace (default: netapp-monitoring), but no data is shown in UI for workload map or Kubernetes metrics in Queries

Check the time setting on the nodes of the K8S cluster. For accurate audit and data reporting, it is strongly recommended to synchronize the time on the Agent machine using Network Time Protocol (NTP) or Simple Network Time Protocol (SNTP).

Some of the net-observer pods in Operator namespace are in Pending state

Net-observer is a DaemonSet and runs a pod in each Node of the k8s cluster.
• Note the pod which is in Pending state, and check if it is experiencing a resource issue for CPU or memory. Ensure the required memory and CPU is available in the node.

I’m seeing the following in my logs immediately after installing the Kubernetes Monitoring Operator:

[inputs.prometheus] Error in plugin: error making HTTP request to http://kube-state-metrics.<namespace>.svc.cluster.local:8080/metrics: Get http://kube-state-metrics.<namespace>.svc.cluster.local:8080/metrics: dial tcp: lookup kube-state-metrics.<namespace>.svc.cluster.local: no such host

This message is typically only seen when a new operator is installed and the telegraf-rs pod is up before the ksm pod is up. These messages should stop once all pods are running.

I do see not any metrics being collected for the Kubernetes CronJobs that exist in my cluster.

Verify your Kubernetes version (i.e. kubectl version). If it is v1.20.x or below, this is an expected limitation. The kube-state-metrics release deployed with the Kubernetes Monitoring Operator only supports v1.CronJob. With Kubernetes 1.20.x and below, the CronJob resource is at v1beta.CronJob. As a result, kube-state-metrics cannot find the CronJob resource.

After installing the operator, the telegraf-ds pods enter CrashLoopBackOff and the pod logs indicate "su: Authentication failure".

Edit the telegraf section in AgentConfiguration, and set dockerMetricCollectionEnabled to false. For more details, refer to the operator's configuration options.

…
spec:
…
telegraf:
…
- name: docker
run-mode:
- DaemonSet
substitutions:
- key: DOCKER_UNIX_SOCK_PLACEHOLDER
value: unix:///run/docker.sock
…
…

I see repeating error messages resembling the following in my Telegraf logs:

E! [agent] Error writing to outputs.http: Post "https://<tenant_url>/rest/v1/lake/ingest/influxdb": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Edit the telegraf section in AgentConfiguration, and increase outputTimeout to 10s. For more details, refer to the operator's configuration options.

I'm missing involvedobject data for some Event Logs.

Be sure you have followed the steps in the Permissions section above.

Why am I seeing two monitoring operator pods running, one named netapp-ci-monitoring-operator-<pod> and the other named monitoring-operator-<pod>?

As of October 12, 2023, Data Infrastructure Insights has refactored the operator to better serve our users; for those changes to be fully adopted, you must remove the old operator and install the new one.

My kubernetes events unexpectedly stopped reporting to Data Infrastructure Insights.

Retrieve the name of the event-exporter pod:

`kubectl -n netapp-monitoring get pods |grep event-exporter |awk '{print $1}' |sed 's/event-exporter./event-exporter/'`

It should be either "netapp-ci-event-exporter" or "event-exporter". Next, edit the monitoring agent kubectl -n netapp-monitoring edit agent, and set the value for LOG_FILE to reflect the appropriate event-exporter pod name found in the previous step. More specifically, LOG_FILE should be set to either "/var/log/containers/netapp-ci-event-exporter.log" or "/var/log/containers/event-exporter*.log"

fluent-bit:
...
- name: event-exporter-ci
  substitutions:
  - key: LOG_FILE
    values:
    - /var/log/containers/netapp-ci-event-exporter*.log
...

Alternatively, one can also uninstall and reinstall the agent.

I'm seeing pod(s) deployed by the Kubernetes Monitoring Operator crash because of insufficient resources.

Refer to the Kubernetes Monitoring Operator configuration options to increase the CPU and/or memory limits as needed.

A missing image or invalid configuration caused the netapp-ci-kube-state-metrics pods to fail to startup or become ready. Now the StatefulSet is stuck and configuration changes are not being applied to the netapp-ci-kube-state-metrics pods.

The StatefulSet is in a broken state. After fixing any configuration problems bounce the netapp-ci-kube-state-metrics pods.

netapp-ci-kube-state-metrics pods fail to start after running a Kubernetes Operator upgrade, throwing ErrImagePull (failing to pull the image).

Try resetting the pods manually.

"Event discarded as being older then maxEventAgeSeconds" messages are being observed for my Kubernetes cluster under Log Analysis.

Modify the Operator agentconfiguration and increase the event-exporter-maxEventAgeSeconds (i.e. to 60s), event-exporter-kubeQPS (i.e. to 100), and event-exporter-kubeBurst (i.e. to 500). For more details on these configuration options, see the configuration options page.

Telegraf warns of, or crashes because of, insufficient lockable memory.

Try increasing the limit of lockable memory for Telegraf in the underlying operating system/node. If increasing the limit is not an option, modify the NKMO agentconfiguration and set unprotected to true. This will instruct Telegraf to no attempt to reserve locked memory pages. While this can pose a security risk as decrypted secrets might be swapped out to disk, it allows for execution in environments where reserving locked memory is not possible. For more details on the unprotected configuration options, refer to the configuration options page.

I see warning messages from Telegraf resembling the following:

W! [inputs.diskio] Unable to gather disk name for "vdc": error reading /dev/vdc: no such file or directory

For the Kubernetes Monitoring Operator, these warning message are benign and can be safely ignored. Alternatively, edit the telegraf section in AgentConfiguration, and set runDsPrivileged to true. For more details, refer to the operator's configuration options.

My fluent-bit pod is failing with the following errors:

[2024/10/16 14:16:23] [error] [/src/fluent-bit/plugins/in_tail/tail_fs_inotify.c:360 errno=24] Too many open files
[2024/10/16 14:16:23] [error] failed initialize input tail.0
[2024/10/16 14:16:23] [error] [engine] input initialization failed

Try to change your fsnotify settings in your cluster:

 sudo sysctl fs.inotify.max_user_instances (take note of setting)

 sudo sysctl fs.inotify.max_user_instances=<something larger than current setting>

 sudo sysctl fs.inotify.max_user_watches (take note of setting)

 sudo sysctl fs.inotify.max_user_watches=<something larger than current setting>

Restart Fluent-bit.

Note: to make these settings persistent across node restarts, you need to put the following lines in /etc/sysctl.conf

 fs.inotify.max_user_instances=<something larger than current setting>
 fs.inotify.max_user_watches=<something larger than current setting>

Additional information may be found from the Support page or in the Data Collector Support Matrix.