Why NetApp NFS for Kafka workloads?

09/15/2025 Contributors

Now that there is a solution for the silly rename issue in NFS storage with Kafka, you can create robust deployments that leverage NetApp ONTAP storage for your Kafka workload. Not only does this significantly reduce operational overhead, it also brings the following benefits to your Kafka clusters:

Reduced CPU utilization on Kafka brokers. Using disaggregated NetApp ONTAP storage separates disk I/O operations from the broker and thus reduces its CPU footprint.
Faster broker recovery-time. Since disaggregated NetApp ONTAP storage is shared across Kafka broker nodes, a new compute instance can replace a bad broker at any point in a fraction of the time compared to conventional Kafka deployments without rebuilding the data.
Storage efficiency. As the storage layer of the application is now provisioned through NetApp ONTAP, customers can avail all the benefits of storage efficiency that comes with ONTAP, such as in-line data compression, deduplication, and compaction.

These benefits were tested and validated in test cases that we discuss in detail in this section.

Reduced CPU utilization on Kafka broker

We discovered that overall CPU utilization is lower than its DAS counterpart when we ran similar workloads on two sperate Kafka clusters that were identical in their technical specifications but differed in their storage technologies. Not only is the overall CPU utilization lower when Kafka cluster is using ONTAP storage, but the increase in the CPU utilization demonstrated a gentler gradient than in a DAS-based Kafka cluster.

Architectural setup

The following table shows the environmental configuration used to demonstrate reduced CPU utilization.

Platform component	Environment configuration
Kafka 3.2.3 Benchmarking tool: OpenMessaging	3 x zookeepers – t2.small 3 x broker servers – i3en.2xlarge 1 x Grafana – c5n.2xlarge 4 x Producer/Consumer — c5n.2xlarge
Operating system on all nodes	RHEL 8.7 or later
NetApp Cloud Volumes ONTAP instance	Single Node Instance – M5.2xLarge

Platform component

Environment configuration

Kafka 3.2.3
Benchmarking tool: OpenMessaging

3 x zookeepers – t2.small
3 x broker servers – i3en.2xlarge
1 x Grafana – c5n.2xlarge
4 x Producer/Consumer — c5n.2xlarge

Operating system on all nodes

RHEL 8.7 or later

NetApp Cloud Volumes ONTAP instance

Single Node Instance – M5.2xLarge

Benchmarking tool

The benchmarking tool used in this test case is the OpenMessaging framework. OpenMessaging is vendor-neutral and language-independent; it provides industry guidelines for finance, e-commerce, IoT, and big-data; and it helps develop messaging and streaming applications across heterogeneous systems and platforms. The following figure depicts the interaction of OpenMessaging clients with a Kafka cluster.

This image depicts the interaction of OpenMessaging clients with a Kafka cluster.

Compute. We used a three-node Kafka cluster with a three-node zookeeper ensemble running on dedicated servers. Each broker had two NFSv4.1 mount points to a single volume on the NetApp CVO instance through a dedicated LIF.
Monitoring. We used two nodes for a Prometheus-Grafana combination. For generating workloads, we have a separate three-node cluster that can produce to and consume from this Kafka cluster.
Storage. We used a single-node NetApp Cloud volumes ONTAP instance with six 250GB GP2 AWS-EBS volumes mounted on the instance. These volumes were then exposed to the Kafka cluster as six NFSv4.1 volumes through dedicated LIFs.
Configuration. The two configurable elements in this test case were Kafka brokers and OpenMessaging workloads.
- Broker config. The following specifications were selected for the Kafka brokers. We used replication factor of 3 for all measurements, as is highlighted below.

This image depicts the specifications selected for the Kafka brokers.

OpenMessaging benchmark (OMB) workload config. The following specifications were provided. We specified a target producer rate, highlighted below.

This image depicts the specifications selected for the OpenMessaging benchmark workload configuration.

Methodology of testing

Two similar clusters were created, each having its own set of benchmarking cluster swarms.
- Cluster 1. NFS-based Kafka cluster.
- Cluster 2. DAS-based Kafka cluster.

Using an OpenMessaging command, similar workloads were triggered on each cluster.

sudo bin/benchmark --drivers driver-kafka/kafka-group-all.yaml workloads/1-topic-100-partitions-1kb.yaml

The produce rate configuration was increased in four iterations, and CPU utilization was recorded with Grafana. The produce rate was set to the following levels:
- 10,000
- 40,000
- 80,000
- 100,000

Observation

There are two primary benefits of using NetApp NFS storage with Kafka:

You can reduce CPU usage by almost one-third. The overall CPU usage under similar workloads was lower for NFS compared to DAS SSDs; savings range from 5% for lower produce rates to 32% for higher produce rates.
A three-fold reduction in CPU utilization drift at higher produce rates. As expected, there was an upward drift for the increase in CPU utilization as the produce rates were increased. However, CPU utilization on Kafka brokers using DAS went up from 31% for the lower produce rate to 70% for the higher produce rate, a 39% increase. However, with an NFS storage backend, the CPU utilization went up from 26% to 38%, a 12% increase.

This graph depicts the behavior of a DAS-based cluster.

This graph depicts the behavior of an NFS-based cluster.

Also, at 100,000 messages, DAS shows more CPU utilization than an NFS cluster.

This graph depicts the behavior of a DAS-based cluster at 100,000 messages.

This graph depicts the behavior of an NFS-based cluster at 100,000 messages.

Faster broker recovery

We discovered that Kafka brokers recover faster when they are using shared NetApp NFS storage. When a broker crashes in a Kafka cluster, this broker can be replaced by a healthy broker with a same broker ID. Upon performing this test case, we found that, in the case of a DAS-based Kafka cluster, the cluster rebuilds the data on a newly added healthy broker, which is time consuming. In the case of a NetApp NFS-based Kafka cluster, the replacing broker continues to read data from the previous log directory and recovers much faster.