Confluent Kafka certification

Contributors

We have performed the certification with Confluent Platform with Kafka for tiered storage in NetApp StorageGRID. Tiered storage separates the data processing and storage management which helps to reduce the operation burden and cost as well as scale the brokers based on computation requirements. Tiered storage sends warm data to cost effective object storage. The NetApp and Confluent teams worked on this certification together and ran the test cases required for certification.

Confluent Kafka setup

We used the following setup for the certification.

In this certification, we used three zookeepers, five brokers, and five tools servers with 256GB RAM and 16 CPUs. For NetApp storage, we used StorageGRID with an SG1000 load balancer with four SG6024s. The storage and brokers were connected via 100GbE connections.

The following figure shows the network topology of configuration used for confluent Kafka certification.

Error: Missing Graphic Image

The tools servers act as an application client that sends the Kafka requests to Kafka brokers.

Confluent tiered storage configuration

The tiered storage configuration requires the following parameters in Kafka:

Confluent.tier.archiver.num.threads=16
confluent.tier.fetcher.num.threads=32
confluent.tier.enable=true
confluent.tier.feature=true
confluent.tier.backend=S3
confluent.tier.s3.bucket=kafkasgdbucket1-2
confluent.tier.s3.region=us-west-2
confluent.tier.s3.cred.file.path=/data/kafka/.ssh/credentials
confluent.tier.s3.aws.endpoint.override=http://kafkasgd.rtpppe.netapp.com:10444/
confluent.tier.s3.force.path.style.access=true

For certification, we used StorageGRID with the HTTP protocol, but HTTPS also works. The access key and secret key are stored in the file name provided in the confluent.tier.s3.cred.file.path parameter.

NetApp storage controller – StorageGrid

We configured single-site configuration in StorageGRID for certification.

Error: Missing Graphic Image

Certification tests

We completed the following five test cases for the certification. The first two were functionality tests and the remaining three were performance tests.

Object store correctness test

This test performs basic operations such as get, put, and delete on the object store used for the tied storage using API calls.

Tiering functionality correctness test

This test checks the end-to-end functionality of the object storage. It creates a topic, produces an event stream to the newly created topic, waits for the brokers to archive the segments to the object storage, consumes the event stream, and validates the consumed stream matches with the produced stream. We have performed this test with and without an object-store fault injection. We simulated node failure by stopping the service manager service in one of the nodes in StorageGRID and validating that the end-to-end functionality works with object storage.

Tier fetch benchmark

This test validated the read performance of the tiered object storage and checked the range fetch read requests under heavy load from segments generated by the benchmark. In this benchmark, Confluent developed custom clients to serve the tier fetch requests.

Produce-consume workload generator

This test indirectly generated write workload on the object store through the archival of segments. The read workload (segments read) was generated from object storage when consumer groups fetched the segments. This workload was generated by TOCC script. This test checked the performance of read and write on the object storage in parallel threads. We tested with and without object store fault injection like the way we did for the tiering functionality correctness test.

Retention workload generator

This test checked the deletion performance of an object storage under a heavy topic-retention workload. The retention workload was generated using a TOCC script that produces a large number of messages in parallel to a test topic. The test topic was configuring with an aggressive size-based and time-based retention setting that caused the event stream to be continuously purged from the object store. The segments were then archived. This led to a large number of deletions in the object storage by the broker and collection of the performance of the object- store delete operations.