Confluent Performance validation
We have performed the verification with Confluent Platform for tiered storage on NetApp ONTAP. The NetApp and Confluent teams worked on this verification together and ran the test cases required for it.
Confluent setup
For the setup, we used three zookeepers, five brokers, and five testing servers with 256GB RAM and 16 CPUs. For NetApp storage, we used ONTAP with an AFF A900 HA pair. The storage and brokers were connected through 100GbE connections.
The following figure shows the network topology of configuration used for tiered storage verification.
The tools servers act as application clients that send or receive events to or from Confluent nodes.
Confluent tiered storage configuration
We used the following testing parameters:
confluent.tier.fetcher.num.threads=80 confluent.tier.archiver.num.threads=80 confluent.tier.enable=true confluent.tier.feature=true confluent.tier.backend=S3 confluent.tier.s3.bucket=kafkabucket1-1 confluent.tier.s3.region=us-east-1 confluent.tier.s3.cred.file.path=/data/kafka/.ssh/credentials confluent.tier.s3.aws.endpoint.override=http://wle-mendocino-07-08/ confluent.tier.s3.force.path.style.access=true bootstrap.server=192.168.150.172:9092,192.168.150.120:9092,192.168.150.164:9092,192.168.150.198:9092,192.168.150.109:9092,192.168.150.165:9092,192.168.150.119:9092,192.168.150.133:9092 debug=true jmx.port=7203 num.partitions=80 num.records=200000000 #object PUT size - 512MB and fetch 100MB – netapp segment.bytes=536870912 max.partition.fetch.bytes=1048576000 #GET size is max.partition.fetch.bytes/num.partitions length.key.value=2048 trogdor.agent.nodes=node0,node1,node2,node3,node4 trogdor.coordinator.hostname.port=192.168.150.155:8889 num.producers=20 num.head.consumers=20 num.tail.consumers=1 test.binary.task.max.heap.size=32G test.binary.task.timeout.sec=3600 producer.timeout.sec=3600 consumer.timeout.sec=3600
For verification, we used ONTAP with the HTTP protocol, but HTTPS also worked. The access key and secret key are stored in the file name provided in the confluent.tier.s3.cred.file.path
parameter.
NetApp storage controller – ONTAP
We configured a single HA pair configuration in ONTAP for verification.
Verification results
We completed the following five test cases for the verification. The first two were functionality tests and the remaining three were performance tests.
Object store correctness test
This test performs basic operations such as get, put, and delete on the object store used for the tiered storage using API calls.
Tiering functionality correctness test
This test checks the end-to-end functionality of the object storage. It creates a topic, produces an event stream to the newly created topic, waits for the brokers to archive the segments to the object storage, consumes the event stream, and validates the consumed stream matches with the produced stream. We have performed this test with and without an object-store fault injection. We simulated node failure by stopping the service manager service in one of the nodes in ONTAP and validating that the end-to-end functionality works with object storage.
Tier fetch benchmark
This test validated the read performance of the tiered object storage and checked the range fetch read requests under heavy load from segments generated by the benchmark. In this benchmark, Confluent developed custom clients to serve the tier fetch requests.
Produce-consume workload generator
This test indirectly generates write workload on the object store through the archival of segments. The read workload (segments read) was generated from object storage when consumer groups fetched the segments. This workload was generated by a TOCC script. This test checked the performance of read and write on the object storage in parallel threads. We tested with and without object store fault injection as we did for the tiering functionality correctness test.
Retention workload generator
This test checked the deletion performance of an object storage under a heavy topic- retention workload. The retention workload was generated using a TOCC script that produces many messages in parallel to a test topic. The test topic was configuring with an aggressive size-based and time-based retention setting that caused the event stream to be continuously purged from the object store. The segments were then archived. This led to many deletions in the object storage by the broker and collection of the performance of the object- store delete operations.
For verification details, see the Confluent website.