Skip to main content
How to enable StorageGRID in your environment

NetApp StorageGRID and big data analytics

Contributors

NetApp StorageGRID use cases

NetApp StorageGRID object storage solution offers scalability, data availability, security, and high performance. Organizations of all sizes and across various industries use StorageGRID S3 for a wide range of use cases. Let's explore some typical scenarios:

Big data analytics: StorageGRID S3 is frequently used as a data lake, where businesses store large amounts of structured and unstructured data for analysis using tools like Apache Spark, Splunk Smartstore and Dremio.

Data Tiering: NetApp customers use ONTAP's FabricPool feature to automatically move data between a high-performance local tier to StorageGRID. Tiering frees up expensive flash storage for hot data while keeping cold data readily available on low-cost object storage. This maximizes performance and savings.

Data backup and disaster recovery: Businesses can use StorageGRID S3 as a reliable and cost-effective solution for backing up critical data and recovering it in case of a disaster.

Data storage for applications: StorageGRID S3 can be used as a storage backend for applications, enabling developers to easily store and retrieve files, images, videos, and other types of data.

Content delivery: StorageGRID S3 can be used to store and deliver static website content, media files, and software downloads to users around the world, leveraging StorageGRID's geo distribution and global namespace for fast and reliable content delivery.

Data Tiering: NetApp customers use ONTAP FabricPool feature to automatically move data between a high-performance local tier to StorageGRID. Tiering frees up expensive flash storage for hot data while keep cold data readily available from low-cost object storage. This maximizes performance and savings.

Data Archive: StorageGRID offers different storage types and supports tiering to public long term low-cost storage options, make it an ideal solution for archiving and long-term retention of data that needs to be retained for compliance or historical purposes.

Object storage use cases

StorageGRID use case diagram

Among the above, big data analytics is one of the topmost use cases and its usage is trending upward.

Why StorageGRID for data lakes?

  • Increased collaboration - Massive shared multi-site, multi-tenancy w/industry standard API access

  • Decreased operational costs - Operational simplicity of a single, self-healing, automated scale-out architecture

  • Scalability - Unlike traditional Hadoop and data warehouse solutions, StorageGRID S3 object storage decouples storage from compute and data, allowing business to scale their storage needs as they grew.

  • Durability and reliability - StorageGRID provides 99.999999999% durability, meaning that data stored is highly resistant to data loss. It also offers high availability, ensuring that data is always accessible.

  • Security - StorageGRID offers various security features, including encryption, access control policy, data lifecycle management, object lock and versioning to protect data stored in S3 buckets

StorageGRID S3 Data Lakes

StorageGRID datalake example

Which data warehouse or data lake works best with S3 object storage

NetApp benchmarked StorageGRID with three data warehouse/lake house ecosystems - Hive, Delta Lake and Dremio. Apache Iceberg: The Definitive Guide includes brief introduction of data warehouse and data lake house and pro/cons of these two architectures.

  • Benchmark Tool - TPC-DS - https://www.tpc.org/tpcds/

  • Big data ecosystems

    • Cluster of 5 VMs, each with 128G RAM and 24 vCPU, SSD storage for system disk

    • Hadoop 3.3.5 with Hive 3.1.3 (1 name node + 4 data nodes)

    • Delta Lake with Spark 3.2.0 (1 master + 4 workers) and Hadoop 3.3.5

    • Dremio v23 (1 master + 4 executors)

  • Object storage

    • NetApp® StorageGRID® 11.6 with 3 x SG6060 + 1x SG1000 load balancer

    • Object protection - 2 copies

  • Database size 1000GB

  • Cache disabled on all 3 ecosystems to get consistent result for each query test.

TPC-DS comes with 99 complex SQL queries for query benchmarking. We measured the total minutes to complete all 99 queries and we dove deeper by breaking down the type and number of S3 requests to analyze the result. The first table below shows the total duration of all 99 queries and the second table summarizes the number and types of S3 requests each ecosystem sent to StorageGRID.

TPC-DS query result

Ecosystem Hive Delta Lake Dremio

Storage layer

NetApp® StorageGRID®

NetApp® StorageGRID®

NetApp® StorageGRID®

Drive type

HDD

HDD

HDD

Table format

Parquet

Parquet

Parquet 1

Database size

1000G

1000G

1000G

TPCDS 99 queries
total minutes

1084 2

55

47

1 Tested both Parquet and Iceberg table format, result is similar.

2 Hive unable to complete query number 72.

TPC-DS queries - S3 requests breakdown

S3 Requests Hive Delta Lake Dremio

GET

1,117,184

2,074,610

4,414,227

observation:
all range GET

80% range get of 2KB to 2MB from 32MB objects, 50 - 100 requests/sec

73% range get below 100KB from 32MB objects, 1000 - 1400 requests/sec

90% 1M byte range get from 256MB objects, 2000 - 2300 requests/sec

List objects

312,053

24,158

240

HEAD
(non-existent object)

156,027

12,103

192

HEAD
(existent object)

982,126

922,732

1,845

Total requests

2,567,390

3,033,603

4,416,504

From the first table, we can see Delta Lake and Dremio are much faster than Hive. From the second table, we notice that Hive sent lots of S3 list-objects requests which is typically slow in all object storage platforms, especially if dealing with a bucket containing many objects. This increases overall query duration significantly. Another observation is Dremio was able to send high number of GET requests in parallel, 2,000 to 2,300 requests per second versus 50 - 100 requests per second in Hive. Hive and Hadoop S3A mimic standard filesystem contributes to Hive slowness to S3 object storage.

Using Hadoop (either on HDFS or S3 object storage) with Hive or Spark requires extensive knowledge of Hadoop and Hive/Spark and how the settings from each service interact - together they have 1000+ settings. Very often, the settings are inter-related and cannot be changed alone. It takes tremendous amounts of time and effort to find the optimal combination of settings and values to use.

Dremio is a data lake engine that uses end-to-end Apache Arrow to dramatically increase query performance. Apache Arrow provides a standardized columnar memory format for efficient data sharing and fast analytics. Arrow employs a language-agnostic approach, designed to eliminate the need for data serialization and deserialization, improving the performance and interoperability between complex data processes and systems.

Dremio's performance is mostly driven by computing power on the Dremio cluster. Though Dremio uses Hadoop's S3A connector for S3 object storage connection, Hadoop is not required and most of Hadoop's fs.s3a settings are not used by Dremio. This makes tuning Dremio performance easy without spending time to learn and test various Hadoop s3a settings.

From this benchmark result, we can conclude that big data analytic system that optimized for S3-based workload is a major performance factor. Dremio optimizes query execution, efficiently utilizes metadata, and provides seamless access to S3 data, resulting in better performance compared to Hive when working with S3 storage. Refer to this page to configure Dremio S3 data source with StorageGRID.

Visit the links below to learn more about how StorageGRID and Dremio work together to provide a modern and efficient data lake infrastructure and how NetApp migrated from Hive + HDFS to Dremio + StorageGRID to dramatically enhance big data analytic efficiency.