Skip to main content
How to enable StorageGRID in your environment

Use Cloudera Hadoop S3A connector with StorageGRID

Contributors netapp-aronk

By Angela Cheng

Hadoop has been a favorite of data scientists for some time now. Hadoop allows for the distributed processing of large data sets across clusters of computers using simple programing frameworks. Hadoop was designed to scale up from single servers to thousands of machines, with each machine possessing local compute and storage.

Why use S3A for Hadoop workflows?

As the volume of data has grown over time, the approach of adding new machines with their own compute and storage has become inefficient. Scaling linearly creates challenges for using resources efficiently and managing the infrastructure.

To address these challenges, the Hadoop S3A client offers high-performance I/O against S3 object storage. Implementing a Hadoop workflow with S3A helps you leverage object storage as a data repository and enables you to separate compute and storage, which in turn enables you to scale compute and storage independently. Decoupling compute and storage also enables you to dedicate the right amount of resources for your compute jobs and provide capacity based on the size of your data set. Therefore, you can reduce your overall TCO for Hadoop workflows.

Configure S3A connector to use StorageGRID

Prerequisites

  • A StorageGRID S3 endpoint URL, a tenant s3 access key, and a secret key for Hadoop S3A connection testing.

  • A Cloudera cluster and root or sudo permission to each host in the cluster to install the Java package.

As of April 2022, Java 11.0.14 with Cloudera 7.1.7 was tested against StorageGRID 11.5 and 11.6. However, the Java version number might be different at the time of a new install.

Install Java package

  1. Check the Cloudera support matrix for the supported JDK version.

  2. Download the Java 11.x package that matches the Cloudera cluster operating system. Copy this package to each host in the cluster. In this example, the rpm package is used for CentOS.

  3. Log into each host as root or using an account with sudo permission. Perform the following steps on each host:

    1. Install the package:

      $ sudo rpm -Uvh jdk-11.0.14_linux-x64_bin.rpm
    2. Check where Java is installed. If multiple versions are installed, set the newly installed version as default:

      alternatives --config java
      
      There are 2 programs which provide 'java'.
      
        Selection    Command
      -----------------------------------------------
       +1           /usr/java/jre1.8.0_291-amd64/bin/java
        2           /usr/java/jdk-11.0.14/bin/java
      
      Enter to keep the current selection[+], or type selection number: 2
    3. Add this line to the end of /etc/profile. The path should match the path of above selection:

      export JAVA_HOME=/usr/java/jdk-11.0.14
    4. Run the following command for the profile to take effect:

      source /etc/profile

Cloudera HDFS S3A configuration

Steps

  1. From the Cloudera Manager GUI, select Clusters > HDFS, and select Configuration.

  2. Under CATEGORY, select Advanced, and scroll down to locate Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml.

  3. Click the (+) sign and add following value pairs.

    Name Value

    fs.s3a.access.key

    <tenant s3 access key from StorageGRID>

    fs.s3a.secret.key

    <tenant s3 secret key from StorageGRID>

    fs.s3a.connection.ssl.enabled

    [true or false] (default is https if this entry is missing)

    fs.s3a.endpoint

    <StorageGRID S3 endpoint:port>

    fs.s3a.impl

    org.apache.hadoop.fs.s3a.S3AFileSystem

    fs.s3a.path.style.access

    [true or false] (default is virtual host style if this entry is missing)

    Sample screenshot

    S3A Configuration

  4. Click the Save Changes button. Select the Stale Configuration icon from the HDFS menu bar, select Restart Stale Services on the next page, and select Restart Now.

    restart stale service icon for Hadoop

Test S3A connection to StorageGRID

Perform basic connection test

Log into one of the hosts in the Cloudera cluster, and enter hadoop fs -ls s3a://<bucket-name>/.

The following example uses path syle with a pre-existing hdfs-test bucket and a test object.

[root@ce-n1 ~]# hadoop fs -ls s3a://hdfs-test/
22/02/15 18:24:37 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
22/02/15 18:24:37 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
22/02/15 18:24:37 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started
22/02/15 18:24:37 INFO Configuration.deprecation: No unit for fs.s3a.connection.request.timeout(0) assuming SECONDS
Found 1 items
-rw-rw-rw-   1 root root       1679 2022-02-14 16:03 s3a://hdfs-test/test
22/02/15 18:24:38 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...
22/02/15 18:24:38 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.
22/02/15 18:24:38 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

Troubleshooting

Scenario 1

Use an HTTPS connection to StorageGRID and get a handshake_failure error after a 15 minute timeout.

Reason: Old JRE/JDK version using outdated or unsupported TLS cipher suite for connection to StorageGRID.

Sample error message

[root@ce-n1 ~]# hadoop fs -ls s3a://hdfs-test/
22/02/15 18:52:34 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
22/02/15 18:52:34 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
22/02/15 18:52:34 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started
22/02/15 18:52:35 INFO Configuration.deprecation: No unit for fs.s3a.connection.request.timeout(0) assuming SECONDS
22/02/15 19:04:51 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...
22/02/15 19:04:51 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.
22/02/15 19:04:51 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
22/02/15 19:04:51 WARN fs.FileSystem: Failed to initialize fileystem s3a://hdfs-test/: org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExistV2 on hdfs: com.amazonaws.SdkClientException: Unable to execute HTTP request: Received fatal alert: handshake_failure: Unable to execute HTTP request: Received fatal alert: handshake_failure
ls: doesBucketExistV2 on hdfs: com.amazonaws.SdkClientException: Unable to execute HTTP request: Received fatal alert: handshake_failure: Unable to execute HTTP request: Received fatal alert: handshake_failure

Resolution: Make sure that JDK 11.x or later is installed and set to default the Java library. Refer to the Install Java package section for more information.

Scenario 2:

Failed to connect to StorageGRID with error message Unable to find valid certification path to requested target.

Reason: StorageGRID S3 endpoint server certificate is not trusted by Java program.

Sample error message:

[root@hdp6 ~]# hadoop fs -ls s3a://hdfs-test/
22/03/11 20:58:12 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
22/03/11 20:58:13 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
22/03/11 20:58:13 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started
22/03/11 20:58:13 INFO Configuration.deprecation: No unit for fs.s3a.connection.request.timeout(0) assuming SECONDS
22/03/11 21:12:25 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...
22/03/11 21:12:25 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.
22/03/11 21:12:25 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
22/03/11 21:12:25 WARN fs.FileSystem: Failed to initialize fileystem s3a://hdfs-test/: org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExistV2 on hdfs: com.amazonaws.SdkClientException: Unable to execute HTTP request: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target: Unable to execute HTTP request: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Resolution: NetApp recommends using a server certificate issued by a known public certificate signing authority to make sure that the authentication is secure. Alternatively, add a custom CA or server certificate to the Java trust store.

Complete the following steps to add a StorageGRID custom CA or server certificate to the Java trust store.

  1. Backup the existing default Java cacerts file.

    cp -ap $JAVA_HOME/lib/security/cacerts $JAVA_HOME/lib/security/cacerts.orig
  2. Import the StorageGRID S3 endpoint cert to the Java trust store.

    keytool -import -trustcacerts -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -noprompt -alias sg-lb -file <StorageGRID CA or server cert in pem format>

Troubleshooting tips

  1. Increase the hadoop log level to DEBUG.

    export HADOOP_ROOT_LOGGER=hadoop.root.logger=DEBUG,console

  2. Execute the command, and direct the log messages to error.log.

    hadoop fs -ls s3a://<bucket-name>/ &>error.log

By Angela Cheng