Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar

Hortonworks Data Flow 2.1 was recently released and includes a new feature which can be used to connect to an Azure Data Lake Store. This is a fantastic use case for HDF as the data movement engine supporting a connected data plane architecture spanning on-premise and cloud deployments.

This how-to will assume that you have created an Azure Data Lake Store account and that you have remote access to an HD Insights head node in order to retrieve some dependent JARs.

We will make use of the new Additional Classpath Resources feature for the GetHdfs and PutHdfs processors in NiFi 1.1, included within HDF 2.1. The following additional dependencies are required for ADLS connectivity:

  • adls2-oauth2-token-provider-1.0.jar
  • azure-data-lake-store-sdk-2.0.4-SNAPSHOT.jar
  • hadoop-azure-datalake-2.0.0-SNAPSHOT.jar
  • jackson-core-2.2.3.jar
  • okhttp-2.4.0.jar
  • okio-1.4.0.jar

The first three Azure-specific JARs can be found in /usr/lib/hdinsight-datalake/ on the HDI head node. The Jackson JAR can be found in /usr/hdp/current/hadoop-client/lib, and the last two can be found in /usr/hdp/current/hadoop-hdfs-client/lib .

Once you've gathered these JARs, distribute to all NiFi nodes and place in a created directory /usr/lib/hdinsight-datalake.

In order to authenticate to ADLS, we'll use OAuth2. This requires the TenantID associated with your Azure account. This simplest way to obtain this is via the Azure CLI, using the azure account show command.

You will also need to create an Azure AD service principal as well as an associated key. Navigate to Azure AD > App Registrations > Add

10347-screen-shot-2016-12-15-at-22739-pm.png

Take note of the Application ID (aka the Client ID) and then generate a key via the Keys blade (please note the Client Secret value will be Hidden after leaving this blade so be sure to copy somewhere safe and store securely).

10348-screen-shot-2016-12-15-at-22906-pm.png

The service principal associated with this application will need to have service-level authorization to access the Azure Data Lake Store instance that exists by assumption as a pre-requisite. This can be done via the IAM blade for your ADLS instance (please note you will not see the Add button in the top toolbar unless you have administrative access for your Azure subscription).

In addition, the service principal will need to have appropriate directory-level authorizations for the ADLS directories to which it should be authorized to read or write. These can be assigned via Data Explorer > Access within your ADLS instance.

At this point, you should have your TenantID, ClientID, and Client Secret available and we will now to be able to configure core-site.xml in order to access Azure Data Lake via the PutHdfs processor.

The important core-site values are as follows (note the variables identified with the '$' sigil below, including part of the refresh URL path).

    <property>
     <name>dfs.adls.oauth2.access.token.provider.type</name>
     <value>ClientCredential</value>
    </property>
    <property>
     <name>dfs.adls.oauth2.refresh.url</name>
     <value>https://login.microsoftonline.com/$YOUR_TENANT_ID/oauth2/token</value>
    </property>
    <property>
     <name>dfs.adls.oauth2.client.id</name>
     <value>$YOUR_CLIENT_ID</value>
    </property>
    <property>
     <name>dfs.adls.oauth2.credential</name>
     <value>$YOUR_CLIENT_SECRET</value>
    </property>
   <property>
      <name>fs.AbstractFileSystem.adl.impl</name>
      <value>org.apache.hadoop.fs.adl.Adl</value>
    </property>
    <property>
      <name>fs.adl.impl</name>
      <value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
    </property>

We're now ready to configure the PutHdfs processor in NiFi.

10349-screen-shot-2016-12-15-at-24105-pm.png

For Hadoop configuration resources, point to your modified core-site.xml including the properties above and an hdfs-site.xml (no ADLS-specific changes are required).

Additional Classpath Resources should point to the /usr/lib/hdinsight-datalake to which we copied the dependencies on all NiFi nodes.

The input to this PutHdfs processor can be any FlowFile, it may be simplest to use the GenerateFlowFile processor to create the input with some Custom Text such as

The time is ${now()} 

When you run the data flow, you should see the FlowFiles appear in the ADLS directory specified in the processor, which you can verify using the Data Explorer in the Azure Portal, or via some other means.

10350-screen-shot-2016-12-15-at-24522-pm.png

10361-screen-shot-2016-12-15-at-24533-pm.png

24,003 Views
Comments
avatar
New Contributor

@slachterman

i just followed the instructions provided in the article to move generated flow files to ADLS using PutHDFS processor, But I am getting the below errors. Please help.

I have specified all the configurations for the PutHDFS as per the article. It is keep on saying "unable to find the valid certification path to requested target, but not sure where i need to upload the ADLS cred certificate in the PutHDFS processor


2019-07-30 17:24:20,657 ERROR [Timer-Driven Process Thread-9] o.apache.nifi.processors.hadoop.PutHDFS PutHDFS[id=44e51785-016c-1000-901a-6aa4d9167c2c] Failed to access HDFS due to com.microsoft.azure.datalake.store.ADLException:

Last encountered exception thrown after 5 tries. [javax.net.ssl.SSLHandshakeException,javax.net.ssl.SSLHandshakeException,javax.net.ssl.SSLHandshakeException,javax.net.ssl.SSLHandshakeException,javax.net.ssl.SSLHandshakeException]

[ServerRequestId:null]: com.microsoft.azure.datalake.store.ADLException: Error getting info for file /

Operation GETFILESTATUS failed with exception javax.net.ssl.SSLHandshakeException : sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Last encountered exception thrown after 5 tries. [javax.net.ssl.SSLHandshakeException,javax.net.ssl.SSLHandshakeException,javax.net.ssl.SSLHandshakeException,javax.net.ssl.SSLHandshakeException,javax.net.ssl.SSLHandshakeException]

[ServerRequestId:null]

Last encountered exception thrown after 5 tries. [javax.net.ssl.SSLHandshakeException,javax.net.ssl.SSLHandshakeException,javax.net.ssl.SSLHandshakeException,javax.net.ssl.SSLHandshakeException,javax.net.ssl.SSLHandshakeException]

[ServerRequestId:null]

at com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1194)

at com.microsoft.azure.datalake.store.ADLStoreClient.getDirectoryEntry(ADLStoreClient.java:741)

at org.apache.hadoop.fs.adl.AdlFileSystem.getFileStatus(AdlFileSystem.java:487)

at org.apache.nifi.processors.hadoop.PutHDFS$1.run(PutHDFS.java:268)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:360)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1942)

at org.apache.nifi.processors.hadoop.PutHDFS.onTrigger(PutHDFS.java:236)

at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)

at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1162)

at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:209)

at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)

at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)

at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946)

at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316)

at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310)

at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639)

at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223)

at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037)

at sun.security.ssl.Handshaker.process_record(Handshaker.java:965)

at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064)

at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)

at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)

at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)

at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)

at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)

at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1564)

at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)

at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)

at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)

at com.microsoft.azure.datalake.store.HttpTransport.makeSingleCall(HttpTransport.java:307)

at com.microsoft.azure.datalake.store.HttpTransport.makeCall(HttpTransport.java:90)

at com.microsoft.azure.datalake.store.Core.getFileStatus(Core.java:691)

at com.microsoft.azure.datalake.store.ADLStoreClient.getDirectoryEntry(ADLStoreClient.java:739)

... 18 common frames omitted


adlsjars.pngputhdfs.png