About steffen.maerkl

steffen.maerkl · ‎01-24-2020

To access the (remote) HDFS of a CDP Data Hub cluster with Spark from within a CML session, you need to provide the hdfs-site.xml and core-site.xml of that specific cluster by copying them into the /etc/hadoop/conf/ folder of your CML session. Below is some sample code that shows how to write data from your CML session to the remote HDFS and read it from there. This will automatically pull the required config files from a defined Data Hub cluster (REMOTE_HDFS_MASTER) and place it in the correct location of your CML session. # This example shows how to write a file to the (remote) HDFS of a CDP Data Hub cluster and read it from there. from __future__ import print_function import sys, re, subprocess from pyspark.sql import SparkSession #Set the user name which will be used to login to the HDFS Master. USER_NAME = subprocess.getoutput("klist | sed -n 's/.*principal://p' | cut -d@ -f1") #FQDN of the HDFS Master REMOTE_HDFS_MASTER = '<REPLACE ME with FQDN of remote HDFS Master, e.g. steffen-data-engineering-cluster-master3.steffen.a123-9p4k.cloudera.site>' #Copy Hadoop config files into the CML session !scp -o "StrictHostKeyChecking no" -T $USER_NAME@$REMOTE_HDFS_MASTER:"/etc/hadoop/conf/core-site.xml /etc/hadoop/conf/hdfs-site.xml" /etc/hadoop/conf spark = SparkSession\ .builder\ .appName("RemoteHDFSAccess")\ .config("spark.authenticate","true")\ .getOrCreate() #Create DataFrame data = [('Anna', 1), ('John', 2), ('Martin', 3), ('Carol', 4), ('Hannah', 5)] df_write = spark.createDataFrame(data) #Write the date to remote HDFS df_write.write.csv(path="hdfs://" + REMOTE_HDFS_MASTER + "/tmp/example", mode="overwrite") #Read the data from remote HDFS df_load = spark.read.csv("hdfs://" + REMOTE_HDFS_MASTER + "/tmp/example") df_load.show() spark.stop()

Online	Offline
Last Visited	‎04-19-2024 09:56 AM

Member Since	‎05-23-2016 12:56 AM
Last Visited	‎04-19-2024 09:56 AM
Posts	1

Cloudera Community

CDP: How to access the remote HDFS of a Data Hub c...