Member since
05-23-2016
1
Post
0
Kudos Received
0
Solutions
01-24-2020
04:18 AM
2 Kudos
To access the (remote) HDFS of a CDP Data Hub cluster with Spark from within a CML session, you need to provide the hdfs-site.xml and core-site.xml of that specific cluster by copying them into the /etc/hadoop/conf/ folder of your CML session.
Below is some sample code that shows how to write data from your CML session to the remote HDFS and read it from there. This will automatically pull the required config files from a defined Data Hub cluster (REMOTE_HDFS_MASTER) and place it in the correct location of your CML session.
# This example shows how to write a file to the (remote) HDFS of a CDP Data Hub cluster and read it from there.
from __future__ import print_function
import sys, re, subprocess
from pyspark.sql import SparkSession
#Set the user name which will be used to login to the HDFS Master.
USER_NAME = subprocess.getoutput("klist | sed -n 's/.*principal://p' | cut -d@ -f1")
#FQDN of the HDFS Master
REMOTE_HDFS_MASTER = '<REPLACE ME with FQDN of remote HDFS Master, e.g. steffen-data-engineering-cluster-master3.steffen.a123-9p4k.cloudera.site>'
#Copy Hadoop config files into the CML session
!scp -o "StrictHostKeyChecking no" -T $USER_NAME@$REMOTE_HDFS_MASTER:"/etc/hadoop/conf/core-site.xml /etc/hadoop/conf/hdfs-site.xml" /etc/hadoop/conf
spark = SparkSession\
.builder\
.appName("RemoteHDFSAccess")\
.config("spark.authenticate","true")\
.getOrCreate()
#Create DataFrame
data = [('Anna', 1), ('John', 2), ('Martin', 3), ('Carol', 4), ('Hannah', 5)]
df_write = spark.createDataFrame(data)
#Write the date to remote HDFS
df_write.write.csv(path="hdfs://" + REMOTE_HDFS_MASTER + "/tmp/example", mode="overwrite")
#Read the data from remote HDFS
df_load = spark.read.csv("hdfs://" + REMOTE_HDFS_MASTER + "/tmp/example")
df_load.show()
spark.stop()
... View more
Labels: