Support Questions
Find answers, ask questions, and share your expertise

How to copy data from one hadoop cluster to other using spark?

How to copy data from one hadoop cluster to other using spark?

Contributor

I am new to Spark and we have a requirement of copying files from one hadoop cluster to another hadoop cluster using spark. I tried to connect to HDFS from spark as below:

 

SparkSession.builder.master('yarn').appName('Cross_Cluster_App').config('spark.yarn.keytab','<keytab_path_in_clusterB').config('spark.yarn.principal','<principal_name>').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://namenode_clusterB:host')

 

 

I am running my code on another edge node of my cluster which is different from clusterB where my data reside. Then I am using the SparkSession object to read my file which is present on

 

clusterB: csv_data = spark.read.csv('hdfs://namenode_of_clusterB:host/filename') But I see the exception: Client cannot authenticate via:[TOKEN, KERBEROS] as below.


Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:758) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:721) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:814) at org.apache.hadoop.ipc.Client$Connection.access$3600(Client.java:411) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1559) at org.apache.hadoop.ipc.Client.call(Client.java:1390) ... 52 more Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615) at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:411) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:801) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:797) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:797) ... 55 more

 

 

Second time, I restarted pyspark shell and submitted only this line:

csv_data = spark.read.csv('hdfs://namenode_of_cluster2:host/filename')

To my surprise, this line is giving the exception the same exception. The configuration I set to the sparkSession object is not applying properly. I tried to pass the files: hdfs-site.xml & core-site.xml to check if there will be any difference in vain. Could anyone tell me how to authenticate the keytab using spark and then access the files on HDFS ?