Support Questions
Find answers, ask questions, and share your expertise

Spark 2,3 and HiveQL: can't "create table as select" on remote Hive installation - "Wrong FS" error

Spark 2,3 and HiveQL: can't "create table as select" on remote Hive installation - "Wrong FS" error

Rising Star

Hi all, I have the following situation that I can't solve even after extensive search and trials.

 

I have 2 clusters, "Cluster 1" is Hortonworks HDP-2.6.5.0 and "Cluster 2" is Cloudera CDH 5.13.1 Enterprise

 

The final goal is to run a Pyspark script on "Cluster 1" and remotely create a Hive Table on "Cluster 2". Script (of course this is just an example to illustrate the issue) is the following:

 

 

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pyspark.sql.functions as func

sconf = SparkConf().setAppName("rob_example")
sc = SparkContext(conf=sconf)
sqlContext = HiveContext(sc)

sqlContext.setConf("fs.defaultFS","hdfs://<REMOTE_HDFS_HOST>:8020")

sqlContext.sql("create table rob.test_output_table as select * from rob.test_input_table");

 

 

 

Now, what happens it that if I use Spatk 1.6.3 on "Cluster 1", the above script runs just fine, and if I log in into Hive on "Cluster 2" I can see the destination table and data are created correctly.

 

But if I use Spark 2.3 on "Cluster 1" (and I need to use 2.3) I get the following exception instead:

 

 

pyspark.sql.utils.AnalysisException: u'java.lang.IllegalArgumentException: Wrong FS: hdfs://<REMOTE_HDFS_HOST>/user/hive/warehouse/rob.db/test_output_table/.hive-staging_hive_2020-05-22_17-59-00_296_5401117384982980379-1/-ext-10000/part-00000-733a2646-fc76-47f0-80d6-a14b28677f7e-c000, expected: hdfs://cluster1_host1.domain1.local:8020;'

 

 

 

First of all, I notice a suspect ";" trailing the "expected" Filesystem where Spark is trying to read/write (maybe pointing to wrong Hive Metastore?).

 

In fact, it's also strange that it's expecting me to tell it to read/write on "Cluster 1" itself, differently from what I specified in my Spark session Hive Context configuration (fs.defaultFS parameter inside the Script)

 

More than this, even if I try setting the following parameters I can't make it work. And it's strange, because as I said if I use Spark 1.6 everything runs smoothly (even without the following additional configurations):

 

sqlContext.setConf("default.fs.name","hdfs://<REMOTE_HDFS_NODE>:8020")
sqlContext.setConf("hive.metastore.uris","thrift://<REMOTE_THRIFT_NODE>:9083")

 

 

Also please note that if I'd use a pure Spark Dataframes approach as follows, things would work, but I need to use Spark SQL otherwise the output table I'll get won't be in Hive compatible format:

 

test_DF = sqlContext.sql("select * from rob.test_input_table")
test_DF.write.mode("overwrite").saveAsTable("rob.test_output_table")

 

 

So, the above last approach is not feasible (not Hive compatible format of the output table).

 

I found a lot of people having similar issues, but none of the cases I found applied exactly to my case, and I'm stuck at this point. Any help/hints would be greatly appreciated!

 

Thank you for any insights on this