Support Questions
Find answers, ask questions, and share your expertise

Spark 2,3 and HiveQL: can't "create table as select" on remote Hive installation - "Wrong FS" error

Spark 2,3 and HiveQL: can't "create table as select" on remote Hive installation - "Wrong FS" error

Rising Star

Hi all, I have the following situation that I can't solve even after extensive search and trials.


I have 2 clusters, "Cluster 1" is Hortonworks HDP- and "Cluster 2" is Cloudera CDH 5.13.1 Enterprise


The final goal is to run a Pyspark script on "Cluster 1" and remotely create a Hive Table on "Cluster 2". Script (of course this is just an example to illustrate the issue) is the following:



from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pyspark.sql.functions as func

sconf = SparkConf().setAppName("rob_example")
sc = SparkContext(conf=sconf)
sqlContext = HiveContext(sc)


sqlContext.sql("create table rob.test_output_table as select * from rob.test_input_table");




Now, what happens it that if I use Spatk 1.6.3 on "Cluster 1", the above script runs just fine, and if I log in into Hive on "Cluster 2" I can see the destination table and data are created correctly.


But if I use Spark 2.3 on "Cluster 1" (and I need to use 2.3) I get the following exception instead:



pyspark.sql.utils.AnalysisException: u'java.lang.IllegalArgumentException: Wrong FS: hdfs://<REMOTE_HDFS_HOST>/user/hive/warehouse/rob.db/test_output_table/.hive-staging_hive_2020-05-22_17-59-00_296_5401117384982980379-1/-ext-10000/part-00000-733a2646-fc76-47f0-80d6-a14b28677f7e-c000, expected: hdfs://cluster1_host1.domain1.local:8020;'




First of all, I notice a suspect ";" trailing the "expected" Filesystem where Spark is trying to read/write (maybe pointing to wrong Hive Metastore?).


In fact, it's also strange that it's expecting me to tell it to read/write on "Cluster 1" itself, differently from what I specified in my Spark session Hive Context configuration (fs.defaultFS parameter inside the Script)


More than this, even if I try setting the following parameters I can't make it work. And it's strange, because as I said if I use Spark 1.6 everything runs smoothly (even without the following additional configurations):





Also please note that if I'd use a pure Spark Dataframes approach as follows, things would work, but I need to use Spark SQL otherwise the output table I'll get won't be in Hive compatible format:


test_DF = sqlContext.sql("select * from rob.test_input_table")



So, the above last approach is not feasible (not Hive compatible format of the output table).


I found a lot of people having similar issues, but none of the cases I found applied exactly to my case, and I'm stuck at this point. Any help/hints would be greatly appreciated!


Thank you for any insights on this