Created 05-16-2018 06:59 PM
Hi,
After installing HDP 2.6.3, I ran Pyspark in the terminal, then initiated a Spark Session, and tried to create a new database (see last line of code:
$ pyspark > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").appName("test").enableHiveSupport().getOrCreate() > spark.sql("show databases").show() > spark.sql("create database if not exists NEW_DB")
However, PySpark threw an error where it was trying to create a database locally:
AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to create database path file:/home/jdoe/spark-warehouse/new_db.db, failed to create database new_db);'
I wasn't trying to create a database locally. I was trying to create a database within Hive. Is there a configuration problem with HDP 2.6.3?
Please advise. Thanks.
Created 05-16-2018 07:22 PM
@John Doe Could you try running on yarn client mode instead of local? I think this will help resolving the problem you have now.
$ pyspark --master yarn from pyspark.sql import SparkSession spark =SparkSession.builder.appName("test").enableHiveSupport().getOrCreate() spark.sql("show databases").show() spark.sql("create database if not exists NEW_DB")
Note: If you comment this post make sure you tag my name. And If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
HTH
Created 05-16-2018 07:22 PM
@John Doe Could you try running on yarn client mode instead of local? I think this will help resolving the problem you have now.
$ pyspark --master yarn from pyspark.sql import SparkSession spark =SparkSession.builder.appName("test").enableHiveSupport().getOrCreate() spark.sql("show databases").show() spark.sql("create database if not exists NEW_DB")
Note: If you comment this post make sure you tag my name. And If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
HTH
Created 05-16-2018 08:58 PM
Hi @Felix Albani,
Thanks for your reply. Unfortunately, the suggestion didn't work. First, it took FOREVER to launch pyspark with the yarn option
$ pyspark --master yarn
option (and I still don't understand why that option was needed). And also, when it did launch, it ultimately threw a bunch of java errors.
Created 05-16-2018 09:12 PM
@John Doe Did it throw errors before of after running the code? I think is expected to take longer since its launching an application on the cluster. Another option that may help you get passed this issue is adding the LOCATION to the directory you like the database to be created? Something like this:
CREATE DATABASE IF NOT EXISTS abc LOCATION '/user/zeppelin/abc.db'
HTH
Created 05-16-2018 08:54 PM
@Felix_Albani,
Thanks for your reply. Unfortunately, the suggestion didn't work. First, it took FOREVER to launch pyspark with the yarn option
$ pyspark --master yarn
option (and I still don't understand why that option was needed). And also, when it did launch, it ultimately threw a bunch of java errors.
Created 05-17-2018 10:56 AM
Do you have enough permissions on this directory /home/jdoe/spark-warehouse ?
Created 05-17-2018 04:05 PM
@Felix Albani
Thank you for your reply. That suggestion actually worked! However, I don't understand why it is necessary to specify the database location in HDFS. Why does that have to be done in HDP? In other Hadoop/Spark distributions, I haven't had to specify the database filepath and database name when creating Hive databases with Spark.
I still believe there is a configuration problem with Hive and Spark with HDP.
Created 05-17-2018 04:16 PM
According to this Hortonworks community URL, Location is NOT mandatory. But it was the only way I was able to create a database.
Created 05-17-2018 04:32 PM
Hi @Felix Albani,
According to @Aditya Sirna's reply to a similar thread, Spark 2 (which is what I am using - NOT Spark 1) has a different warehouse location, which, I suppose, explains why LOCATION must be used.
@Aditya Sirna, if I want to create a Hive database with Spark, do I have to use the location statement? If so, what location statement should I use if I want to keep my databases and tables managed by the Hive metastore?
Created 05-17-2018 04:40 PM
@John Doe Good to hear LOCATION helped. Please remember mark the answer if you think it has helped you with the issue.