Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HDP 2.6 Spark can't create database - configuration issue?

avatar
Explorer

Hi,

After installing HDP 2.6.3, I ran Pyspark in the terminal, then initiated a Spark Session, and tried to create a new database (see last line of code:

$ pyspark
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").appName("test").enableHiveSupport().getOrCreate()
> spark.sql("show databases").show()
> spark.sql("create database if not exists NEW_DB")

However, PySpark threw an error where it was trying to create a database locally:

AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to create database path file:/home/jdoe/spark-warehouse/new_db.db, failed to create database new_db);'

I wasn't trying to create a database locally. I was trying to create a database within Hive. Is there a configuration problem with HDP 2.6.3?

Please advise. Thanks.

1 ACCEPTED SOLUTION

avatar

@John Doe Could you try running on yarn client mode instead of local? I think this will help resolving the problem you have now.

$ pyspark --master yarn
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
spark.sql("show databases").show()
spark.sql("create database if not exists NEW_DB")

Note: If you comment this post make sure you tag my name. And If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

HTH

View solution in original post

12 REPLIES 12

avatar

@John Doe Could you try running on yarn client mode instead of local? I think this will help resolving the problem you have now.

$ pyspark --master yarn
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
spark.sql("show databases").show()
spark.sql("create database if not exists NEW_DB")

Note: If you comment this post make sure you tag my name. And If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

HTH

avatar
Explorer

Hi @Felix Albani,

Thanks for your reply. Unfortunately, the suggestion didn't work. First, it took FOREVER to launch pyspark with the yarn option

$ pyspark --master yarn

option (and I still don't understand why that option was needed). And also, when it did launch, it ultimately threw a bunch of java errors.

avatar

@John Doe Did it throw errors before of after running the code? I think is expected to take longer since its launching an application on the cluster. Another option that may help you get passed this issue is adding the LOCATION to the directory you like the database to be created? Something like this:

CREATE DATABASE IF NOT EXISTS abc LOCATION '/user/zeppelin/abc.db'

HTH

avatar
Explorer

@Felix_Albani,

Thanks for your reply. Unfortunately, the suggestion didn't work. First, it took FOREVER to launch pyspark with the yarn option

$ pyspark --master yarn

option (and I still don't understand why that option was needed). And also, when it did launch, it ultimately threw a bunch of java errors.

avatar
@John Doe

Do you have enough permissions on this directory /home/jdoe/spark-warehouse ?

avatar
Explorer

@Felix Albani

Thank you for your reply. That suggestion actually worked! However, I don't understand why it is necessary to specify the database location in HDFS. Why does that have to be done in HDP? In other Hadoop/Spark distributions, I haven't had to specify the database filepath and database name when creating Hive databases with Spark.

I still believe there is a configuration problem with Hive and Spark with HDP.

avatar
Explorer

According to this Hortonworks community URL, Location is NOT mandatory. But it was the only way I was able to create a database.

avatar
Explorer

Hi @Felix Albani,

According to @Aditya Sirna's reply to a similar thread, Spark 2 (which is what I am using - NOT Spark 1) has a different warehouse location, which, I suppose, explains why LOCATION must be used.

@Aditya Sirna, if I want to create a Hive database with Spark, do I have to use the location statement? If so, what location statement should I use if I want to keep my databases and tables managed by the Hive metastore?

avatar

@John Doe Good to hear LOCATION helped. Please remember mark the answer if you think it has helped you with the issue.