Created 07-07-2021 09:07 AM
I have a spark cluster with one master and two workers on different servers. I have copied hive-site.xml in spark conf on all 3 servers and started thrift server on master server pointing to spark master. Using this to connect beeline to thrift server and run spark sql queries.
I created a scala application to load data from csv to dataframe and then to a spark sql table backed by S3 parquet. Metadata is in MySQL. I am using spark submit command to run the scala app on the cluster.
I do not see the table created by the scala app in beeline. However, when I use spark shell on master connected to spark master and do the same loading of a csv file and creating a table, I can see it in beeline. Am I missing something with the scala app? Also do we need HDFS to make this work?
hive.metastore.uris in hive-site.xml is not set currently. Not sure what I should set that to, since I dont have anything running on the 9083 port.
Also I started thrift server(which runs on port 10000) from spark sbin directory like this : /opt/spark/sbin/start-thriftserver.sh --master spark://<master-ip>:7077 --total-executor-cores 1
This is my scala code:
def main(args : Array[String]) { println( "Hello World!" ) val warehouseLocation = "/home/ubuntu/test/hive_warehouse" val spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.sql.warehouse.dir", warehouseLocation) .enableHiveSupport() .getOrCreate() val df = spark.read.format("csv").load(args(0)) df.createOrReplaceTempView("my_temp_table") spark.sql("create table test1_0522 location 's3a://<test-bucket>/data/test1_0522' stored as PARQUET as select * from my_temp_table") spark.sql("SHOW TABLES").show() }
Created 07-07-2021 11:12 AM
@SparkNewbie
Can you add the redcline between your 2 commands
spark.sql("create table test1_0522 location 's3a://<test-bucket>/data/test1_0522' stored as PARQUET as select * from my_temp_table")
REFRESH TABLE test1_0522;
spark.sql("SHOW TABLES").show()
That should resolve the problem.
Happy hadooping
Created 07-07-2021 04:34 PM
Thank you for your response. I tried as you suggested but the table didn't show up in beeline.
I checked in mysql metastore db as well. The new table info does not show up there.
21/07/07 23:07:56 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
I still see the above in the stderr file eventhough my hive-site.xml is set to use mysql. Not sure if I am missing something else.
<name>hive.metastore.db.type</name>
<value>mysql</value>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://<mysql end point>:3306/metastore_db</value>
Created 07-08-2021 02:33 PM
Bingo you are using the derby DB, which is only recommended for testing.
There are three modes for Hive Metastore deployment:
In Hive by default, metastore service runs in the same JVM as the Hive service. It uses embedded derby database stored on the local file system in this mode. Thus both metastore service and hive service runs in the same JVM by using embedded Derby Database.
But, this mode also has its limitation that, as only one embedded Derby database can access the database files on disk at any one time, so only one Hive session could be open at a time.
21/07/07 23:07:56 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
Derby is an embedded relational database in Java program and used for online transaction processing and has a 3.5 MB disk-space footprint.
Depending on your software HDP or Cloudera ensure the hive DB is plugged to an external Mysql database
For CDH using Mysql
For HDP using mysql
Check your current hive UI backend metadata databases !! After installing MySQL then you should toggle hive config to point to the external Mysql database ..
Once done your commands and the refresh should succeed
Please let me know if you need help
Happy hadooping