Created on 04-03-2016 02:02 AM - edited 08-18-2019 05:37 AM
Hi, Im trying to execute queries with Spark SQL over hive tables stored in hdfs single node, but Im with some problems to start spark correctly. I already have hadoop and hive installed and already created the tables with hive with the data stored in hdfs.
I will say what is my hadoop and hive configuration, and hope that someone there already try to execute queries with spark over hive tables and can give a help, and can say what are the step to install spark correctly for this purpose.
I installed hadoop-2.7.1, I extract the files add the environment variables and configured core-site.xml and hdfs-site.xml.
core-site.xml:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property>
Then format the namenode with:
hadoop namenode -format
Then I start hadoop with:
./start-yarn.sh ./start-dfs.sh
And it seems that everything works:
[hadoopdadmin@hadoop sbin]$ jps 9601 NameNode 9699 DataNode 10003 Jps 9091 ResourceManager 9894 SecondaryNameNode 9191 NodeManager
Then after hadoop installed I download hive 1.2.1 and just extract the files and add the environment variables.
The .bashrc file is like this now:
export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64 export HADOOP_HOME=/usr/local/hadoop-2.7.1 export HIVE_HOME=/usr/local/apache-hive-1.2.1-bin export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin
To start Hive I just write hive and it seems that works:
[hadoopadmin@hadoopSingleNode ~]$ hive Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive>
I have some tables in hive that I create with this command:
create table customer (C_CUSTKEY INT, C_NAME STRING, C_ADDRESS STRING, C_NATIONKEY INT, C_PHONE STRING, C_ACCTBAL DOUBLE, C_MKTSEGMENT STRING, C_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tables/customer';
Now its time to install spark to query this hive tabes. What Im doing is just download this version "http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz", extract the files and configure environment variables. After this with spark-shell Im getting a lot of errors.
I already try a lot of things but nothing is working to fix the issues, so someone can see what is not ok in my configurations step or what is missing here?
Errors that are appearing after execute spark-shell command:
Created 04-03-2016 06:45 PM
Spark shell attempts to start a SQL Context by default. The first thing I would check is whether you are pointing Spark at your existing Hive meta store. In your {SPARK_HOME}/conf folder you should have a hive-site.xml file. Make sure you have the following configuraiton:
<property> <name>hive.metastore.uris</name> <value>thrift://{IP of meta store host}:{port meta store listening}</value> </property>
This should tell Spark Shell to connect to your existing meta store instead of trying to create a default, which is what it looks like it is trying to do. The SQL context should now be able to start up and you should be able to access Hive by using the default SQLContext.
val result = sqlContext.sql("SELECT * FROM {hive table name}") result.show
If the Hive Context was not created by default then do this and retry the query:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Created 04-03-2016 06:45 PM
Spark shell attempts to start a SQL Context by default. The first thing I would check is whether you are pointing Spark at your existing Hive meta store. In your {SPARK_HOME}/conf folder you should have a hive-site.xml file. Make sure you have the following configuraiton:
<property> <name>hive.metastore.uris</name> <value>thrift://{IP of meta store host}:{port meta store listening}</value> </property>
This should tell Spark Shell to connect to your existing meta store instead of trying to create a default, which is what it looks like it is trying to do. The SQL context should now be able to start up and you should be able to access Hive by using the default SQLContext.
val result = sqlContext.sql("SELECT * FROM {hive table name}") result.show
If the Hive Context was not created by default then do this and retry the query:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Created 04-04-2016 01:35 PM
Thank you really. Now it is working! It is just showing some warnings about "version information not found in metastore..." and "failed to get database default returning NoSuchObjectException". But as they are warnings should be working fine, right?