Created 10-23-2017 03:00 PM
Hi,
I am using Spark 1.6 for my current setup in HDP. I have a task to work with hive tables using Spark in Java.
I have noticed that I am able to connect with my DB "TCGA" in Spark-shell.
scala> sqlContext.sql("show tables in TCGA") res0: org.apache.spark.sql.DataFrame = [tableName: string, isTemporary: boolean] scala> sqlContext.sql("show tables in TCGA").show 17/10/22 21:02:11 INFO SparkContext: Starting job: show at <console>:26 17/10/22 21:02:17 INFO DAGScheduler: Got job 0 (show at <console>:26) with 1 output partitions 17/10/22 21:02:17 INFO DAGScheduler: Final stage: ResultStage 0 (show at <console>:26) 17/10/22 21:02:17 INFO DAGScheduler: Parents of final stage: List() 17/10/22 21:02:18 INFO DAGScheduler: Missing parents: List() 17/10/22 21:02:18 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at show at <console>:26), which has no missing parents 17/10/22 21:02:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1888.0 B, free 511.1 MB) 17/10/22 21:02:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1197.0 B, free 511.1 MB) 17/10/22 21:02:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:45108 (size: 1197.0 B, free: 511.1 MB) 17/10/22 21:02:25 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008 17/10/22 21:02:28 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at show at <console>:26) 17/10/22 21:02:28 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 17/10/22 21:02:34 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 3156 bytes) 17/10/22 21:02:35 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 17/10/22 21:02:40 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2013 bytes result sent to driver 17/10/22 21:02:40 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 7863 ms on localhost (1/1) 17/10/22 21:02:41 INFO DAGScheduler: ResultStage 0 (show at <console>:26) finished in 12.361 s 17/10/22 21:02:41 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 17/10/22 21:02:42 INFO DAGScheduler: Job 0 finished: show at <console>:26, took 30.668589 s +--------------------+-----------+ | tableName|isTemporary| +--------------------+-----------+ | cbioportal_new| false| | cbioportal_new_feed| false| |cbioportal_new_in...| false| |cbioportal_new_pr...| false| |cbioportal_new_valid| false| |firebrowse_simple...| false| |firebrowse_simple...| false| |firebrowse_simple...| false| |firebrowse_simple...| false| |firebrowse_simple...| false| | test| false| | test_feed| false| | test_invalid| false| | test_profile| false| | test_valid| false| +--------------------+-----------+
Whereas, I try the same setup in JAVA, i am shown empty list of tables in my Database TCGA.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/10/23 00:28:28 INFO SparkContext: Running Spark version 1.6.3 17/10/23 00:28:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/10/23 00:28:30 INFO SecurityManager: Changing view acls to: root 17/10/23 00:28:30 INFO SecurityManager: Changing modify acls to: root 17/10/23 00:28:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 17/10/23 00:28:32 INFO Utils: Successfully started service 'sparkDriver' on port 43887. 17/10/23 00:28:32 INFO Slf4jLogger: Slf4jLogger started 17/10/23 00:28:33 INFO Remoting: Starting remoting 17/10/23 00:28:33 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.0.2.15:33809] 17/10/23 00:28:33 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 33809. 17/10/23 00:28:33 INFO SparkEnv: Registering MapOutputTracker 17/10/23 00:28:34 INFO SparkEnv: Registering BlockManagerMaster 17/10/23 00:28:34 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-a9fe7e0c-5862-4293-b14c-c218d0a85121 17/10/23 00:28:34 INFO MemoryStore: MemoryStore started with capacity 1579.1 MB 17/10/23 00:28:34 INFO SparkEnv: Registering OutputCommitCoordinator 17/10/23 00:28:34 INFO Utils: Successfully started service 'SparkUI' on port 4040. 17/10/23 00:28:34 INFO SparkUI: Started SparkUI at http://10.0.2.15:4040 17/10/23 00:28:34 INFO Executor: Starting executor ID driver on host localhost 17/10/23 00:28:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40068. 17/10/23 00:28:34 INFO NettyBlockTransferService: Server created on 40068 17/10/23 00:28:34 INFO BlockManagerMaster: Trying to register BlockManager 17/10/23 00:28:34 INFO BlockManagerMasterEndpoint: Registering block manager localhost:40068 with 1579.1 MB RAM, BlockManagerId(driver, localhost, 40068) 17/10/23 00:28:34 INFO BlockManagerMaster: Registered BlockManager 17/10/23 00:28:37 INFO HiveContext: Initializing execution hive, version 1.2.1 17/10/23 00:28:37 INFO ClientWrapper: Inspected Hadoop version: 2.5.1 17/10/23 00:28:37 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.5.1 17/10/23 00:28:38 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 17/10/23 00:28:38 INFO ObjectStore: ObjectStore, initialize called 17/10/23 00:28:39 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 17/10/23 00:28:39 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 17/10/23 00:28:45 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 17/10/23 00:28:49 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 17/10/23 00:28:49 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 17/10/23 00:29:00 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 17/10/23 00:29:00 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 17/10/23 00:29:02 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY 17/10/23 00:29:02 INFO ObjectStore: Initialized ObjectStore 17/10/23 00:29:04 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 17/10/23 00:29:05 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 17/10/23 00:29:07 INFO HiveMetaStore: Added admin role in metastore 17/10/23 00:29:07 INFO HiveMetaStore: Added public role in metastore 17/10/23 00:29:09 INFO HiveMetaStore: No user is added in admin role, since config is empty 17/10/23 00:29:13 INFO HiveMetaStore: 0: get_all_databases 17/10/23 00:29:13 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_all_databases 17/10/23 00:29:14 INFO HiveMetaStore: 0: get_functions: db=default pat=* 17/10/23 00:29:14 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_functions: db=default pat=* 17/10/23 00:29:14 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table. 17/10/23 00:29:19 INFO SessionState: Created local directory: /tmp/1361a727-483a-424a-8936-ee5979fb5a02_resources 17/10/23 00:29:19 INFO SessionState: Created HDFS directory: /tmp/hive/root/1361a727-483a-424a-8936-ee5979fb5a02 17/10/23 00:29:19 INFO SessionState: Created local directory: /tmp/root/1361a727-483a-424a-8936-ee5979fb5a02 17/10/23 00:29:19 INFO SessionState: Created HDFS directory: /tmp/hive/root/1361a727-483a-424a-8936-ee5979fb5a02/_tmp_space.db 17/10/23 00:29:20 INFO HiveContext: default warehouse location is /user/hive/warehouse 17/10/23 00:29:20 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes. 17/10/23 00:29:20 INFO ClientWrapper: Inspected Hadoop version: 2.5.1 17/10/23 00:29:33 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.5.1 17/10/23 00:29:38 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 17/10/23 00:29:39 INFO ObjectStore: ObjectStore, initialize called 17/10/23 00:29:39 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 17/10/23 00:29:39 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 17/10/23 00:29:44 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 17/10/23 00:29:50 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 17/10/23 00:29:50 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 17/10/23 00:29:51 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 17/10/23 00:29:51 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 17/10/23 00:29:51 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 17/10/23 00:29:51 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY 17/10/23 00:29:51 INFO ObjectStore: Initialized ObjectStore 17/10/23 00:30:00 INFO HiveMetaStore: Added admin role in metastore 17/10/23 00:30:00 INFO HiveMetaStore: Added public role in metastore 17/10/23 00:30:03 INFO HiveMetaStore: No user is added in admin role, since config is empty 17/10/23 00:30:12 INFO HiveMetaStore: 0: get_all_databases 17/10/23 00:30:12 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_all_databases 17/10/23 00:30:13 INFO HiveMetaStore: 0: get_functions: db=default pat=* 17/10/23 00:30:13 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_functions: db=default pat=* 17/10/23 00:30:13 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table. 17/10/23 00:30:22 INFO SessionState: Created local directory: /tmp/bc93d037-7940-4e83-bc24-02d93dff54bf_resources 17/10/23 00:30:22 INFO SessionState: Created HDFS directory: /tmp/hive/root/bc93d037-7940-4e83-bc24-02d93dff54bf 17/10/23 00:30:22 INFO SessionState: Created local directory: /tmp/root/bc93d037-7940-4e83-bc24-02d93dff54bf 17/10/23 00:30:22 INFO SessionState: Created HDFS directory: /tmp/hive/root/bc93d037-7940-4e83-bc24-02d93dff54bf/_tmp_space.db 17/10/23 00:30:41 INFO HiveMetaStore: 0: get_tables: db=TCGA pat=.* 17/10/23 00:30:41 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_tables: db=TCGA pat=.* 17/10/23 00:31:08 INFO SparkContext: Starting job: show at MavenMainHbase.java:46 17/10/23 00:31:14 INFO DAGScheduler: Got job 0 (show at MavenMainHbase.java:46) with 1 output partitions 17/10/23 00:31:14 INFO DAGScheduler: Final stage: ResultStage 0 (show at MavenMainHbase.java:46) 17/10/23 00:31:14 INFO DAGScheduler: Parents of final stage: List() 17/10/23 00:31:14 INFO DAGScheduler: Missing parents: List() 17/10/23 00:31:17 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at show at MavenMainHbase.java:46), which has no missing parents 17/10/23 00:31:31 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1824.0 B, free 1579.1 MB) 17/10/23 00:31:31 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1175.0 B, free 1579.1 MB) 17/10/23 00:31:31 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:40068 (size: 1175.0 B, free: 1579.1 MB) 17/10/23 00:31:31 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006 17/10/23 00:31:33 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at show at MavenMainHbase.java:46) 17/10/23 00:31:33 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 17/10/23 00:31:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2105 bytes) 17/10/23 00:31:38 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 17/10/23 00:31:38 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 940 bytes result sent to driver 17/10/23 00:31:39 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1815 ms on localhost (1/1) 17/10/23 00:31:39 INFO DAGScheduler: ResultStage 0 (show at MavenMainHbase.java:46) finished in 5.187 s 17/10/23 00:31:39 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 17/10/23 00:31:40 INFO DAGScheduler: Job 0 finished: show at MavenMainHbase.java:46, took 32.074415 s +---------+-----------+ |tableName|isTemporary| +---------+-----------+ +---------+-----------+
Here, is the sample JAVA code that I used to get the above result
SparkConf conf = new SparkConf().setAppName("SparkHive").setMaster("local").setSparkHome("/usr/hdp/2.5.6.0-40/spark/").set("HADOOP_CONF_DIR","/usr/hdp/2.5.6.0-40/hive/conf/").set("spark.driver.extraClassPath","/usr/hdp/2.5.6.0-40/hive/conf"); conf.set("spark.sql.hive.thriftServer.singleSession", "true"); SparkContext sc = new SparkContext(conf); HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc); hiveContext.setConf("hive.metastore.uris","thrift://sandbox.kylo.io:9083"); hiveContext.setConf("spark." + "sql.warehouse.dir","/user/hive/warehouse"); DataFrame df= hiveContext.sql("show tables in TCGA"); df.show();
And, here is my pom.xml:
<dependencies> <!-- https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-core --> <dependency> <groupId>org.apache.phoenix</groupId> <artifactId>phoenix-core</artifactId> <version>4.4.0-HBase-1.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-hbase-handler --> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-hbase-handler --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-hbase-handler</artifactId> <version>1.2.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>1.2.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>1.2.1</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-metastore</artifactId> <version>1.2.1</version> </dependency> <dependency> <groupId>org.apache.thrift</groupId> <artifactId>libthrift</artifactId> <version>0.9.0</version> <type>pom</type> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.thrift/libfb303 --> <dependency> <groupId>org.apache.thrift</groupId> <artifactId>libfb303</artifactId> <version>0.9.0</version> <type>pom</type> </dependency> <!-- https://mvnrepository.com/artifact/commons-httpclient/commons-httpclient --> <dependency> <groupId>commons-httpclient</groupId> <artifactId>commons-httpclient</artifactId> <version>3.1</version> </dependency> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient-osgi</artifactId> <version>4.3-beta2</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-contrib --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-contrib</artifactId> <version>1.2.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.6.3</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>1.6.3</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 --> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.10</artifactId> <version>1.6.3</version> </dependency> <!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind --> <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId> <version>2.4.4</version> </dependency> <!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-core --> <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-core</artifactId> <version>2.4.4</version> </dependency> <!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-annotations --> <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-annotations</artifactId> <version>2.4.4</version> </dependency> </dependencies>
I think it has got to do with the code unable to find hive-site.xml , so i gave all possible classpaths in SparkConf to make it work. but no luck yet. What other configrations I have to set?
Created 10-23-2017 08:09 PM
@Jasim can you post the metastore logs here for debugging.
Created 10-23-2017 10:23 PM
hivemetastorelog.txt
@nkumar attached is the log from metastore.
Created 10-24-2017 05:25 AM
@Jasmin,
from the metastore log, what I could find is this line which seems to be of interest: "[org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@7688c6c2]: common.JvmPauseMonitor (JvmPauseMonitor.java:run(193)) - Detected pause in JVM or host machine (eg GC): pause of approximately 8978ms"
this means that there was memory issue in Java heap for hiveserver and it paused GC due to that. Which means probably query itself got hanged. Also following this, I only see query being in progress but not completed (from the log shared).
So, it would be a good idea to check the Java heap memory and other relavant parameters.
-XX:NewRatio= ? -XX:MaxHeapFreeRatio= ? -XX:MinHeapFreeRatio= ?
Created 10-24-2017 10:15 AM
@nkumar,
I have doubts over hive-site.xml, because what I observe is that it creates its own DB instance rather than reference the pre-existing Metastore derby db. Is there something that I should do regarding the classpath?
Created 12-04-2019 12:12 PM
If this issue is fixed, can you share the sample code. I am looking to use 1.6.3 and need some Java code samples for extracting data from Hive.