Support Questions

jasim_waheed · ‎10-23-2017

Hi,

I am using Spark 1.6 for my current setup in HDP. I have a task to work with hive tables using Spark in Java.
I have noticed that I am able to connect with my DB "TCGA" in Spark-shell.

scala> sqlContext.sql("show tables in TCGA")
res0: org.apache.spark.sql.DataFrame = [tableName: string, isTemporary: boolean]

scala> sqlContext.sql("show tables in TCGA").show
17/10/22 21:02:11 INFO SparkContext: Starting job: show at <console>:26
17/10/22 21:02:17 INFO DAGScheduler: Got job 0 (show at <console>:26) with 1 output partitions
17/10/22 21:02:17 INFO DAGScheduler: Final stage: ResultStage 0 (show at <console>:26)
17/10/22 21:02:17 INFO DAGScheduler: Parents of final stage: List()
17/10/22 21:02:18 INFO DAGScheduler: Missing parents: List()
17/10/22 21:02:18 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at show at <console>:26), which has no missing parents
17/10/22 21:02:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1888.0 B, free 511.1 MB)
17/10/22 21:02:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1197.0 B, free 511.1 MB)
17/10/22 21:02:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:45108 (size: 1197.0 B, free: 511.1 MB)
17/10/22 21:02:25 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008
17/10/22 21:02:28 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at show at <console>:26)
17/10/22 21:02:28 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/10/22 21:02:34 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 3156 bytes)
17/10/22 21:02:35 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/10/22 21:02:40 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2013 bytes result sent to driver
17/10/22 21:02:40 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 7863 ms on localhost (1/1)
17/10/22 21:02:41 INFO DAGScheduler: ResultStage 0 (show at <console>:26) finished in 12.361 s
17/10/22 21:02:41 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/10/22 21:02:42 INFO DAGScheduler: Job 0 finished: show at <console>:26, took 30.668589 s
+--------------------+-----------+
|           tableName|isTemporary|
+--------------------+-----------+
|      cbioportal_new|      false|
| cbioportal_new_feed|      false|
|cbioportal_new_in...|      false|
|cbioportal_new_pr...|      false|
|cbioportal_new_valid|      false|
|firebrowse_simple...|      false|
|firebrowse_simple...|      false|
|firebrowse_simple...|      false|
|firebrowse_simple...|      false|
|firebrowse_simple...|      false|
|                test|      false|
|           test_feed|      false|
|        test_invalid|      false|
|        test_profile|      false|
|          test_valid|      false|
+--------------------+-----------+

Whereas, I try the same setup in JAVA, i am shown empty list of tables in my Database TCGA.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/10/23 00:28:28 INFO SparkContext: Running Spark version 1.6.3
17/10/23 00:28:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/23 00:28:30 INFO SecurityManager: Changing view acls to: root
17/10/23 00:28:30 INFO SecurityManager: Changing modify acls to: root
17/10/23 00:28:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
17/10/23 00:28:32 INFO Utils: Successfully started service 'sparkDriver' on port 43887.
17/10/23 00:28:32 INFO Slf4jLogger: Slf4jLogger started
17/10/23 00:28:33 INFO Remoting: Starting remoting
17/10/23 00:28:33 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.0.2.15:33809]
17/10/23 00:28:33 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 33809.
17/10/23 00:28:33 INFO SparkEnv: Registering MapOutputTracker
17/10/23 00:28:34 INFO SparkEnv: Registering BlockManagerMaster
17/10/23 00:28:34 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-a9fe7e0c-5862-4293-b14c-c218d0a85121
17/10/23 00:28:34 INFO MemoryStore: MemoryStore started with capacity 1579.1 MB
17/10/23 00:28:34 INFO SparkEnv: Registering OutputCommitCoordinator
17/10/23 00:28:34 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/10/23 00:28:34 INFO SparkUI: Started SparkUI at http://10.0.2.15:4040
17/10/23 00:28:34 INFO Executor: Starting executor ID driver on host localhost
17/10/23 00:28:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40068.
17/10/23 00:28:34 INFO NettyBlockTransferService: Server created on 40068
17/10/23 00:28:34 INFO BlockManagerMaster: Trying to register BlockManager
17/10/23 00:28:34 INFO BlockManagerMasterEndpoint: Registering block manager localhost:40068 with 1579.1 MB RAM, BlockManagerId(driver, localhost, 40068)
17/10/23 00:28:34 INFO BlockManagerMaster: Registered BlockManager
17/10/23 00:28:37 INFO HiveContext: Initializing execution hive, version 1.2.1
17/10/23 00:28:37 INFO ClientWrapper: Inspected Hadoop version: 2.5.1
17/10/23 00:28:37 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.5.1
17/10/23 00:28:38 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/10/23 00:28:38 INFO ObjectStore: ObjectStore, initialize called
17/10/23 00:28:39 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
17/10/23 00:28:39 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
17/10/23 00:28:45 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
17/10/23 00:28:49 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/10/23 00:28:49 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/10/23 00:29:00 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/10/23 00:29:00 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/10/23 00:29:02 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
17/10/23 00:29:02 INFO ObjectStore: Initialized ObjectStore
17/10/23 00:29:04 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/10/23 00:29:05 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
17/10/23 00:29:07 INFO HiveMetaStore: Added admin role in metastore
17/10/23 00:29:07 INFO HiveMetaStore: Added public role in metastore
17/10/23 00:29:09 INFO HiveMetaStore: No user is added in admin role, since config is empty
17/10/23 00:29:13 INFO HiveMetaStore: 0: get_all_databases
17/10/23 00:29:13 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=get_all_databases    
17/10/23 00:29:14 INFO HiveMetaStore: 0: get_functions: db=default pat=*
17/10/23 00:29:14 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=get_functions: db=default pat=*    
17/10/23 00:29:14 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
17/10/23 00:29:19 INFO SessionState: Created local directory: /tmp/1361a727-483a-424a-8936-ee5979fb5a02_resources
17/10/23 00:29:19 INFO SessionState: Created HDFS directory: /tmp/hive/root/1361a727-483a-424a-8936-ee5979fb5a02
17/10/23 00:29:19 INFO SessionState: Created local directory: /tmp/root/1361a727-483a-424a-8936-ee5979fb5a02
17/10/23 00:29:19 INFO SessionState: Created HDFS directory: /tmp/hive/root/1361a727-483a-424a-8936-ee5979fb5a02/_tmp_space.db
17/10/23 00:29:20 INFO HiveContext: default warehouse location is /user/hive/warehouse
17/10/23 00:29:20 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
17/10/23 00:29:20 INFO ClientWrapper: Inspected Hadoop version: 2.5.1
17/10/23 00:29:33 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.5.1
17/10/23 00:29:38 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/10/23 00:29:39 INFO ObjectStore: ObjectStore, initialize called
17/10/23 00:29:39 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
17/10/23 00:29:39 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
17/10/23 00:29:44 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
17/10/23 00:29:50 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/10/23 00:29:50 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/10/23 00:29:51 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/10/23 00:29:51 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/10/23 00:29:51 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
17/10/23 00:29:51 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
17/10/23 00:29:51 INFO ObjectStore: Initialized ObjectStore
17/10/23 00:30:00 INFO HiveMetaStore: Added admin role in metastore
17/10/23 00:30:00 INFO HiveMetaStore: Added public role in metastore
17/10/23 00:30:03 INFO HiveMetaStore: No user is added in admin role, since config is empty
17/10/23 00:30:12 INFO HiveMetaStore: 0: get_all_databases
17/10/23 00:30:12 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=get_all_databases    
17/10/23 00:30:13 INFO HiveMetaStore: 0: get_functions: db=default pat=*
17/10/23 00:30:13 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=get_functions: db=default pat=*    
17/10/23 00:30:13 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
17/10/23 00:30:22 INFO SessionState: Created local directory: /tmp/bc93d037-7940-4e83-bc24-02d93dff54bf_resources
17/10/23 00:30:22 INFO SessionState: Created HDFS directory: /tmp/hive/root/bc93d037-7940-4e83-bc24-02d93dff54bf
17/10/23 00:30:22 INFO SessionState: Created local directory: /tmp/root/bc93d037-7940-4e83-bc24-02d93dff54bf
17/10/23 00:30:22 INFO SessionState: Created HDFS directory: /tmp/hive/root/bc93d037-7940-4e83-bc24-02d93dff54bf/_tmp_space.db
17/10/23 00:30:41 INFO HiveMetaStore: 0: get_tables: db=TCGA pat=.*
17/10/23 00:30:41 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=get_tables: db=TCGA pat=.*    
17/10/23 00:31:08 INFO SparkContext: Starting job: show at MavenMainHbase.java:46
17/10/23 00:31:14 INFO DAGScheduler: Got job 0 (show at MavenMainHbase.java:46) with 1 output partitions
17/10/23 00:31:14 INFO DAGScheduler: Final stage: ResultStage 0 (show at MavenMainHbase.java:46)
17/10/23 00:31:14 INFO DAGScheduler: Parents of final stage: List()
17/10/23 00:31:14 INFO DAGScheduler: Missing parents: List()
17/10/23 00:31:17 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at show at MavenMainHbase.java:46), which has no missing parents
17/10/23 00:31:31 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1824.0 B, free 1579.1 MB)
17/10/23 00:31:31 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1175.0 B, free 1579.1 MB)
17/10/23 00:31:31 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:40068 (size: 1175.0 B, free: 1579.1 MB)
17/10/23 00:31:31 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
17/10/23 00:31:33 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at show at MavenMainHbase.java:46)
17/10/23 00:31:33 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/10/23 00:31:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2105 bytes)
17/10/23 00:31:38 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/10/23 00:31:38 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 940 bytes result sent to driver
17/10/23 00:31:39 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1815 ms on localhost (1/1)
17/10/23 00:31:39 INFO DAGScheduler: ResultStage 0 (show at MavenMainHbase.java:46) finished in 5.187 s
17/10/23 00:31:39 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/10/23 00:31:40 INFO DAGScheduler: Job 0 finished: show at MavenMainHbase.java:46, took 32.074415 s
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
+---------+-----------+

Here, is the sample JAVA code that I used to get the above result

SparkConf conf = new SparkConf().setAppName("SparkHive").setMaster("local").setSparkHome("/usr/hdp/2.5.6.0-40/spark/").set("HADOOP_CONF_DIR","/usr/hdp/2.5.6.0-40/hive/conf/").set("spark.driver.extraClassPath","/usr/hdp/2.5.6.0-40/hive/conf");
conf.set("spark.sql.hive.thriftServer.singleSession", "true");
SparkContext sc = new SparkContext(conf);
HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
hiveContext.setConf("hive.metastore.uris","thrift://sandbox.kylo.io:9083");
hiveContext.setConf("spark." +
        "sql.warehouse.dir","/user/hive/warehouse");
DataFrame df= hiveContext.sql("show tables in TCGA");
 df.show();

And, here is my pom.xml:

<dependencies>

    <!-- https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-core -->
    <dependency>
        <groupId>org.apache.phoenix</groupId>
        <artifactId>phoenix-core</artifactId>
        <version>4.4.0-HBase-1.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-hbase-handler -->
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-hbase-handler -->
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-hbase-handler</artifactId>
        <version>1.2.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>1.2.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc -->
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-jdbc</artifactId>
        <version>1.2.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-metastore</artifactId>
        <version>1.2.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.thrift</groupId>
        <artifactId>libthrift</artifactId>
        <version>0.9.0</version>
        <type>pom</type>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.thrift/libfb303 -->
    <dependency>
        <groupId>org.apache.thrift</groupId>
        <artifactId>libfb303</artifactId>
        <version>0.9.0</version>
        <type>pom</type>
    </dependency>
    <!-- https://mvnrepository.com/artifact/commons-httpclient/commons-httpclient -->
    <dependency>
        <groupId>commons-httpclient</groupId>
        <artifactId>commons-httpclient</artifactId>
        <version>3.1</version>
    </dependency>


    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient-osgi</artifactId>
        <version>4.3-beta2</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-contrib -->
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-contrib</artifactId>
        <version>1.2.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.6.3</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.10</artifactId>
        <version>1.6.3</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 -->
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-hive_2.10</artifactId>
        <version>1.6.3</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind -->
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
        <version>2.4.4</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-core -->
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-core</artifactId>
        <version>2.4.4</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-annotations -->
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-annotations</artifactId>
        <version>2.4.4</version>
    </dependency>
</dependencies>

I think it has got to do with the code unable to find hive-site.xml , so i gave all possible classpaths in SparkConf to make it work. but no luck yet. What other configrations I have to set?

narendrakumar · ‎10-23-2017

@Jasim can you post the metastore logs here for debugging.

jasim_waheed · ‎10-23-2017

hivemetastorelog.txt

@nkumar attached is the log from metastore.

narendrakumar · ‎10-24-2017

@Jasmin,

from the metastore log, what I could find is this line which seems to be of interest: "[org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@7688c6c2]: common.JvmPauseMonitor (JvmPauseMonitor.java:run(193)) - Detected pause in JVM or host machine (eg GC): pause of approximately 8978ms"

this means that there was memory issue in Java heap for hiveserver and it paused GC due to that. Which means probably query itself got hanged. Also following this, I only see query being in progress but not completed (from the log shared).

So, it would be a good idea to check the Java heap memory and other relavant parameters.

-XX:NewRatio= ? 
-XX:MaxHeapFreeRatio= ? 
-XX:MinHeapFreeRatio= ?

jasim_waheed · ‎10-24-2017

@nkumar,

I have doubts over hive-site.xml, because what I observe is that it creates its own DB instance rather than reference the pre-existing Metastore derby db. Is there something that I should do regarding the classpath?

SamBerchmans · ‎12-04-2019

If this issue is fixed, can you share the sample code. I am looking to use 1.6.3 and need some Java code samples for extracting data from Hive.

Cloudera Community

Support Questions

HiveContext unable to connect to exisiting Metastore using Java