About bikas

bikas · ‎02-15-2017

A full stack trace would help understand which interaction is resulting in this. If IDE based code is being used then you could try to not use the spark-assembly jar that is present on HDFS and instead use the local spark-assembly jar from the Spark build being compiled against. This could be done by overriding spark.yarn.jar config. Could be the that compile dependency of Spark in your IDE is different from the runtime dependency on HDFS. Another thing could be scala version mismatch.

bikas · ‎02-07-2017

Yes. For Spark 1.6 it is GA in HDP 2.5.3. The documentation is available from the Github site for a given SHC release tag. That is the source of truth.

bikas · ‎02-01-2017

--files will add it to the working directory of the YARN app master and container and this would mean that those files (and not jars) would be in the classpath of the app master and container. But in client mode jobs the main driver code is running in the client machine. So these --files are not available on the driver. SPARK_CLASSPATH adds these files to the driver classpath. Its an env var. So one could say the following. Note it will warn saying its deprecated and cannot be used concurrently with --driver-class-path option. More information can be found here. https://github.com/hortonworks-spark/shc export SPARK_CLASSPATH=/a/b/c/hbase-site.xml;/d/e/f/hive-site.xml

bikas · ‎01-30-2017

Unfortunately, that kind of functionality does not exist for Spark Streaming. Spark Streaming runs as a standard YARN job and YARN commands could be used to start, stop (kill) and re-submit a job. A properly written Spark streaming job should be able to support at-least once or exactly-once semantics through this lifecycle. But other than that there is no UI or other automation support for it. Zeppelin is designed for interactive analysis and running Spark streaming via Zeppelin is not recommended (other than demos for presentations).

bikas · ‎01-26-2017

This is not available in any distribution since its a package and can be used independently. The latest 1.6 release is https://github.com/hortonworks-spark/shc/tree/v1.0.1-1.6 You can build that with the hbase version that matches your env.

bikas · ‎01-23-2017

The flow seems right. Thats a good use case for livy. Assuming it goes YourApp->Livy->Spark and back. You will need to look at Livy client logs or livy logs for session id 339. Seems like the client is asking for a session (livy spark job) that does not exist anymore. Could have been not started and abandoned or lost.

bikas · ‎01-23-2017

SHC does not have a notion of listing tables in HBase. It works on the table catalog provided to the data source in the program. Hive will also not list HBase tables because they are not present in the metastore. There is some rudimentary way to add Hbase external tables in Hive but I dont think that really used. I could be wrong. To list Hbase tables, currently the only reliable way would be to use HBase API's inside the spark program to list tables.

bikas · ‎01-23-2017

Hive and HiveContext in Spark can only show the tables that are registered in the Hive Metastore and Hbase tables are usually not there because the schema of most Hbase tables are not easily defined in the metastore. To read HBase tables from Spark using DataFrame API please consider Spark HBase Connector

bikas · ‎01-23-2017

In HDP 2.5 Zeppelin JDBC interpreter can run 1 query in a paragraph. There is a limit of 10 queries that can be run simultaneously overall.

bikas · ‎01-17-2017

First of all, which Spark version are you using. Apache Spark 2.0 has support for automatically acquiring HBase security tokens correctly for that job and all its executors. Apache Spark 1.6 does not have that feature but in HDP Spark 1.6 we have backported that feature and it can acquire the HBase tokens for the jobs. The tokens are automatically acquired if 1) security is enabled and 2) hbase-site.xml is present on the client classpath 3) that hbase-site.xml has kerberos security configured. Then hbase tokens for the hbase master specified in that hbase-site.xml are acquired and used in the job. In order to obtain the tokens spark client needs to use hbase code and so specific hbase jars need to be present in the client classpath. This is documented in here on the SHC github page. Search for "secure" on that page. To access hbase inside the spark jobs the job obviously needs hbase jars to be present for the driver and/or executors. That would be part of your existing job submission for non-secure clusters, which I assume already works. If this job is going to be long running and run beyond the token expire time (typically 7 days) then you need to submit the Spark job with the --keytab and --principal option such that Spark can use that keytab to re-acquire tokens before the current ones expire.

Online	Offline
Last Visited	‎09-23-2018 04:18 AM

Member Since	‎10-09-2015 06:38 PM
Last Visited	‎09-23-2018 04:18 AM
Posts	76
Kudos received	33

Cloudera Community

Re: Dataframe Insert into ORC table is slow compar...

Re: Spark map vs foreachRdd

Re: How to configure spark-log4j-properties in Amb...

Re: Spark 2 Technical preview with patches

Re: Spark Hbase connector latest version for Spark...

Re: Local Spark Development against a remote clust...

Re: HDP Spark HBase Connector

Re: Spark HBase Connector (SHC) job fails to conne...

Re: Managing Spark Streaming from Ambari?

Re: Spark Hbase connector latest version for Spark...

Re: Livy rest api with web app

Re: List hbase tables Spark sql

Re: List hbase tables Spark sql

Re: Does zeppelin support multiple hive queries in...

Re: Spark can't connect to HBase using Kerberos i...