Member since
10-09-2015
76
Posts
33
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3537 | 03-09-2017 09:08 PM | |
3717 | 02-23-2017 08:01 AM | |
1017 | 02-21-2017 03:04 AM | |
1026 | 02-16-2017 08:00 AM | |
647 | 01-26-2017 06:32 PM |
01-25-2018
07:13 PM
If you have HDP 2.6.3 then you should be able to find spark 2.2 version of spark-llap available under /usr/hdp/current/ Perhaps you are using the older versions of shc using --packages and thats not compatible with spark 2.2.
... View more
04-24-2017
08:47 PM
/etc/hive/conf/hive-site.xml is the config for Hive service itself and is managed via Ambari through the Hive service config page. /usr/hdp/current/spark-client/conf/hive-site.xml actually points to /etc/spark/conf/hive-site.xml . This is the minimal hive config that Spark needs to access Hive. This is managed via Ambari through the Spark service config page. Ambari correctly configures this hive site for Kerberos. Depending upon your version of HDP you may not have the correct support in Ambari for configuring Livy.
... View more
03-17-2017
06:59 PM
When connecting via beeline did that use HiveServer2 or SparkThriftServer?
... View more
03-12-2017
01:51 AM
Is this needed even after the HDP 2.5 native Oozie Spark action?
... View more
03-09-2017
09:08 PM
1 Kudo
Apache Spark has traditionally worked sub-optimally with ORC because ORC used to be inside Apache Hive and Apache Spark depends on a very old release of Hive 1.2.1 from mid-2015. We are working on figuring out how to best update Apache Spark's version of ORC either by upgrading Apache Spark's dependency to latest Apache Hive or depending for ORC from the new Apache ORC project.
... View more
03-08-2017
06:54 PM
1 Kudo
For JDBC there is a built-in jar for JDBC support. No need for Simba.
... View more
03-08-2017
06:51 PM
2 Kudos
Probably missing having hbase-site or phoenix conf in the classpath. So it cannot find the ZK info for hbase/phoenix.
... View more
02-23-2017
08:01 AM
What exception/error happens in code 2? Just curious. foreachRDD is the prescriptive method to write to external systems. So you should be using foreachRDD. The outer loop executes on the driver and inner loop on the executors. Executors run on remote machines in a cluster. However in the code above its not clear how dynamoConnection is available to executors since such network connections are usually not serializable. Or is the following line inadvertently missing from snippet 1. val dynamoConnection = setupDynamoClientConnection() If yes, then the slowness could stem from repeatedly creating a dynamoClientConnection for each record. The recommended pattern is to use foreachPartition() to create the connection once per partition and then rdd.foreach() to write the records using that connection. For more info please search for foreachPartition see http://spark.apache.org/docs/latest/streaming-programming-guide.html
... View more
02-21-2017
03:04 AM
1 Kudo
That log4j only affects the service daemons like spark history server and anything you run on the client machines. For executors/drivers that run on YARN machines, the log4j file has to be passed to them using "--files" option during job submit and then referenced via JVM property via JVM arguments "-Dlog4j.configuration" . See here for examples.
... View more
02-16-2017
08:00 AM
1 Kudo
Ambari should be giving the option of which HDP 2.5.x version to install. Choosing higher versions would give higher Apache versions. E.g. HDP 2.5.3 will give Apache Spark 2.0.1. The next HDP 2.5.4+ release will give 2.0.2 HDP 2.6 (not released yet) will have Apache Spark 2.1. You can try a tech preview of that on HDC
... View more
02-15-2017
03:01 AM
A full stack trace would help understand which interaction is resulting in this. If IDE based code is being used then you could try to not use the spark-assembly jar that is present on HDFS and instead use the local spark-assembly jar from the Spark build being compiled against. This could be done by overriding spark.yarn.jar config. Could be the that compile dependency of Spark in your IDE is different from the runtime dependency on HDFS. Another thing could be scala version mismatch.
... View more
02-07-2017
06:43 PM
Yes. For Spark 1.6 it is GA in HDP 2.5.3. The documentation is available from the Github site for a given SHC release tag. That is the source of truth.
... View more
02-01-2017
10:01 PM
1 Kudo
--files will add it to the working directory of the YARN app master and container and this would mean that those files (and not jars) would be in the classpath of the app master and container. But in client mode jobs the main driver code is running in the client machine. So these --files are not available on the driver. SPARK_CLASSPATH adds these files to the driver classpath. Its an env var. So one could say the following. Note it will warn saying its deprecated and cannot be used concurrently with --driver-class-path option. More information can be found here. https://github.com/hortonworks-spark/shc export SPARK_CLASSPATH=/a/b/c/hbase-site.xml;/d/e/f/hive-site.xml
... View more
01-30-2017
03:49 AM
2 Kudos
Unfortunately, that kind of functionality does not exist for Spark Streaming. Spark Streaming runs as a standard YARN job and YARN commands could be used to start, stop (kill) and re-submit a job. A properly written Spark streaming job should be able to support at-least once or exactly-once semantics through this lifecycle. But other than that there is no UI or other automation support for it. Zeppelin is designed for interactive analysis and running Spark streaming via Zeppelin is not recommended (other than demos for presentations).
... View more
01-26-2017
06:32 PM
This is not available in any distribution since its a package and can be used independently. The latest 1.6 release is https://github.com/hortonworks-spark/shc/tree/v1.0.1-1.6 You can build that with the hbase version that matches your env.
... View more
01-23-2017
07:54 PM
The flow seems right. Thats a good use case for livy. Assuming it goes YourApp->Livy->Spark and back. You will need to look at Livy client logs or livy logs for session id 339. Seems like the client is asking for a session (livy spark job) that does not exist anymore. Could have been not started and abandoned or lost.
... View more
01-23-2017
07:45 PM
1 Kudo
SHC does not have a notion of listing tables in HBase. It works on the table catalog provided to the data source in the program. Hive will also not list HBase tables because they are not present in the metastore. There is some rudimentary way to add Hbase external tables in Hive but I dont think that really used. I could be wrong. To list Hbase tables, currently the only reliable way would be to use HBase API's inside the spark program to list tables.
... View more
01-23-2017
01:33 AM
Hive and HiveContext in Spark can only show the tables that are registered in the Hive Metastore and Hbase tables are usually not there because the schema of most Hbase tables are not easily defined in the metastore. To read HBase tables from Spark using DataFrame API please consider Spark HBase Connector
... View more
01-23-2017
01:30 AM
1 Kudo
In HDP 2.5 Zeppelin JDBC interpreter can run 1 query in a paragraph. There is a limit of 10 queries that can be run simultaneously overall.
... View more
01-17-2017
08:33 PM
First of all, which Spark version are you using. Apache Spark 2.0 has support for automatically acquiring HBase security tokens correctly for that job and all its executors. Apache Spark 1.6 does not have that feature but in HDP Spark 1.6 we have backported that feature and it can acquire the HBase tokens for the jobs. The tokens are automatically acquired if 1) security is enabled and 2) hbase-site.xml is present on the client classpath 3) that hbase-site.xml has kerberos security configured. Then hbase tokens for the hbase master specified in that hbase-site.xml are acquired and used in the job. In order to obtain the tokens spark client needs to use hbase code and so specific hbase jars need to be present in the client classpath. This is documented in here on the SHC github page. Search for "secure" on that page. To access hbase inside the spark jobs the job obviously needs hbase jars to be present for the driver and/or executors. That would be part of your existing job submission for non-secure clusters, which I assume already works. If this job is going to be long running and run beyond the token expire time (typically 7 days) then you need to submit the Spark job with the --keytab and --principal option such that Spark can use that keytab to re-acquire tokens before the current ones expire.
... View more
01-17-2017
08:23 PM
If this does not work for you please open the feature request by creating an issue on the github project for SHC. /cc @wyang
... View more
01-17-2017
08:20 PM
Ideally, just before that OWN failure log, there should be an exception or error message about some task for vertex with id 1484566407737_0004_1. That could give more info. Even if more info is not there, you will be able to find the task attempt that actually failed. That task attempt can show you which machine and YARN container is ran on. Sometimes the logs dont have the error because it logged into stderr. In that case, the stderr from the containers YARN logs may show the error.
... View more
01-11-2017
09:09 PM
+1. Thats what I mentioned in my last comment below. Copying here so everyone get the context quickly. Ranger KMS could be the issue because it causes problems for getting HDFS delegation token. If Z or L user needs to get HDFS delegation token then they also need to be super users for Ranger. You are better off trying with non-Ranger cluster or adding them to Ranger super users which is different from core site super users.
... View more
01-11-2017
08:29 PM
AM percent property in YARN is relevant if the cluster has idle resources but still and AM is not being being started for the application. On the YARN UI you will see available capacity but AM not being started. E.g. cluster has 100GB capacity and is using 50GB only. If you want to run X apps concurrently and each AM need M GB resources (per config) then you need X*M capacity for AMs and this can be used to determine the AM percent as a function of the total cluster capacity. On the other hand, if the cluster does not have any capacity at that time (as seen in YARN UI) then changing the AM percent may not help. The cluster does not have capacity to obtain a container slot for the AM. E.g. cluster has 100GB capacity and is already using 100GB. In this case you will have to wait for capacity to free up.
... View more
01-11-2017
08:29 PM
AM percent property in YARN is relevant if the cluster has idle resources but still and AM is not being being started for the application. On the YARN UI you will see available capacity but AM not being started. E.g. cluster has 100GB capacity and is using 50GB only. If you want to run X apps concurrently and each AM need M GB resources (per config) then you need X*M capacity for AMs and this can be used to determine the AM percent as a function of the total cluster capacity. On the other hand, if the cluster does not have any capacity at that time (as seen in YARN UI) then changing the AM percent may not help. The cluster does not have capacity to obtain a container slot for the AM. E.g. cluster has 100GB capacity and is already using 100GB. In this case you will have to wait for capacity to free up.
... View more
01-04-2017
03:01 AM
3 Kudos
Ranger KMS could be the issue because it causes problems for getting HDFS delegation token. If Z or L user needs to get HDFS delegation token then they also need to be super users for Ranger. You are better off trying with non-Ranger cluster or adding them to Ranger super users which is different from core site super users.
... View more
01-03-2017
10:01 PM
Why are we changing it in the Zeppelin env? Can this be changed in the Spark interpreter configs? /cc @prabhjyot singh
... View more
01-03-2017
09:56 PM
3 Kudos
If you are trying to authenticate use FOO via LDAP on Zeppelin and then use Zeppelin to launch a %livy.spark notebook as user FOO then you are using Livy impersonation (this is different from Zeppelin's own impersonation which is only recommended for the shell interpreter, not Livy interpreter). User FOO should be available in the Hadoop cluster also because the jobs will eventually run as that user. HDP 2.5.3 should already have all the configs setup for you. Its a bug that livy.spark.master in Zeppelin is not yarn-cluster. Next, Livy should be using Livy keytab and Zeppelin should be using Zeppelin keytab. Zeppelin user needs to be configured a livy.superuser in Livy config. Livy user should be configured as a proxy user in core-site.xml so that YARN/HDFS allow it to impersonate other users (in thi case hadoopadmin) when submitting spark jobs. If that Zeppelin->Livy connection fails then you will see an exception in Zeppelin and logs in Livy. If that succeeds then Livy will try to submit the job. If that fails you will see the exception in Livy logs. From the above exception in your last comment, it appears that Livy user is not configured as a proxy user properly in core-site.xml. You can check that in hadoop configs and may have to restart affected services in case you change it. In HDP 2.5.3 this should already be done for you during Livy installation via Ambari.
... View more
01-03-2017
09:41 PM
Alicia, please see my answer above on Oct 24. If you are running Spark on YARN you will have to go through the YARN RM UI to get to the Spark UI for a running job. Link for YARN UI is available from Ambari YARN service. For a completed job, you will need to go through Spark History Server. Link for Spark history server is available from the Ambari Spark service.
... View more