Member since
07-05-2017
72
Posts
3
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3057 | 08-13-2019 03:54 PM | |
3057 | 05-10-2019 07:11 PM | |
8267 | 01-09-2019 12:17 AM | |
4203 | 01-08-2019 01:54 PM | |
14741 | 05-24-2018 02:43 PM |
07-11-2022
01:33 PM
Hi @data_diver , From " Now my intention is to copy/write a table from CML python into that space that I can see from datahubcluster. The table or dataframe formats that I use are pandas or spark.", the only missing piece of information here would be, in which sink and format are you wanting to write the table? i.e.: Use case from what I understand : 1) CML Spark/PySpark/Python/Pandas -> 2) Create a "table" Dataframe -> 3) Write the Dataframe down to your Datahub space -> 4) Questions at this point would be: 4.1) Write the dataframe as a Managed table into Hive? Then use the HiveWarehouseConnector( from now on HWC). 4.2) Write the dataframe as an External table into Hive? Then use the HWC. 4.3) Write the dataframe directly to a Datahub HDFS path? Then HDFS service in a Datahub is not meant for this purpose, therefore unsupported. 4.4) Write the dataframe directly to a filesystem? Then this would not be related to a Datahub, and you'll just have to df.write.option("path", "/some/path").saveAsTable("t"), where "path" is your ADLS storage container. 4.5) Write to Hive, without HWC? You'll need to configure CML to have Spark load the Hive ThriftServer endpoints from your Datahub. I haven't tried this nor am sure about its support status, but should be doable, and as long as both experiences CML and DH are on the same environments, network wise both should be reachable. This means, collecting the hive-site.xml file from your Datahub, and making these Hive Metastore URIS as per the hive-site.xml configs available to your CML Spark session somehow. This method should let you write directly to an EXTERNAL table in DH Hive. i.e.: adding to your SparkSession: .config("spark.datasource.hive.warehouse.metastoreUri", "[Hive_Metastore_Uris]") Please feel free to clarify the details on the use case, and hopefully on the need for a "Spark connector", i.e.: - Is there any issue with using the HWC ? - Are you getting any specific errors while connecting from CML to DH using Spark? - Which client will you be using in DH to query the table? - Any other you can think of... The more details you can provide, the better we'll be able to help you.
... View more
06-02-2020
07:22 AM
@renzhongpei The log4j-driver.properties/log4j-executor.properties, can be anywhere in your filesystem, just make sure to reference them from the right location in the files arguments section: --files key.conf,test.keytab,/path/to/log4j-driver.properties,/path/to/log4j-executor.properties If you have a workspace in your home directory, then it can safely be located in your current path, upon using the spark-submit spark client, --files will look for both log4j-driver.properties/log4j-executor.properties in the CPW unless otherwise specified.
... View more
04-13-2020
11:00 AM
1 Kudo
Description Used for Tooling and running examples to test the HWC Repo Info Repo URL: https://github.com/dbompart/hive_warehouse_connector Account Name: dbompart Repo Name: hive_warehouse_connector
... View more
03-15-2020
03:28 PM
The purpose of this repo is to provide quick examples and utilities to work on a Spark and Hive integration on HDP 3.1.4
Prerequisites:
HiveServer2Interactive (LLAP) must be installed, up and running
Bash and Python interpreter must be available
Ideally, for connections using HTTP transport protocol, in the Ambari -> Hive -> Configs, hive.server2.thrift.http.cookie.auth.enabled should be set to true
Set hadoop.proxyuser.hive.hosts=* in the Ambari -> HDFS -> Configs -> Custom core-site section (core-site.xml, or mention at least the LLAP Hosts, HS2 Hosts and Spark client hosts separated by commas).
Once the LLAP service is up and running, the next step for this setup requires the following properties to be configured in the Spark client:
spark.datasource.hive.warehouse.load.staging.dir=
spark.datasource.hive.warehouse.metastoreUri=
spark.hadoop.hive.llap.daemon.service.hosts=
spark.jars=
spark.security.credentials.hiveserver2.enabled=
spark.sql.hive.hiveserver2.jdbc.url=
spark.sql.hive.hiveserver2.jdbc.url.principal=
spark.submit.pyFiles=
spark.hadoop.hive.zookeeper.quorum=
While the above information can be manually collected as explained in our Cloudera official documentation, the following steps will help in collecting the standard information and avoid making mistakes during the copy and paste process of parameter values between HS2I and Spark. Steps:
--Notice the connection is to the LLAP Host--
ssh root@my-llap-host
cd /tmp
wget https://raw.githubusercontent.com/dbompart/hive_warehouse_connector/master/hwc_info_collect.sh
chmod +x hwc_info_collect.sh
./hwc_info_collect.sh
The above script will immediately provide enough information with regard to the following:
The information to enable the Spark and Hive integration (HWConnector)
A working spark-shell command to test initial connectivity
A short how-to list all Databases in Hive, in scala.
Done !!!
LDAP/AD Authentication
In an LDAP enabled authentication setup, the username and password will be passed in plaintext. The recommendation is to Kerberize the cluster. Otherwise, expect to see the username and password exposed in clear text amongst the logs.
To provide username and password, we will have to specify them as part of the JDBC URL string in the following format:
jdbc:hive2://zk1:2181,zk2:2181,zk3:2181/;user=myusername;password=mypassword;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive
Note: We will need to URL encode the password if it has a special character. For example, user=hr1;password=BadPass#1 will translate to user=hr1;password=BadPass%231
This method is not fully supported for Spark HWC integration nor it is recommended.
Kerberos Authentication
For Kerberos authentication, the following pre-conditions have to be met:
Initial "kinit" has to always be executed and validated.
$ kinit dbompart@EXAMPLE.COM
Password for dbompart@EXAMPLE.COM:
$ klist -v
Credentials cache: API:501:9
Principal: dbompart@EXAMPLE.COM
Cache version: 0
Server: krbtgt/EXAMPLE.COM@EXAMPLE.COM
Client: dbompart@EXAMPLE.COM
Ticket etype: aes128-cts-hmac-sha1-96
Ticket length: 256
Auth time: Feb 11 16:11:36 2013
End time: Feb 12 02:11:22 2013
Renew till: Feb 18 16:11:36 2013
Ticket flags: pre-authent, initial, renewable, forwardable
Addresses: addressless
For YARN, HDFS, Hive and HBase long-running jobs, DelegationTokens have to be fetched. Hence, provide "--keytab" and "--principal" extra arguments, i.e.:
spark-submit $arg1 $arg2 $arg3 $arg-etc --keytab my_file.keytab
--principal dbompart@EXAMPLE.COM --class a.b.c.d app.jar
For Kafka, a JAAS file has to be provided:
With a keytab, recommended for long running jobs:
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="./my_file.keytab"
storeKey=true
useTicketCache=false
serviceName="kafka"
principal="user@EXAMPLE.COM";
};
Without a keytab, usually used for batch jobs:
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=true
renewTicket=true
serviceName="kafka";
};
And, it also has to be mentioned at the JVM level:
spark-submit $arg1 $arg2 $arg3 $arg-etc --files jaas.conf
--conf spark.driver.extraJava.Options="-Djava.security.auth.login.config=./jaas.conf"
--conf spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"
Livy2 - Example
curl -X POST --data '{"kind": "pyspark", "queue": "default", "conf": { "spark.jars": "/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.32-1.jar", "spark.submit.pyFiles":"/usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.32-1.zip", "spark.hadoop.hive.llap.daemon.service.hosts": "@llap0","spark.sql.hive.hiveserver2.jdbc.url": "jdbc:hive2://[node2.cloudera.com:2181,node3.cloudera.com:2181,node4.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive](http://node2.cloudera.com:2181,node3.cloudera.com:2181,node4.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive)", "spark.yarn.security.credentials.hiveserver2.enabled": "false","spark.sql.hive.hiveserver2.jdbc.url.principal": "hive/[_HOST@EXAMPLE.COM](mailto:_HOST@EXAMPLE.COM)", "spark.datasource.hive.warehouse.load.staging.dir": "/tmp", "spark.datasource.hive.warehouse.metastoreUri": "thrift://node3.cloudera.com:9083", "spark.hadoop.hive.zookeeper.quorum": "[node2.cloudera.com:2181,node3.cloudera.com:2181,node4.cloudera.com:2181](http://node2.cloudera.com:2181,node3.cloudera.com:2181,node4.cloudera.com:2181)"}}' -H "X-Requested-By: admin" -H "Content-Type: application/json" --negotiate -u : [http://node3.cloudera.com:8999/sessions/](http://node3.cloudera.com:8999/sessions/) | python -mjson.tool
Submitting a brief example to show databases (hive.showDatabases()):
curl --negotiate -u : http://node2.cloudera.com:8999/sessions/2/statements -X POST -H 'Content-Type: application/json' -H "X-Requested-By: admin" -d '{"code":"from pyspark_llap import HiveWarehouseSession"}'
curl --negotiate -u : http://node2.cloudera.com:8999/sessions/2/statements -X POST -H 'Content-Type: application/json' -H "X-Requested-By: admin" -d '{"code":"hive = HiveWarehouseSession.session(spark).build()"}'
curl --negotiate -u : http://node2.cloudera.com:8999/sessions/2/statements -X POST -H 'Content-Type: application/json' -H "X-Requested-By: admin" -d '{"code":"hive.showDatabases().show()"}'
Quick reference for basic API commands to check on the application status:
# Check sessions. Based on the ID field, update the following curl commands to replace "2" with $ID.
curl --negotiate -u : http://node2.cloudera.com:8999/sessions/ | python -mjson.tool
# Check session status
curl --negotiate -u : http://node2.cloudera.com:8999/sessions/2/status | python -mjson.tool
# Check session logs
curl --negotiate -u : http://node2.cloudera.com:8999/sessions/2/log | python -mjson.tool
# Check session statements.
curl --negotiate -u : http://node2.cloudera.com:8999/sessions/2/statements | python -mjson.tool
Zeppelin - Example
Livy2 Interpreter
Assumptions:
The cluster is kerberized.
LLAP has already been enabled.
We got the initial setup information using the script hwc_info_collect.sh
In Ambari->Spark2->Configs->Advanced livy2-conf, the property livy.spark.deployMode should be set to either "yarn-cluster" or just plain "cluster". Note: Client mode is not supported.
Extra steps:
Add the following property=value in Ambari->Spark2->Configs->Custom livy2-conf section: - livy.file.local-dir-whitelist=/usr/hdp/current/hive_warehouse_connector/
We can test our configurations before setting them statically in the Interpreter:
Notebook > First paragraph:
%livy2.conf
livy.spark.datasource.hive.warehouse.load.staging.dir=$value
livy.spark.datasource.hive.warehouse.metastoreUri=$value
livy.spark.hadoop.hive.llap.daemon.service.hosts=$value
livy.spark.jars=file:///$value
livy.spark.security.credentials.hiveserver2.enabled=true
livy.spark.sql.hive.hiveserver2.jdbc.url=$value
livy.spark.sql.hive.hiveserver2.jdbc.url.principal=$value
livy.spark.submit.pyFiles=file:///$value
livy.spark.hadoop.hive.zookeeper.quorum=$value
Please note that compared to a regular spark-shell or spark-submit, this time we'll have to specify the filesystem scheme file:///, otherwise it'll try to reference a path on HDFS by default.
Notebook > Second paragraph:
%livy2
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
hive.showDatabases().show()
Creating a Table with Dummy data in Hive
For this specific task, we can expedite the table creation and dummy data ingest by referring to the Cloudera's VideoKB. In the above link, the Python script (HiveRandom.zip) should help you create and load a table based on an input table schema.
Another short bash script is available show_create_cleaner.sh, and it can be used as the following:
wget https://raw.githubusercontent.com/dbompart/hive_warehouse_connector/master/show_create_cleaner.sh
chmod +x show_create_cleaner.sh
./show_create_cleaner.sh show_create_table_output_file.txt
This bash script is a quick cleaner, it will make the Show create table stmt output re-usable in Hive or Spark by using --clean which will also remove the Table's Location and Table's Properties sections, i.e.:
./show_create_cleaner.sh show_create_table_output_file.txt --clean
Common errors
No service instances found in registry
Check again the configuration settings, especially the llap.daemon.service.hosts value and also the corresponding zNode which should be available and readable in from Zookeeper.
error: object hortonworks is not a member of package com
This usually means that either the HWC jar or zip files were not successfully uploaded to the Spark classpath. We can confirm this by looking at the logs and searching for:
Uploading resource file:/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.32-1.jar
Cannot run get splits outside HS2
Add hive.fetch.task.conversion="more" To Custom hiveserver2-interactive section. And check the LLAP logs if needed.
Query returns no more than 1000
Follow the HWS API guide. This usually means that execute() method was incorrectly used instead of executeQuery().
"Blacklisted configuration values in session config: spark.master"
In the /etc/livy2/conf/spark-blacklist.conf file on the Livy2 server host, reconfigure this file to allow/disallow for configurations to be modified.
Unable to read HiveServer2 configs from ZooKeeper. Tried all existing HiveServer2 URIS from ZooKeeper
LLAP may not be up and running or there is a problem on reading its znode.
Suggested documentation
https://community.cloudera.com/t5/Community-Articles/Integrating-Apache-Hive-with-Apache-Spark-Hive-Warehouse/ta-p/249035
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html
... View more
03-15-2020
02:42 PM
Can you try with this jar file instead: https://repo.hortonworks.com/content/repositories/releases/com/hortonworks/hive/hive-warehouse-connector_2.11/1.0.0.3.1.0.147-1/hive-warehouse-connector_2.11-1.0.0.3.1.0.147-1.jar It may not work since you're in a lower version, but its looking like a bug which could've been fixed in a recent HDP release.
... View more
03-15-2020
12:41 PM
have you tried hive.executeQuery("select * from test_zak") instead?
... View more
08-13-2019
07:34 PM
Zeppelin and Spark-shell are not the same client and properties work diferently, if you moved on to Zeppelin can we assume it did work for Spark-shell? In regard to the Zeppelin issue, the problem should be within the the path to the hive warehouse connector file either on the spark.jars or the spark.submit.pyFiles, I believe the path must be whitelisted in Zeppelin, but its clear that the hivewarehouseconnector files are not being succesfully uploaded to the application classpath, therefore, the pyspark_llap module cannot be imported. Hope it helps. BR, David
... View more
08-13-2019
03:54 PM
1 Kudo
Hey Shashank, You're still skipping the link to: API Operations - https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/integrating-hive/content/hive_hivewarehousesession_api_operations.html Listing the databases in Hive from Spark using the SparkSQL API will not work, as long as metastore.default.catalog is set to "spark" which is the default value and recommended to leave it as it is. So to summarize, by default SparkSQL API (spark.sql("$query")) will access the Spark catalog, instead you should be using the HiveWarehouseSessionAPI as explained in the link above, something like: import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
hive.showDatabases().show()
hive.setDatabase("foodmart")
hive.showTables().show()
hive.execute("describe formatted foodmartTable").show()
hive.executeQuery("select * from foodmartTable limit 5").show()
... View more
05-14-2019
03:17 PM
@ashok.kumar, same thing : Path does not exist: hdfs://cch1wpsteris01:8020/user/root/rdd; Create it ("/user/root/rdd") and grant proper permissions to it.
... View more
05-10-2019
07:32 PM
Hi @Vasanth Reddy, If I understood you correctly, I think you should be checking this: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Setting_up_queues These are per container properties: yarn.scheduler.minimum-allocation-mb = 1024 yarn.scheduler.maximum-allocation-mb = 4096 yarn.schedular.minimum-allocation-vcores = 3 yarn.schedular.maximum-allocation-vcores = 3 While you're watching at cluster wide metrics: running containers to 72 Allocated Cpu v-cores to 72 Allocated memory MB to 120034 With the above settings, at one point you may have the same 72 containers with Allocated memory MB to 294912 (72 containers * maximum mb 4096) Let me know if I missunderstood your question. BR, David Bompart
... View more
05-10-2019
07:11 PM
1 Kudo
Hi @Shashank Naresh, It's not clear what is your current version, I'll assume HDP3. If that is the case, you may want to read the following links, along with its internal links: Spark to Hive access on HDP3 - https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html Configuration - https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/integrating-hive/content/hive_configure_a_spark_hive_connection.html API Operations - https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/integrating-hive/content/hive_hivewarehousesession_api_operations.html In short, Spark has its own catalog, meaning that you will not natively have access to Hive catalog as you did on HDP2. BR, David Bompart
... View more
05-10-2019
07:02 PM
Hi @ashok.kumar, The log is pointing to `java.io.FileNotFoundException: File does not exist: hdfs:/spark2-history`, meaning that in your spark-defaults.conf file, you have specified this directory to be your Spark Events logging dir. In this HDFS path, Spark will try to write it's event logs - not to be confused with YARN application logs, or your application logs -, and it's failing to find it. You might want to check your spark-defaults.conf file, and point `spark.eventLog.dir` to either a valid hdfs path, or a local path where your Spark Application has access to write. For example, assuming that your client is a Linux/MacOSX machine, you can simply create a /tmp/spark-events directory, grant appropriate write access to it, and then configure spark-defaults.conf to be like: spark.eventLog.dir=file:///tmp/spark-events This property can also be overriden, which will be easier for quick tests, i.e.: /spark-submit --class org.com.st.com.st.Iot_kafka_Consumer --master local[*] --conf spark.eventLog.dir="file:///tmp/spark-events" /usr/local/src/softwares/com.st-0.0.1-SNAPSHOT-jar-with-dependencies.jar BR, David Bompart
... View more
01-09-2019
01:14 AM
@Berry Österlund , SparkR is indeed no longer supported to access Hive, R is not considered in the HWC connector.
... View more
01-09-2019
01:03 AM
@Debjyoti Das have you tried replacing: spark = org.apache.spark.sql.SparkSession.builder.appName("MyApp")//.enableHiveSupport().getOrCreate; With: spark = org.apache.spark.sql.SparkSession.builder.appName("MyApp").getOrCreate();
... View more
01-09-2019
12:30 AM
Hi @Sangram Gaikwad, They have been resolved and do not exist in HDP 3.0.1. They're just probably just missing on the Known Issues notes.
... View more
01-09-2019
12:17 AM
@Pavel Stejskal Using the HiveWarehouseConnector + Hiveserver2Interactive(LLAP for managed tables) is mandatory and the reasons are explained in the HDP3 documentation, if you're not using it then for sure the properties are not OK, if the namespace part of it is not configured to point to the hiveserver2Interactive znode ( I think that's what you meant), then that is not correct. To read a table into a DF, you have to use HiveWarehouseSession's API, i.e: val df = hive.executeQuery("select * from web_sales") I'd like to suggest reading throught this entire article. BR.
... View more
01-08-2019
07:48 PM
Spark-submit looks fine, this issue will take more than a forum to resolve, would require code and logs analysis I'd say. Meanwhile, I can only suggest to pass "-Dsun.security.krb5.debug=true" to the extraJavaOptions, and it would also help if you can set the following in log4j.properties file "log4j.logger.org.apache.spark.deploy.yarn.Client=DEBUG", then restart the application, hoping it will print more pointers. Also, if your KDC is an MIT KDC, double check that your principal has not set a 'Maximum Renewal Time' of 00:00:00 as explained here Another property to try out, depending on your application use case that may help is to set: --conf mapreduce.job.complete.cancel.delegation.tokens=false
... View more
01-08-2019
01:54 PM
1 Kudo
Hi @Nikhil Raina, In simple words, a speculative execution means that Hadoop in overall doesn't try to fix slow tasks as it is hard to detect the reason (misconfiguration, hardware issues, etc), instead, it just launches another parallel/backup task for each task that is performing slower than the expected, on faster nodes. So these backup tasks are called speculative tasks and it can be enabled/disabled as its benefits are per use case and up to the Hadoop Admin to consider it to be beneficial or not; speculative execution has an impact on the cluster throughput and resource usage. You can find this in MapReduce or Spark for example. Hope it helps, David
... View more
01-07-2019
09:00 PM
@Daniel Müller You might want to increase the --executor-memory (and probably yarn.scheduler.maximum-allocation-mb as well) to a value that can hold your data size in memory. In some cases repartitioning is a better option.
... View more
01-07-2019
08:26 PM
Hi @Michael Bronson, Is it deleting everything else but the .inprogress files? The following is already present and fixed on HDP 2.6.4: https://issues.apache.org/jira/browse/SPARK-8617 Where one of the proposed changes was to use loading time for inprogress files as lastUpdated and keep using the modTime for completed files. First one will prevent deletion of inprogress job files. The second one will ensure that lastUpdated time won't change for completed jobs in an event of HistoryServer reboot. - Can you double check the .inprogress files timestamp. - Check they do not correspond to actual running applications (streaming apps for example) - Check permission on these files, and perhaps try to manually delete one of these lingering .inprogress files logged in as the spark user and see if it lets you remove one of them. - Restart the SHS and check the log to see if it prints any errors while trying to remove these .inprogress files. Similar error messages like: case t: Exception => logError("Exception in cleaning logs", t) logError(s"IOException in cleaning ${attempt.logPath}", t) logInfo(s"No permission to delete ${attempt.logPath}, ignoring.") Regards, David
... View more
01-03-2019
06:57 PM
Can you share (masked) the spark submit command and the full "delegation token has expired" stacktrace? Also what is the use case of your app?
... View more
12-27-2018
06:58 PM
Hi Mani, you might also want to increase the number of executors then, and may probably be able to lower the memory size. Try with: spark-submit --master yarn --deploy-mode client --driver-memory 5g --num-executors 6 --executor-memory 8g myclass myjar.jar param1 param1 param3 param4 param5 Tunning this requires lots of other information like input data size, application use case, datasource information, cluster resources available, etc. Keep tunning --num-executors --executor-memory and --executor-cores (5 is usually a good number)
... View more
12-26-2018
06:58 AM
Hi Mani, use - - executor-memory 10g instead of 6g, and remove the overHead config property.
... View more
12-24-2018
06:44 PM
Sure, can you share your spark-submit command with the arguments as well? Mask any sensitive information please.
... View more
12-23-2018
07:30 PM
Hi @Bharat Bhushan, It seems like there's something missconfigured in your spark-shell or spark-submit commands, which prevents a serviceRegistry creation and remains null: https://github.com/apache/hive/blob/master/llap-ext-client/src/java/org/apache/hadoop/hive/llap/LlapBaseInputFormat.java#L361 Please refer to: https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.0/integrating-hive/content/hive_configure_a_spark_hive_connection.html Verify both spark.sql.hive.hiveserver2.jdbc.url and spark.hadoop.hive.llap.daemon.service.hosts are correct.
... View more
12-23-2018
07:02 PM
Hi @Aakriti Batra, The problem seems to be in the JAAS file passed to the executor, it would help to see it's content, but I'd rather suggest you to read this whole article instead: https://community.hortonworks.com/articles/56704/secure-kafka-java-producer-with-kerberos.html
... View more
12-23-2018
06:56 PM
hi @Ali, You might want to add "--keytab /path/to/the/headless-keytab", "--principal principalNameAsPerTheKeytab" and "--conf spark.hadoop.fs.hdfs.impl.disable.cache=true" to the spark-submit command.
... View more
12-23-2018
06:34 PM
Hi Mani, Consider boosting spark.yarn.executor.memoryOverhead from 6.6 GB to something higher than 8.2 GB, by adding "--conf spark.yarn.executor.memoryOverhead=10GB" to the spark-submit command. You could also workaround this by increasing the number of partitions (repartitioning) and number of executors.
... View more
12-03-2018
03:04 PM
Hi Kumar, It could be the case that you have a stale SHS process, so work on stopping any SHS process as PID 39259 and/or remove "rm" the SHS process id file usually under /var/run/spark2, and the restart it again.
... View more
12-03-2018
03:01 PM
With the same user used to run this application, what is the output of running in all and each of the nodemanager nodes "which java"? ProcessBuilder doesn't use the locations in environment variables, it looks for "java" in "/usr/bin/java", is that the java binary you're working on giving permissions?
... View more