About dbompart

dbompart · ‎01-09-2019

@Pavel Stejskal Using the HiveWarehouseConnector + Hiveserver2Interactive(LLAP for managed tables) is mandatory and the reasons are explained in the HDP3 documentation, if you're not using it then for sure the properties are not OK, if the namespace part of it is not configured to point to the hiveserver2Interactive znode ( I think that's what you meant), then that is not correct. To read a table into a DF, you have to use HiveWarehouseSession's API, i.e: val df = hive.executeQuery("select * from web_sales") I'd like to suggest reading throught this entire article. BR.

dbompart · ‎01-08-2019

Spark-submit looks fine, this issue will take more than a forum to resolve, would require code and logs analysis I'd say. Meanwhile, I can only suggest to pass "-Dsun.security.krb5.debug=true" to the extraJavaOptions, and it would also help if you can set the following in log4j.properties file "log4j.logger.org.apache.spark.deploy.yarn.Client=DEBUG", then restart the application, hoping it will print more pointers. Also, if your KDC is an MIT KDC, double check that your principal has not set a 'Maximum Renewal Time' of 00:00:00 as explained here Another property to try out, depending on your application use case that may help is to set: --conf mapreduce.job.complete.cancel.delegation.tokens=false

dbompart · ‎01-08-2019

Hi @Nikhil Raina, In simple words, a speculative execution means that Hadoop in overall doesn't try to fix slow tasks as it is hard to detect the reason (misconfiguration, hardware issues, etc), instead, it just launches another parallel/backup task for each task that is performing slower than the expected, on faster nodes. So these backup tasks are called speculative tasks and it can be enabled/disabled as its benefits are per use case and up to the Hadoop Admin to consider it to be beneficial or not; speculative execution has an impact on the cluster throughput and resource usage. You can find this in MapReduce or Spark for example. Hope it helps, David

dbompart · ‎01-07-2019

Hi @Michael Bronson, Is it deleting everything else but the .inprogress files? The following is already present and fixed on HDP 2.6.4: https://issues.apache.org/jira/browse/SPARK-8617 Where one of the proposed changes was to use loading time for inprogress files as lastUpdated and keep using the modTime for completed files. First one will prevent deletion of inprogress job files. The second one will ensure that lastUpdated time won't change for completed jobs in an event of HistoryServer reboot. - Can you double check the .inprogress files timestamp. - Check they do not correspond to actual running applications (streaming apps for example) - Check permission on these files, and perhaps try to manually delete one of these lingering .inprogress files logged in as the spark user and see if it lets you remove one of them. - Restart the SHS and check the log to see if it prints any errors while trying to remove these .inprogress files. Similar error messages like: case t: Exception => logError("Exception in cleaning logs", t) logError(s"IOException in cleaning ${attempt.logPath}", t) logInfo(s"No permission to delete ${attempt.logPath}, ignoring.") Regards, David

dbompart · ‎01-03-2019

Can you share (masked) the spark submit command and the full "delegation token has expired" stacktrace? Also what is the use case of your app?

dbompart · ‎12-27-2018

Hi Mani, you might also want to increase the number of executors then, and may probably be able to lower the memory size. Try with: spark-submit --master yarn --deploy-mode client --driver-memory 5g --num-executors 6 --executor-memory 8g myclass myjar.jar param1 param1 param3 param4 param5 Tunning this requires lots of other information like input data size, application use case, datasource information, cluster resources available, etc. Keep tunning --num-executors --executor-memory and --executor-cores (5 is usually a good number)

dbompart · ‎12-26-2018

Hi Mani, use - - executor-memory 10g instead of 6g, and remove the overHead config property.

dbompart · ‎12-24-2018

Sure, can you share your spark-submit command with the arguments as well? Mask any sensitive information please.

dbompart · ‎12-23-2018

Hi @Aakriti Batra, The problem seems to be in the JAAS file passed to the executor, it would help to see it's content, but I'd rather suggest you to read this whole article instead: https://community.hortonworks.com/articles/56704/secure-kafka-java-producer-with-kerberos.html

dbompart · ‎12-23-2018

hi @Ali, You might want to add "--keytab /path/to/the/headless-keytab", "--principal principalNameAsPerTheKeytab" and "--conf spark.hadoop.fs.hdfs.impl.disable.cache=true" to the spark-submit command.

Online	Offline
Last Visited	‎11-30-2022 01:29 PM

Member Since	‎07-05-2017 05:13 PM
Last Visited	‎11-30-2022 01:29 PM
Posts	74
Kudos received	3

Cloudera Community

Re: Hive databases are not visible in Spark sessio...

Re: Hive databases are not visible in Spark sessio...

Re: HDP 3.1 & Spark 2.3.2 - hive.table("default.ta...

Re: what is speculative execution

Re: Kerberos Cross Realm HDFS Access Via Spark App...

Re: HDP 3.1 & Spark 2.3.2 - hive.table("default.ta...

Re: How to reissue a delegated token after max lif...

Re: what is speculative execution

Re: spark history logs that .inprogress are not de...

Re: How to reissue a delegated token after max lif...

Re: Spark org.apache.spark.SparkException: Task fa...

Re: Spark org.apache.spark.SparkException: Task fa...

Re: Spark org.apache.spark.SparkException: Task fa...

Re: Error reading Kafka topic from Spark: Unable t...

Re: How to reissue a delegated token after max lif...