About Jerry

Jerry · ‎02-26-2020

Hi, I understand that you have a spark java code, Which is taking 2 hours to process 4MB of data and you like to improve the performance of this application. I recommend you to check the below documents, Which helps in performance tuning both in code and configuration level. https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/ https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/ Thanks Jerry

Jerry · ‎06-20-2019

Hi, Yes, We can enforce a job to run in a particular queue based on user using placement policy. We can define a secondary group for each user. Whenever the user submits a job will land on the secondary group queue. reference Link: https://blog.cloudera.com/blog/2016/06/untangling-apache-hadoop-yarn-part-4-fair-scheduler-queue-basics/ Thanks Jerry

aamsoh · ‎06-11-2019

Hi Jerry, It has starting working after I changed the property "offsets.topic.replication.factor" 3 to 1. Thanks for your support. Aamir.

Yeseswini · ‎06-03-2019

from spark or pyspark shell use the below commands to access hive database objects. spark.sql("show databases;") spark.sql("select * from databasename.tablename;") or spark.read.table("databasename.tablename") You can give any query inside spark.sql which will give you results.

Jerry · ‎05-28-2019

Hi, Can you try to execute a sample spark application. Let us know the results spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --keytab <location>/<filename>.keytab --principal <principle name> /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 1 1 Thanks Jerry

AKR · ‎05-14-2019

Hi, The 'major.minor version 52.0' error is possibly because the jar was compiled in JDK 1.8, but you are trying to run it using a JDK 1.7 environment. The reported number is the required number, not the number you are using. To solve this, it's always better to have the JDK and JRE pointed to the same version.

techsoln · ‎04-29-2019

Thanks, Agreed. I also found the bug details. Based on the URL https://spark.apache.org/docs/1.6.0/#downloading you shared, it contains details which says it is compatible with 2.6+ and 3.1+ which is totally misleading since 3.6 is 3.1+ I have started working to upgrade my app to spark 2. Any suggestiosn on Spark 1.6 to Spark 2 migration guide on Cloudera cluster

Tim Armstrong · ‎04-17-2019

In it's default configuration, metadata is cached until an "INVALIDATE METADATA" command evicts the table from the cache. Or until the catalog is restarted. In 5.16 and 6.1+ there are some non-default options that will evict metadata after a particular timeout. At some point these will become the defaults. Table stats are collected and stored in the hive metastore when you run a "compute stats" command. They are then just part of the table metadata.

EricL · ‎03-31-2019

Hi, I assume that you work on the managed table instead of external table? This could be because of lack of permissions from the user who tried to run the DROP command to remove the underlining HDFS path. Check HMS server log to see if you can find any error messages.

Jerry · ‎03-12-2019

Hi Naveen, If you have limited number of ports available. You can assign port for each application. --conf "spark.driver.port=4050" —conf "spark.executor.port=51001" --conf "spark.ui.port=4005" Hope it helps Thanks Jerry

Online	Offline
Last Visited	‎12-19-2024 01:08 PM

Member Since	‎07-30-2018 04:22 AM
Last Visited	‎12-19-2024 01:08 PM
Posts	60
Kudos received	14

Cloudera Community

Re: Enforce User to use dedicated YARN queue.

Re: kafka-console-producer.sh

Re: incorrect data,post sqoop from MS-SQL to Hadoo...

Re: Import failed: com.teradata.connector.common.e...

Re: HIVE 2.1.1 @ Cloudera 6.0.1 with Parquet

Re: Spark job takes 2 hours to read data from HDFS...

Re: Enforce User to use dedicated YARN queue.

Re: kafka-console-producer.sh

Re: how to access hive database/tables through spa...

Re: spark-submit Client cannot authenticate with k...

Re: Can't start Kafka Service broker

Re: pyspark fails with python 3.6

Re: Does IMPALA cached the query statistics?

Re: Drop partition remove data but not HDFS folder

Re: Spark port binding issue