About Jerry

Jerry · ‎02-26-2020

Hi, I understand that you have a spark java code, Which is taking 2 hours to process 4MB of data and you like to improve the performance of this application. I recommend you to check the below documents, Which helps in performance tuning both in code and configuration level. https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/ https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/ Thanks Jerry

Jerry · ‎06-20-2019

Hi, Yes, We can enforce a job to run in a particular queue based on user using placement policy. We can define a secondary group for each user. Whenever the user submits a job will land on the secondary group queue. reference Link: https://blog.cloudera.com/blog/2016/06/untangling-apache-hadoop-yarn-part-4-fair-scheduler-queue-basics/ Thanks Jerry

Jerry · ‎06-11-2019

Hi, Please use the below steps if your cluster is unsecure. kafka-topics --create --zookeeper <zk host>:2181 --replication-factor 3 --partitions 1 --topic Testmessage Run a Kafka console producer: $ kafka-console-producer --broker-list <broker hostnome>:9092 --topic Testmessage Run a Kafka console consumer: $ kafka-console-consumer --new-consumer --topic Testmessage --from-beginning --bootstrap-server <broker hostnome>:9092 Thanks Jerry

Jerry · ‎06-10-2019

Hi, From the logs, It seems its unable to find the broker "Connection to node -1 (localhost/127.0.0.1:9092) could not be established" Make sure you have broker running in the node and listening to 9092 Also try adding the fully qualified name or IP instead of localhost Thanks Jerry

Jerry · ‎05-28-2019

Hi Harish, You can create a hive context and can access the hive table. Example Program: from pyspark.sql import HiveContext hive_context = HiveContext(sc) sample = hive_context.table("default.<tablename>") sample.show() Reference Link: https://stackoverflow.com/questions/36051091/query-hive-table-in-pyspark

Jerry · ‎05-28-2019

Hi, Can you try to execute a sample spark application. Let us know the results spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --keytab <location>/<filename>.keytab --principal <principle name> /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 1 1 Thanks Jerry

Jerry · ‎04-29-2019

Hi Based on the error message you have shared. ... TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' This error corresponds to bug JIRA SPARK-19019 [1]. This bug relates to a compatibility issue between Spark and Python 3.6 Spark 1.6 requires Python 2.6+ as per the Document[2] [1] https://issues.apache.org/jira/browse/SPARK-19019 [2] https://spark.apache.org/docs/1.6.0/#downloading [3] https://www.cloudera.com/documentation/enterprise/5-14-x/topics/spark_python.html#spark_python__section_ark_lkn_25

Jerry · ‎04-22-2019

Hi, Its showing that it is running in unsupported version. Could you let us know the below version Kafka, CDH, System Java version. 'major.minor version 52.0' Will occur if you are running the application in JAVA version which is less than 1.8

Jerry · ‎04-16-2019

Hi, Impala query usually faster on 2nd time than 1st attempt of same query. This is because of OS cache, Which will keep the files in memory and reuse it. It is OS level feature and not specific to Impala. For further performance improvement, there is a concept of "HDFS caching" which is utilized by Impala. HDFS Caching helps further to improve the speed of query results Reference Link below: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_hdfs_caching.html Thanks Jerry

Jerry · ‎03-26-2019

Hi, If it is a External table we need to delete the directory manually. ALTER TABLE table_name DROP [IF EXISTS] PARTITION <>; hadoop fs -rm -r <partition file path> Thanks Jerry

Online	Offline
Last Visited	‎12-19-2024 01:08 PM

Member Since	‎07-30-2018 04:22 AM
Last Visited	‎12-19-2024 01:08 PM
Posts	60
Kudos received	14

Cloudera Community

Re: Enforce User to use dedicated YARN queue.

Re: kafka-console-producer.sh

Re: incorrect data,post sqoop from MS-SQL to Hadoo...

Re: Import failed: com.teradata.connector.common.e...

Re: HIVE 2.1.1 @ Cloudera 6.0.1 with Parquet

Re: Spark job takes 2 hours to read data from HDFS...

Re: Enforce User to use dedicated YARN queue.

Re: kafka-console-producer.sh

Re: kafka-console-producer.sh

Re: how to access hive database/tables through spa...

Re: spark-submit Client cannot authenticate with k...

Re: pyspark fails with python 3.6

Re: Can't start Kafka Service broker

Re: Does IMPALA cached the query statistics?

Re: Drop partition remove data but not HDFS folder