Member since
07-30-2018
56
Posts
14
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
789 | 06-20-2019 10:14 AM | |
7888 | 06-11-2019 07:04 AM | |
783 | 03-05-2019 07:25 AM | |
2164 | 01-03-2019 10:42 AM | |
6300 | 12-04-2018 11:59 PM |
02-26-2020
05:10 AM
Hi, I understand that you have a spark java code, Which is taking 2 hours to process 4MB of data and you like to improve the performance of this application. I recommend you to check the below documents, Which helps in performance tuning both in code and configuration level. https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/ https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/ Thanks Jerry
... View more
12-12-2019
03:29 AM
Hi Dhiraj, Assign the application to a queue named after the primary group (usually the first group) to which the user belongs. Reference Link: https://blog.cloudera.com/untangling-apache-hadoop-yarn-part-4-fair-scheduler-queue-basics/#crayon-5bb5e2c83a686341712232 Thanks Jerry
... View more
06-20-2019
10:14 AM
Hi, Yes, We can enforce a job to run in a particular queue based on user using placement policy. We can define a secondary group for each user. Whenever the user submits a job will land on the secondary group queue. reference Link: https://blog.cloudera.com/blog/2016/06/untangling-apache-hadoop-yarn-part-4-fair-scheduler-queue-basics/ Thanks Jerry
... View more
06-13-2019
10:37 AM
Hi, Are you trying to run the sqoop command in the worker node? Can you try redeploying the sqoop client configuration and restarting Cloudera manager agent on the node. Which can help in reconfiguring the symlinks Thanks Jerry
... View more
06-11-2019
07:04 AM
Hi, Please use the below steps if your cluster is unsecure. kafka-topics --create --zookeeper <zk host>:2181 --replication-factor 3 --partitions 1 --topic Testmessage Run a Kafka console producer: $ kafka-console-producer --broker-list <broker hostnome>:9092 --topic Testmessage Run a Kafka console consumer: $ kafka-console-consumer --new-consumer --topic Testmessage --from-beginning --bootstrap-server <broker hostnome>:9092 Thanks Jerry
... View more
06-10-2019
05:53 AM
Hi , Can you try "INVALIDATE METADATA [[db_name.]table_name]". usually we run this command if table is not created by Impala to reload the metadata. Link: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_invalidate_metadata.html Thanks Jerry
... View more
06-10-2019
04:33 AM
Hi, From the logs, It seems its unable to find the broker "Connection to node -1 (localhost/127.0.0.1:9092) could not be established" Make sure you have broker running in the node and listening to 9092 Also try adding the fully qualified name or IP instead of localhost Thanks Jerry
... View more
05-28-2019
03:56 AM
2 Kudos
Hi Harish, You can create a hive context and can access the hive table. Example Program: from pyspark.sql import HiveContext hive_context = HiveContext(sc) sample = hive_context.table("default.<tablename>") sample.show() Reference Link: https://stackoverflow.com/questions/36051091/query-hive-table-in-pyspark
... View more
05-28-2019
03:50 AM
Hi, Can you try to execute a sample spark application. Let us know the results spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --keytab <location>/<filename>.keytab --principal <principle name> /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 1 1 Thanks Jerry
... View more
05-28-2019
03:43 AM
Hi Dean, Gateway role is just for the client configuration, As it is not a service the state is denoted as N/A and it is expected behavior. Do you facing this issue, Even after adding the Gateway node/ Deploying client configuration. Thanks Jerry
... View more
04-29-2019
11:51 AM
1 Kudo
Hi Based on the error message you have shared. ... TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' This error corresponds to bug JIRA SPARK-19019 [1]. This bug relates to a compatibility issue between Spark and Python 3.6 Spark 1.6 requires Python 2.6+ as per the Document[2] [1] https://issues.apache.org/jira/browse/SPARK-19019 [2] https://spark.apache.org/docs/1.6.0/#downloading [3] https://www.cloudera.com/documentation/enterprise/5-14-x/topics/spark_python.html#spark_python__section_ark_lkn_25
... View more
04-22-2019
01:45 AM
Hi, Its showing that it is running in unsupported version. Could you let us know the below version Kafka, CDH, System Java version. 'major.minor version 52.0' Will occur if you are running the application in JAVA version which is less than 1.8
... View more
04-16-2019
08:32 AM
2 Kudos
Hi, Impala query usually faster on 2nd time than 1st attempt of same query. This is because of OS cache, Which will keep the files in memory and reuse it. It is OS level feature and not specific to Impala. For further performance improvement, there is a concept of "HDFS caching" which is utilized by Impala. HDFS Caching helps further to improve the speed of query results Reference Link below: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_hdfs_caching.html Thanks Jerry
... View more
03-26-2019
07:53 AM
Hi, If it is a External table we need to delete the directory manually. ALTER TABLE table_name DROP [IF EXISTS] PARTITION <>; hadoop fs -rm -r <partition file path> Thanks Jerry
... View more
03-12-2019
09:26 AM
Hi Naveen, If you have limited number of ports available. You can assign port for each application. --conf "spark.driver.port=4050" —conf "spark.executor.port=51001" --conf "spark.ui.port=4005" Hope it helps Thanks Jerry
... View more
03-05-2019
09:53 AM
Hi , Try to check which process is using the port 9083. netstat -plant | grep 9083 If required kill and wait if it is restarting the process automatically but any script. Once confirmed restart and check HMS and HS2 Thanks Jerry
... View more
03-05-2019
07:52 AM
Hi Renuka, Sqoop 2 is not supported in CDH 6.x. To upgrade to CDH 6.x, you must delete the Sqoop 2 service before the upgrade Check the below link for sqoop2 metastore backup https://www.cloudera.com/documentation/enterprise/upgrade/topics/ug_cdh_upgrade_backup.html#concept_yfz_ltv_hs Hope it helps Thanks Jerry
... View more
03-05-2019
07:36 AM
Hi, From the shared exception it seems you are hitting bug "SQOOP-2999". A change to sqoop (SQOOP-2737) introduced a new class OracleUtils which has a dependency on org.apache.commons.lang3.StringUtils. Unfortunately the jar that fulfills this dependency is not on the classpath that Sqoop passes to the mappers. As a workaround add the jar manually sqoop import -libjars /opt/cloudera/parcels/CDH/jars/commons-lang3-3.1.jar Let us know if any questions Thanks Jerry
... View more
03-05-2019
07:25 AM
Hi Sandeep, It seems the data has been imported as the count is same. But sometimes the datatype might be different between MS-SQL and Hive tables which result in NULL values. Let's try to check the datatype between the tables and also share the sqoop command for further check Thanks Jerry
... View more
02-12-2019
06:34 AM
Hi, If there is any changes in the Hive metadata. Please try to run msck repair table <tablename> to get it in sync Reference Link: https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cdh_ig_hive_troubleshooting.html Thanks Jerry
... View more
01-29-2019
06:47 AM
Hi Tulasi, Could you check the value of this property Container Executor Group from the file " container-executor.cfg " file and cross check with CM configuration Thanks Jerry
... View more
01-28-2019
10:33 AM
Hi Tulasi, Could you please verify the "container-executor.group" are same on both from Cloudera manager (Yarn->Configuration->Container Executor Group) and /etc/hadoop/conf.cloudera.yarn/container-executor.cfg (from Node manager host) Let us know if you have questions Thanks Jerry
... View more
01-08-2019
10:27 AM
Hi Vinod, 32K partition is huge to handle, We can define bucket instead of partition to avoid too many small files. Can you share the type of query you are trying on this partitions. Thanks Jerry
... View more
01-03-2019
10:42 AM
1 Kudo
Hi, When importing an empty table from Teradata to HDFS via Sqoop using the --table option, We are getting the below exception com.teradata.connector.common.exception.ConnectorException: Input source table is empty Its a bug from teradata end and the fix is yet to be released. Until the fix is available, We recommend using sqoop import --query as a workaround instead of --table Hope it helps. Let us know if you have any questions Thanks Jerry
... View more
01-02-2019
09:24 AM
Hi Fenton, Thanks for the information, Join condition could create cross product which result in the multiplication of records(As shown in the screen shot). Could you please share the query plan and join condition to check further. explain <query> Thanks Jerry
... View more
12-20-2018
01:10 AM
Hi, This can be caused by the lack of /var/lib/alternatives/hadoop-conf in a specific host. Did you try to restart cloudera agent service this could rebuild the alternatives Run the below script to check the alternatives are linked properly. ls -lart /etc/alternatives | grep "CDH" | while read a b c d e f g h i j k do alternatives --display $i done let us know if you have any questions Thanks Jerry
... View more
12-18-2018
10:02 AM
Hi, We could specifies the maximum size of each Parquet data file produced by Impala INSERT statements. By specifying "set PARQUET_FILE_SIZE=<size>" Reference Link: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet_file_size.html Impala partitioning and hive bucketing will also help in managing data Link: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_partitioning.html Hope it helps. Let us know if any questions. Thanks Jerry
... View more
12-17-2018
08:08 AM
Hi, We understand your query is failing with RPCTIMEOUT. It could be many reason which can cause timeout. 1. Could you able to check at which point query starts to fail (From Impala daemon log using query profile id) 2. Share the Query profile. Impala will create a optimised query plan only when it is completely aware about the table stats. Please make sure you have ran "compute stats db.tablename" on all the tables used in that query. Hope it helps. Let me know if you have any questions Thanks Jerry
... View more
12-14-2018
09:50 AM
Hi , yarn application can go to pending state is the resource are unavailable. Could you please share the below information to check further. 1. Resource manager schedular screen shot 2. yarn application console output Let us know if you have any questions. Thanks Jerry
... View more
12-14-2018
08:27 AM
2 Kudos
Hi, yarn.scheduler.maximum-allocation-mb is specified as 20 GB which means the largest amount of physical memory, that can be requested for a container and yarn.scheduler.minimum-allocation-mb will be the least amount of physical memory, that can be requested for a container. When we submit a MR job requested container memory will be assigned “mapreduce.map.memory.mb” which is by default 1 GB. If it is not specified then we will be given container of 1GB.(Same for reducer) This can be verified in the yarn logs -: mapreduce.map.memory.mb - requested container memory 1GB INFO [Thread-52] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: mapResourceRequest:<memory:1024, vCores:1> mapreduce.map.java.opts - Which is 80% of container memory by default org.apache.hadoop.mapred.JobConf: Task java-opts do not specify heap size. Setting task attempt jvm max heap size to -Xmx820m 1 GB is the default and it is quite low. I recommend reading the below link. It provides a good understanding of YARN and MR memory setting, how they relate, and how to set some baseline settings based on the cluster node size (disk, memory, and cores). https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_yarn_tuning.html Hope it helps. Let us know if you have any questions Thanks Jerry
... View more