About rreddy

rreddy · ‎10-02-2017

useful for maven based builds...thanks

rreddy · ‎06-22-2017

You probably are using Hive on Tez. There is user-level explain for Hive on Tez users. Apply below setting and then run 'explain' query to see much more clearly readable tree of operations. This is also available for Hive on Spark and setting is called 'hive.spark.explain.user' set hive.explain.user=true

rreddy · ‎06-22-2017

If you are doing this on single node in cluster then yes delete the original copied data files and Namenode will take care of recreating missing data files.

rreddy · ‎03-27-2017

One option is to delete existing external table and create new table that includes new column. Since this is Hive metadata operation, your data files wont be touched. Downside is that you will have to execute alter table command to redefine partitions on new table.

rreddy · ‎03-26-2017

In production type cluster (with 10's-100+ nodes) with Namenode HA enabled, best practice is to have 2 Namenodes (1 active and 1 standby) and 3 Journal nodes.

rreddy · ‎08-17-2016

Spark Standalone mode is Spark’s own built-in clustered environment. Standalone-Master is the resource manager for the Spark Standalone cluster.Standalone-Worker is the worker in the Spark Standalone cluster. To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster.You can launch standalone cluster either manually, by starting a master and workers by hand, or use launch scripts. In most enterprises, you already have Hadoop cluster that is running YARN and want to leverage it for resource management instead of additionally running Spark Standalone mode. If using YARN, spark applications will run its spark-master and spark-workers within containers of YARN. Irrespective of your deployment mode, Spark application will consume same resources it requires to process the data. In case of YARN you have to be aware of what other workloads will be running on cluster (like MR, Tez etc) at same time spark application is executing and size your machines accordingly.

rreddy · ‎08-15-2016

Can you share full details of error message?You are probably missing some libraries. If you are reading data from hdfs and running in yarn-cluster mode your parallelism by default will be equal to number of hdfs blocks. As best practice you should avoid doing collect operation unless its small test dataset and instead use saveAsTextFile method to write result dataset to hdfs or local file.

rreddy · ‎08-10-2016

What version of spark and hdp? Can you list out all jar under SPARK_HOME directory from worker machine in cluster?

rreddy · ‎08-08-2016

In what mode are you running spark application? spark shell or yarn client etc?

rreddy · ‎07-18-2016

We explicitly listed out FQDN's of all hosts in both the clusters under [domain_realm] section of krb5.conf file. We have to update this file everytime we add node to our clusters and our clusters are currently less than <100 nodes and this solution is manageable but for large clusters this may be challenge.

Online	Offline
Last Visited	‎08-24-2022 03:40 PM

Member Since	‎02-08-2016 02:27 PM
Last Visited	‎08-24-2022 03:40 PM
Posts	39
Kudos received	29

Cloudera Community

Re: Move data gives Found duplicated storage UUID ...

Re: minimum number of journal nodes needed.

Re: KDC Cross Realm Trust Setup- krb5.conf config?...

Re: Best way to merge multi part file into single ...

Re: Kafka 0.9 new-producer on kerberized HDP 2.4

Re: Unable to connect to kerborised hiveserver2 fr...

Re: Hive Explain plan Interpretation

Re: Move data gives Found duplicated storage UUID ...

Re: how to add partition to existing table having ...

Re: minimum number of journal nodes needed.

Re: Spark Standalone Cluster.

Re: Fetch distinct values of a column in Dataframe...

Re: Spark application fails on slaves when launchi...

Re: Spark driver memory keeps growing

Re: KDC Cross Realm Trust Setup- krb5.conf config?...