Member since
02-08-2016
39
Posts
29
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1486 | 06-22-2017 05:05 PM | |
2047 | 03-26-2017 11:55 PM | |
2472 | 07-18-2016 03:15 PM | |
17890 | 06-29-2016 07:43 PM | |
1450 | 06-20-2016 06:11 PM |
10-02-2017
04:19 PM
useful for maven based builds...thanks
... View more
06-22-2017
06:58 PM
You probably are using Hive on Tez. There is user-level explain for Hive on Tez users. Apply below setting and then run 'explain' query to see much more clearly readable tree of operations. This is also available for Hive on Spark and setting is called 'hive.spark.explain.user' set hive.explain.user=true
... View more
06-22-2017
05:05 PM
If you are doing this on single node in cluster then yes delete the original copied data files and Namenode will take care of recreating missing data files.
... View more
03-27-2017
02:25 AM
One option is to delete existing external table and create new table that includes new column. Since this is Hive metadata operation, your data files wont be touched. Downside is that you will have to execute alter table command to redefine partitions on new table.
... View more
03-26-2017
11:55 PM
2 Kudos
In production type cluster (with 10's-100+ nodes) with Namenode HA enabled, best practice is to have 2 Namenodes (1 active and 1 standby) and 3 Journal nodes.
... View more
08-17-2016
04:31 AM
Spark Standalone mode is Spark’s own built-in clustered environment. Standalone-Master is the resource manager for the Spark Standalone cluster.Standalone-Worker is the worker in the Spark Standalone cluster.
To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster.You can launch standalone cluster either manually, by starting a master and workers by hand, or use launch scripts. In most enterprises, you already have Hadoop cluster that is running YARN and want to leverage it for resource management instead of additionally running Spark Standalone mode. If using YARN, spark applications will run its spark-master and spark-workers within containers of YARN. Irrespective of your deployment mode, Spark application will consume same resources it requires to process the data. In case of YARN you have to be aware of what other workloads will be running on cluster (like MR, Tez etc) at same time spark application is executing and size your machines accordingly.
... View more
08-15-2016
03:17 AM
Can you share full details of error message?You are probably missing some libraries. If you are reading data from hdfs and running in yarn-cluster mode your parallelism by default will be equal to number of hdfs blocks. As best practice you should avoid doing collect operation unless its small test dataset and instead use saveAsTextFile method to write result dataset to hdfs or local file.
... View more
08-10-2016
07:49 PM
What version of spark and hdp? Can you list out all jar under SPARK_HOME directory from worker machine in cluster?
... View more
08-08-2016
04:33 AM
In what mode are you running spark application? spark shell or yarn client etc?
... View more
07-18-2016
03:15 PM
2 Kudos
We explicitly listed out FQDN's of all hosts in both the clusters under [domain_realm] section of krb5.conf file. We have to update this file everytime we add node to our clusters and our clusters are currently less than <100 nodes and this solution is manageable but for large clusters this may be challenge.
... View more