Member since
02-08-2016
39
Posts
29
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
744 | 06-22-2017 05:05 PM | |
1047 | 03-26-2017 11:55 PM | |
1565 | 07-18-2016 03:15 PM | |
14096 | 06-29-2016 07:43 PM | |
672 | 06-20-2016 06:11 PM |
10-02-2017
04:57 PM
2 Kudos
Exclusivity is possible in yarn using node labels. Use the property spark spark.yarn.am.nodeLabelExpression to restrict application master to a set of nodes while running spark on yarn. Add the node labels to whichever nodes you want to use for application masters (which I believe will launch driver program for spark). Enabling YARN Node Labels In your case, if temporary nodes are handful compared to static nodes, it will be worth to explore non exclusive node labels that will prevent AM from being created on temporary nodes.
... View more
10-02-2017
04:39 PM
Are you running Spark on YARN? If so, as explained in SPARK-4253 you cannot set spark.driver.host . It will ignore this config item in yarn-cluster mode.
... View more
10-02-2017
04:32 PM
1 Kudo
What are you trying to compare exactly? Hive insert query performance in LLAP vs HS2 mode? It will be helpful to know if: Queue, user limits, memory settings are all the same? and if its batch insert or single row? how many llap containers are running in cluster vs total configured yarn
... View more
10-02-2017
04:19 PM
useful for maven based builds...thanks
... View more
06-22-2017
07:43 PM
1 Kudo
We want to avoid non-local reads of data as much as possible for best performance. Details here: http://ercoppa.github.io/HadoopInternals/AnatomyMapReduceJob.html#maptask-launch
... View more
06-22-2017
07:22 PM
1 Kudo
Have you confirmed if there are containers being run on this node(and non local reads) thats causing job to be slow? If thats the case I would recommend to install only 'datanode' process first and once cluster is balanced (maybe after day) add 'nodemanager' process to run containers on the node.
... View more
06-22-2017
07:16 PM
1) If you are HDP customer, please open support ticket to get exact answer depending on your current version on Ambari and metastore type. Features are added and deprecated in every release so its always recommended to get official response from support depending on your installation. 2) https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.1/bk_system-admin-guide/content/sysadminguides_ha_chap3.html
... View more
06-22-2017
07:04 PM
From what location are you invoking hplsql shell? If you have hplsql-site.xml in current directory, it should take precedence over default hplsql-site.xml file under lib folder. As an alternative you can try using SET command once you are inside shell to set parameters.
... View more
06-22-2017
06:58 PM
You probably are using Hive on Tez. There is user-level explain for Hive on Tez users. Apply below setting and then run 'explain' query to see much more clearly readable tree of operations. This is also available for Hive on Spark and setting is called 'hive.spark.explain.user' set hive.explain.user=true
... View more
06-22-2017
05:05 PM
If you are doing this on single node in cluster then yes delete the original copied data files and Namenode will take care of recreating missing data files.
... View more
06-22-2017
04:56 PM
AFAIK there are options to discover and 'repair' corrupted files that are stored in HDFS. Most common reasons for file file corruption are associated to hdfs blocks missing or corrupted. HDFS may automatically act to fix such corrupt files periodically depending on cause like missing block or checksum mismatch etc. But in your case the file itself is still open and not considered 'complete' or 'closed' by hdfs so unless you have way to recreate entire file from source system by 'reprocessing' such files cant be 'fixed'.
... View more
03-27-2017
02:25 AM
One option is to delete existing external table and create new table that includes new column. Since this is Hive metadata operation, your data files wont be touched. Downside is that you will have to execute alter table command to redefine partitions on new table.
... View more
03-27-2017
01:58 AM
Can you post xml version of your falcon job?
... View more
03-26-2017
11:55 PM
2 Kudos
In production type cluster (with 10's-100+ nodes) with Namenode HA enabled, best practice is to have 2 Namenodes (1 active and 1 standby) and 3 Journal nodes.
... View more
08-17-2016
04:31 AM
Spark Standalone mode is Spark’s own built-in clustered environment. Standalone-Master is the resource manager for the Spark Standalone cluster.Standalone-Worker is the worker in the Spark Standalone cluster.
To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster.You can launch standalone cluster either manually, by starting a master and workers by hand, or use launch scripts. In most enterprises, you already have Hadoop cluster that is running YARN and want to leverage it for resource management instead of additionally running Spark Standalone mode. If using YARN, spark applications will run its spark-master and spark-workers within containers of YARN. Irrespective of your deployment mode, Spark application will consume same resources it requires to process the data. In case of YARN you have to be aware of what other workloads will be running on cluster (like MR, Tez etc) at same time spark application is executing and size your machines accordingly.
... View more
08-15-2016
03:17 AM
Can you share full details of error message?You are probably missing some libraries. If you are reading data from hdfs and running in yarn-cluster mode your parallelism by default will be equal to number of hdfs blocks. As best practice you should avoid doing collect operation unless its small test dataset and instead use saveAsTextFile method to write result dataset to hdfs or local file.
... View more
08-10-2016
07:49 PM
What version of spark and hdp? Can you list out all jar under SPARK_HOME directory from worker machine in cluster?
... View more
08-08-2016
04:33 AM
In what mode are you running spark application? spark shell or yarn client etc?
... View more
07-18-2016
03:15 PM
2 Kudos
We explicitly listed out FQDN's of all hosts in both the clusters under [domain_realm] section of krb5.conf file. We have to update this file everytime we add node to our clusters and our clusters are currently less than <100 nodes and this solution is manageable but for large clusters this may be challenge.
... View more
07-07-2016
10:24 PM
I'm assuming you are referring to /tmp/ directory in hdfs. You can use below command to clean it up and cron it to run every week. hadoop fs -rm -r /tmp/*
... View more
07-07-2016
10:20 PM
1 Kudo
We have 1 Active Directory KDC names AD.COM which is shared by all environments (active directory users obtain tickets from AD.COM)
then we have separate CORP.COM kerberos KDC's (same realm name ) in each layer Dev,TST,AT,PROD. Existing krb5.conf file config(part of it) in all enviroments is : AD.COM = {
kdc = ad-kdc.com
admin_server = ad-kdc.com
}
[domain_realm]
.company.com = CORP.COM
company.com = CORP.COM
.ad.com = AD.COM
ad.com = AD.COM
[capaths]
AD.COM = {
CORP.COM = .
}
Now we want to 1)add DR.CORP.COM kerberos KDC for DR cluster and 2) also set up cross realm trust with PROD cluster to be able to use distcp . What should be krb5.conf file to set up cross realm trust for DR cluster? Nodes in both DR and PROD cluster have same '.company.com' domain so we are not sure how to set up krb5.conf for cross realm trust? We wanted to try below but not sure if clients in DR cluster can access PROD name node as domain name is same for all nodes in all clusters ---- FOR DR CLUSTER
[domain_realm]
.company.com = DR.CORP.COM
company.com = DR.CORP.COM
.AD.com = AD.COM
AD.com = AD.COM
[capaths]
AD.COM = {
DR.CORP.COM = .
CORP.COM = DR.CORP.COM
}
---- FOR PROD CLUSTER
[domain_realm]
.company.com = .CORP.COM
company.com = .CORP.COM
.AD.com = AD.COM
AD.com = AD.COM
[capaths]
AD.COM = {
.CORP.COM = .
DR.CORP.COM = .CORP.COM
}
Looking for some best practices or help with config above.
... View more
Labels:
- Labels:
-
Apache Hadoop
07-01-2016
12:08 AM
As what user are you accessing ambari? Does user have admin access to amabri?
... View more
06-29-2016
07:43 PM
3 Kudos
Use hadoop-streaming job (with single reducer) to merge all part files data to single hdfs file on cluster itself
and then use hdfs get to fetch single file to local system. $ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat
... View more
06-24-2016
07:25 PM
2 Kudos
You can kill a storm topology as shown below. Use set_wait_secs to set some buffer time so messages already in topology are completely processed before topology is killed. Its equivalent to -w option using storm kill CLI command. Map conf = Utils.readStormConfig();
Client client = NimbusClient.getConfiguredClient(conf).getClient();
KillOptions killOpts = new KillOptions();
killOpts.set_wait_secs(waitSeconds); // time to wait before killing
client.killTopologyWithOpts(topology_name, killOpts); //provide topology name
I'm not sure if there is any direct way to achieve what you want without 1) changing this value for every run or 2) set this to a very high value (like 10 min) so that its guarantees that all messages are processed before killing topology. Please keep in mind that the main use case of storm is to do continuous computation on data by having your topologies running for ever.
... View more
06-22-2016
12:07 AM
What is your config/server.properties and config/producer.properties? Anything in zookeeper logs?
... View more
06-20-2016
06:11 PM
1 Kudo
For new producer api, try increasing the metadata.fetch.timeout.ms in producer config
and also socket.timeout.ms
... View more