Member since
09-29-2015
67
Posts
45
Kudos Received
10
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1967 | 05-25-2016 10:24 AM | |
11958 | 05-19-2016 11:24 AM | |
8420 | 05-13-2016 10:09 AM | |
3103 | 05-13-2016 06:41 AM | |
9027 | 03-25-2016 09:15 AM |
11-10-2015
11:44 AM
1 Kudo
In general I configure disk space allocation for tez job spills the same way as Yarn intermediate data. Please find here some discussions regarding how to configure it: http://community.hortonworks.com/questions/2230/recommended-size-for-yarnnodemanagerresourcelocal.html#answer-2282
http://community.hortonworks.com/questions/1405/can-you-please-advise-about-how-best-to-use-this-s.html?redirectedFrom=1711
... View more
11-02-2015
08:16 AM
1 Kudo
When working with a table of 1000 partitions and having the Hive concurrency enabled, I once ran into some problems. I don't know if it is still an issue (the problem appeared last year with Hive 0.13) but I think it can be worth mentioning it here: http://mail-archives.apache.org/mod_mbox/hive-user/201408.mbox/%3CCAENxBwxmjN7VTJuzq1G4FimoFYkwZsWJJW4xfiq-6uR+Hb7=cA@mail.gmail.com%3E
... View more
10-28-2015
11:48 AM
It does not work in Ambari (same error as what I got with the configuration I described before). My problem is the integration with Ambari, not the configuration in hdfs-site.xml (as mentioned before, when editing directly hdfs-site.xml it works fine).
... View more
10-28-2015
10:59 AM
Setting the following value in the Ambari box corresponding to the property dfs.datanode.data.dir does not seem to work: /hadoop/hdfs/data,[SSD]/mnt/ssdDisk/hdfs/data I get a warning "Must be a slash or drive at the start" and I cannot save the new configuration. Is there a way to define those disk storages in Ambari (in the past I tried to do it in the hdfs-site.xml file and it worked fine)? My Ambari version is 2.1.0 and I use HDP 2.3.0 (Sandbox).
... View more
Labels:
- Labels:
-
Apache Hadoop
10-28-2015
08:04 AM
1 Kudo
If you use the same partitions for yarn intermediate data than for the HDFS blocks, then you might also consider setting the fs.datanode.du.reserved property, which reserves some space on those partitions for non-hdfs use (such as intermediate yarn data). One base recommendation I saw on my first Hadoop training long time ago was to dedicate 25% of the "data disks" for that kind of intermediate data. I guess the optimal answer should consider the maximum amount of intermediate data you can get at the same time (when launching a job, do you use all the data of HDFS as input data?) and dedicate the space for yarn.nodemanager.resource.local-dirs accordingly. I would also recommend turning on the property mapreduce.map.output.compress in order to reduce the size of the intermediate data.
... View more
10-22-2015
05:40 PM
Apart from the fact that the partition is getting full, the main reason I see to move the checkpoint directory is that you cannot trust that the data under /tmp won't be wiped out after a reboot of your server. In general, avoid putting any kind of Hadoop information (data or metadata) under /tmp, unless you are sure this is kind of temporary or non critical information.
... View more
10-22-2015
03:22 PM
Doesn't Yarn offer a protection mechanism against too much overcommitting? I am thinking of the parameters: yarn.nodemanager.pmem-check-enabled yarn.nodemanager.vmem-check-enabled yarn.nodemanager.vmem-pmem-ratio
... View more
10-22-2015
01:09 PM
Instead of using an external monitoring tool such as Upstart or Supervisor, I would recommend using a cluster software solution. In the past, I have used (not for Hadoop) with good success the Pacemaker software (http://clusterlabs.org). It not only detects some failure, but also automatically raises up the standby daemon, can handle dependencies (first recovering the database, then the application), define some fencing, do placement policies (avoiding having the database and the application on the same node for instance)...
... View more
10-21-2015
11:46 AM
1 Kudo
One thing I did in the past to avoid that script to be loaded multiple times was editing that script and add a kind of global variable to detect if the script needs to be executed again. Something like that (put that at the beginning of the script, just after the first line #!/bin/bash): [ "x$SCRIPT_HADOOP_ENV_LOADED" = "x1" ] && return export SCRIPT_HADOOP_ENV_LOADED=1
... View more
10-15-2015
10:43 AM
If in doubt, you could decrease the replication factor for that folder to 2 or even 1 (although this 1 is kind of risky).
... View more
- « Previous
- Next »