About sluangsay

sluangsay · ‎11-10-2015

In general I configure disk space allocation for tez job spills the same way as Yarn intermediate data. Please find here some discussions regarding how to configure it: http://community.hortonworks.com/questions/2230/recommended-size-for-yarnnodemanagerresourcelocal.html#answer-2282 http://community.hortonworks.com/questions/1405/can-you-please-advise-about-how-best-to-use-this-s.html?redirectedFrom=1711

sluangsay · ‎11-02-2015

When working with a table of 1000 partitions and having the Hive concurrency enabled, I once ran into some problems. I don't know if it is still an issue (the problem appeared last year with Hive 0.13) but I think it can be worth mentioning it here: http://mail-archives.apache.org/mod_mbox/hive-user/201408.mbox/%3CCAENxBwxmjN7VTJuzq1G4FimoFYkwZsWJJW4xfiq-6uR+Hb7=cA@mail.gmail.com%3E

sluangsay · ‎10-28-2015

It does not work in Ambari (same error as what I got with the configuration I described before). My problem is the integration with Ambari, not the configuration in hdfs-site.xml (as mentioned before, when editing directly hdfs-site.xml it works fine).

sluangsay · ‎10-28-2015

Setting the following value in the Ambari box corresponding to the property dfs.datanode.data.dir does not seem to work: /hadoop/hdfs/data,[SSD]/mnt/ssdDisk/hdfs/data I get a warning "Must be a slash or drive at the start" and I cannot save the new configuration. Is there a way to define those disk storages in Ambari (in the past I tried to do it in the hdfs-site.xml file and it worked fine)? My Ambari version is 2.1.0 and I use HDP 2.3.0 (Sandbox).

sluangsay · ‎10-28-2015

If you use the same partitions for yarn intermediate data than for the HDFS blocks, then you might also consider setting the fs.datanode.du.reserved property, which reserves some space on those partitions for non-hdfs use (such as intermediate yarn data). One base recommendation I saw on my first Hadoop training long time ago was to dedicate 25% of the "data disks" for that kind of intermediate data. I guess the optimal answer should consider the maximum amount of intermediate data you can get at the same time (when launching a job, do you use all the data of HDFS as input data?) and dedicate the space for yarn.nodemanager.resource.local-dirs accordingly. I would also recommend turning on the property mapreduce.map.output.compress in order to reduce the size of the intermediate data.

sluangsay · ‎10-22-2015

Apart from the fact that the partition is getting full, the main reason I see to move the checkpoint directory is that you cannot trust that the data under /tmp won't be wiped out after a reboot of your server. In general, avoid putting any kind of Hadoop information (data or metadata) under /tmp, unless you are sure this is kind of temporary or non critical information.

sluangsay · ‎10-22-2015

Doesn't Yarn offer a protection mechanism against too much overcommitting? I am thinking of the parameters: yarn.nodemanager.pmem-check-enabled yarn.nodemanager.vmem-check-enabled yarn.nodemanager.vmem-pmem-ratio

sluangsay · ‎10-22-2015

Instead of using an external monitoring tool such as Upstart or Supervisor, I would recommend using a cluster software solution. In the past, I have used (not for Hadoop) with good success the Pacemaker software (http://clusterlabs.org). It not only detects some failure, but also automatically raises up the standby daemon, can handle dependencies (first recovering the database, then the application), define some fencing, do placement policies (avoiding having the database and the application on the same node for instance)...

sluangsay · ‎10-21-2015

One thing I did in the past to avoid that script to be loaded multiple times was editing that script and add a kind of global variable to detect if the script needs to be executed again. Something like that (put that at the beginning of the script, just after the first line #!/bin/bash): [ "x$SCRIPT_HADOOP_ENV_LOADED" = "x1" ] && return export SCRIPT_HADOOP_ENV_LOADED=1

sluangsay · ‎10-15-2015

If in doubt, you could decrease the replication factor for that folder to 2 or even 1 (although this 1 is kind of risky).

Online	Offline
Last Visited	‎05-30-2016 01:32 PM

Member Since	‎09-29-2015 07:44 AM
Last Visited	‎05-30-2016 01:32 PM
Posts	67
Kudos received	45

Cloudera Community

Re: Data Processing Using Pig from local to HDFS

Re: "Number of reduce tasks is set to 0 since ther...

Re: Sqoop import : composite primary key and textu...

Re: can we create a facts and dimensional tables i...

Re: Hive QL - Aggregating within a group

Re: Any guidelines for disk space allocation for t...

Re: Maximum Hive Table Partitions allowed & recomm...

Re: How to configure storage policy in Ambari?

How to configure storage policy in Ambari?

Re: Recommended size for yarn.nodemanager.resource...

Re: Moving Secondary NameNode Checkpoint directory

Re: One dead big job blocks all jobs

Re: How to setup High Availability for Ambari serv...

Re: How to avoid variables in hadoop-env being cal...

Re: Will removing data from /apps/hbase/data/.hbck...