About ArpitAgarwal

ArpitAgarwal · ‎07-26-2016

You may run into slow Hadoop service start on your OS X development laptop. You can check this by opening up your service logs and looking for large (5-10 second) gaps between successive log entries at startup. Diagnosis It often manifests as test failures for MiniDFSCluster-based tests that use short timeouts (<10 seconds). Here is an example from a NameNode log file with a 5 second stall at startup. 2016-07-25 14:57:37,982 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2016-07-25 14:57:43,060 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). Another 5 second stall during NameNode startup. 2016-07-25 14:57:48,790 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Append Enabled: true 2016-07-25 14:57:53,914 INFO org.apache.hadoop.util.GSet: Computing capacity for map INodeMap Resolution If you see this behavior you are likely running into an OS X bug. The fix is to put all your entries for localhost on one line as described in this StackOverflow answer. i.e. Make sure your /etc/hosts file has something like this: # Replace myhostname with the hostname of your laptop. # 127.0.0.1 localhost myhostname myhostname.local myhostname.Home Instead of this: 127.0.0.1 localhost myhostname.local 127.0.0.1 myhostname myhostname.Home Root Cause The root cause of this problem appears to be a long delay when looking up the local host name with InetAddress.getLocalHost. The following code is a minimal repro of this problem on affected systems. import java.net.*; class Lookup { public static void main(String[] args) throws Exception { System.out.println(InetAddress.getLocalHost().getCanonicalHostName()); } } This program can take over 5 seconds to execute on an affected machine. Verified on OS X 10.10.5 with Oracle JDK 1.8.0_91 and 1.7.0_79.

ArpitAgarwal · ‎07-20-2016

It's common for mature software products to have parallel maintenance lines. Enterprises often delay upgrades due to regulatory/certification requirements or other reasons.

ArpitAgarwal · ‎07-19-2016

Hi @bigdata.neophyte, the Apache Hadoop community is actively maintaining two stable release lines: 2.7.x - The latest release in this line is 2.7.2. There should be an RC for 2.7.3 out later this month. The git branch for release 2.7.3 is branch-2.7.3. The next maintenance release on this line will be 2.7.4 and it is currently tracked by branch-2.7. 2.6.x - The latest release in this line is 2.6.4. There may be a 2.6.5 (off branch-2.6) but no release manager is actively driving it right now (any Hadoop committer can be a release manager). The 2.8.0 release has been significantly delayed as community effort was diverted to stabilizing 2.6.x and 2.7.x. It is planned but there is no timetable for the release. Also common-dev at hadoop.apache.org would be a good place for this question as you are likely to get responses from release managers.

ArpitAgarwal · ‎07-18-2016

Note that 'hdfs fsck / -files -blocks -locations' is a workaround. It can be slow on large clusters. There is no efficient way to query the block locations of a file.

ArpitAgarwal · ‎07-18-2016

Unrelated to your question, please consider upgrading to a more recent HDP release. HDP 2.4.2 has numerous performance and stability improvements over 2.1.x, not to mention tons of new features.

ArpitAgarwal · ‎07-17-2016

Added the Ambari tag for better visibility by Ambari experts.

ArpitAgarwal · ‎07-17-2016

Hi @Peter Kim, The NameNode selects a set of DataNodes for placing replicas of a newly allocated block. Each DataNode independently selects the target disk for its replica using a round robin policy. So replica placement looks like your case 1. i.e. Case1. BlockPool - blk_,,,,,, blk_...meta -> Datnode1 - disk1 | Datnode8 - disk2 | Datanode3 - disk6 ...... There is no good way to redistribute blocks across disks on a DN as @Hari Rongali mentioned. However a Disk Balancer feature is under development to address this use case. Also if I understand correctly you have two DN storage directories on one physical volume. We do not recommend doing that as it will affect your performance. You should have a one-one relation between storage directories and physical volumes (assuming you are using disks in the recommended JBOD configuration)

ArpitAgarwal · ‎07-17-2016

Hi @Kartik Vashishta, I can answer for the HDFS services. The NameNode heap size depends on the total number of file system objects that you have (files/blocks). The exact heap tuning recommendations are documented in the HDP manual install section (same link that @Sandeep Nemuri provided in another answer). I recommend checking that the Ambari configured values are in line with these recommendations since misconfigured heap settings affect NameNode performance significantly. Also the heap size requirements change with time as cluster usage grows. The DataNode heap size requirement depends on the total number of blocks on each DataNode. The default 1GB heap is insufficient for larger capacity DataNodes. We now recommend using a heap size of 4GB for DataNodes as Benjamin suggested. Ensuring you have GC logging enabled for your services is a good idea. There is an HCC article on NameNode heap tuning that goes into a lot more detail on related topics.

ArpitAgarwal · ‎07-11-2016

Hi @Leon L, the easiest way to do so from the command line, if you are an administrator, is run the 'fsck' command with the -files -blocks -locations options. e.g. $ hdfs fsck /myfile.txt -files -blocks -locations Connecting to namenode via http://localhost:50070 FSCK started by someuser (auth:SIMPLE) from /127.0.0.1 for path /myfile.txt at Sun Jul 10 17:55:32 PDT 2016 /myfile.txt 875664 bytes, 1 block(s): OK 0. BP-810817926-127.0.0.1-1468198364624:blk_1073741825_1001 len=875664 repl=1 [127.0.0.1:50010] This will return a list of blocks along with which DataNodes that have the replicas of each block. This is a one off solution if you need to get the block locations for a small number of files. There is no publicly available API to query the block locations for a file that I know of. Could you please explain your use case?

ArpitAgarwal · ‎07-08-2016

Hi @Andrew Watson, HDFS and S3 are distinct file systems. Today there is no way to use S3 as a storage tier within HDFS. You can use the S3A file system which is bundled in the Apache Hadoop distributions to store data in S3. However your application (or administrator) would have to make a conscious decision to use either HDFS or S3A. You may find HDFS-9806 interesting. This is a proposal from Microsoft to use alternate filesystems like Amazon S3 or Microsoft Azure as storage types within HDFS. Sounds like it exactly addresses your use case.

Online	Offline
Last Visited	‎11-03-2023 01:06 PM

Member Since	‎07-30-2019 10:45 AM
Last Visited	‎11-03-2023 01:06 PM
Posts	111
Kudos received	185

Cloudera Community

Re: What is active and passive NameNode in Hadoop?

Re: NameNode heapsize is bigger then it should be.

Re: Delete old BP-* DataNode directories by hand?

Re: NameNode edit logs - purging/Best practises

Re: Hadoop 3.0 in a Virtual Box for beginners

Improving service startup times on OS X developmen...

Re: Understanding Apache Hadoop releases

Re: Understanding Apache Hadoop releases

Re: List out the Metadata attributes in Hadoop

Re: Hi.i am using HDP 2.1.2 .i want to enable hdfs...

Re: hdp clean directory current

Re: HDFS Datanode Replication Policy on multiple d...

Re: Do I need to tune Java heap size

Re: hdfs file actual block paths

Re: HDFS Heterogeneous Storage - Using AWS S3 as s...