About Justin_Watkins

Justin_Watkins · ‎09-21-2016

I would answer this question by asking what you are trying to achieve. Sharding (as I understand it) is used in traditional databases to do some of the distributed stuff that Hadoop does ... but in a different way. The "horizontal partitioning" in shards sounds similar to column-oriented storage. See ORC files in Hive. The "distributing tables across servers to spread the load" part of sharding is what HDFS does natively. If you are trying to do in Hadoop what you do in a relational database, then I would advise that you take a deeper look at the way that Hadoop works. It is also possible that I've misunderstood your question, and what you are trying to achieve.

Justin_Watkins · ‎05-20-2016

Have you considered using the downloadable VM? VirtualBox is free, and there is a free VM-Ware player for Windows. The only caveat is that your host machine needs at least 10Gb RAM - ideally 16Gb - and about the same in disk space.

Justin_Watkins · ‎05-19-2016

The documentation (https://wiki.apache.org/hadoop/GettingStartedWithHadoop) implies that the data is gone, which is what most humans would expect, comparing to file system such as ext4 or NTFS. However, this is not the case with HDFS. In HDFS, data is stored by each datanode as blocks on the real (ext4) filesystem. The datanode only knows about blocks, and knows nothing about HDFS files or HDFS directories. This page https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html explains the architecture, especially the section on "Data Replication". If you need to delete the entire filesystem, you should first delete all the directories using an HDFS command such as "hdfs dfs rm -rf -skipTrash" before doing the "hdfs -format". Or use Christian's suggestion above and repeatedly overwrite the files in each datanode's data directories - but that may be a lot of work if you have a large cluster.

Justin_Watkins · ‎05-18-2016

If you can use a downloaded virtual machine, try out the "Hortonworks Sandbox" at http://hortonworks.com/downloads/#sandbox (as Lester mentioned above). This is a pre-installed single-node Hadoop cluster inside a virtual machine. You can get the sandbox for VM-Ware (commercial - but there is a free viewer) or for VirtualBox (which is free). You should be able to run this on any Windows / Mac / Linux machine so long as you have enough disk space and RAM. Similar downloads also exist for the other major distributors of Hadoop. There are also links on the same page for accessing an online sandbox via Microsoft Azure. This might be your "online for free" option. It is supposed to be "free for a month" (I haven't tried it). I assume that a subscription fee must be paid beyond that time. The tutorials are here: http://hortonworks.com/tutorials/ and I suggest you start with the tutorials under the heading "Hello World".

Justin_Watkins · ‎05-06-2016

There are a number of things that cause HDFS imbalance. This post explains some of those causes in more detail. The balancer should be run regularly in a production system (you can kick it off from the command line, so you can schedule it using cron, for example). The balancer can take a while to complete if there are a lot of blocks to move. Note that, when HDFS moves a block, the old block gets "marked for deletion" but doesn't get deleted immediately. HDFS deals with these un-used blocks over time.

Justin_Watkins · ‎05-06-2016

Emil and Benjamin have covered the question thoroughly. I would add the following general point. When you import data into a Hive table, you must define a schema before loading. This is not likely to be a problem if the data originated in a DBMS. However, importing data first into HDFS allows you to load the data without defining a schema - you just load the data. You can apply the schema later on if you wish. In this way, loading into HDFS first gives you greater flexibility. In your case, the schema is stored alongside the data in Avro anyway, so my point might be academic.

Justin_Watkins · ‎05-04-2016

Thanks @Pardeep. This looks like it will help.

Justin_Watkins · ‎05-04-2016

The Ambari Enhanced Configs (called "Guided Configuration" in the Hortonworks Admin-1 Class) feature is really useful: cluster admin staff don't need to keep referring to the docs to work out what the maxima, minima and recommended values are for critical parameters in the Hadoop cluster, for example HDFS NameNode Java heap size, YARN minimum and maximum container memory size, etc. I would really like to find out how the Enhanced Config values - especially the default and recommended values - are calculated. Does Ambari dynamically calculate these values? Or is Ambari relying on some behind-the-scenes script to prepopulate some file somewhere? This page https://cwiki.apache.org/confluence/display/AMBARI/Enhanced+Configs (see step 2) gives some clues - but appears to be mainly about how to create your own Enhanced Config. Is there some documentation somewhere that explains the Enhanced Configs as implemented in HDP? Where can I drill into the documentation or source-code to determine the maths behind each value in Enhanced Config?

Justin_Watkins · ‎05-04-2016

I agree with @Jitendra Yadav. The blog-posts of Michael Noll are excellent reading, especially in the realm of Kafka.

Justin_Watkins · ‎05-03-2016

To identify "corrupt" or "missing" blocks, the command-line command 'hdfs fsck /path/to/file' can be used. Other tools also exist. HDFS will attempt to recover the situation automatically. By default there are three replicas of any block in the cluster. so if HDFS detects that one replica of a block has become corrupt or damaged, HDFS will create a new replica of that block from a known-good replica, and will mark the damaged one for deletion. The known-good state is determined by checksums which are recorded alongside the block by each DataNode. The chances of two replicas of the same block becoming damaged is very small indeed. HDFS can - and does - recover from this situation because it has a third replica, with its checksum, from which further replicas can be created. The chances of three replicas of the same block becoming damaged is so remote that it would suggest a significant failure somewhere else in the cluster. If this situation does occur, and all three replicas are damaged, then 'hdfs fsck' will report that block as "corrupt" - i.e. HDFS cannot self-heal the block from any of its replicas. Rebuilding the data behind a corrupt block is a lengthy process (like any data recovery process). If this situation should arise, deep investigation of the health of the cluster as a whole should also be undertaken.

Online	Offline
Last Visited	‎09-21-2016 09:44 AM

Member Since	‎05-03-2016 03:55 PM
Last Visited	‎09-21-2016 09:44 AM
Posts	24
Kudos received	33

Cloudera Community

Re: Will i be charged to my credit card for choosi...

Re: I have to practise hadoop(hive,pig,sqoop,oozie...

Re: HDFS File Placement when File Size Exceeds Blo...

Re: Sharding in HDFS

Re: Will i be charged to my credit card for choosi...

Re: namenode format effect

Re: I have to practise hadoop(hive,pig,sqoop,oozie...

Re: Uneven DFS data storage across cluster.

Re: Import to HDFS or Hive

Re: How are Ambari Enhanced Config / Guided Config...

How are Ambari Enhanced Config / Guided Config def...

Re: HOW TO INTEGRATE OR CONSUME DATA FROM KAFKA TO...

Re: Best way of handling corrupt or missing blocks...