Member since
05-03-2016
24
Posts
32
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
664 | 05-20-2016 09:11 AM | |
3690 | 05-18-2016 04:00 PM | |
1142 | 05-03-2016 04:43 PM |
09-21-2016
09:44 AM
I would answer this question by asking what you are trying to achieve. Sharding (as I understand it) is used in traditional databases to do some of the distributed stuff that Hadoop does ... but in a different way. The "horizontal partitioning" in shards sounds similar to column-oriented storage. See ORC files in Hive. The "distributing tables across servers to spread the load" part of sharding is what HDFS does natively. If you are trying to do in Hadoop what you do in a relational database, then I would advise that you take a deeper look at the way that Hadoop works. It is also possible that I've misunderstood your question, and what you are trying to achieve.
... View more
09-16-2016
09:56 AM
The short and blunt answer is that it's not possible. The rather more helpful, but longer answer is that there are several workarounds, briefly summarised as NFS and copyToLocal: NFS: it is possible to mount the HDFS system using the NFS Gateway. NFS is not designed for the massive file sizes that HDFS can store, so you need to be careful. But it might be acceptable for this particular problem. The Hortonworks NFS documentation is here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_hdfs_nfs_gateway/content/hdfs-nfs-gateway-user-guide.html CoppyToLocal: use 'hdfs dfs -get' or 'hdfs dfs -copyToLocal' to make a copy of the file from HDFS to your local file system, and execute it locally. See http://stackoverflow.com/questions/20333135/how-to-execute-hadoop-jar-from-hdfs-filesystem for some tricks.
... View more
09-16-2016
09:45 AM
I can't answer your question exactly, but I have heard rumours (only rumours, no details) that other people have struggled to get FusionIO working too. The "workaround" was to go back to SSDs or HDDs. Another possible workaround, if you need the sort of speed that FusionIO promises might be to use Ramdisks instead. I realise that Ramdisks are volatile, and might not give you what you need. Please see https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/MemoryStorage.html
... View more
06-02-2016
10:08 AM
1 Kudo
Without the rest of of the output of the job, it is difficult to tell what the problem is. However, going through the output carefully line by line, it is usually possible to see what the actual problem is. I have seen this error before, and sometimes it is a secondary issue - the true cause might be something else prior to this. If you have the entire output, please can you share that.
... View more
05-20-2016
09:11 AM
2 Kudos
Have you considered using the downloadable VM? VirtualBox is free, and there is a free VM-Ware player for Windows. The only caveat is that your host machine needs at least 10Gb RAM - ideally 16Gb - and about the same in disk space.
... View more
05-20-2016
09:02 AM
1 Kudo
The latest documentation on DistCp is here https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
... View more
05-19-2016
02:29 PM
1 Kudo
Well spotted @Pradeep. This is so easy to miss. Look carefully at the bottom of the screenshot and also near the bottom of the logfile. The important line is Execution of 'hadoop --config /usr/hdp/current/hadoop-client/conf dfsadmin -fs hdfs://bigdata -safemode get | grep OFF' returned 1. (italics and bold-text added for clarity). There is some further text after that about the fact that the 'hadoop' command is deprecated, which may have caused you to visually skip over that all-important "returned 1".
... View more
05-19-2016
10:11 AM
5 Kudos
The documentation (https://wiki.apache.org/hadoop/GettingStartedWithHadoop) implies that the data is gone, which is what most humans would expect, comparing to file system such as ext4 or NTFS. However, this is not the case with HDFS. In HDFS, data is stored by each datanode as blocks on the real (ext4) filesystem. The datanode only knows about blocks, and knows nothing about HDFS files or HDFS directories. This page https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html explains the architecture, especially the section on "Data Replication". If you need to delete the entire filesystem, you should first delete all the directories using an HDFS command such as "hdfs dfs rm -rf -skipTrash" before doing the "hdfs -format". Or use Christian's suggestion above and repeatedly overwrite the files in each datanode's data directories - but that may be a lot of work if you have a large cluster.
... View more
05-18-2016
04:00 PM
2 Kudos
If you can use a downloaded virtual machine, try out the "Hortonworks Sandbox" at http://hortonworks.com/downloads/#sandbox (as Lester mentioned above). This is a pre-installed single-node Hadoop cluster inside a virtual machine. You can get the sandbox for VM-Ware (commercial - but there is a free viewer) or for VirtualBox (which is free). You should be able to run this on any Windows / Mac / Linux machine so long as you have enough disk space and RAM. Similar downloads also exist for the other major distributors of Hadoop. There are also links on the same page for accessing an online sandbox via Microsoft Azure. This might be your "online for free" option. It is supposed to be "free for a month" (I haven't tried it). I assume that a subscription fee must be paid beyond that time. The tutorials are here: http://hortonworks.com/tutorials/ and I suggest you start with the tutorials under the heading "Hello World".
... View more
05-16-2016
04:15 PM
2 Kudos
@Mahipal Jupali - Is there any possibility of getting this information via the Ambari API? Sorry I can't give a full answer, but maybe this would be a suitable starting point: https://cwiki.apache.org/confluence/display/AMBARI/Modify+configurations
... View more
05-16-2016
03:52 PM
In addition to what Pradeep said, the way to add new "master nodes" is to add new "worker nodes", then to assign master services to that node. Have a look at this https://community.hortonworks.com/answers/31935/view.html too.
... View more
05-16-2016
03:46 PM
1 Kudo
Have a look in Ambari >> Services >> HDFS >> Configs >> Advanced ... In the "Filter" box top right, type the text "proxy". In custom core-site, you will probably find the following settings: hadoop.proxyuser.root.groups=*
hadoop.proxyuser.root.hosts=* which essentially gives Linux "root" privileges to the Ambari service. See this for more information. In a production system, you would probably want to use the syntax hadoop.proxyuser.ambariusr.groups=* instead, where ambariusr is not root, and has limited privileges. More information on Hadoop Proxy Users can be found here (Apache Hadoop site). Hope this helps.
... View more
05-06-2016
03:28 PM
1 Kudo
@srinathji kyadari: In a Hadoop cluster, Spark processing gets run on top of YARN as a set of distributed processes. So at the very simplest level, Spark will run wherever NodeManager (the slave component of YARN) is installed. Hive queries get compiled into MapReduce (or Tez or possibly Spark) jobs by the HiveServer. These jobs also run on top of YARN. So again at a very simple level, Hive queries will run wherever NodeManager is installed. Hive also uses master services. You can move the Hive master services from one master node to another if you need to. These master services are HiveServer2 (which looks after client connections and compilation), Hive Metastore (which stores the metadata) and WebHCat Server (=HCatalog). This is a very high-level view. Does any of this help?
... View more
05-06-2016
02:51 PM
There are a number of things that cause HDFS imbalance. This post explains some of those causes in more detail. The balancer should be run regularly in a production system (you can kick it off from the command line, so you can schedule it using cron, for example). The balancer can take a while to complete if there are a lot of blocks to move. Note that, when HDFS moves a block, the old block gets "marked for deletion" but doesn't get deleted immediately. HDFS deals with these un-used blocks over time.
... View more
05-06-2016
02:38 PM
Emil and Benjamin have covered the question thoroughly. I would add the following general point. When you import data into a Hive table, you must define a schema before loading. This is not likely to be a problem if the data originated in a DBMS. However, importing data first into HDFS allows you to load the data without defining a schema - you just load the data. You can apply the schema later on if you wish. In this way, loading into HDFS first gives you greater flexibility. In your case, the schema is stored alongside the data in Avro anyway, so my point might be academic.
... View more
05-06-2016
02:16 PM
4 Kudos
The Ambari interface provides a wizard that makes it easy to add nodes. The "Add hosts" wizard in Ambari assumes that these nodes will be workers (slaves) - so it gives the options to install DataNode, NodeManager and client services. Start the wizard from the Hosts tab, then click "actions" and "Add New Hosts" - as shown. After adding nodes, you can then move master services from one node to another. For example, go to Services >> HDFS, and from the service actions button top right (see screenshot) you will see an option to move the NameNode service. You can move services around as you wish - just be aware that doing so will force a small amount of downtime on your cluster. Ambari will tell you which services need restarting afterwards. Hope that helps.
... View more
05-04-2016
01:25 PM
Thanks @Pardeep. This looks like it will help.
... View more
05-04-2016
12:55 PM
Translation: after following this link "http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_installing_manually_book/content/ch_installing_hue_chapter.html" to install Hue, I type my ip address: 8000 and this picture (attached) appears
... View more
05-04-2016
12:37 PM
1 Kudo
The Ambari Enhanced Configs (called "Guided Configuration" in the Hortonworks Admin-1 Class) feature is really useful: cluster admin staff don't need to keep referring to the docs to work out what the maxima, minima and recommended values are for critical parameters in the Hadoop cluster, for example HDFS NameNode Java heap size, YARN minimum and maximum container memory size, etc. I would really like to find out how the Enhanced Config values - especially the default and recommended values - are calculated. Does Ambari dynamically calculate these values? Or is Ambari relying on some behind-the-scenes script to prepopulate some file somewhere?
This page https://cwiki.apache.org/confluence/display/AMBARI/Enhanced+Configs (see step 2) gives some clues - but appears to be mainly about how to create your own Enhanced Config. Is there some documentation somewhere that explains the Enhanced Configs as implemented in HDP? Where can I drill into the documentation or source-code to determine the maths behind each value in Enhanced Config?
... View more
Labels:
05-04-2016
11:56 AM
I agree with @Jitendra Yadav. The blog-posts of Michael Noll are excellent reading, especially in the realm of Kafka.
... View more
05-04-2016
08:48 AM
Thanks @nmaillard. I'm not a Java expert, and I'm having a bit of trouble interpreting the source code. The "distance" seems to be calculated using class "clusterMap" of type NetworkTopology. From your link, see line 874 int shortestDistance = clusterMap.getDistance(writer, shortestStorage.getDatanodeDescriptor()); but I can't find the source for NetworkTopology. Do you know where I could find the source code for NetworkTopology, so I can see how getDistance is calculated? I'm interested to know if this "distance" is just based on the racks, or if a more involved calculation is performed, to determine for example some "distance" from one node to another in the same rack. Many thanks for your help.
... View more
05-03-2016
05:07 PM
4 Kudos
To identify "corrupt" or "missing" blocks, the command-line command 'hdfs fsck /path/to/file' can be used. Other tools also exist. HDFS will attempt to recover the situation automatically. By default there are three replicas of any block in the cluster. so if HDFS detects that one replica of a block has become corrupt or damaged, HDFS will create a new replica of that block from a known-good replica, and will mark the damaged one for deletion. The known-good state is determined by checksums which are recorded alongside the block by each DataNode. The chances of two replicas of the same block becoming damaged is very small indeed. HDFS can - and does - recover from this situation because it has a third replica, with its checksum, from which further replicas can be created. The chances of three replicas of the same block becoming damaged is so remote that it would suggest a significant failure somewhere else in the cluster. If this situation does occur, and all three replicas are damaged, then 'hdfs fsck' will report that block as "corrupt" - i.e. HDFS cannot self-heal the block from any of its replicas. Rebuilding the data behind a corrupt block is a lengthy process (like any data recovery process). If this situation should arise, deep investigation of the health of the cluster as a whole should also be undertaken.
... View more
05-03-2016
04:49 PM
2 Kudos
When HDFS is writing a file, the block is placed "closest" to the client. How is this determined? Is it merely based on whether the client is co-located on a DataNode, or does the NameNode do some clever stuff with the network to determine latencies?
... View more
- Tags:
- Hadoop Core
- HDFS
Labels:
05-03-2016
04:43 PM
5 Kudos
The block placement in HDFS depends on a few things: If the Client application is on a DataNode machine (e.g. a Pig script running on a node in the cluster), then HDFS will attempt to write all the first-replica blocks to that DataNode - because it is the "closest". Some blocks may get written to other DataNodes, for example if the first DataNode is full. Second-replica and third-replica (etc.) blocks get written randomly to multiple DataNodes according to the rack-aware block-placement policy. If the Client is NOT on a DataNode machine, then all the first-replica blocks get written randomly to a DataNode in the same rack. Second-replica etc. blocks get written to random DataNodes as above. If the Client is WebHDFS, then all the first-replica blocks get written to one DataNode (this is a limitation of the way WebHDFS works: the NameNode will only give the WebHDFS client one DataNode to write to). This can be a problem when writing files larger than a single disk. Second-replica etc. blocks get written to random DataNodes as above.
... View more