Created 05-23-2016 10:20 AM
Hi,
Is it a good practice to stream data directly from the source systems directly into HDFS using Knox exposed WebHDFS APIs, or using the Knox edge node as a staging area before ingesting into HDFS a better one?
Thansk
Created 05-23-2016 10:44 AM
The best practice is to use a proper edge node and the hadoop api. It will be significantly faster than webhdfs ( 2x performance ) and knox will have some additional performance impact.
However if you have to use knox because of your firewall/security settings then you have to do it. Then I don't get the second option. What do you mean with using knox as an edge node?
Created 05-23-2016 10:30 AM
@Greenhorn Techie If your source system has access to Knox exposed WebHDFS then this would be the good way as you would avoid data hop on edge node. This method should take less time to put data on HDFS than putting data via edge node.
Also accessing directly knox exposed WebHDFS will let you avoid SSH access to knox edge node.SO first option looks more secure and fast.
Hope this helps you.
Created 05-23-2016 10:39 AM
@Pradeep Bhadani I have seen solutions where the staging option has been taken.. So just wondering what advantages the option of staging solution brings, in comparison to the direct streaming through Knox WebHDFS API?
Created 05-23-2016 10:44 AM
The best practice is to use a proper edge node and the hadoop api. It will be significantly faster than webhdfs ( 2x performance ) and knox will have some additional performance impact.
However if you have to use knox because of your firewall/security settings then you have to do it. Then I don't get the second option. What do you mean with using knox as an edge node?
Created 05-23-2016 10:57 AM
@Benjamin Leonhardi Let me rephrase my question. Assume I have a HDP cluster and then an edge node outside the cluster. On the edge node, I have installed Knox service. My question is to understand which is the better way of ingesting data into HDFS.
1. Should I use the edge node as a staging area to ingest the data first onto the edge node (which means storage is needed on the edge node) and then ingest onto HDFS? This would help to secure the data nodes being exposed to the outside world
2. Alternatively, I can configure Knox service on the edge node such that the WebHDFS API goes through Knox and hence the Namenode URL / IP address is not exposed beyond Knox. In this case, the source directly streams to HDFS, but know doing the address translation. However, no additional storage is needed on the edge node for staging the data temporarily.
Created 05-23-2016 11:50 AM
I think you have different options. You need to run the server -> web api thing somewhere. ( python with curl, some little java program whatever ). You could run it on the source servers but then all of them need to run the logic or you can run it on the edge node but then you potentially have a bottleneck ( although I have normally not seen that unless your data volumes are huge assuming the data is properly compressed ).
In the end there is no hadoop "best practice" It depends on your setup. I like to do all my data gathering and the push into hadoop on a single node. It makes it easier to manage that piece of code and often there is some kind of logic that needs to aggregate data from different source servers. In this case it would be an edge node with data. However if tou directly push from the source systems you save one hop. So trade off.
Another alternative is something like sshfs. This allows you to have an edge node without the additional storage or extra hop. ( i.e. you mount all relevant source systems as NFS or SCP filesystems. )
Created 05-23-2016 04:00 PM
@Benjamin Leonhardi Thanks for your response.
Created 05-23-2016 10:50 AM
@Greenhorn TechieAs mentioned in above comment, edge not is beneficial when you access HDFS via hadoop API . When you knox , putting data on edge node would be like a hop means it will increase overall time taken to ingest data on HDFS.