Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Does data get copied in edge node from external sources?

avatar

Hadoop clients like sqoop,hive are installed in Edge nodes. A sqoop job is triggered which captures historical the data from RDBMS residing in a different server and not mounted. Now my questions are:

1) Edge nodes will act a gateway for any external sources for this job. Does the data passes through Edge node to the data nodes. Does this mean that an intermediate staging layer will be created in the Edge node which captures data from RDBMS? Will there be any difference if the file is transferred from external source rather than RDBMS?(Considering the file is not splitted in the source)

2) If the data's are stored in edge nodes as staging layer then what happens when source data is too huge that cant be stored in edge nodes? I know its open ended question but If you could help with few points that would be helpful.

1 ACCEPTED SOLUTION

avatar
Guru

All good questions and fortunately the answer is very simple: all data passes through the edge node with no staging or landing there. Even better, the data passes directly to hadoop where it performs a map-reduce job (all mappers, no reducers) to import the rows in parallel.

Useful refs:

View solution in original post

4 REPLIES 4

avatar
Guru

All good questions and fortunately the answer is very simple: all data passes through the edge node with no staging or landing there. Even better, the data passes directly to hadoop where it performs a map-reduce job (all mappers, no reducers) to import the rows in parallel.

Useful refs:

avatar

@Greg Keys Thanks for the links. I understand that hadoop clients should be installed in edge nodes, but is that the only use of edge nodes? It was suggested that the clients like sqoop should be on edge nodes as the data transfer rate will be very high. If the clients are in name/data nodes then it might affect the node performances. What Im not understanding is, if there is no intermediate staging taking place in edge node then is there any other tasks being performed in the edge nodes when the data is transferred from the external sources?

avatar
Guru

Streaming the data to hadoop does consume a lot of CPU even though the data is only passing through. Putting the client on the edge node isolates this and thus prevents CPU contention on the cluster doing jobs. The edge node typically is used for client implementation a) to isolate users from logging into master or worker nodes, and b) for isolating resource usage as with CPU with Sqoop.

avatar

@Greg Keys Thanks Greg. I got the information which I was looking for.