Member since
10-06-2015
45
Posts
54
Kudos Received
0
Solutions
02-18-2016
09:28 PM
2 Kudos
Does a mapper copy a physical block or does it copy an entire logical file? DistCp map tasks are responsible for copying a list of logical files. This differs from typical MapReduce processing, where each map task consumes an input split, which maps 1:1 (usually) to an individual block of an HDFS file. The reason for this is that DistCp needs to preserve not only the block data at the destination, but also the metadata that links an inode with a named path to all of those blocks. Therefore, DistCp needs to use APIs that operate at the file level, not the block level. The overall architecture of DistCp is to generate what it calls a "copy listing", which is a list of files from the source that need to be copied to the destination, and then partition the work of copying the files in the copy listing to multiple mappers. The Apache documentation for DistCp contains more details on the policies involved in this partitioning. http://hadoop.apache.org/docs/r2.7.2/hadoop-distcp/DistCp.html#InputFormats_and_MapReduce_Components It is possible that tuning the number of mappers as described in the earlier answer could improve throughput. Particularly for a large cluster at the source, I'd expect increasing the number of mappers to increase overall parallelism and leverage the NIC available on multiple nodes for the data transfer. It's difficult to give general advice on this though. It might take experimentation to tune it for the particular workload involved.
... View more
02-12-2016
03:43 PM
1 Kudo
All---thanks for the very helpful answers. The real issue here is that values get changed after the original correct installation. Then you get nailed by surprise later because arbitrarily much time can go by before processes are restarted (That's what happens repeatedly here.) It would be wonderful if Ambari could have an option to execute the same script it executes to do install-time checks periodically to catch this kind of thing.
... View more
04-20-2016
03:01 AM
1 Kudo
Hi, @Peter Coates Assuming you have moderate number of files did you tried the below option: bash$ hadoop distcp2 -f hdfs://nn1:8020/srclist hdfs://nn2:8020/bar/foo Where srclist contains (you can populate this file by recursive listing) hdfs://nn1:8020/foo/dir1/a
hdfs://nn1:8020/foo/dir2/b More info here: https://hadoop.apache.org/docs/r1.2.1/distcp2.html Please let me know if this works. Thanks
... View more
01-12-2016
06:56 PM
1 Kudo
One last detail---if the time runs out, and the blocks go on the queue for replication, what happens when the node comes back online and reports. Are they stricken from the queue? What if they've already been replicated?
... View more
06-28-2017
12:29 PM
Configured hdfs-site.xml file both cluster HA node. When I starting start-all.sh it starting other second cluster services after that namenode getting down.
... View more
05-17-2017
07:39 AM
Is there anyway to get some sort of guesstimation on how long a rebalance will take?
... View more
01-19-2016
03:43 PM
Closing this as a redirect to https://community.hortonworks.com/questions/3034/running-all-services-as-same-user.html?redirectedFrom=5474
... View more
12-09-2015
04:47 PM
The documentation seems to suggest that the normal mode of use would be to have one reconstituted replica sitting around and that reconstituting an encoded block would be done only if this isn't the case. Keeping a block by default would eliminate most of the space savings because the data would expand from 1.6 to 2.6 times the raw file size. Why not have a policy that for leaves a single size copy for a limited time after a block is used? A "working set" as it were, so if you've used a block in the last X hours the decoded block won't be deleted.
... View more
11-24-2015
04:06 PM
Notes to add to Ancil's comments Teradata Query Grid with Hortonworks capability is tied to releases - ie One-Way(14.10), Bi-Directional(15.00) and Push Down(15.00). Teradata Unified Data Mover - an intelligent Data Mover between Unified Data Architecture(Teradata, Aster and Hadoop). Teradata Studio for desktop - offers Data Source Explorer, Object Viewer, and transfer
... View more
11-24-2015
03:58 PM
also note, Teradata has released their Teradata Connector for Hadoop(TDCH) with HDP2.3 through joint efforts with both engineering teams.
... View more