About pcoates

cnauroth · ‎02-18-2016

Does a mapper copy a physical block or does it copy an entire logical file? DistCp map tasks are responsible for copying a list of logical files. This differs from typical MapReduce processing, where each map task consumes an input split, which maps 1:1 (usually) to an individual block of an HDFS file. The reason for this is that DistCp needs to preserve not only the block data at the destination, but also the metadata that links an inode with a named path to all of those blocks. Therefore, DistCp needs to use APIs that operate at the file level, not the block level. The overall architecture of DistCp is to generate what it calls a "copy listing", which is a list of files from the source that need to be copied to the destination, and then partition the work of copying the files in the copy listing to multiple mappers. The Apache documentation for DistCp contains more details on the policies involved in this partitioning. http://hadoop.apache.org/docs/r2.7.2/hadoop-distcp/DistCp.html#InputFormats_and_MapReduce_Components It is possible that tuning the number of mappers as described in the earlier answer could improve throughput. Particularly for a large cluster at the source, I'd expect increasing the number of mappers to increase overall parallelism and leverage the NIC available on multiple nodes for the data transfer. It's difficult to give general advice on this though. It might take experimentation to tune it for the particular workload involved.

pcoates · ‎02-12-2016

All---thanks for the very helpful answers. The real issue here is that values get changed after the original correct installation. Then you get nailed by surprise later because arbitrarily much time can go by before processes are restarted (That's what happens repeatedly here.) It would be wonderful if Ambari could have an option to execute the same script it executes to do install-time checks periodically to catch this kind of thing.

rbiswas1 · ‎04-20-2016

Hi, @Peter Coates Assuming you have moderate number of files did you tried the below option: bash$ hadoop distcp2 -f hdfs://nn1:8020/srclist hdfs://nn2:8020/bar/foo Where srclist contains (you can populate this file by recursive listing) hdfs://nn1:8020/foo/dir1/a hdfs://nn1:8020/foo/dir2/b More info here: https://hadoop.apache.org/docs/r1.2.1/distcp2.html Please let me know if this works. Thanks

pcoates · ‎01-12-2016

One last detail---if the time runs out, and the blocks go on the queue for replication, what happens when the node comes back online and reports. Are they stricken from the queue? What if they've already been replicated?

RajeshMadurai · ‎06-28-2017

Configured hdfs-site.xml file both cluster HA node. When I starting start-all.sh it starting other second cluster services after that namenode getting down.

Brian_Law · ‎05-17-2017

Is there anyway to get some sort of guesstimation on how long a rebalance will take?

mherring · ‎01-19-2016

Closing this as a redirect to https://community.hortonworks.com/questions/3034/running-all-services-as-same-user.html?redirectedFrom=5474

pcoates · ‎12-09-2015

The documentation seems to suggest that the normal mode of use would be to have one reconstituted replica sitting around and that reconstituting an encoded block would be done only if this isn't the case. Keeping a block by default would eliminate most of the space savings because the data would expand from 1.6 to 2.6 times the raw file size. Why not have a policy that for leaves a single size copy for a limited time after a block is used? A "working set" as it were, so if you've used a block in the last X hours the decoded block won't be deleted.

mlochbihler · ‎11-24-2015

Notes to add to Ancil's comments Teradata Query Grid with Hortonworks capability is tied to releases - ie One-Way(14.10), Bi-Directional(15.00) and Push Down(15.00). Teradata Unified Data Mover - an intelligent Data Mover between Unified Data Architecture(Teradata, Aster and Hadoop). Teradata Studio for desktop - offers Data Source Explorer, Object Viewer, and transfer

mlochbihler · ‎11-24-2015

also note, Teradata has released their Teradata Connector for Hadoop(TDCH) with HDP2.3 through joint efforts with both engineering teams.

Online	Offline
Last Visited	‎03-30-2016 02:57 PM

Member Since	‎10-06-2015 02:10 PM
Last Visited	‎03-30-2016 02:57 PM
Posts	45
Kudos received	48

Cloudera Community

Re: Number of distcp mappers is small. Why?

Re: If THP is disabled after Java process start, d...

Re: How to discp a partitioned multi-level directo...

Re: How does restarting a data node affect block r...

Re: How to use Name Service ID between to Clusters

Re: What are the best practices for HDFS rebalanci...

Re: Why are there 21 separate service accounts?

Re: How will Erasure Coding affect the principle o...

Re: Access modes for teradata beyond Sqoop ingesti...

Re: Connecting Teradata to HDP 2.2.4