About pcoates

pcoates · ‎02-18-2016

Thanks for gettng back. Yes--I'm aware of the -m option, but it appears from the documentation that the mappers get a list of HDFS level files and work on these. I'm trying to find out if my understanding is accurate: that unlike a typical map reduce job that deals in single blocks or splits, the distcp maps each get the URI of an entire file or files to copy. Therefore, you might have hundreds of blocks, but if it's all one file, the same mapper will handle all. Is this the case?

pcoates · ‎02-18-2016

We see very few mappers created for discp copies. Are these mappers being allocated at the block level or at the file level? I.e., does a mapper copy a physical block or does it copy an entire logical file?

pcoates · ‎02-12-2016

All---thanks for the very helpful answers. The real issue here is that values get changed after the original correct installation. Then you get nailed by surprise later because arbitrarily much time can go by before processes are restarted (That's what happens repeatedly here.) It would be wonderful if Ambari could have an option to execute the same script it executes to do install-time checks periodically to catch this kind of thing.

pcoates · ‎02-05-2016

Aha. The problem turns out to be with the multiple directories named in the file naming the sources. You can have many sources, but only one target. The behavior I was looking for would be for distcp to make a separate tree for each input directory under the target. This seems not to be the way distcp works, but it's easy to script around it.

pcoates · ‎02-03-2016

I must have been unclear. We definitely want to use discp and cannot use Falcon for admin reasons. The problem is that I can't get the fully recursive behavior with discp. There's probably a way to do it, but I'm having trouble getting it to build the full depth of the directories on the target if it goes more than one level deep.

pcoates · ‎02-03-2016

Falcon is not available in my environment, unfortunately. Is there no way to do this without it? This must come up fairly often with partitioned HDFS files and ORC.

pcoates · ‎02-02-2016

I have a cluster with THP inadvertently left enabled. If I disable it, will the processes that are already running stop using it, or do they need to be restarted. Restarting is very inconvenient in this environment.

pcoates · ‎02-02-2016

I need to take a list of HDFS directories and copy the contents of those directories to another HDFS using discp. The problem is recursively creating the directories automatically. These are large partitioned files, and the available means seem to preserved structure only one level deep. Can anyone provide an example?

pcoates · ‎01-12-2016

One last detail---if the time runs out, and the blocks go on the queue for replication, what happens when the node comes back online and reports. Are they stricken from the queue? What if they've already been replicated?

pcoates · ‎01-11-2016

The three staleness properties control how long it will take for nodes that have not been heard from are regarded as stale, and whether to read or write to such nodes. I don't think that's what we're looking for. What I'm asking is whether it is necessary to avoid replicating blocks on nodes that are temporarily offline. I found the property dfs.namenode.replication.interval which is described as "controlling the periodicity with which the NN computed replication work for data nodes." It sounds like bumping it up temporarily might work. Opinion?

Online	Offline
Last Visited	‎03-30-2016 02:57 PM

Member Since	‎10-06-2015 02:10 PM
Last Visited	‎03-30-2016 02:57 PM
Posts	45
Kudos received	48

Cloudera Community

Re: Number of distcp mappers is small. Why?

Number of distcp mappers is small. Why?

Re: If THP is disabled after Java process start, d...

Re: How to discp a partitioned multi-level directo...

Re: How to discp a partitioned multi-level directo...

Re: How to discp a partitioned multi-level directo...

If THP is disabled after Java process start, do th...

How to discp a partitioned multi-level directory w...

Re: How does restarting a data node affect block r...

Re: How does restarting a data node affect block r...