Member since
10-06-2015
45
Posts
54
Kudos Received
0
Solutions
02-18-2016
06:52 PM
1 Kudo
Thanks for gettng back. Yes--I'm aware of the -m option, but it appears from the documentation that the mappers get a list of HDFS level files and work on these. I'm trying to find out if my understanding is accurate: that unlike a typical map reduce job that deals in single blocks or splits, the distcp maps each get the URI of an entire file or files to copy. Therefore, you might have hundreds of blocks, but if it's all one file, the same mapper will handle all. Is this the case?
... View more
02-18-2016
05:35 PM
2 Kudos
We see very few mappers created for discp copies. Are these mappers being allocated at the block level or at the file level? I.e., does a mapper copy a physical block or does it copy an entire logical file?
... View more
Labels:
- Labels:
-
Apache Hadoop
02-12-2016
03:43 PM
1 Kudo
All---thanks for the very helpful answers. The real issue here is that values get changed after the original correct installation. Then you get nailed by surprise later because arbitrarily much time can go by before processes are restarted (That's what happens repeatedly here.) It would be wonderful if Ambari could have an option to execute the same script it executes to do install-time checks periodically to catch this kind of thing.
... View more
02-05-2016
09:34 PM
2 Kudos
Aha. The problem turns out to be with the multiple directories named in the file naming the sources. You can have many sources, but only one target. The behavior I was looking for would be for distcp to make a separate tree for each input directory under the target. This seems not to be the way distcp works, but it's easy to script around it.
... View more
02-03-2016
07:41 PM
I must have been unclear. We definitely want to use discp and cannot use Falcon for admin reasons. The problem is that I can't get the fully recursive behavior with discp. There's probably a way to do it, but I'm having trouble getting it to build the full depth of the directories on the target if it goes more than one level deep.
... View more
02-03-2016
04:50 PM
1 Kudo
Falcon is not available in my environment, unfortunately. Is there no way to do this without it? This must come up fairly often with partitioned HDFS files and ORC.
... View more
02-02-2016
08:58 PM
2 Kudos
I have a cluster with THP inadvertently left enabled. If I disable it, will the processes that are already running stop using it, or do they need to be restarted. Restarting is very inconvenient in this environment.
... View more
02-02-2016
08:47 PM
4 Kudos
I need to take a list of HDFS directories and copy the contents of those directories to another HDFS using discp. The problem is recursively creating the directories automatically. These are large partitioned files, and the available means seem to preserved structure only one level deep. Can anyone provide an example?
... View more
Labels:
- Labels:
-
Apache Hadoop
01-12-2016
06:56 PM
1 Kudo
One last detail---if the time runs out, and the blocks go on the queue for replication, what happens when the node comes back online and reports. Are they stricken from the queue? What if they've already been replicated?
... View more
01-11-2016
08:10 PM
1 Kudo
The three staleness properties control how long it will take for nodes that have not been heard from are regarded as stale, and whether to read or write to such nodes. I don't think that's what we're looking for. What I'm asking is whether it is necessary to avoid replicating blocks on nodes that are temporarily offline. I found the property dfs.namenode.replication.interval which is described as "controlling the periodicity with which the NN computed replication work for data nodes." It sounds like bumping it up temporarily might work. Opinion?
... View more
01-11-2016
07:21 PM
3 Kudos
I have a requirement to periodically restart all cluster nodes at the machine level. Assume I've done an FSCK before starting to confirm that all blocks are fully replicated. Question is, as I restart each node in turn, will the NameNode notice that any block on that node is under-replicated and put those blocks on the replication queue? If this does happen, will it automatically remove those blocks when the data node comes back online and reports it's blocks to the NN? Note, this is a hardware restart, so the Ambari rolling restart doesn't do the job.
... View more
- Tags:
- Hadoop Core
- HDFS
Labels:
- Labels:
-
Apache Hadoop
01-06-2016
07:37 PM
3 Kudos
Within a cluster we have no trouble executing commands agains an HA NameNode using the NameServiceID. But it doesn't work when doing discp from one cluster to another because the clusters are unaware of each other's mapping of nodes to NameServiceID. How does one do this?
... View more
Labels:
- Labels:
-
Apache Hadoop
01-04-2016
05:24 PM
4 Kudos
We have two use cases--one is the normal slight imbalance that can creep up gradually and the other is when we add new nodes. Ten new nodes can be 100TB+ to move around--it can take a very long time with normal dfs.network.bandwidth.persecond setting. What's a good strategy? Is it reasonable to use chron to reset the value during off hours? What's the best practice? Also, does rebalancing defer to normal processing dynamically?
... View more
Labels:
- Labels:
-
Apache Hadoop
12-09-2015
04:47 PM
The documentation seems to suggest that the normal mode of use would be to have one reconstituted replica sitting around and that reconstituting an encoded block would be done only if this isn't the case. Keeping a block by default would eliminate most of the space savings because the data would expand from 1.6 to 2.6 times the raw file size. Why not have a policy that for leaves a single size copy for a limited time after a block is used? A "working set" as it were, so if you've used a block in the last X hours the decoded block won't be deleted.
... View more
12-08-2015
10:29 PM
1 Kudo
The admins want to know why every service has its own account ID, and is there any harm is using the same account for all? The cluster will be tightly secured. What is the best practice?
... View more
12-06-2015
12:03 AM
1 Kudo
Hadoop has long stressed moving the code to the data, both because it's faster to move the code than to move the data, and more importantly because the network is a limited shared resource that can easily be swamped. Erasure coding would seem to require that a large proportion of the data must move across the network because the contents of a single block will reside on multiple nodes. This would presumably apply not just the ToR switch, but the shared network as well, if the ability to tolerate the loss of a rack is preserved. Is this true and how are these principles reconciled?
... View more
- Tags:
- Hadoop Core
- HDFS
Labels:
- Labels:
-
Apache Hadoop
11-23-2015
06:10 PM
1 Kudo
Your inode article is a great addition to David's answer. I'm puzzled though that any machine would run out of inodes before running out of disk space---it would require a strange configuration of the file system, wouldn't it? Was someone trying to save on inode allocation by assuming the average file would be larger? I can't think of any other reason to stray from the defaults. Any idea why?
... View more
11-20-2015
08:26 PM
Thanks Ancil. I'm still curious about what can be done from inside Hadoop. The federation of queries is particularly interesting becauese you don't always want to import the data into HDFS.
... View more
11-20-2015
03:44 PM
1 Kudo
Thanks for the reply. Yes, I read that page--the problem is trying to confirm whether the version of the connector in that tarball that this leads to, which seems to be for HDP 2.3 works with 2.2.4. Can't seem to locate one specifically for 2.2.4.
... View more
11-20-2015
02:47 PM
2 Kudos
The Hortonworks Connector for Teradata enables ingestion with Sqoop as well as outbound data via Sqoop. Can someone please outline any other modes of interaction with Teradata that may be available. For instance, is it possible to execute Hive queries against Teradata without actually importing the data? Can queries be federated?
... View more
Labels:
- Labels:
-
Apache Sqoop
11-20-2015
02:39 PM
Need to confirm correct version of the Hortonworks Connector for Teradata for an HDP 2.2.4 installation. Is the one in hdp-connector-for-teradata-1.4.1.2.3.2.0-2950-distro.tar.gz acceptable?
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
11-20-2015
01:07 AM
So truly in-place is impossible---but it sounds like if the data were partitioned one could execute the distcp on one partition at a time, deleting each original partition after it is copied. Thanks man.
... View more
11-19-2015
09:40 PM
1 Kudo
I have hundreds of thousands of small data blocks (< 64MB) that I'd like to turn into a more manageable number of larger blocks, say, 128MB or 256MB. This is CSV data. How can I do this with a distributed job, and can it be done "in place", i.e., without temporarily doubling the space requirement?
... View more
Labels:
- Labels:
-
Apache Hadoop
11-19-2015
01:16 PM
What patterns or practices exist for dealing with time-series data specifically in batch mode, i.e, Tez or MR as opposed to Spark. Sorting orders the data within a block or ORC split, but how are boundaries between blocks usually handled? For instance, finding derivatives, inflection points, etc. breaks down at file boundaries---are there standard patterns or libraries to deal with this?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Tez
11-18-2015
02:32 PM
1 Kudo
Monte Carlo and is one of many simulation types that execute a huge amount of repetitive tasks that use relatively little data. The "data" is usually little more than sets of parameters to a function that must be executed a zillion times. Often this is followed by some kind of summarizing process. Clearly a custom MR job can be written for this, but is there any kind of standard frameworks that HDP recommends, or a published set of best practices?
... View more
Labels:
- Labels:
-
Apache Hadoop
11-18-2015
01:46 PM
That would be great, thanks. Please shoot me the link if you do file an enhancement. My intuition is that granting a percentage of total cluster capacity 0 to 100 would make sense, but perhaps not for the entire replication job, but only to clear the must urgent queue. Some customers really will want nothing done until safety is restored. Banks especially have all kinds of mandates, requirement, consent decrees, etc., that produce what seem from the outside to be unreasonable demands.
... View more
11-17-2015
03:14 PM
That's what I was looking for. Thanks. Recovery time is a concern for a bank I'm working with because there's a window of exposure to data loss while the data is under-replicated. This would be more useful if expressed as a % of capacity. I wonder if it would be worth asking for an enhancement?
... View more
11-17-2015
03:03 PM
Fantastic answer. Exactly what I was looking for.
... View more
11-16-2015
04:24 PM
4 Kudos
Earlier Hadoop versions had problems with many small files because of the demands it placed on the NameNode. Modern machines and newer versions of NameNode seem to have mitigated this somewhat, but how much? Is there a rule of thumb for what number of files is too many? Small files also have proportionately more overhead per MB of data. Is there a rule of thumb for what is too small?
... View more
Labels:
- Labels:
-
Apache Hadoop