Member since
10-06-2015
45
Posts
54
Kudos Received
0
Solutions
11-20-2015
02:47 PM
2 Kudos
The Hortonworks Connector for Teradata enables ingestion with Sqoop as well as outbound data via Sqoop. Can someone please outline any other modes of interaction with Teradata that may be available. For instance, is it possible to execute Hive queries against Teradata without actually importing the data? Can queries be federated?
... View more
Labels:
- Labels:
-
Apache Sqoop
11-20-2015
02:39 PM
Need to confirm correct version of the Hortonworks Connector for Teradata for an HDP 2.2.4 installation. Is the one in hdp-connector-for-teradata-1.4.1.2.3.2.0-2950-distro.tar.gz acceptable?
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
11-20-2015
01:07 AM
So truly in-place is impossible---but it sounds like if the data were partitioned one could execute the distcp on one partition at a time, deleting each original partition after it is copied. Thanks man.
... View more
11-19-2015
09:40 PM
1 Kudo
I have hundreds of thousands of small data blocks (< 64MB) that I'd like to turn into a more manageable number of larger blocks, say, 128MB or 256MB. This is CSV data. How can I do this with a distributed job, and can it be done "in place", i.e., without temporarily doubling the space requirement?
... View more
Labels:
- Labels:
-
Apache Hadoop
11-19-2015
01:16 PM
What patterns or practices exist for dealing with time-series data specifically in batch mode, i.e, Tez or MR as opposed to Spark. Sorting orders the data within a block or ORC split, but how are boundaries between blocks usually handled? For instance, finding derivatives, inflection points, etc. breaks down at file boundaries---are there standard patterns or libraries to deal with this?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Tez
11-18-2015
02:32 PM
1 Kudo
Monte Carlo and is one of many simulation types that execute a huge amount of repetitive tasks that use relatively little data. The "data" is usually little more than sets of parameters to a function that must be executed a zillion times. Often this is followed by some kind of summarizing process. Clearly a custom MR job can be written for this, but is there any kind of standard frameworks that HDP recommends, or a published set of best practices?
... View more
Labels:
- Labels:
-
Apache Hadoop
11-18-2015
01:46 PM
That would be great, thanks. Please shoot me the link if you do file an enhancement. My intuition is that granting a percentage of total cluster capacity 0 to 100 would make sense, but perhaps not for the entire replication job, but only to clear the must urgent queue. Some customers really will want nothing done until safety is restored. Banks especially have all kinds of mandates, requirement, consent decrees, etc., that produce what seem from the outside to be unreasonable demands.
... View more
11-17-2015
03:14 PM
That's what I was looking for. Thanks. Recovery time is a concern for a bank I'm working with because there's a window of exposure to data loss while the data is under-replicated. This would be more useful if expressed as a % of capacity. I wonder if it would be worth asking for an enhancement?
... View more
11-17-2015
03:03 PM
Fantastic answer. Exactly what I was looking for.
... View more
11-16-2015
04:24 PM
4 Kudos
Earlier Hadoop versions had problems with many small files because of the demands it placed on the NameNode. Modern machines and newer versions of NameNode seem to have mitigated this somewhat, but how much? Is there a rule of thumb for what number of files is too many? Small files also have proportionately more overhead per MB of data. Is there a rule of thumb for what is too small?
... View more
Labels:
- Labels:
-
Apache Hadoop
- « Previous
- Next »