About pcoates

pcoates · ‎11-20-2015

The Hortonworks Connector for Teradata enables ingestion with Sqoop as well as outbound data via Sqoop. Can someone please outline any other modes of interaction with Teradata that may be available. For instance, is it possible to execute Hive queries against Teradata without actually importing the data? Can queries be federated?

pcoates · ‎11-20-2015

Need to confirm correct version of the Hortonworks Connector for Teradata for an HDP 2.2.4 installation. Is the one in hdp-connector-for-teradata-1.4.1.2.3.2.0-2950-distro.tar.gz acceptable?

pcoates · ‎11-20-2015

So truly in-place is impossible---but it sounds like if the data were partitioned one could execute the distcp on one partition at a time, deleting each original partition after it is copied. Thanks man.

pcoates · ‎11-19-2015

I have hundreds of thousands of small data blocks (< 64MB) that I'd like to turn into a more manageable number of larger blocks, say, 128MB or 256MB. This is CSV data. How can I do this with a distributed job, and can it be done "in place", i.e., without temporarily doubling the space requirement?

pcoates · ‎11-19-2015

What patterns or practices exist for dealing with time-series data specifically in batch mode, i.e, Tez or MR as opposed to Spark. Sorting orders the data within a block or ORC split, but how are boundaries between blocks usually handled? For instance, finding derivatives, inflection points, etc. breaks down at file boundaries---are there standard patterns or libraries to deal with this?

pcoates · ‎11-18-2015

Monte Carlo and is one of many simulation types that execute a huge amount of repetitive tasks that use relatively little data. The "data" is usually little more than sets of parameters to a function that must be executed a zillion times. Often this is followed by some kind of summarizing process. Clearly a custom MR job can be written for this, but is there any kind of standard frameworks that HDP recommends, or a published set of best practices?

pcoates · ‎11-18-2015

That would be great, thanks. Please shoot me the link if you do file an enhancement. My intuition is that granting a percentage of total cluster capacity 0 to 100 would make sense, but perhaps not for the entire replication job, but only to clear the must urgent queue. Some customers really will want nothing done until safety is restored. Banks especially have all kinds of mandates, requirement, consent decrees, etc., that produce what seem from the outside to be unreasonable demands.

pcoates · ‎11-17-2015

That's what I was looking for. Thanks. Recovery time is a concern for a bank I'm working with because there's a window of exposure to data loss while the data is under-replicated. This would be more useful if expressed as a % of capacity. I wonder if it would be worth asking for an enhancement?

pcoates · ‎11-17-2015

Fantastic answer. Exactly what I was looking for.

pcoates · ‎11-16-2015

Earlier Hadoop versions had problems with many small files because of the demands it placed on the NameNode. Modern machines and newer versions of NameNode seem to have mitigated this somewhat, but how much? Is there a rule of thumb for what number of files is too many? Small files also have proportionately more overhead per MB of data. Is there a rule of thumb for what is too small?

Online	Offline
Last Visited	‎03-30-2016 02:57 PM

Member Since	‎10-06-2015 02:10 PM
Last Visited	‎03-30-2016 02:57 PM
Posts	45
Kudos received	48

Cloudera Community

Access modes for teradata beyond Sqoop ingestion.

Connecting Teradata to HDP 2.2.4

Re: How can one change block size for large existi...

How can one change block size for large existing H...

Patterns for batch processing time-series data?

What's the best way to do Monte Carlo simulation o...

Re: What rules set priority of recovery from lost ...

Re: What rules set priority of recovery from lost ...

Re: How many files is too many on a modern HDP clu...

How many files is too many on a modern HDP cluster...