About learninghuman

learninghuman · ‎12-21-2016

I see 3 different options to deploy HDP on Hadoop HDInsights (Built on top of HDP) HDP as a Service Deploying HDP on Azure's bare metal In my understanding 1 and 2 are managed services where the control is limited when it comes to the choice of OS etc. HDInsight has multiple cluster types (not sure whats the rationale behind this though) Questions: Whats the rationale behind having multiple cluster types for HDInsight? Why are two services (1 and 2 above) offered? When to use what? (apart from this) Are there any performance benchmarks done on HDInsight or HDP on Azure in a production situation? What are the different storage types possible on the above services? Atleast on HDInsight i see that Blob storage and Data Lake Store are options but both are external to the compute nodes. May hit performance, hence curious about question 3 apart from the fact that the cluster run on the virtual machines. What are the option to provision HDP on Azure bare metal nodes (Option 3)? Does CloudBreak help there?

learninghuman · ‎08-25-2016

@Tom McCuch Thanks a lot for the views and inputs. It definitely helps.

learninghuman · ‎08-22-2016

@Tom McCuch Thanks again. Do you recommend, data to be sorted for ORC optimization to work? Or it does not really matter? And any benchmark volume with performance testing done for adhoc queries with the optimization mentioned above?

learninghuman · ‎08-18-2016

@Tom McCuch Thanks for the detailed response. In terms of querying capabilities (from a BI tool or a CLI or Hue), to achieve faster query response as its required in the operational reporting, one way is to structure the data (by means of partition etc) for pre-defined queries but for adhoc operational reporting queries, whats your take on ODS in hadoop to achieve the desired performance? One way is restrict the volume of data (in addition to ORC format, Tez etc) in the ODS layer as its for operational needs anyways (so history may not be required). Please share your thoughts.

learninghuman · ‎08-16-2016

I am asked to build an ODS (Operational data store) in hadoop for an insurance client. In this regard, few questions First of all, is it recommended to build the ODS in hadoop? What are the pros and cons of buildingODS in hadoop? Any best practices around this topic? The ODS should facilitate the operational reporting needs that should support adhoc queries.

learninghuman · ‎06-29-2016

@Benjamin Leonhardi Thanks, makes sense

learninghuman · ‎06-28-2016

Data comes from multiple sources and these are exposed in the hive table for the users. A specific column is sensitive and needs to be given restricted access. If a user who wants to join 2 such tables on the column that he does not have access to, then whats the best approach to make it work? One option is to link the sensitive column with a generated key so that the user can join on the generated key. Is this a good idea or any better idea?

learninghuman · ‎05-13-2016

What does it mean for a hive table with ORC or Avro format to have the Field delimiter specified? Does hive ignore even if its specified? For example, CREATE TABLE if not exists T ( C1 STRING , C2 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' STORED AS ORC tblproperties ("orc.compress"="SNAPPY")

learninghuman · ‎04-11-2016

Can a non-numeric column be specified for a --split-by key parameter? What are the potential issues in doing so?

learninghuman · ‎03-30-2016

@Benjamin Leonhardi I think i got it. Its still the same number of files but with more reducers. In my mind, it was always just the buckets not the partitions. So i thought its 30 files (30 buckets and 40 partitions), but in fact its still 1200 files in both the case but in optimized its more number of reducers.

Online	Offline
Last Visited	‎07-05-2016 08:52 AM

Member Since	‎06-29-2016 09:30 AM
Last Visited	‎07-05-2016 08:52 AM
Posts	81
Kudos received	43

Cloudera Community

Re: Mapreduce and Hcatalog Integration fails to us...

HDInsight Vs HDP Service on Azure Vs HDP on Azure ...

Re: Hadoop for Operational data store

Re: Hadoop for Operational data store

Re: Hadoop for Operational data store

Hadoop for Operational data store

Re: Hadoop data linking from multiple sources

Hadoop data linking from multiple sources

Hive ORC or AVRO format with field delimiters

Sqoop --split-by on a string /varchar column

Re: Hive - Deciding the number of buckets