About bleonhardi

bleonhardi · ‎05-15-2016

"My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS." Correct "What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?" You are correct and buckets are essentially files in these partition folders. Every bucket = one file. You can find the reasoning and the uses for them here: https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html

bleonhardi · ‎05-13-2016

1.What is the max joins that we can used in Hive for best performance ? what is the limitation of using joins ? what happen if we use multiple joins (will it affect performance or Job fail )? There is no max join. By now Hive has a good cost based optimizer with statistics. So as long as you properly run statistics on the table you can have complex queries as well. However denormalized tables are cheaper ( storage is cheap ) so they make more sense than in traditional databases. But as sourygna said very general question. 2.While Querying what kind of fields should be used for join keys? As in any database Integer keys are the best. Strings work but may require more memory. If you use floats you get what you deserve :-). 3.How will you make use of Partitioning and bucketing http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data 4.Criticality of type casting ? Converting the data types on fly over the queries ? Better if you don't do it. ORC files are optimized for each datatype so using strings and cast them on demand will slow performance. For delimited files much less important. 5.Using multiple casting will it affect the HIVE job performance ? See 4. Yes as long as you use ORC. 6.How to avoid using multiple Internal Joins, any alternative that we can use of avoiding multiple joins? Denormalization? 7.What is the best way of doing splitting Not sure I understand the question. If you use ORC you have per default 256MB blocks which have 64MB stripes. Good default. But if you want more map tasks you can reduce the block size. 8.when to use left outer join and right outer join to avoid full table scan. Very generic question. 9.What is best way to use select query instead of scanning full table Very generic question. Look at the presentation I linked for details on Predicate pushdown. Sort your data properly during insert. 10.Map join optimization ? when to use Map joins ? When the small table fits easily into memory of a map task? 11.SKEW join optimization ? when to use SKEW joins https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization has details on when its good 12.SMB join optimization? When to go SMP joins ? Seriously you should read the hive confluence page. In general I would trust the CBO. 13.During huge data process what needs to do to prevent from job failures ? what is the best practices in that scenario ? Problems I have seen where WAY too many partitions and small files in each partition. Too many splits result in problems. So you should make sure to properly load data into hive ( see my presentation) . Make sure the file sizes in your hive tables are proper. Also keep an eye out for reducer and mapper numbers to make sure they are in healthy range. If they aren't there is no fixed rule on why. 14.Advantage of De-Normalization and where should I use on HIVE Less joins but more data space. As Sourygna said, these are some veeery generic questions. You might have to drill down a bit into what you actually concretely want.

bleonhardi · ‎05-12-2016

I had in what I send you and it seemed to work. I would open a support case.

bleonhardi · ‎05-12-2016

Falcon has the following parameter that can be set. The retry policy. <retry policy="exp-backoff" delay="hours(1)" attempts="1"/> https://falcon.apache.org/EntitySpecification.html Search for Retry

bleonhardi · ‎05-12-2016

The thing is that Pig has an abstraction layer between the operators and the actual implementation. Tez per se does not need to be a mapper or a reducer. It by definition is more flexible. However since Hive and Pig have been written with the map/reduce model in mind this was kept for the compilation into tez. After all the underlying needs didn't change too much. you still need Mappers for data transforms and Reducers for group bys, joins , ... In general I would look into the Tez view to find the details of the tasks. http://hortonworks.com/wp-content/uploads/2015/05/am_tabl_3.png

bleonhardi · ‎05-12-2016

This can have different reasons unfortunately I have seen the same error when using wrong hostnames ( not fully qualified, localhost instead of the full hostname ... ) I have also seen it in the context of security, i.e. when connecting to a kerberized kafka and providing a wrong jaas configuration, ... First thing I would do is check the broker-host variable. Fully qualify it and see if that fixes anything.

bleonhardi · ‎05-11-2016

You can find information on the included products on the homepage http://hortonworks.com/ then click on products. In general you might find the following tools interesting: HBase: A no-sql store that can handle huge datavolumes and is good for user transactions ( a bit like a simpler OLTP database on speed ). The API is a simple put/get/scan API. Phoenix: A SQL layer on top of HBase. This is a proper bigdata transaction store Kafka: A Message Queueing like cache often used as the realtime store and buffer in a BigData system. Flume: A framework to load data into hadoop ( either hdfs or Kafka etc. ) You can for example have one flume agent on each webserver to aggregate web logs Nifi/Hortonworks Data Flow: Similar to Flume but much more powerful and simply better. The tool to gather data in your enterprise filter, transform, and push it into Hadoop. A bit like a realtime ETL engine Storm: Realtime analytics typically consumes from Kafka Spark/Spark Streaming: Spark is an analytical platform and provides a streaming version very similar in usecases to Storm but very different in execution. ( Mini batches, powerful analytics built in ... ) Hive: The OLAP like database in Hadoop. Also provides transactions for streaming inserts but still a bit new. For large analytical queries. I think you would need to explain a bit what your product is actually supposed to do so we could answer more intelligently.

bleonhardi · ‎05-11-2016

There are two applications -the oozie launcher ( one AM one map task look in both ) here you would find any errors kicking off the sqoop job. Most likely in the map task that runs the sqoop command. - the actual sqoop application, this one will have one am and x mappers. You could find errors here as well. And there are the three different outputs you will know the right one when you see it. The log you send seems to be the appmaster and is meaningless.

bleonhardi · ‎05-11-2016

You need to look at the output of the actual sqoop job running in yarn. Hue/Oozie UI/Resource manager/ or yarn logs -applicationId <> allow you to read them.

bleonhardi · ‎05-10-2016

Would be interesting to see. There seem to be a couple data quality tools out there in the open source commnity mural/mosaic but the last update in the repository seems to have been 4 years ago. So not sure how useful that is. https://java.net/projects/mosaic

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Hive Clusters/Buckets

Re: HIVE Best Practice

Re: How many times falcon retry for failed jobs in...

Re: How many times falcon retry for failed jobs in...

Re: Question on tez dag task and pig on tez

Re: Spark Kafa createDirectStream failed while cre...

Re: Newbie question about how Hortonworks works wi...

Re: oozie job fails without any errors

Re: oozie job fails without any errors

Re: Find Fields in Noise with Spark