About learninghuman

learninghuman · ‎03-28-2016

@Benjamin Leonhardi On 3, by normal load i referred to your slide 14 and 15. As per that, if you have 30 buckets and 40 partitions, you would have 30 reducers in total (one reducer per bucket across all partitions). So its only 30 files versus 1200 files in the optimized case. That's why i still wonder how it fixes the small file problem (as per slide 16). At the same time i understand the fact about the performance and memory issue. Its really optimized in these 2 cases.

learninghuman · ‎03-25-2016

@Benjamin Leonhardi, Thanks. Just one last set of questions 1. Sort by is only sorting within a reducer. So it you have 10 reducers, it ends up in 10 different ORC files. If you apply sort by on column C1, it may still happen that the same C1 may appear in 10 files unless you distribute by C1. But within each of those files, sorting may help to skip blocks.Am i right? 2. Does ORC maintain index at the block level or the stripe level? (as per slide 6 it looks like block level but as per slide 4, its at the stripe level). If its at the stripe level, it can skip the stripe but if a stripe has to be read, it has to read the entire stripe? 3.And on "Optimized", I understand in terms of the performance but still it has more reducers than the normal load, so how does it fix the small file problem? 4. May be PPD is only for ORC format, but the other concepts of partitioning, bucketing, optimized apply to other formats as well?

learninghuman · ‎03-25-2016

@Joseph Niemiec You mentioned "Left outerjoin and test for null in the WHERE is probably better for scaling then UNION DISTINCT if you are worried about a reducer problem. Same join syntax as the example below..." How left outer join avoids reducer (unless its a map join)? Do you recommend left outer join than union distinct? And in the point "We have found a fun case where if you try to use this to dedupe or clean.....", so my understanding is that if a partition has 5 records which are duplicates (the initial master load already had it), there is no way to remove unless a 6th records which is a duplicate of those 5 records come in the staging load. Am i right? If so, what is your recommendation to remove duplicates in the initial load itself?

learninghuman · ‎03-25-2016

@Benjamin Leonhardi I went through your slides and got few questions around that. Since its related to bucketing, partitioning, I think it makes sense to continue in the same thread itself. In dynamic partition (DP) loading, you used the word standard load. Does that mean setting the number of reducers to 0 or you mean something else? You mentioned that larger number of writers and large number of partitions lead to small files. The number of mappers is based on the number of blocks and each mapper writes separate files to every partition. So irrespective of the number of partitions, the large number of mappers itself can lead to small files. Am I right? What is the default key used for distribution if you don’t use the distribute by clause? What is the distribution key when there are more than 1 partition? Slide 13 – to enable this kind of load, do you recommend to just set the number of reducers to be same as the number of partitions? And any reducer can get data for any partition, which may lead to small files, is this what you mean(hash conflict)? Slide 14 – One reducer for each bucket across all partitions lead to ORC writer memory issues. Why is this the case? Optimized Dynamic sorted partitioning – one reducer for each partition and bucket. From above point, there are 5 partitions and 4 buckets then 4 reducers only, but in the optimized case there are 20 reducers. More the number of reducers, smaller the files are going to be? How can this solve the small file problem? Sort by for PPD – ORC index would anyways help to ignore reading some blocks. But when it comes to reading the block which has the predicate value, sorting helps performance only when the predicate value is reached quickly when reading the file. If the value happens to be at the end of the file then you still end up reading the whole file. So performance improvement in PPD with Sort by really depends, am I right?

learninghuman · ‎03-22-2016

@Joseph Niemiec Looks like this approach puts a restriction that the columns needed to be compared for duplication have to be the partition columns. Not all columns may qualify to be partition columns. Even if i find the hash of those columns, partition on that hash column may not qualify due to high cardinality. Any other option apart from dynamic partition pruning?

learninghuman · ‎03-21-2016

Problem Statement: I have a huge history data set in HDFS on top of which i want to remove duplicates to begin with and also the daily ingested data have to be compared with the history to remove duplicates plus the daily data may have duplicates within itself as well. Duplicates could mean If the keys in 2 records are the same then they are duplicates. Depends on few columns. If those columns match, then they are duplicates. Question: What is an optimized solution to remove duplicates in both these situations? Can we avoid reducer at all? If so what are the options? How hashing would help here? I see vague solutions around but its not very well documented and hard to understand. I have already looked at link, but its not clear. Code samples would help.

learninghuman · ‎03-17-2016

@Artem Ervits HCatOutputFormat class is in fact in the jar /usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar only. Actually its not about this specific class or jar to begin with; Its actually the command that i used; Changing it to the following works. Note that the program arguments come at the end. The link that you suggested made me try this. Also, the other jars mentioned in the above link need to be added as it complained about other classes as well one by one. Those jars may be different based on the distribution and version. In HDP 2.3.2 i did the following export HCAT_HOME=/usr/hdp/current/hive-webhcat export HIVE_HOME=/usr/hdp/current/hive-client export LIB_JARS=$HCAT_HOME/share/hcatalog/hive-hcatalog-core.jar,$HIVE_HOME/lib/hive-metastore.jar,$HIVE_HOME/lib/libthrift-0.9.2.jar,$HIVE_HOME/lib/hive-exec.jar,$HIVE_HOME/lib/libfb303-0.9.2.jar,$HIVE_HOME/lib/jdo-api-3.0.1.jar,$HIVE_HOME/lib/datanucleus-api-jdo-3.2.6.jar hadoop jar mr-hcat.jar <mainclass> -libjars ${LIB_JARS} mr_input_text mr_output_text

learninghuman · ‎03-16-2016

I ran into issues while trying to read the hive table and write to a hive table using mapreduce. Input and output: Input hive table: mr_input_text Output hive table: mr_output_text Command that runs the MapReduce job hadoop jar mr-hcat.jar mr_input_text mr_output_text Environment: HDP 2.3.2 This fails with the following exception. Used "yarn logs -applicationId " command to get this log. java.lang.RuntimeException: java.lang.ClassNotFoundException :class org.apache.hcatalog.mapreduce.HCatOutputFormat not found I tried setting the -libjars with the hcatalog core jar as follows, but still it fails. hadoop jar mr-hcat.jar mr_input_text mr_output_text -libjars /usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar Note: I tried running from both as hdfs and hive user. Already looked at stackoverflow link

learninghuman · ‎03-16-2016

Issue resolved by setting /etc/hive/conf in the classpath Instead of /etc/hive/conf/*.

learninghuman · ‎03-15-2016

@Benjamin Leonhardi, On select Performance, which version of hive you are referring to? In believe you are talking about data pruning (I posted a question related to that).On the number of buckets, i am not sure i understood it well. Going by your customer id example for bucketing, how would the number of buckets be decided?

Online	Offline
Last Visited	‎07-05-2016 08:52 AM

Member Since	‎06-29-2016 09:30 AM
Last Visited	‎07-05-2016 08:52 AM
Posts	81
Kudos received	43

Cloudera Community

Re: Mapreduce and Hcatalog Integration fails to us...

Re: Hive - Deciding the number of buckets

Re: Hive - Deciding the number of buckets

Re: Remove duplicates Using Map reduce or Hive

Re: Hive - Deciding the number of buckets

Re: Remove duplicates Using Map reduce or Hive

Remove duplicates Using Map reduce or Hive

Re: Mapreduce and HCatalog Integration - HCatOutpu...

Mapreduce and HCatalog Integration - HCatOutputFor...

Re: Mapreduce and Hcatalog Integration fails to us...

Re: Hive - Deciding the number of buckets