Member since
06-29-2016
81
Posts
43
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1169 | 03-16-2016 08:26 PM |
03-28-2016
01:21 PM
1 Kudo
@Benjamin Leonhardi On 3, by normal load i referred to your slide 14 and 15. As per that, if you have 30 buckets and 40 partitions, you would have 30 reducers in total (one reducer per bucket across all partitions). So its only 30 files versus 1200 files in the optimized case. That's why i still wonder how it fixes the small file problem (as per slide 16). At the same time i understand the fact about the performance and memory issue. Its really optimized in these 2 cases.
... View more
03-25-2016
09:47 PM
1 Kudo
@Benjamin Leonhardi, Thanks. Just one last set of questions 1. Sort by is only sorting within a reducer. So it you have 10 reducers, it ends up in 10 different ORC files. If you apply sort by on column C1, it may still happen that the same C1 may appear in 10 files unless you distribute by C1. But within each of those files, sorting may help to skip blocks.Am i right? 2. Does ORC maintain index at the block level or the stripe level? (as per slide 6 it looks like block level but as per slide 4, its at the stripe level). If its at the stripe level, it can skip the stripe but if a stripe has to be read, it has to read the entire stripe? 3.And on "Optimized", I understand in terms of the performance but still it has more reducers than the normal load, so how does it fix the small file problem? 4. May be PPD is only for ORC format, but the other concepts of partitioning, bucketing, optimized apply to other formats as well?
... View more
03-25-2016
08:01 PM
@Joseph Niemiec You mentioned "Left outerjoin and test for null in the WHERE is probably better for scaling then UNION DISTINCT if you are worried about a reducer problem. Same join syntax as the example below..." How left outer join avoids reducer (unless its a map join)? Do you recommend left outer join than union distinct? And in the point "We have found a fun case where if you try to use this to dedupe or clean.....", so my understanding is that if a partition has 5 records which are duplicates (the initial master load already had it), there is no way to remove unless a 6th records which is a duplicate of those 5 records come in the staging load. Am i right? If so, what is your recommendation to remove duplicates in the initial load itself?
... View more
03-25-2016
07:38 PM
1 Kudo
@Benjamin Leonhardi I went through your slides and got few questions around that.
Since its related to bucketing, partitioning, I think it makes sense to
continue in the same thread itself.
In dynamic partition (DP) loading, you used the
word standard load. Does that mean setting the number of reducers to 0 or you
mean something else?
You mentioned that larger number of writers and
large number of partitions lead to small files. The number of mappers is based
on the number of blocks and each mapper writes separate files to every
partition. So irrespective of the number of partitions, the large number of
mappers itself can lead to small files. Am I right?
What is the default key used for distribution if
you don’t use the distribute by clause? What is the distribution key when there
are more than 1 partition?
Slide 13 – to enable this kind of load, do you recommend
to just set the number of reducers to be same as the number of partitions? And
any reducer can get data for any partition, which may lead to small files, is
this what you mean(hash conflict)?
Slide 14 – One reducer for each bucket across
all partitions lead to ORC writer memory issues. Why is this the case?
Optimized Dynamic sorted partitioning – one reducer
for each partition and bucket. From above point, there are 5 partitions and 4
buckets then 4 reducers only, but in the optimized case there are 20 reducers.
More the number of reducers, smaller the files are going to be? How can this
solve the small file problem?
Sort by for PPD – ORC index would anyways help
to ignore reading some blocks. But when it comes to reading the block which has
the predicate value, sorting helps performance only when the predicate value is
reached quickly when reading the file. If the value happens to be at the end of
the file then you still end up reading the whole file. So performance
improvement in PPD with Sort by really depends, am I right?
... View more
03-22-2016
03:00 PM
@Joseph Niemiec Looks like this approach puts a restriction that the columns needed to be compared for duplication have to be the partition columns. Not all columns may qualify to be partition columns. Even if i find the hash of those columns, partition on that hash column may not qualify due to high cardinality. Any other option apart from dynamic partition pruning?
... View more
03-21-2016
02:13 AM
5 Kudos
Problem Statement: I have a huge history data set in HDFS on top of which i want to remove duplicates to begin with and also the daily ingested data have to be compared with the history to remove duplicates plus the daily data may have duplicates within itself as well. Duplicates could mean
If the keys in 2 records are the same then they are duplicates. Depends on few columns. If those columns match, then they are duplicates. Question:
What is an optimized solution to remove duplicates in both these situations? Can we avoid reducer at all? If so what are the options? How hashing would help here? I see vague solutions around but its not very well documented and hard to understand. I have already looked at link, but its not clear. Code samples would help.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
03-17-2016
02:48 PM
2 Kudos
@Artem Ervits HCatOutputFormat class is in fact in the jar /usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar only. Actually its not about this specific class or jar to begin with; Its actually the command that i used; Changing it to the following works. Note that the program arguments come at the end. The link that you suggested made me try this. Also, the other jars mentioned in the above link need to be added as it complained about other classes as well one by one. Those jars may be different based on the distribution and version. In HDP 2.3.2 i did the following export HCAT_HOME=/usr/hdp/current/hive-webhcat
export HIVE_HOME=/usr/hdp/current/hive-client
export LIB_JARS=$HCAT_HOME/share/hcatalog/hive-hcatalog-core.jar,$HIVE_HOME/lib/hive-metastore.jar,$HIVE_HOME/lib/libthrift-0.9.2.jar,$HIVE_HOME/lib/hive-exec.jar,$HIVE_HOME/lib/libfb303-0.9.2.jar,$HIVE_HOME/lib/jdo-api-3.0.1.jar,$HIVE_HOME/lib/datanucleus-api-jdo-3.2.6.jar
hadoop jar mr-hcat.jar <mainclass> -libjars ${LIB_JARS} mr_input_text mr_output_text
... View more
03-16-2016
08:28 PM
2 Kudos
I ran into issues while trying to read the hive table and write to a hive table using mapreduce. Input and output:
Input hive table: mr_input_text Output hive table: mr_output_text Command that runs the MapReduce job hadoop jar mr-hcat.jar mr_input_text mr_output_text
Environment: HDP 2.3.2 This fails with the following exception. Used "yarn logs -applicationId " command to get this log.
java.lang.RuntimeException: java.lang.ClassNotFoundException :class org.apache.hcatalog.mapreduce.HCatOutputFormat not found I tried setting the -libjars with the hcatalog core jar as follows, but still it fails. hadoop jar mr-hcat.jar mr_input_text mr_output_text -libjars /usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar
Note:
I tried running from both as hdfs and hive user. Already looked at stackoverflow link
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache HCatalog
-
Apache Hive
03-16-2016
08:26 PM
Issue resolved by setting /etc/hive/conf in the classpath Instead of /etc/hive/conf/*.
... View more
03-15-2016
09:44 PM
1 Kudo
@Benjamin Leonhardi, On select Performance, which version of hive you are referring to? In believe you are talking about data pruning (I posted a question related to that).On the number of buckets, i am not sure i understood it well. Going by your customer id example for bucketing, how would the number of buckets be decided?
... View more
- « Previous
- Next »