- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Not getting any output in SortMergeJoin example of Tez
- Labels:
-
Apache Hadoop
-
Apache Tez
Created ‎04-13-2017 12:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am doing Sort merge join using tez examples jar using Tez 0.7.1. The sample of two files are:-
ISBN;"Book-Title";"Book-Author";"Year-Of-Publication";"Publisher";"Image-URL-S";"Image-URL-M";"Image-URL-L" 0195153448;"Classical Mythology";"Mark P. O. Morford";"2002";"Oxford University Press";"http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg";"http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg";"http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg"
User-ID;"ISBN";"Book-Rating"
276725;"034545104X";"0"
First one has 300 thousand and second one has around 1 million records and the common attribute is ISBN of a book.
The DAG is getting completed successfully but there is no output. Even the logs look fine.
My understanding of SortMergeJoin is that it sorts both datasets on the join attribute and then looks for qualifying records by merging the two datasets. The sorting step groups all tuples with the same value in the join column together and thus makes it easy to identify partitions or groups of tuples with the same value in the join column. I am referring this link from Tez examples. Just wanted to confirm that how is it deciding the join attribute which in this case should be ISBN. PLease help.
Created ‎04-19-2017 12:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Tez "sortmergejoin" example works very differently from your expectations, it identifies all occurrences of lines in file1 which also occur in file2 using sort merge join. Your output is empty, because obviously there are no identical lines in your input files.
If you want to use SQL joins, either consider using Hive on Tez or write a MapReduce job that does the join operation.
Created ‎04-19-2017 12:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Tez "sortmergejoin" example works very differently from your expectations, it identifies all occurrences of lines in file1 which also occur in file2 using sort merge join. Your output is empty, because obviously there are no identical lines in your input files.
If you want to use SQL joins, either consider using Hive on Tez or write a MapReduce job that does the join operation.
Created ‎04-20-2017 06:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@gnovak Thanks a lot, I guess I missed that point. That has to be the reason why there is nothing in the output.
