Support Questions

mrizvi · ‎04-13-2017

I am doing Sort merge join using tez examples jar using Tez 0.7.1. The sample of two files are:-

ISBN;"Book-Title";"Book-Author";"Year-Of-Publication";"Publisher";"Image-URL-S";"Image-URL-M";"Image-URL-L"
0195153448;"Classical Mythology";"Mark P. O. Morford";"2002";"Oxford University Press";"http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg";"http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg";"http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg"

User-ID;"ISBN";"Book-Rating"

 276725;"034545104X";"0"

First one has 300 thousand and second one has around 1 million records and the common attribute is ISBN of a book.

The DAG is getting completed successfully but there is no output. Even the logs look fine.

My understanding of SortMergeJoin is that it sorts both datasets on the join attribute and then looks for qualifying records by merging the two datasets. The sorting step groups all tuples with the same value in the join column together and thus makes it easy to identify partitions or groups of tuples with the same value in the join column. I am referring this link from Tez examples. Just wanted to confirm that how is it deciding the join attribute which in this case should be ISBN. PLease help.

gnovak · ‎04-19-2017

@Mushtaq Rizvi

The Tez "sortmergejoin" example works very differently from your expectations, it identifies all occurrences of lines in file1 which also occur in file2 using sort merge join. Your output is empty, because obviously there are no identical lines in your input files.

If you want to use SQL joins, either consider using Hive on Tez or write a MapReduce job that does the join operation.

View solution in original post

gnovak · ‎04-19-2017

@Mushtaq Rizvi

The Tez "sortmergejoin" example works very differently from your expectations, it identifies all occurrences of lines in file1 which also occur in file2 using sort merge join. Your output is empty, because obviously there are no identical lines in your input files.

If you want to use SQL joins, either consider using Hive on Tez or write a MapReduce job that does the join operation.

mrizvi · ‎04-20-2017

@gnovak Thanks a lot, I guess I missed that point. That has to be the reason why there is nothing in the output.

Cloudera Community

Support Questions

Not getting any output in SortMergeJoin example of Tez