Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive- Cross join-Optimization

avatar
New Contributor

Hi,

I have 2 tables in Hive. Table-a has millions of rows and many columns. Table-b has just few rows with two columns (key as tablename, value as 1).

Now I do the following,

1) Cross join table a and b.
2) Based on table A's on of the date column, we need to pass the table b's column "value" as "1" or "2" and bring the table A's date column less than one or two days.

e.g.

select * from table-a, table-b where 1=1 and table-a.date_col <= from_unixtime(unix_timestamp()-(table-b.value*24*60*60)))  - while the table-b.value is 1 then it subtracts one day or any number of hours. 

Basically, we need to pass the table-b value as parameter according to our requirements. This cross join works for smaller tables. When the data of table-a is huge, it creates performance or memory issue. 

Please advise on any solution to fine tune this approach or achieve the result using any alternate approach and overcome this issue in HIVE. 

Thanks.

3 REPLIES 3

avatar
Super Guru
Does it have to be Cross JOIN? You are generating tens or hundreds of million rows, it is very intensive. Is it true that for every row in table-b, you have to join across in table-a?

Cheers
Eric

avatar
New Contributor

The table-b has only one row with a value 1. My intention is to pass this value in TABLE-b.value, and subtract the number of days in the query. If I pass 1 then it should subtract one day from the TABLE-a.date_col. If I pass 2 then it should give me records that are 2 days lesser than TABLE-a.date_col. 

 

SELECT *
FROM TABLE-a,
TABLE-b
WHERE 1=1
AND TABLE-a.date_col <= from_unixtime(unix_timestamp()-(TABLE-b.value*24*60*60)))

avatar
Super Guru
Is table-b's value fixed? You mentioned 1 or 2. Are they pre-defined?

If you know the values up-front, then you don't need to do cross join, just add conditions based on the fixed number of values, like below:

SELECT * FROM table-a WHERE
TABLE-a.date_col <= from_unixtime(unix_timestamp()-(1*24*60*60)))
OR
TABLE-a.date_col <= from_unixtime(unix_timestamp()-(2*24*60*60)))
OR
TABLE-a.date_col <= from_unixtime(unix_timestamp()-(3*24*60*60)))
...

Not sure if it still meet your requirement.

Cheers