Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive- Cross join-Optimization

Highlighted

Hive- Cross join-Optimization

New Contributor

Hi,

I have 2 tables in Hive. Table-a has millions of rows and many columns. Table-b has just few rows with two columns (key as tablename, value as 1).

Now I do the following,

1) Cross join table a and b.
2) Based on table A's on of the date column, we need to pass the table b's column "value" as "1" or "2" and bring the table A's date column less than one or two days.

e.g.

select * from table-a, table-b where 1=1 and table-a.date_col <= from_unixtime(unix_timestamp()-(table-b.value*24*60*60)))  - while the table-b.value is 1 then it subtracts one day or any number of hours. 

Basically, we need to pass the table-b value as parameter according to our requirements. This cross join works for smaller tables. When the data of table-a is huge, it creates performance or memory issue. 

Please advise on any solution to fine tune this approach or achieve the result using any alternate approach and overcome this issue in HIVE. 

Thanks.

3 REPLIES 3

Re: Hive- Cross join-Optimization

Guru
Does it have to be Cross JOIN? You are generating tens or hundreds of million rows, it is very intensive. Is it true that for every row in table-b, you have to join across in table-a?

Cheers
Eric
Highlighted

Re: Hive- Cross join-Optimization

New Contributor

The table-b has only one row with a value 1. My intention is to pass this value in TABLE-b.value, and subtract the number of days in the query. If I pass 1 then it should subtract one day from the TABLE-a.date_col. If I pass 2 then it should give me records that are 2 days lesser than TABLE-a.date_col. 

 

SELECT *
FROM TABLE-a,
TABLE-b
WHERE 1=1
AND TABLE-a.date_col <= from_unixtime(unix_timestamp()-(TABLE-b.value*24*60*60)))

Highlighted

Re: Hive- Cross join-Optimization

Guru
Is table-b's value fixed? You mentioned 1 or 2. Are they pre-defined?

If you know the values up-front, then you don't need to do cross join, just add conditions based on the fixed number of values, like below:

SELECT * FROM table-a WHERE
TABLE-a.date_col <= from_unixtime(unix_timestamp()-(1*24*60*60)))
OR
TABLE-a.date_col <= from_unixtime(unix_timestamp()-(2*24*60*60)))
OR
TABLE-a.date_col <= from_unixtime(unix_timestamp()-(3*24*60*60)))
...

Not sure if it still meet your requirement.

Cheers
Don't have an account?
Coming from Hortonworks? Activate your account here