Created 06-01-2016 05:23 PM
Hello everyone,
I have posted query below in which i have used 3 partition table and all are the external table pointing to some partition ditectory on hdfs.In every table partition is done on "ds" column(i.e. ds=2016-06-01-08).
ds=year-month-day-hour
On hourly basis data is landing on hdfs,creating directory as above timestamp and tables are pointing to all those partition directories.
A question is:
As already i explained all tables are partitioned,but
Where to write where clause in below query? so that data will be processed for that particular partition only.
Please let me know.
Query:
INSERT OVERWRITE TABLE falconexample.Patient_proce PARTITION (${falcon_output_partitions_hive}) select p.id,p.gender, p.Age, p.birthdate, o.component[1].valuequantity.value, o.component[1].valuequantity.unit from (select *, floor(datediff(to_date(from_unixtime(unix_timestamp())), to_date(birthdate)) / 365.25) as Age FROM falconexample. patient1) p inner join falconexample.DiagnosticReport1 d on p.id = substr(d.subject.reference,9) inner join falconexample.Observation1 o on p.id = substr(o.subject.reference,9) where p.Age>17 and p.Age<86 and o.component[1].valuequantity.value <140;
Created 06-01-2016 05:26 PM
In general, a SELECT query scans the entire table (other than for sampling). If a table created using the PARTITIONED BY clause, a query can do partition pruning and scan only a fraction of the table relevant to the partitions specified by the query. Hive currently does partition pruning if the partition predicates are specified in the WHERE clause or the ON clause in a JOIN. For example, if table page_views is partitioned on column date, the following query retrieves rows for just days between 2008-03-01 and 2008-03-31.
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
|
If a table page_views is joined with another table dim_users, you can specify a range of partitions in the ON clause as follows:
SELECT page_views.*
FROM page_views JOIN dim_users
ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' )
|
Created 06-01-2016 05:46 PM
Thank you Sunile for giving such a quick response.
According to your answer,
I can write something like
............................................
.....falconexample.DiagnosticReport1 d on (p.id = substr(d.subject.reference,9) and p.ds='2016-06-01-10') inner join falconexample.Observation1 o on (p.id = substr(o.subject.reference,9)and o.ds='2016-06-01-10' and d.ds='2016-06-01-10')
Right?
Created 06-01-2016 06:06 PM
@Manoj Dhake as long as the those dates are within the partition range then yes it will only query the partition. the CBO will help make that determination.
Created 06-02-2016 07:31 PM
@Manoj Dhake please advise if I have answered your question.
Created 09-29-2017 06:30 AM
i don't think that hive does pruning on the table at left side during join, even if you provide the predicate in ON clause. That works in sql but not in hive. i am using hive 1.2. you can check this fact by using "explain <query>" . it is better to have the table pruned using sub query , and then join it with another table. let me know your thoughts.
Created 06-02-2016 08:22 AM
floor(datediff(to_date(from_unixtime(unix_timestamp())), to_date(birthdate)) / 365.25
That unix_timestamp() could turn off a few optimizations in the planner, which might not be related to this issue.
Start using CURRENT_DATE instead of the unix_timestamp() call, for faster queries.