Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to ask hive query to fetch data for specific partition?

avatar
Super Collaborator

Hello everyone,

I have posted query below in which i have used 3 partition table and all are the external table pointing to some partition ditectory on hdfs.In every table partition is done on "ds" column(i.e. ds=2016-06-01-08).

ds=year-month-day-hour

On hourly basis data is landing on hdfs,creating directory as above timestamp and tables are pointing to all those partition directories.

A question is:

As already i explained all tables are partitioned,but

Where to write where clause in below query? so that data will be processed for that particular partition only.

Please let me know.

Query:

INSERT OVERWRITE TABLE falconexample.Patient_proce PARTITION (${falcon_output_partitions_hive}) select p.id,p.gender, p.Age, p.birthdate, o.component[1].valuequantity.value, o.component[1].valuequantity.unit from (select *, floor(datediff(to_date(from_unixtime(unix_timestamp())), to_date(birthdate)) / 365.25) as Age FROM falconexample. patient1) p inner join falconexample.DiagnosticReport1 d on p.id = substr(d.subject.reference,9) inner join falconexample.Observation1 o on p.id = substr(o.subject.reference,9) where p.Age>17 and p.Age<86 and o.component[1].valuequantity.value <140;

6 REPLIES 6

avatar
Master Guru

@Manoj Dhake

Partition Based Queries

In general, a SELECT query scans the entire table (other than for sampling). If a table created using the PARTITIONED BY clause, a query can do partition pruning and scan only a fraction of the table relevant to the partitions specified by the query. Hive currently does partition pruning if the partition predicates are specified in the WHERE clause or the ON clause in a JOIN. For example, if table page_views is partitioned on column date, the following query retrieves rows for just days between 2008-03-01 and 2008-03-31.

SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'

If a table page_views is joined with another table dim_users, you can specify a range of partitions in the ON clause as follows:

SELECT page_views.* FROM page_views JOIN dim_users ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31')

avatar
Super Collaborator

Thank you Sunile for giving such a quick response.

According to your answer,

I can write something like

............................................

.....falconexample.DiagnosticReport1 d on (p.id = substr(d.subject.reference,9) and p.ds='2016-06-01-10') inner join falconexample.Observation1 o on (p.id = substr(o.subject.reference,9)and o.ds='2016-06-01-10' and d.ds='2016-06-01-10')

Right?

avatar
Master Guru

@Manoj Dhake as long as the those dates are within the partition range then yes it will only query the partition. the CBO will help make that determination.

avatar
Master Guru

@Manoj Dhake please advise if I have answered your question.

avatar
New Contributor

i don't think that hive does pruning on the table at left side during join, even if you provide the predicate in ON clause. That works in sql but not in hive. i am using hive 1.2. you can check this fact by using "explain <query>" . it is better to have the table pruned using sub query , and then join it with another table. let me know your thoughts.

avatar
Expert Contributor
floor(datediff(to_date(from_unixtime(unix_timestamp())), to_date(birthdate)) / 365.25

That unix_timestamp() could turn off a few optimizations in the planner, which might not be related to this issue.

Start using CURRENT_DATE instead of the unix_timestamp() call, for faster queries.