- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to ask hive query to fetch data for specific partition?
- Labels:
-
Apache Hive
Created ‎06-01-2016 05:23 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone,
I have posted query below in which i have used 3 partition table and all are the external table pointing to some partition ditectory on hdfs.In every table partition is done on "ds" column(i.e. ds=2016-06-01-08).
ds=year-month-day-hour
On hourly basis data is landing on hdfs,creating directory as above timestamp and tables are pointing to all those partition directories.
A question is:
As already i explained all tables are partitioned,but
Where to write where clause in below query? so that data will be processed for that particular partition only.
Please let me know.
Query:
INSERT OVERWRITE TABLE falconexample.Patient_proce PARTITION (${falcon_output_partitions_hive}) select p.id,p.gender, p.Age, p.birthdate, o.component[1].valuequantity.value, o.component[1].valuequantity.unit from (select *, floor(datediff(to_date(from_unixtime(unix_timestamp())), to_date(birthdate)) / 365.25) as Age FROM falconexample. patient1) p inner join falconexample.DiagnosticReport1 d on p.id = substr(d.subject.reference,9) inner join falconexample.Observation1 o on p.id = substr(o.subject.reference,9) where p.Age>17 and p.Age<86 and o.component[1].valuequantity.value <140;
Created ‎06-01-2016 05:26 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Partition Based Queries
In general, a SELECT query scans the entire table (other than for sampling). If a table created using the PARTITIONED BY clause, a query can do partition pruning and scan only a fraction of the table relevant to the partitions specified by the query. Hive currently does partition pruning if the partition predicates are specified in the WHERE clause or the ON clause in a JOIN. For example, if table page_views is partitioned on column date, the following query retrieves rows for just days between 2008-03-01 and 2008-03-31.
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
|
If a table page_views is joined with another table dim_users, you can specify a range of partitions in the ON clause as follows:
SELECT page_views.*
FROM page_views JOIN dim_users
ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' )
|
- See also Group By.
- See also Sort By / Cluster By / Distribute By / Order By.
Created ‎06-01-2016 05:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Sunile for giving such a quick response.
According to your answer,
I can write something like
............................................
.....falconexample.DiagnosticReport1 d on (p.id = substr(d.subject.reference,9) and p.ds='2016-06-01-10') inner join falconexample.Observation1 o on (p.id = substr(o.subject.reference,9)and o.ds='2016-06-01-10' and d.ds='2016-06-01-10')
Right?
Created ‎06-01-2016 06:06 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Manoj Dhake as long as the those dates are within the partition range then yes it will only query the partition. the CBO will help make that determination.
Created ‎06-02-2016 07:31 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Manoj Dhake please advise if I have answered your question.
Created ‎09-29-2017 06:30 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i don't think that hive does pruning on the table at left side during join, even if you provide the predicate in ON clause. That works in sql but not in hive. i am using hive 1.2. you can check this fact by using "explain <query>" . it is better to have the table pruned using sub query , and then join it with another table. let me know your thoughts.
Created ‎06-02-2016 08:22 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
floor(datediff(to_date(from_unixtime(unix_timestamp())), to_date(birthdate)) / 365.25
That unix_timestamp() could turn off a few optimizations in the planner, which might not be related to this issue.
Start using CURRENT_DATE instead of the unix_timestamp() call, for faster queries.
