Created on 05-02-2018 12:32 PM - edited 09-16-2022 06:10 AM
Hi,
I have 6 historical nodes, with 18 cores and 64GB RAM.
When i am loading data around 7million records (8dimensions, 2 metrics) - it is taking around 30minutes for druid to do indexing. I am using the following command using beeline:
CREATE TABLE test_druid STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ( "druid.datasource" = "test_druid", "druid.segment.granularity" = "MONTH", "druid.query.granularity" = "DAY") as select cast(trans_date as timestamp) as `__time` , col1, col2, col3 from testdb.test_hive_Table where to_date(trans_Date) = '2018-01-01';
Is it expected to take so much time? What are the recommended configurations for better performance.
Please suggest. Thanks in advance.
Created 05-02-2018 02:29 PM
The speed at which the indexes are generated depends on Hive LLAP workers not Druid it self.
You might try to change the "druid.segment.granularity" = "MONTH" to "DAY" that will give you more parallelism thus might run faster if LLAP has enough resources.
Created 05-03-2018 01:02 PM
Hi Slim,
Thanks for your response. For my benifit, iam posting my understanding. Can you please confirm. At the end, iam posting followup question.
Case 1: In hive table: (only timestamp data differs for each customer)
trans_timestamp,customer,product,store,qty,amount
2017-01-01 14:01:01,c1,p1,s1,1,10
2017-01-01 14:10:01,c1,p1,s1,2,20
2017-01-01 14:20:01,c1,p1,s1,3,30
2017-01-01 14:01:01,c2,p1,s1,4,40
2017-01-01 14:10:01,c2,p1,s1,5,50
2017-01-01 14:20:01,c2,p1,s1,6,60
Config: segment.granularity=DAY and query.granularity=HOUR
In Druid Segments: One segment per day, only 1 segment as we have only 1 day data. 6 rows are rolled up into 2 rows. trans_timestamp,customer,product,store,qty,amount
2017-01-01 14:00:00,c1,p1,s1,6,60
2017-01-01 14:00:00,c2,p1,s1,15,150
Case 2: In hive table:
trans_timestamp,customer,product,store,qty,amount
2017-01-01 00:00:00,c1,p1,s1,1,10
2017-02-01 00:00:00,c1,p2,s1,2,20
2017-03-01 00:00:00,c1,p3,s1,3,30
2017-01-01 00:00:00,c2,p1,s1,4,40
2017-02-01 00:00:00,c2,p2,s1,5,50
2017-03-01 00:00:00,c2,p3,s1,6,60
Config: segment.granularity=MONTH and query.granularity=DAY In Druid Segments: One segment per month. Total 3 segments as we have 3 months data.
trans_timestamp,customer,product,store,qty,amount
segment1:
2017-01-01 00:00:00,c1,p1,s1,1,10
2017-01-01 00:00:00,c2,p1,s1,4,40
segment2:
2017-02-01 00:00:00,c1,p2,s1,2,20
2017-02-01 00:00:00,c2,p2,s1,5,50
segment3:
2017-03-01 00:00:00,c1,p3,s1,3,30
2017-03-01 00:00:00,c2,p3,s1,6,60
Question:
If i have daily data volume at say 10million. My monthly volume is ~300million. If i use segment granularity as day, then i will have 1000+ segments (assuming 3yrs of data). If i want to query by month ( group by month), then lot of segments need to be processed.
On the other hand, if i set segment granularity as month - then my segment size would be huge. Does segment size has an impact on performance?
Based on your experience what is recommened? What are people usually using.
In my case, the granularity of the data is day (thankfully no timestamp involved.).
Please suggest. Thanks in advance.