Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Memory configuraiton for better performance on LLAP, Druid

Explorer

Hi,

I have 6 historical nodes, with 18 cores and 64GB RAM.

When i am loading data around 7million records (8dimensions, 2 metrics) - it is taking around 30minutes for druid to do indexing. I am using the following command using beeline:

CREATE TABLE test_druid STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ( "druid.datasource" = "test_druid", "druid.segment.granularity" = "MONTH", "druid.query.granularity" = "DAY") as select cast(trans_date as timestamp) as `__time` , col1, col2, col3 from testdb.test_hive_Table where to_date(trans_Date) = '2018-01-01';

Is it expected to take so much time? What are the recommended configurations for better performance.

Please suggest. Thanks in advance.

2 REPLIES 2

Expert Contributor

The speed at which the indexes are generated depends on Hive LLAP workers not Druid it self.

You might try to change the "druid.segment.granularity" = "MONTH" to "DAY" that will give you more parallelism thus might run faster if LLAP has enough resources.

Explorer

Hi Slim,

Thanks for your response. For my benifit, iam posting my understanding. Can you please confirm. At the end, iam posting followup question.

Case 1: In hive table: (only timestamp data differs for each customer)

trans_timestamp,customer,product,store,qty,amount

2017-01-01 14:01:01,c1,p1,s1,1,10

2017-01-01 14:10:01,c1,p1,s1,2,20

2017-01-01 14:20:01,c1,p1,s1,3,30

2017-01-01 14:01:01,c2,p1,s1,4,40

2017-01-01 14:10:01,c2,p1,s1,5,50

2017-01-01 14:20:01,c2,p1,s1,6,60

Config: segment.granularity=DAY and query.granularity=HOUR

In Druid Segments: One segment per day, only 1 segment as we have only 1 day data. 6 rows are rolled up into 2 rows. trans_timestamp,customer,product,store,qty,amount

2017-01-01 14:00:00,c1,p1,s1,6,60

2017-01-01 14:00:00,c2,p1,s1,15,150

Case 2: In hive table:

trans_timestamp,customer,product,store,qty,amount

2017-01-01 00:00:00,c1,p1,s1,1,10

2017-02-01 00:00:00,c1,p2,s1,2,20

2017-03-01 00:00:00,c1,p3,s1,3,30

2017-01-01 00:00:00,c2,p1,s1,4,40

2017-02-01 00:00:00,c2,p2,s1,5,50

2017-03-01 00:00:00,c2,p3,s1,6,60

Config: segment.granularity=MONTH and query.granularity=DAY In Druid Segments: One segment per month. Total 3 segments as we have 3 months data.

trans_timestamp,customer,product,store,qty,amount

segment1:

2017-01-01 00:00:00,c1,p1,s1,1,10

2017-01-01 00:00:00,c2,p1,s1,4,40

segment2:

2017-02-01 00:00:00,c1,p2,s1,2,20

2017-02-01 00:00:00,c2,p2,s1,5,50

segment3:

2017-03-01 00:00:00,c1,p3,s1,3,30

2017-03-01 00:00:00,c2,p3,s1,6,60

Question:

If i have daily data volume at say 10million. My monthly volume is ~300million. If i use segment granularity as day, then i will have 1000+ segments (assuming 3yrs of data). If i want to query by month ( group by month), then lot of segments need to be processed.

On the other hand, if i set segment granularity as month - then my segment size would be huge. Does segment size has an impact on performance?

Based on your experience what is recommened? What are people usually using.

In my case, the granularity of the data is day (thankfully no timestamp involved.).

Please suggest. Thanks in advance.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.