i have using HDP 3.1 and create materialize view with druid handler, but when i try to rebuild this mv, it takes long time to complete. all fact and dimension tables are ACID tables.
example DDL as below:
CREATE MATERIALIZED VIEW dwh.druid_fct_txn_sun_mv STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ( 'druid.segment.granularity'='DAY', 'druid.query.granularity'='MINUTE') AS select .....
1. how to verify if this MV is incremental rebuild?
2. how to check writeID on this MV?
As all fact and dimension tables are ACID tables. You have to understand what compaction is and why they are needed in Hive ACID.
ACID Tables, allow every transaction (INSERT, UPDATE, DELETE) on a bucket create a new files stored inside delta directories so each table partition contains one base directory and several delta directories see below:
[email@example.com] hdfs dfs -ls -R /warehouse/tablespace/managed/hive/clients.db/t
drwxr-xr-x - hive hive 0 2019-11-28 17:03 /warehouse/tablespace/managed/hive/clients.db/t/base_0000022
-rw-r--r-- 1 hive hive 602 2019-11-28 17:03 /warehouse/tablespace/managed/hive/clients.db/t/base_0000022/bucket_00000
drwxr-xr-x - hive hive 0 2019-11-28 17:06 /warehouse/tablespace/managed/hive/clients.db/t/delta_0000023_0000023_0000
-rw-r--r-- 1 hive hive 611 2019-11-28 17:06 /warehouse/tablespace/managed/hive/clients.db/t/delta_0000023_0000023_0000/bucket_00000
drwxr-xr-x - hive hive 0 2019-11-28 17:07 /warehouse/tablespace/managed/hive/clients.db/t/delta_0000024_0000024_0000
-rw-r--r-- 1 hive hive 610 2019-11-28 17:07 /warehouse/tablespace/managed/hive/clients.db/t/delta_0000024_0000024_0000/bucket_00000
As you can expect, there are as many delta directories in a partition as transactions made (actually transactions are grouped and committed in batches). This can rapidly become a problem in case of streaming ingestion as Hive will have to read more and more files on every request.
Minor compactions to merge the delta files of one bucket into one delta file;
Major compactions to merge all the delta files of one bucket to the existing base file of this bucket
Compaction might not be triggering that's why the rebuild is so slow?
A good practice is to run compaction
-- Major compaction query example
ALTER TABLE Clients_db.tbl1 PARTITION (col1 = 'value1') COMPACT 'major';
This is how you can ensure the performance of Hive if you encounter the same troubles with compactions
To enable Materialized view rewriting you should set this global property to TRUE
SET hive.materializedview.rewriting.time.window=10min; [ Rebuild operation run periodically, e.g., every 10 minutes]
Then you can select which view is enabled for rewriting
ALTER MATERIALIZED VIEW mv ENABLE|DISABLE REWRITE.
Once the MV has been created, it is populated with the data present in the tables. But after data has been modified in the source table, it is not automatically reflected in the MV. Hence, The MV needs to be rebuilt using the command:
ALTER MATERIALIZED VIEW mv REBUILD;
This ensures the optimizer will always attempt an incremental rebuild the MV an incremental rebuild can only be made when new data has been inserted. But In the case of DELETE or UPDATE in the table, a full rebuild will be applied.
To get useful information you could run the below against your MV
DESCRIBE FORMATTED mv1;
In the Table parameters, the transient_lastDdlTime should give you an indication of the last incremental rebuild.
Thank you very much for a detail solution on compaction Fact Table partition. i have tested and the rebuild mv after compactions and it has change the ways it works before, but i still encountered problem with taking long time on rebuild mv.
If you are rebuilding the materialized View [MV] it takes into account all the changes made on the base tables.
Do you execute a Hive query to set the location of the Druid broker using an address and port of the broker text listening port?
Materialized views (MV) are more than traditional views, they bring:
Incremental rebuilds are the least Memory intensive but a good approach is to rebuild if the materialized view uses non-transactional tables you cannot verify whether its contents are outdated, however you can use the automatic rewriting. You can combine a rebuild operation run periodically, e.g. every 5 or 10 minutes, and define the required freshness of the materialized view data using the hive.materializedview.rewriting.time.window configuration parameter, for instance:
What you could do to test is set the above parameter and then update the underlying table then wait for 10 to 11 minutes then query the MV to see if it has been updated, that's more efficient than rebuilding the MV
Hope that helps, please revert.
Thank you for your prompted respond
Do you execute a Hive query to set the location of the Druid broker using an address and port of the broker text listening port? No because it is already set via ambari.
Do this setting SET hive.materializedview.rewriting.time.window=10min; on Advanced hive-interactive-site or Advance hive-site ?
In order to enable query rewriting using Materialized views, this global property is needed. You should set that global parameter in Custom hive-site
Thereafter restart all the stale components Can you also share the values of
hive.materializedview.rewriting.incremental in Advanced hive-interactive-site and Advanced hive-site
Next time at MV creation enable the rewrite at creation time
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;
To test update the source tables and wait after the 10 minutes to see if the values get reflected in the MV
Hello there just trying to follow up on this thread, did the response help you resolve the issue or do you still need further help.
It would be a good idea to update so you get the help needed.