I have used org.openx.data.jsonserde.JsonSerDe for loading log data(fields such as Map etc.,) into HIVE External table. I could be able to query the data properly. However the query response time is high and I have created ORC derived table from the staging table using CREATE EXTERNAL TABLE <orc_log> stored as ORC tblproperties("orc.compress"="SNAPPY") AS SELECT * from orig_log;
There are two questions:
1. How to sync ORC table with orig_log table where the data is loading incrementally in this orig table.
2.ANALYZE TABLE statement fails for both orig_log and orc_log tables because the complex JSON data type such as Map is not supported.
Would be great if you can suggest the way to overcome/resolve my issue. Thanks in advance!!!
Since you data is made of logs, I assume they they are getting in a timely manner (today you are importing toady or yesterday logs and so on). Thus, in this case a suitable option would be to partition both table by day (or you can set a different time granularity according to how much data you have) and then sync the changed partitions overwriting the ORC table partitions with the content of the updated JSON table partitions.
As a side note, I think you don't need to create the ORC table as external.
Thanks for your reply. So I need to use INSERT OVERWRITE command to load the data from orig_log to orc_log table periodically right after creating partition on day column.