Created 06-08-2016 10:34 AM
Hi all,
I am currently pulling the max value of a timestamp column from my tables in Hive and using this to pull data after this date using Sqoop, i am using Oozie in order to perform these steps. This is currently done by running a query against the Hive table to put this value into hdfs and then this is picked up in another Oozie action before being passed to the Sqoop action. This all run perfectly fine, however retrieving the Max timestamp value and putting this into HDFS is currently very slow and I can only see this getting slower as more data is inserted into the table. The Hive SQL I am using to pull this value is as below:
INSERT OVERWRITE DIRECTORY '${lastModifiedDateSaveLocation}' select max(${timestamp_column}) from ${hive_table};
Can anyone suggest a more optimized solution to retrieve this max timestamp?
Thanks for your help,
Dan
Created 06-08-2016 11:40 AM
In general when you use date organized tables partitioning makes sense. If you for example are sure that data has been inserted in the last days you can use this to restrict the amount of data that need to be read. For example by partitioning by day and adding a where condition at the end that only reads the last two days or so.
I assume you already use ORC? That data format has automatic indexing which should skip the majority of data blocks of your data set assuming you could give a range the timestamp can be in. ( Like where timestamp > one week ago ) . Since Hive 1.2.0 you also have the ability to read footers first which would make even better use of this feature. ( PEr default tasks are started but will close immediately, with the below setting the files are opened for a quick peek into the footer before tasks are even assigned )
hive.exec.orc.split.strategy=ETL
Finally Hive can use statistics for query acceleration but you need to be sure to have updated statistics so autogather needs to be true. It can also lead to wrong results if you sideload data etc.
hive.compute.query.using.stats=true
So in summary use partitioning, ORC, and have a look at the parameters below and try them out.
Created 06-08-2016 11:40 AM
In general when you use date organized tables partitioning makes sense. If you for example are sure that data has been inserted in the last days you can use this to restrict the amount of data that need to be read. For example by partitioning by day and adding a where condition at the end that only reads the last two days or so.
I assume you already use ORC? That data format has automatic indexing which should skip the majority of data blocks of your data set assuming you could give a range the timestamp can be in. ( Like where timestamp > one week ago ) . Since Hive 1.2.0 you also have the ability to read footers first which would make even better use of this feature. ( PEr default tasks are started but will close immediately, with the below setting the files are opened for a quick peek into the footer before tasks are even assigned )
hive.exec.orc.split.strategy=ETL
Finally Hive can use statistics for query acceleration but you need to be sure to have updated statistics so autogather needs to be true. It can also lead to wrong results if you sideload data etc.
hive.compute.query.using.stats=true
So in summary use partitioning, ORC, and have a look at the parameters below and try them out.
Created 06-13-2016 02:15 PM
Thanks for the response Ben. The changes you have suggested have worked to a degree and I know due to the use of partitions that there should be no further degradation to the speed of the query, but this query can still take up to a minute to complete. I shall continue to look into other solutions to this issue and post them if I find any.
Created 06-13-2016 06:18 PM
@Daniel PerryI don't suppose the stats approach works? That should be instantaneous. The only other option I could think of is updating the record when you write your data. I.e. in your ingestion job select the biggest record and write it into a file/hive table. You then have it available immediately when you need it. ( its like manual stats )
Created 06-14-2016 09:34 AM
@Daniel Perry Oh one last idea if you use sqoop? If you cannot use the sqoop metastore you might be able to run sqoop in a shell/ssh action grep the output of the sqoop incremental import which will display the last value save it in hdfs and use it next time. Of course a bit ugly but ...