I have a hive table with 400 mio of records - using hdp 2.3 It is a simple table with 7 string fields, stored as ORC, ZLIB.
Select query of one entry from this table takes about 25minutes which is far to long since query time is grownig with number of entries. I expect to have 4 billion of records in 5 years so one query in five years will take 4 hours ?!
Do you have any idea how to decrease a query time?
Hello @Marco Gaido, my query would be something like:
select originalxml from tablename where messageid='b49014bb-0267-4578-a53-5a829f973xxd' I have a small 4 node cluster (4x32GB RAM, 6 core) hdfs is on EMC Isilon.
Executon engine = TEZ
Since you are using ORC, try using bloom filters on your data and if possible inserted in sorted order. This combination can help reducing the number of blocks that need to be queried.
This is a map only query, so all the time is going to running and finishing all the mappers. Because of cluster size, I think your query is not getting all mapper slots in one go. So, bloom filters should definitely help you.
@Ravi Mutyala I tried with bloom filters like you suggested but it is the same, no improvement in timing. I created a test table according to this link:
does your table have any columns like date time stamp when the data was ingested or something to quantify a partition. I think adding partitions or bucketing could significantly reduce table scans. You can get an explain plan to understand on how the data is being accessed and may be look at some recommendations on how to make the queries run faster.
Please see this article on how to enable vectorization and some other settings to get some immediate imporvements.
Hello @Predrag Minovic, thank you for suggestion. HBase service is not active in the moment since I did not expect to be necessary during installation. Is it possible that I can contact you over mail to explain the situation?