If ACID tables are not used then how to handle small files problem in Hive?. Is there any archival process to follow like creating HAR files?
Alter Table/Partition Concatenate
In Hive release 0.8.0 RCFile added support for fast block level merging of small RCFiles using concatenate command.
In Hive release 0.14.0ORC files added support fast stripe level merging of small ORC files using concatenate command.
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;
If the table or partition contains many small RCFiles or ORC files, then the above command will merge them into larger files. In case of RCFile the merge happens at block level whereas for ORC files the merge happens at stripe level thereby avoiding the overhead of decompressing and decoding the data.
Question on Mutations
So if we need to apply a thousand mutations, this would be a thousand operations, rather than one bulk operation.
Can LLAP be used to read more data than can fit into memory?
Yes , it has eviction policy and stored the data in compressed format.
Question on data transfer:
Specific example, if a “select *” is performed on a very large table, can the application receive that
data as a “stream” or does some component (LLAP or HiveServer2, etc) need to hold the entire dataset in memory?
All the results are streamed to HDFS and the results are streamed from there . NO memory constraint
Question on query result:
Related, does LLAP send results back as they become available (like Hbase scanresults) or only once the query completes? –
Returns the results once SQL completes
Question on compaction:
We may benefit from Hive’s ACID feature to handle “deltas”. Advantages seem to be:
•it would allow the updated data to be available in queries before a compaction has taken place. “You can update the data. compaction should be transparent”
•compaction implementation already exists, no need for bespoke implementation – Hive has inbuilt compaction technique [Major and Minor]
Question on spark and hive llap integration:
•Can LLAP be leveraged to serve data to Spark jobs efficiently? I.e., can LLAP inform Spark on the partitioning of the data it will provide? Or is it very course, plain jdbc, interface?
LLAP Spark Context is in Tech Preview(TP).
Question on cache eviction algorithm:
Seems Hive Metastore does not cache much data. Which means each query for Metadata, which would include statistics, goes through the “datanucleus” ORM layer.
Is this correct?
LLAP has a metadata cache.
The daemon caches metadata for input files, as well as the data. The metadata and index information can be cached even for data that is not currently cached. Metadata is stored in process in Java objects; cached data is stored in the format described in the I/O section, and kept off-heap (see Resource management).
Eviction policy. The eviction policy is tuned for analytical workloads with frequent (partial) table-scans. Initially, a simple policy like LRFU is used. The policy is pluggable.
Caching granularity. Column-chunks are the unit of data in the cache. This achieves a compromise between low-overhead processing and storage efficiency. The granularity of the chunks depends on the particular file format and execution engine (Vectorized Row Batch size, ORC stripe, etc.).
A bloom filter is automatically created to provide Dynamic Runtime Filtering.
Question Hive LLAP on specific Nodes
In Ambari, how to specify where to run llap daemon on specific node.