We are receiving Hourly JSON data into HDFS. The size of the data would be 7GB per hour .
What is best way to do Upserts(Updates and Inserts in Hadoop) for large dataset. Hive or HBase or Nifi . What is flow . Can anyone help us in the flow .updates.txt
@Shu When using for larger dataset , The merge is taking longer time to complete . Final Table is having 150GB added every day . So Scanning the final table and adding updates is really taking more than an hour . Any other alternative approach .
@Eugene Koifman We are facing an issue which seems to be a limitation of Hive 1.2 ACID tables. We are using MERGE for loading mutable data on Hive ACID tables but loading/Reading these ACID tables using Pig or using Spark seems to be an issue .
Does Hive ACID table for Hive version 1.2 posses the capability of being read into Apache Pig using HCatLoader (or other means) or in Spark using SQLContext(or other means).
For Spark, it seems it is only possible to read ACID tables if the table is fully compacted i.e no delta folders exist in any partition. Details in the following JIRA
However I wanted to know if it is supported at all in Apache Pig to read ACID tables in Hive.
When I tried reading both an un-partitoned/partitioned ACID table in Pig version 0.16 I get 0 records read.
Successfully read 0 records from: "dwh.acid_table"
HDP version 2.6.5
Spark version 2.3
Pig version 0.16
Hive version 1.2