Support Questions

Find answers, ask questions, and share your expertise

Parquet data duplication

avatar
Rising Star

Hello All,

It's sur that parquet files make OLAP queries faster cause of it columnar format, but in the other side the datalake is duplicated (raw data + parquet data). even if parquet can be compressed, dont you think that duplicating all the data can costs a lot ?

tazimehdi.com
1 ACCEPTED SOLUTION

avatar
Master Mentor
15 REPLIES 15

avatar
Rising Star

i agree, but actually the orc part will be duplicated no ?

tazimehdi.com

avatar
Master Mentor

@Mehdi TAZI Better compression mean less storage cost. My suggestion is not to confuse HBASE or Nosql with HDFS. There are customer who are using HDFS, Hive and not using HBASE. HBASE is designed for special use cases where you have to access data in real time "You have mentioned this already" 🙂

avatar
Rising Star

yes thanks ^^, in my case i'm using hbase because i'm handling a large amount of small files.

tazimehdi.com

avatar
Master Mentor

@Mehdi TAZI That's sound correct. I did connect with you on twitter. Feel free to connect back and we can discuss in detail. I do believe that you are on the right track

avatar
Master Mentor
@Mehdi TAZI

in one of your deleted responses you'd mentioned that you duplicate date for hive queries and hbase for small files issues. You can actually map hive to hbase and use analytics queries on top of HBase. That may not be the most efficient way but you can also map HBase snapshots to Hive and that will be a lot better as far as HBase is concerned.

avatar
Rising Star

first of all thanks for you answer the duplication wasn't about the date but more about the data on parquet and hbase, otherwise using hive over hbase is not really as good as having a columnar format... have a nice day 🙂

tazimehdi.com