Support Questions

TAZIMehdi · ‎01-19-2016

Hello All,

It's sur that parquet files make OLAP queries faster cause of it columnar format, but in the other side the datalake is duplicated (raw data + parquet data). even if parquet can be compressed, dont you think that duplicating all the data can costs a lot ?

tazimehdi.com

nsabharwal · ‎01-19-2016

@Mehdi TAZI

I am big fan of orc

http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

View solution in original post

TAZIMehdi · ‎01-19-2016

i agree, but actually the orc part will be duplicated no ?

tazimehdi.com

nsabharwal · ‎01-19-2016

@Mehdi TAZI Better compression mean less storage cost. My suggestion is not to confuse HBASE or Nosql with HDFS. There are customer who are using HDFS, Hive and not using HBASE. HBASE is designed for special use cases where you have to access data in real time "You have mentioned this already" 🙂

TAZIMehdi · ‎01-19-2016

yes thanks ^^, in my case i'm using hbase because i'm handling a large amount of small files.

tazimehdi.com

nsabharwal · ‎01-19-2016

@Mehdi TAZI That's sound correct. I did connect with you on twitter. Feel free to connect back and we can discuss in detail. I do believe that you are on the right track

aervits · ‎01-20-2016

@Mehdi TAZI

in one of your deleted responses you'd mentioned that you duplicate date for hive queries and hbase for small files issues. You can actually map hive to hbase and use analytics queries on top of HBase. That may not be the most efficient way but you can also map HBase snapshots to Hive and that will be a lot better as far as HBase is concerned.

TAZIMehdi · ‎01-20-2016

first of all thanks for you answer the duplication wasn't about the date but more about the data on parquet and hbase, otherwise using hive over hbase is not really as good as having a columnar format... have a nice day 🙂

tazimehdi.com

Cloudera Community

Support Questions

Parquet data duplication