question Re: Parquet data duplication in Support Questions

Parquet data duplication

TAZIMehdi — Tue, 19 Jan 2016 18:31:24 GMT

Hello All,

It's sur that parquet files make OLAP queries faster cause of it columnar format, but in the other side the datalake is duplicated (raw data + parquet data). even if parquet can be compressed, dont you think that duplicating all the data can costs a lot ?

Re: Parquet data duplication

nsabharwal — Mon, 19 Aug 2019 12:08:22 GMT

@Mehdi TAZI

I am big fan of orc

http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

Re: Parquet data duplication

TAZIMehdi — Tue, 19 Jan 2016 19:06:05 GMT

Thanks for your answer, but my question wasn't about comparing the compression rate, actually we need both of the original and columnar files. so is it normal to duplicate all the datalake to have better performance ?

Re: Parquet data duplication

nsabharwal — Tue, 19 Jan 2016 19:14:20 GMT

@Mehdi TAZI No and I never heard of duplicating the data with Parquet. I hope you are not referring to HDFS replication factor. If you are then please see this

Re: Parquet data duplication

TAZIMehdi — Tue, 19 Jan 2016 19:26:16 GMT

i think i didn't explain well my point, let's assume a system that receives data from outside sources, normally we store the raw data in HDFS/HBASE in order to keep it in it original format.

now let's assume that we want make ad-hoc faster queries, so we convert all the data to parquet format and of course keep the raw one ! ( this is the duplication that i'm talking about)

Re: Parquet data duplication

nsabharwal — Tue, 19 Jan 2016 19:33:41 GMT

@Mehdi TAZI Very good point. It goes back to ELT ..Source of truth "raw data" lands in HDFS, we run transformations on that data and load into Hive or HBASE based on used case. There is significant cost difference in storing the source of truth in Hadoop vs. Expensive SAN or EDW.

You don't have to store in HDFS. You can load data directly into Hive or HBase tables. The very basic use case i,e Data archival. You can "move" data from EDW into Hive using sqoop. Data goes directly into hive tables.

Re: Parquet data duplication

TAZIMehdi — Tue, 19 Jan 2016 21:09:51 GMT

Excuse me, i didn't understand your answer.

here is a typical case : i have a job that reads raw data from a source(ex : kafka) to store them into the datalake(HBase over HDFS) for an archiving purpose, and at the same time this same job create parquet files that stores on HDFS for an analytics purpose. here we are saving the same data in different formats for too diferente purposes, so the same data is duplicated.

1 - is-it the right one to do that

2 - if yes is it normal that the data is duplicated.

thanks a lot !

Re: Parquet data duplication

nsabharwal — Tue, 19 Jan 2016 21:22:37 GMT

@Mehdi TAZI

1- You are using HBASE for very fast lookup/near real time data access - Yes it's ok.

2- You want to store data into HDFS - Yes, it's ok and it can serve many use cases down the road. You can have this data for long time. Create hive tables on top of this data for analytics or reporting.

Re: Parquet data duplication

TAZIMehdi — Tue, 19 Jan 2016 21:45:35 GMT

this is exactly what i was doing, but the actual architecture includes generated parquet files to improve performance and it works !, the only side effect is data duplication, so i was wondering if there is an other technology that will allow me to improve performance without having this side effect.

Re: Parquet data duplication

nsabharwal — Tue, 19 Jan 2016 22:11:25 GMT

@Mehdi TAZI HBASE and HDFS is really good combination. You don't have to store everything in HBASE. You can store field that required for your application. Having compression for HBASE and Hive table "orc" will help you to reduce your storage foot print.

Re: Parquet data duplication

TAZIMehdi — Tue, 19 Jan 2016 22:24:51 GMT

i agree, but actually the orc part will be duplicated no ?

Re: Parquet data duplication

nsabharwal — Tue, 19 Jan 2016 22:36:12 GMT

@Mehdi TAZI Better compression mean less storage cost. My suggestion is not to confuse HBASE or Nosql with HDFS. There are customer who are using HDFS, Hive and not using HBASE. HBASE is designed for special use cases where you have to access data in real time "You have mentioned this already" 🙂

Re: Parquet data duplication

TAZIMehdi — Tue, 19 Jan 2016 23:18:37 GMT

yes thanks ^^, in my case i'm using hbase because i'm handling a large amount of small files.

Re: Parquet data duplication

nsabharwal — Tue, 19 Jan 2016 23:27:33 GMT

@Mehdi TAZI That's sound correct. I did connect with you on twitter. Feel free to connect back and we can discuss in detail. I do believe that you are on the right track

Re: Parquet data duplication

aervits — Wed, 20 Jan 2016 10:40:05 GMT

@Mehdi TAZI

in one of your deleted responses you'd mentioned that you duplicate date for hive queries and hbase for small files issues. You can actually map hive to hbase and use analytics queries on top of HBase. That may not be the most efficient way but you can also map HBase snapshots to Hive and that will be a lot better as far as HBase is concerned.

Re: Parquet data duplication

TAZIMehdi — Wed, 20 Jan 2016 17:44:22 GMT

first of all thanks for you answer the duplication wasn't about the date but more about the data on parquet and hbase, otherwise using hive over hbase is not really as good as having a columnar format... have a nice day 🙂