Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Parquet data duplication

avatar
Rising Star

Hello All,

It's sur that parquet files make OLAP queries faster cause of it columnar format, but in the other side the datalake is duplicated (raw data + parquet data). even if parquet can be compressed, dont you think that duplicating all the data can costs a lot ?

tazimehdi.com
1 ACCEPTED SOLUTION

avatar
Master Mentor
15 REPLIES 15

avatar
Master Mentor

avatar
Rising Star

Thanks for your answer, but my question wasn't about comparing the compression rate, actually we need both of the original and columnar files. so is it normal to duplicate all the datalake to have better performance ?

tazimehdi.com

avatar
Master Mentor

@Mehdi TAZI No and I never heard of duplicating the data with Parquet. I hope you are not referring to HDFS replication factor. If you are then please see this

avatar
Rising Star

i think i didn't explain well my point, let's assume a system that receives data from outside sources, normally we store the raw data in HDFS/HBASE in order to keep it in it original format.

now let's assume that we want make ad-hoc faster queries, so we convert all the data to parquet format and of course keep the raw one ! ( this is the duplication that i'm talking about)

tazimehdi.com

avatar
Master Mentor

@Mehdi TAZI Very good point. It goes back to ELT ..Source of truth "raw data" lands in HDFS, we run transformations on that data and load into Hive or HBASE based on used case. There is significant cost difference in storing the source of truth in Hadoop vs. Expensive SAN or EDW.

You don't have to store in HDFS. You can load data directly into Hive or HBase tables. The very basic use case i,e Data archival. You can "move" data from EDW into Hive using sqoop. Data goes directly into hive tables.

avatar
Rising Star

Excuse me, i didn't understand your answer.

here is a typical case : i have a job that reads raw data from a source(ex : kafka) to store them into the datalake(HBase over HDFS) for an archiving purpose, and at the same time this same job create parquet files that stores on HDFS for an analytics purpose. here we are saving the same data in different formats for too diferente purposes, so the same data is duplicated.

1 - is-it the right one to do that

2 - if yes is it normal that the data is duplicated.

thanks a lot !

tazimehdi.com

avatar
Master Mentor

@Mehdi TAZI

1- You are using HBASE for very fast lookup/near real time data access - Yes it's ok.

2- You want to store data into HDFS - Yes, it's ok and it can serve many use cases down the road. You can have this data for long time. Create hive tables on top of this data for analytics or reporting.

avatar
Rising Star

this is exactly what i was doing, but the actual architecture includes generated parquet files to improve performance and it works !, the only side effect is data duplication, so i was wondering if there is an other technology that will allow me to improve performance without having this side effect.

tazimehdi.com

avatar
Master Mentor

@Mehdi TAZI HBASE and HDFS is really good combination. You don't have to store everything in HBASE. You can store field that required for your application. Having compression for HBASE and Hive table "orc" will help you to reduce your storage foot print.