About TAZIMehdi

TAZIMehdi · ‎01-19-2016

i agree, but actually the orc part will be duplicated no ?

TAZIMehdi · ‎01-19-2016

this is exactly what i was doing, but the actual architecture includes generated parquet files to improve performance and it works !, the only side effect is data duplication, so i was wondering if there is an other technology that will allow me to improve performance without having this side effect.

TAZIMehdi · ‎01-19-2016

Thanks a lot, i had already seen this post, and there is still no answer why cassandra can't manage the same amount of data as hadoop does. nb : the accepted answer is not completely true, cassandra doesn't run over HDFS.

TAZIMehdi · ‎01-19-2016

Hello ! thanks a lot for your answer, i did read CAP Theorem,but i still can't see why Cassandra can't handle the same amount of data as hadoop does.

TAZIMehdi · ‎01-19-2016

Excuse me, i didn't understand your answer. here is a typical case : i have a job that reads raw data from a source(ex : kafka) to store them into the datalake(HBase over HDFS) for an archiving purpose, and at the same time this same job create parquet files that stores on HDFS for an analytics purpose. here we are saving the same data in different formats for too diferente purposes, so the same data is duplicated. 1 - is-it the right one to do that 2 - if yes is it normal that the data is duplicated. thanks a lot !

TAZIMehdi · ‎01-19-2016

In several sources on internet, they explain that HDFS is build to handle more amount of data than nosql solutions(cassandra for ex). in general when we go further than 1To we must start thinking Hadoop(HDFS) and not NoSQL. Beside the architecture and the fact that HDFS performs in batch and that most of noSQL (ex : cassandra) perform in random I/O, and beside the schema design differences, why NoSQL Solutions cassandra for example can't handle the same amount of data like HDFS ? Why can't we use those solutions as datalake, why we only use them as hot storage solutions in a big data architecture. thanks a lot

TAZIMehdi · ‎01-19-2016

i think i didn't explain well my point, let's assume a system that receives data from outside sources, normally we store the raw data in HDFS/HBASE in order to keep it in it original format. now let's assume that we want make ad-hoc faster queries, so we convert all the data to parquet format and of course keep the raw one ! ( this is the duplication that i'm talking about)

TAZIMehdi · ‎01-19-2016

Thanks for your answer, but my question wasn't about comparing the compression rate, actually we need both of the original and columnar files. so is it normal to duplicate all the datalake to have better performance ?

TAZIMehdi · ‎01-19-2016

Hello All, It's sur that parquet files make OLAP queries faster cause of it columnar format, but in the other side the datalake is duplicated (raw data + parquet data). even if parquet can be compressed, dont you think that duplicating all the data can costs a lot ?

TAZIMehdi · ‎01-13-2016

I Finally found an answer, In the HBase lily indexer there is no single point of failure, the simple fact of running multiple instances is sufficient and will also share the work of indexing over the nodes cause it's based on the hbase replication.

Online	Offline
Last Visited	‎06-19-2018 01:50 PM

Member Since	‎06-18-2018 03:05 AM
Last Visited	‎06-19-2018 01:50 PM
Posts	34
Kudos received	13

Cloudera Community

Re: Create Hive table to read parquet files from p...

Re: How to setup high availability for lily Hbase ...

Re: Parquet data duplication

Re: Parquet data duplication

Re: Amount of data storage : HDFS vs NoSQL

Re: Amount of data storage : HDFS vs NoSQL

Re: Parquet data duplication

Amount of data storage : HDFS vs NoSQL

Re: Parquet data duplication

Re: Parquet data duplication

Parquet data duplication

Re: How to setup high availability for lily Hbase ...