Member since
06-18-2018
34
Posts
13
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
86875 | 02-02-2016 03:08 PM | |
1960 | 01-13-2016 09:52 AM |
01-19-2016
01:45 PM
this is exactly what i was doing, but the actual architecture includes generated parquet files to improve performance and it works !, the only side effect is data duplication, so i was wondering if there is an other technology that will allow me to improve performance without having this side effect.
... View more
01-19-2016
01:41 PM
Thanks a lot, i had already seen this post, and there is still no answer why cassandra can't manage the same amount of data as hadoop does. nb : the accepted answer is not completely true, cassandra doesn't run over HDFS.
... View more
01-19-2016
01:39 PM
Hello ! thanks a lot for your answer, i did read CAP Theorem,but i still can't see why Cassandra can't handle the same amount of data as hadoop does.
... View more
01-19-2016
01:09 PM
Excuse me, i didn't understand your answer. here is a typical case : i have a job that reads raw data from a source(ex : kafka) to store them into the datalake(HBase over HDFS) for an archiving purpose, and at the same time this same job create parquet files that stores on HDFS for an analytics purpose. here we are saving the same data in different formats for too diferente purposes, so the same data is duplicated. 1 - is-it the right one to do that 2 - if yes is it normal that the data is duplicated. thanks a lot !
... View more
01-19-2016
01:02 PM
2 Kudos
In several sources on internet, they explain that HDFS is build to handle more amount of data than nosql solutions(cassandra for ex). in general when we go further than 1To we must start thinking Hadoop(HDFS) and not NoSQL. Beside the architecture and the fact that HDFS performs in batch and that most of noSQL (ex : cassandra) perform in random I/O, and beside the schema design differences, why NoSQL Solutions cassandra for example can't handle the same amount of data like HDFS ? Why can't we use those solutions as datalake, why we only use them as hot storage solutions in a big data architecture. thanks a lot
... View more
Labels:
- Labels:
-
Apache Hadoop
01-19-2016
11:26 AM
1 Kudo
i think i didn't explain well my point, let's assume a system that receives data from outside sources, normally we store the raw data in HDFS/HBASE in order to keep it in it original format. now let's assume that we want make ad-hoc faster queries, so we convert all the data to parquet format and of course keep the raw one ! ( this is the duplication that i'm talking about)
... View more
01-19-2016
11:06 AM
Thanks for your answer, but my question wasn't about comparing the compression rate, actually we need both of the original and columnar files. so is it normal to duplicate all the datalake to have better performance ?
... View more
01-19-2016
10:31 AM
2 Kudos
Hello All, It's sur that parquet files make OLAP queries faster cause of it columnar format, but in the other side the datalake is duplicated (raw data + parquet data).
even if parquet can be compressed, dont you think that duplicating all the data can costs a lot ?
... View more
Labels:
- Labels:
-
Apache Hadoop
01-13-2016
09:52 AM
I Finally found an answer, In the HBase lily indexer there is no single point of failure, the simple fact of running multiple instances is sufficient and will also share the work of indexing over the nodes cause it's based on the hbase replication.
... View more
- « Previous
-
- 1
- 2
- Next »