About TAZIMehdi

obrobecker · ‎02-19-2020

with newer versions of spark, the sqlContext is not load by default, you have to specify it explicitly : scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) warning: there was one deprecation warning; re-run with -deprecation for details sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@6179af64 scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> sqlContext.sql("describe mytable") res2: org.apache.spark.sql.DataFrame = [col_name: string, data_type: string ... 1 more field] I'm working with spark 2.3.2

bleonhardi · ‎01-29-2016

By and large, large ORC files are better. HDFS has a sweetspot for files that are 1-10 times the block size. But 20GB should also be ok. There will be one map task for each block of the ORC file anyway. So the difference should be not big as long as your files are as big or bigger than a block. Files significantly smaller than a block would be bad though. If you create a very big file just keep an eye out for stripe sizes in the ORC file if you see any performance problems. I have sometimes seen very small stripes due to memory restrictions in the writer. So if you want to aggregate a large amount of data as fast as possible having a single big file would be good. However having one 20GB ORC file also means you have loaded it with one task so the load will normally be too slow. You may want to have a couple reducers to increase load speed. Alternatively you can also use ALTER TABLE CONCATENATE to merge small ORC files together. More details on how to influence the load can be found below. http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

pminovic · ‎01-28-2016

@Mehdi TAZI AFAIK, Ozone is a key-object store like AWS S3. Keys/objects are organized into buckets with unique set of keys. Bucket data and Ozone metadata stored in Storage Containers (SC) which coexist with HDFS blocks on Data nodes in a separate block pool. Ozone metadata distributed on SCs, no central NN. Buckets can be huge and are divided into partitions also stored in SCs. R/W supported, append and update not. SC implementaion to use LevelDB or RocksDB. Ozone architecture doc and all details are here. So, it's not on top of HDFS, it's going to coexist with HDFS and share DNs with HDFS.

TAZIMehdi · ‎01-20-2016

thanks a lot for you answer once again 🙂 1 - what do you mean by source to destination ? is it somekind of ETL on raw data to put in a DW ? 2.1 - is there in recomanded MPP data by hortonworks ? 2.2 if there is no option what other alternative exists ? Thanks 😉

aengineer · ‎01-21-2016

@Mehdi TAZI As @Arpit Agarwal mentioned this is not related to CAP theorem. HDFS and Cassandra exposes different kind of interfaces so an apple to apple comparison is not possible. From the papers and benchmarking results that I have seen Cassandra is often restricted to sub-1000 nodes. References : Planet Cassandra http://www.planetcassandra.org/nosql-performance-benchmarks/ Netflix Engineering Blog : http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html http://techblog.netflix.com/2014/07/revisiting-1-million-writes-per-second.html It is typical to see HDFS clusters with sizes far more than 1000's of nodes, so the scale at which HDFS operates is very different from Cassandra. Please keep in mind that a 300 odd nodes dedicated to No-SQL storage can store large amounts of data. However where HDFS shines is the diverse set of applications that you can run on it. Cassandra addresses a very focused scenario, where as HDFS is very general purpose. You can run a set of application including HBase which provides functionality that Cassandra provides. So if you are an enterprise it is often the case that you have needs that can only be addressed by different tools, and HDFS will provide access to set of tools that operate upon your data. At this point of time, we have no data that says Cassandra can or cannot handle same amount of data as HDFS, I think the only data point is typically Cassandra benchmarks are run with with much smaller number of nodes.

TAZIMehdi · ‎01-20-2016

first of all thanks for you answer the duplication wasn't about the date but more about the data on parquet and hbase, otherwise using hive over hbase is not really as good as having a columnar format... have a nice day 🙂

aervits · ‎02-05-2016

@Mehdi TAZI please accept best answer or provide your own solution.

aervits · ‎01-13-2016

you can earn points and build a great reputation on this site if you can write a short article and post it. I am sure there are a lot of customers that would be interested in the same @Mehdi TAZI

Online	Offline
Last Visited	‎06-19-2018 01:50 PM

Member Since	‎06-18-2018 03:05 AM
Last Visited	‎06-19-2018 01:50 PM
Posts	34
Kudos received	13

Cloudera Community

Re: Create Hive table to read parquet files from p...

Re: How to setup high availability for lily Hbase ...

Re: Create Hive table to read parquet files from p...

Re: storage strategy of OCR / Parquet file

Re: Can I use Hbase as a datalake

Re: How to achieve realtime analytics on hadoop

Re: Amount of data storage : HDFS vs NoSQL

Re: Parquet data duplication

Re: Hadoop services high availability

Re: How to setup high availability for lily Hbase ...