Member since
06-18-2018
34
Posts
13
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
87431 | 02-02-2016 03:08 PM | |
1997 | 01-13-2016 09:52 AM |
02-19-2020
10:49 PM
with newer versions of spark, the sqlContext is not load by default, you have to specify it explicitly : scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) warning: there was one deprecation warning; re-run with -deprecation for details sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@6179af64 scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> sqlContext.sql("describe mytable") res2: org.apache.spark.sql.DataFrame = [col_name: string, data_type: string ... 1 more field] I'm working with spark 2.3.2
... View more
01-29-2016
10:34 AM
3 Kudos
By and large, large ORC files are better. HDFS has a sweetspot for files that are 1-10 times the block size. But 20GB should also be ok. There will be one map task for each block of the ORC file anyway. So the difference should be not big as long as your files are as big or bigger than a block. Files significantly smaller than a block would be bad though. If you create a very big file just keep an eye out for stripe sizes in the ORC file if you see any performance problems. I have sometimes seen very small stripes due to memory restrictions in the writer. So if you want to aggregate a large amount of data as fast as possible having a single big file would be good. However having one 20GB ORC file also means you have loaded it with one task so the load will normally be too slow. You may want to have a couple reducers to increase load speed. Alternatively you can also use ALTER TABLE CONCATENATE to merge small ORC files together. More details on how to influence the load can be found below. http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
... View more
01-28-2016
02:11 AM
1 Kudo
@Mehdi TAZI AFAIK, Ozone is a key-object store like AWS S3. Keys/objects are organized into buckets with unique set of keys. Bucket data and Ozone metadata stored in Storage Containers (SC) which coexist with HDFS blocks on Data nodes in a separate block pool. Ozone metadata distributed on SCs, no central NN. Buckets can be huge and are divided into partitions also stored in SCs. R/W supported, append and update not. SC implementaion to use LevelDB or RocksDB. Ozone architecture doc and all details are here. So, it's not on top of HDFS, it's going to coexist with HDFS and share DNs with HDFS.
... View more
01-20-2016
10:45 PM
thanks a lot for you answer once again 🙂 1 - what do you mean by source to destination ? is it somekind of ETL on raw data to put in a DW ? 2.1 - is there in recomanded MPP data by hortonworks ? 2.2 if there is no option what other alternative exists ? Thanks 😉
... View more
01-21-2016
12:20 AM
4 Kudos
@Mehdi TAZI As @Arpit Agarwal mentioned this is not related to CAP theorem. HDFS and Cassandra exposes different kind of interfaces so an apple to apple comparison is not possible. From the papers and benchmarking results that I have seen Cassandra is often restricted to sub-1000 nodes. References : Planet Cassandra http://www.planetcassandra.org/nosql-performance-benchmarks/ Netflix Engineering Blog : http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html http://techblog.netflix.com/2014/07/revisiting-1-million-writes-per-second.html It is typical to see HDFS clusters with sizes far more than 1000's of nodes, so the scale at which HDFS operates is very different from Cassandra. Please keep in mind that a 300 odd nodes dedicated to No-SQL storage can store large amounts of data. However where HDFS shines is the diverse set of applications that you can run on it. Cassandra addresses a very focused scenario, where as HDFS is very general purpose. You can run a set of application including HBase which provides functionality that Cassandra provides. So if you are an enterprise it is often the case that you have needs that can only be addressed by different tools, and HDFS will provide access to set of tools that operate upon your data. At this point of time, we have no data that says Cassandra can or cannot handle same amount of data as HDFS, I think the only data point is typically Cassandra benchmarks are run with with much smaller number of nodes.
... View more
01-20-2016
09:44 AM
first of all thanks for you answer the duplication wasn't about the date but more about the data on parquet and hbase, otherwise using hive over hbase is not really as good as having a columnar format... have a nice day 🙂
... View more
02-05-2016
07:48 PM
@Mehdi TAZI please accept best answer or provide your own solution.
... View more
01-13-2016
11:05 AM
1 Kudo
you can earn points and build a great reputation on this site if you can write a short article and post it. I am sure there are a lot of customers that would be interested in the same @Mehdi TAZI
... View more