About bleonhardi

bleonhardi · ‎05-24-2016

yes you would need to configure user sync with ldap/ad in the ranger ui. Alternatively use UNIX user sync in Ranger to sync with the local operating system. ( Works as well )

bleonhardi · ‎05-24-2016

Alternatively use kerberos and kerberize the HDFS UI. In this case only SPNEGO enabled browsers will be able to access the ui and you will have the same filesystem access restrictions as users have when directly accessing hdfs.

mlanciaux · ‎06-09-2016

Can you give me top 50 min, max and the average. Also did you try the query ? What was the behaviour ? The reason I am asking that if your query is very long using a few number of reducer for example it may imply the skew and so to maximize usage of the cluster one way is too look at surrogate key creation.

Stewart12586 · ‎05-16-2016

Thanks Eric 🙂 I think that I will have some "troubles" to analyze and segment the data in Spark Step because I will need to create some rules to make that division

bleonhardi · ‎05-16-2016

Get data into the cluster? Easiest way is to have a delimited file and do hadoop fs -put file <hdfs location> You can then read those files with sc.textFile. You should go through a couple of basic tutorials I think to work with hadoop: http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/

bleonhardi · ‎05-16-2016

1) yes you can see the "Tez session was closed ... 2) In anything after HDP2 tez is enabled by default. MapReduce might be going away as an option anyway 3) You can still use set execution engine in queries set hive.execution.engine=mr or tez 4) Not sure what you mean with utiliy. The Tez view in ambari would provide the functionality I am not completely sure about the out of the box integration with resource manager https://www.youtube.com/watch?v=xyqct59LxLY

SQLShaw · ‎05-10-2016

Here is a great writeup on file compression in Hadoop - http://comphadoop.weebly.com/

ahadjidj · ‎05-12-2016

@vamsi valiveti have you tried this solution ?

bleonhardi · ‎05-10-2016

Contrary to popular believe Spark is not in-memory only a) Simple read no shuffle ( no joins, ... ) For the initial reads Spark like MapReduce reads the data in a stream and processes it as it comes along. I.e. unless there is a reason spark will NOT materialize the full RDDs in memory ( you can tell him to do it however if you want to cache a small dataset ) An RDD is resilient because spark knows how to recreate it ( re read a block from hdfs for example ) not because its stored in mem in different locations. ( that can be done too though. ) So if you filter out most of your data or do an efficient aggregation that aggregates on the map side you will never have the full table in memory. b) Shuffle This is done very similarly to MapReduce as it writes the map outputs to disc and reads them with the reducers through http. However spark uses an aggressive filesystem buffer strategy on the Linux filesystem so if the OS has memory available the data will not be actually written to physical disc. c) After Shuffle RDDs after shuffle are normally cached by the engine ( otherwise a failed node or RDD would require a complete re run of the job ) however as abdelkrim mentions Spark can spill these to disc unless you overrule that. d) Spark Streaming This is a bit different. Spark streaming expects all data to fit in memory unless you overwrite settings.

christian_proko · ‎05-10-2016

Hi Ed, It would be useful to know if you are aiming for HA or performance. Since it is a small cluster you may use it as a POC and not care much about HA, I don't know. One option not mentioned below is going with 3 masters and 3 slaves in a small HA cluster setup. That allows you to balance services on the masters more and/or dedicate one to be mostly an edge node. If security is a topic that may come in handy. Cheers, Christian

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: How to setup Hive Authentication in my cluster...

Re: Quickly secure the access to the cluster via h...

Re: Part-1 : Join involving 24 billion X 1 to 8 mi...

Re: Methodology to apply Data Mining in Big Data

Re: Apache Mahout K-Means Algorithm in HDP 2.4 on ...

Re: Enable/disable Tez and verify

Re: How a huge compressed file will get stored in ...

Re: basic hive clarification

Re: dataframe bigger than evalable memory

Re: Number of Zookepers on small cluster