About bleonhardi

bleonhardi · ‎04-29-2016

Didn't understand it 100% but I assume you installed a cluster and forgot to install nodemanagers on some nodes? You can do that on the Host pages. In Ambari - Go to Hosts - Select the Host you want to install the Nodemanager on - Press +Add and Select Nodemanager. Unfortunately you have to do this for every node one by one. If you want an automated way you would need to use the Rest API. ( this assumes you installed yarn and just didn't select some nodes for nodemanagers, if you didn't install yarn at all you can use the Add Service feature in the lower left of the main screen )

bleonhardi · ‎04-29-2016

The following has a good overview: Essentially RDDs are directly implemented code. Whatever you write gets executed. Dataframes on the other hand get compiled into an execution plan and then executed by the same engine. Essentially there is only a Dataframe API in python. So you would expect the same performance unless you use big heavy python udfs. https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html Essentially Dataframes have two advantages over RDDs. A) Most people do not know how to write optimized Spark code and B) the optimizer can do some tricks based on data charateristics that a user might not be aware of during write time. ( dataset size etc. ) https://0x0fff.com/spark-dataframes-are-faster-arent-they/ Edit: I was curious and did dig a bit deeper and I think here is the best overview, essentially as said, as long as you use the basic Dataframe functions performance is equal because the Dataframe code ( Python or Scala ) gets translated into the same code RDD code ( scala ) but if you use heavy python udfs you will see performance differences again. https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ "This is still true if you want to use Dataframe’s User Defined Functions, you can write them in Java/Scala or Python and this will impact your computation performance – but if you manage to stay in a pure Dataframe computation – then nothing will get between you and the best computation performance you can possibly get."

bleonhardi · ‎04-29-2016

Hello Pedro, Mark might be the person to ask. There are essentially dozens of ways to integrate SAS with Hadoop. - Directly reading from HDFS into SAS - Using Hive like any other database ( pretty good option ) - Using embedded processes for fast data loading and processing - SASGrid for Hadoop ( running jobs in yarn using embedded processes ) - SAS LASr an inmemory data store that uses HDFS as data store and can connect to hadoop in parallel https://community.hortonworks.com/articles/4689/getting-started-with-sas-and-hadoop.html

bleonhardi · ‎04-29-2016

Hi Chokroma, so the action is still not supported. However there is a tech note out to use it. Seems you used that? http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_HDP_RelNotes/content/community_features.html The method you mention was added in Spark 1.5 so it seems like you have an old jar file somewhere. It should be the spark assembly. The one in the share lib is 1.6 so that cannot be it. You must have an old version around somewhere. https://issues.apache.org/jira/browse/SPARK-3071

bleonhardi · ‎04-29-2016

"There is no, out of the box approach, that allows discovery and routing of the client to the application after it starts or upon container failure." Is there any way apart from node labels to tell slider to request containers on some nodes of the cluster? I fear otherwise this is not very useful. However if you could say: Start containers on datanodes 1-4 and try to keep them up. It would be quite useful. You could have a load balancer in front of it for high availability. Without that I do not see the usecases. I mean you could do that with nodelabels I suppose but it would be a big effort.

bleonhardi · ‎04-28-2016

You would have to make sure that the mapreduce.framework.name is set correctly ( yarn I suppose ) and the mapred files are there but first please verify that your nameNode parameter is set correctly. HDFS is very exact about it and requires the hdfs:// in front. So hdfs://nameonode:8020 instead of namenode:8020

bleonhardi · ‎04-27-2016

What do you mean with memory? As far as I know a temporary table is just like any other table with the one exception that it will be cleaned up when the session ends. So you can choose any storage format but it will be HDFS. So it depends. If you only need it once I would agree ORC is most likely not good but if you create a temp tables once and then query it a couple of times ORC definitely makes sense to me . Edit: Interesting You could use the HDFS storage policies here. Do you have a cluster that has been setup like this? You could still use any kind of storage you want compressed or not and I still think that ORC will be good if you use your temporary table a couple times. Starting in Hive 1.1.0 the storage policy for temporary tables can be set to memory , ssd , or default with the hive.exec.temporary.table.storage configuration parameter (see HDFS Storage Types and Storage Policies).

bleonhardi · ‎04-27-2016

Which temporary tables are we talking about? Tables you create with CREATE TEMPORARY TABLE? These can have any storage format you want. So you you create it as ORC it definitely WILL be compressed. Or what do you mean with "compression is enabled" ? https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/TruncateTable There are also some internal structures for example the dataset that is generated by the Tez job before Hiveserver2 returns it to the client. This can be text or sequence file ( configurable ) but I heard there is a jira to use ORC for it instead.

bleonhardi · ‎04-26-2016

@Kevin Sievers Hi Kevin, your commands look good to me, somehow he does not take the number of reduce tasks though. You are right Hadoop should be MUCH faster. But the one reduce task and even weirder one mapper seem to be the problem And I assure you it runs with a lot of mappers and 40 reducers and is loading and transforming around 300 GB of data in 20 minutes on an 7 datanode cluster. So basically I have NO idea why he does only one mapper, I have no idea why he has the second Reducer AT ALL. I have no idea why he ignores the mapred.reduce.tasks parameter? I think a support ticket might be in order. set hive.tez.java.opts = "-Xmx3600m"; set hive.tez.container.size = 4096; set mapred.reduce.tasks=120; CREATE EXTERNAL TABLE STAGING ... ... insert into TABLE TARGET partition (day = 20150811) SELECT * FROM STAGING distribute by DT ;

bleonhardi · ‎04-26-2016

@Adnan Ahmed Actually the default "block size" for WASB IS 500MB. So that explains that. http://i1.blogs.msdn.com/b/bigdatasupport/archive/2015/02/17/sqoop-job-performance-tuning-in-hdinsight-hadoop.aspx "dfs.block.size which is represented by fs.azure.block.size in Windows Azure Storage Blob, WASB (set to 512 MB by default), max split size etc."

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Node Manager not configured at the starting , ...

Re: why dataframes are faster in all lnaguages?

Re: Case Study - SAS using Hadoop

Re: Oozie Spark Action on HDP 2.4 - NoSuchMethodE...

Re: Hadoop as an Application PaaS with Slider and ...

Re: org.apache.oozie.action.ActionExecutorExceptio...

Re: Is compression used for Hive temporary tables?

Re: Is compression used for Hive temporary tables?

Re: How do you force the number of reducers in a m...

Re: Spark RDD partitions behavior in HDInsight (Az...