About bleonhardi

bleonhardi · ‎08-01-2016

Running UDFs in pig is what pig is for. You should fix that problem. Have you registered your jars? http://pig.apache.org/docs/r0.16.0/udf.html#udf-java There are other possibilities as well, Spark comes to mind esp. with python it can be relatively easy to setup ( although it also has its problems like python versions ) And there are some ETL tools that can utilize hadoop. But by and large pig with java udfs is a very straight forward way to do custom data cleaning on data in hadoop. There is no reason you shouldn't get it to work.

bleonhardi · ‎08-01-2016

Just want to support that answer. a repartition is not bad in any case since I have seen some interesting characteristics with kafka producers. If you use round robin for them they send data to a random partition and switch them every 10 min or so. So it is possible that a single Partition in Kafka will randomly get ALL the data and blow your spark application up ( Flume kafka connector was my example ). A repartition after the KafkaStream fixed that. You can parametrize this based on the number of executors etc.

bleonhardi · ‎08-01-2016

How about you just try it? I am pretty sure it will be the same but just make two CTAS tables and test it quickly.

bleonhardi · ‎07-31-2016

The number of cores are the number of tasks "slots" in the executor. This is the reference to "you want 2-3x of physical cores". You want to make sure that at the same time spark runs more tasks than cores of the cpu ( there is hyper threading and some overhead). So assuming you have 15000 tasks and 100 executor cores in total spark will run them in 150 "waves". Think of it like a yarn in yarn. Now you also have vcores in yarn and the executor cores are translated into vcore requests but normally they are not switched on and purely ornamental. I.e. Yarn only uses memory for assignment )

bleonhardi · ‎07-30-2016

Sounds like a bug to me. Support ticket? There was an issue with ATS1.5 but it should definitely be fixed 2.4.2

bleonhardi · ‎07-30-2016

You do not need to reduce the number of tasks to 2x cores you need to reduce the number of tasks that run AT THE SAME TIME to 2-3 per core ( so 5 * 16 * 2 = 160 ) . You also don't need to change block size or anything. Also Executors work best with 10-50GB of RAM so 24GB executors or so seem fine. You can then set the number of tasks running with the --executor-cores y flag. This means that an executor can run y tasks at the same time. In your case 16 * 2-3 might be a good value. So the 15000 tasks will be executed one after another. You can tune these parameters accordingly. ( You can also try two smaller executors per node since Garbage Collection for long running tasks is an actual concern but as said 24GB is not too much. ) http://enterprisesecurityinjava.blogspot.co.uk/2016/04/slides-from-running-spark-in-production.html

bleonhardi · ‎07-29-2016

I am pretty sure that Hive Strings are not max. 32k long. I think the limit is something like 2GB. I am pretty sure if there exists a lower limit then it will be specific to a client or something. But I will verify that when I come back. That link seems to collaborate that and hive.apache.org also doesn't give a max. http://stackoverflow.com/questions/35030936/is-there-maximum-size-of-string-data-type-in-hive Also since both Varchar and String are String values and use dictionaries I am pretty sure the serialization will be pretty much identical. So why would Varchar be better? As I said I don't know but I would be surprised if there was a real difference. I assumed VARCHAR is simply a layer on top of String that checks values during insert. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-StringColumnSerialization

bleonhardi · ‎07-29-2016

@Steven Hirsch the big issue is that ATS 1.0 often couldn't keep up with 10s of queries per second on large clusters. And in some situations this limited the number of queries running in the cluster. Like really bad. Like cluster being empty because it would wait for ATS bad. There were some tuning options to make that better but by and large the single ATS server and single leveldb backend had limitations. So less aesthetic and more performance. In ATS 1.5 they made it better ATS 2.0 hopefully fixes that problem once and for all.

bleonhardi · ‎07-28-2016

yeah if you set it to 7 days he should just start cleaning older values after restart ( potentially after hitting the clean period the interval_ms thing)

bleonhardi · ‎07-28-2016

1. Is there a way to restrict max size that users can use for Spark executor and Driver when submitting jobs on Yarn cluster? You can set an upper limit for all task ( yarn max allocation mb or similar in yarn-site.xml ). But there is no way I am aware of to specifically restrcit spark applications or applications in one queue. 2. What the best practice around determining number of executor required for a job? Its a good question. There was an interesting presentation about that. The conclusion for executor size is: "It depends but usually 10-40GB and 3-6 cores per executor is a good limit. " A max number of executors is not that easy it depends on the amount of data you want to analyze and the speed you need. So let's assume you have 4 cores per executor and he can run 8 tasks in each and you want to analyze 100GB of data and you say you want around 128MB or one block per executor so you would need a thousand tasks in total. To run them all at the same time you could go up to 100 executors for max. performance but you can also make it smaller. It would then be slower. Bottomline its not unlike a mapreduce task. If you want a rule of thumb then the upper limit is data amount / hdfs block size / number of cores per executor x 2. More will not help you much. http://www.slideshare.net/HadoopSummit/running-spark-in-production-61337353 Is there a max limit that users can be restricted to? You can use yarn to create a queue for your spark users. There is a yarn parameter user limit which allows you to restrict a single user from having more than a specific amount of a queue. user-limit = 0.25 for example would restrcit a user from taking more than 25% of the queue. Or you could give every user a queue. 3. How RM handles resource allocation if most of the resources are consumed by Spark jobs in a queue? How preemption is handled? Like with any other task in yarn? Spark is not special. Preemption with Spark will kill executors and that is not great for spark ( although it can survive it for a while. ) I would avoid preemption if I could

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Big Data Analytics - Approach for Data Quality...

Re: How to scale Spark-Streaming with kafka?

Re: Hive STRING vs VARCHAR Performance

Re: Tuning parallelism: increase or decrease?

Re: HiveServer2 Hive user's nofile ulimit above 64...

Re: Tuning parallelism: increase or decrease?

Re: Hive STRING vs VARCHAR Performance

Re: How many HIVE concurrent queries can be execut...

Re: Yarn Timeline db consuming 466GB space

Re: Is it possible to define max memory size for S...