About bleonhardi

bleonhardi · ‎02-11-2016

Did it also generate the logs? Its a bit weird, that is what I just downoaded from the tutorial link: So for me it needed the return, it needs to be in the same line as the for and I don't have the main at the top you have.

bleonhardi · ‎02-11-2016

Can you have a look at the file in the sandbox ( cat vi ). Might be different from windows ( Thats notepad++ right? )

bleonhardi · ‎02-11-2016

And if you do and still have the problem, have a look into the script file. It is not that long and it should be pretty obvious if a return is not lined up with the function before it. The comment says that he saw some intendation errors. Since I do not have the problems they must have been added perhaps by opening the file in Windows or something. Or there was a bad version hosted that was updated? Not sure.

bleonhardi · ‎02-11-2016

Just downloaded and ran the generate_logs file in the sandbox and it works fine for me. Are you sure you didn't modify the script somehow? Sometimes python is very sensitive to tab changes. Its easy for a return to suddenly be a couple characters more to the left or right and with that not in the function anymore.

bleonhardi · ‎02-11-2016

Regarding the bug: ( with thanks to @Neeraj Sabharwal ) https://community.hortonworks.com/questions/14383/dfsinputstream-has-been-closed-already.html So the get is simply a single get on an hdfs folder? Then a slow network connection would be my only guess.

bleonhardi · ‎02-11-2016

How do you copy the small files? Are you running one hadoop fs -put for every small file? ( for example in a shell script ). Then I would expect bad performance because the hadoop client is a java application and needs some setup time for each command. If you run it in a single put command this would be very bad performance. I normally get 200-300GB/hour. So 60MB should be done in seconds. I would check network speed by doing a simple scp from your client to a node of the cluster. Regarding small files: - A put of small files is definitely slower than a put of one big file but it shouldn't be 20 minutes. I once benchmarked it and I think it was 2-3 times slower to write very small files. - Why do you copy such tiny files into HDFS? This is bad for hadoop in general. Try to find a way to merge them. ( if its data files, if they are oozie definitions or so its obviously different. The input stream closed is by itself not dangerous. Normal put commands can show it in many scenarios ( a minor bug added to hdfs and fixed now ).

bleonhardi · ‎02-11-2016

It really depends on your scenario which block size is better. As a simple example: Let's assume your cluster has 50 task slots. Let's for simplicity assume that your task needs 5minutes to analyze 128MB and 1minute to set up a map task. So if you want to analyze 1.28GB of data. You need 10 tasks which can run in the cluster in parallel. So in total your job takes 5+1 minutes = 6 minutes. If you have 256MB blocks you need 5 tasks. They will take 10+1 = 11 minutes and will be slower. So 128MB blocks are faster. If you have 128GB of data you need 1000 tasks at 128MB block size or 20 waves. This means you need 20 * 6 = 120 minutes. If you have 256MB blocks you need 10 waves or 10 * 10+1minutes = 110 minutes. So your task is faster because you have less task setup time. It all gets more complicated if you take into account Tez task reuse, compression, the type of analytics you run etc. ORC for example already has 256MB blocks per default because it normally can skip a lot of data internally. On the other hand if you run heavy analytic tasks on smaller data (like data mining) a smaller block size might be better because your task will be heavily CPU bound and a single block could take a long time. So the answer as usually is: It depends and you have to try it out for yourself what works in your specific scenario. 128MB is a good default but 256MB might work as well. Ir not. For the rest: What Artem said, 3 times + replication, really small files are bad and HAR files can be a good way to adjust for them.

bleonhardi · ‎02-11-2016

That does not sound like an Hive error at all but some library dependencies you have missing. http://stackoverflow.com/questions/4928271/how-to-install-jstl-the-absolute-uri-http-java-sun-com-jstl-core-cannot-be-r

bleonhardi · ‎02-11-2016

In the capacity scheduler your could set a high priority queue at 90% of the cluster with extension( max. Capacity) to 100% and a low priority queue wirh 10% with extension( max capacity) to 100%. In this case jobs the first queue would always get 90% of the cluster if it needs it and the second queue would only get a tiny amount of the cluster if the high priority Queues have Queries. The low priority queue would still be able to monopolize the cluster if if has very long running tasks. But you could fix that with preemption. ( Or by making sure tasks in your cluster don't run for too long which they shouldn't anyway.)

bleonhardi · ‎02-09-2016

Uber task sounds interesting I need to check this out. Also having a dedicated sub queue for the oozie actions is most likely the way for me to go because I have one type of action that doesn't take a kill nicely. ( It would result in duplicated data ). I just remember the summit last year when someone mentioned that they had some configuration where preemption resulted in essentially endlessly running tasks. I cannot remember the constellation when this happens.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Flume Tutorial Error with Python Script

Re: Flume Tutorial Error with Python Script

Re: Flume Tutorial Error with Python Script

Re: Flume Tutorial Error with Python Script

Re: Taking long time to copy files from hdfs

Re: Taking long time to copy files from hdfs

Re: Best practices between size block , size file ...

Re: how to connect with hive database using JSP ?

Re: How can I set priority to queues with Capacity...

Re: Yarn preemption and Oozie