Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5431 | 08-12-2016 01:02 PM | |
2204 | 08-08-2016 10:00 AM | |
2613 | 08-03-2016 04:44 PM | |
5519 | 08-03-2016 02:53 PM | |
1430 | 08-01-2016 02:38 PM |
02-11-2016
04:05 PM
Did it also generate the logs? Its a bit weird, that is what I just downoaded from the tutorial link: So for me it needed the return, it needs to be in the same line as the for and I don't have the main at the top you have.
... View more
02-11-2016
03:52 PM
Can you have a look at the file in the sandbox ( cat vi ). Might be different from windows ( Thats notepad++ right? )
... View more
02-11-2016
03:02 PM
1 Kudo
And if you do and still have the problem, have a look into the script file. It is not that long and it should be pretty obvious if a return is not lined up with the function before it. The comment says that he saw some intendation errors. Since I do not have the problems they must have been added perhaps by opening the file in Windows or something. Or there was a bad version hosted that was updated? Not sure.
... View more
02-11-2016
02:50 PM
2 Kudos
Just downloaded and ran the generate_logs file in the sandbox and it works fine for me. Are you sure you didn't modify the script somehow? Sometimes python is very sensitive to tab changes. Its easy for a return to suddenly be a couple characters more to the left or right and with that not in the function anymore.
... View more
02-11-2016
11:44 AM
1 Kudo
Regarding the bug: ( with thanks to @Neeraj Sabharwal ) https://community.hortonworks.com/questions/14383/dfsinputstream-has-been-closed-already.html So the get is simply a single get on an hdfs folder? Then a slow network connection would be my only guess.
... View more
02-11-2016
11:17 AM
3 Kudos
How do you copy the small files? Are you running one hadoop fs -put for every small file? ( for example in a shell script ). Then I would expect bad performance because the hadoop client is a java application and needs some setup time for each command. If you run it in a single put command this would be very bad performance. I normally get 200-300GB/hour. So 60MB should be done in seconds. I would check network speed by doing a simple scp from your client to a node of the cluster. Regarding small files: - A put of small files is definitely slower than a put of one big file but it shouldn't be 20 minutes. I once benchmarked it and I think it was 2-3 times slower to write very small files. - Why do you copy such tiny files into HDFS? This is bad for hadoop in general. Try to find a way to merge them. ( if its data files, if they are oozie definitions or so its obviously different. The input stream closed is by itself not dangerous. Normal put commands can show it in many scenarios ( a minor bug added to hdfs and fixed now ).
... View more
02-11-2016
10:07 AM
6 Kudos
It really depends on your scenario which block size is better. As a simple example: Let's assume your cluster has 50 task slots. Let's for simplicity assume that your task needs 5minutes to analyze 128MB and 1minute to set up a map task. So if you want to analyze 1.28GB of data. You need 10 tasks which can run in the cluster in parallel. So in total your job takes 5+1 minutes = 6 minutes. If you have 256MB blocks you need 5 tasks. They will take 10+1 = 11 minutes and will be slower. So 128MB blocks are faster. If you have 128GB of data you need 1000 tasks at 128MB block size or 20 waves. This means you need 20 * 6 = 120 minutes. If you have 256MB blocks you need 10 waves or 10 * 10+1minutes = 110 minutes. So your task is faster because you have less task setup time. It all gets more complicated if you take into account Tez task reuse, compression, the type of analytics you run etc. ORC for example already has 256MB blocks per default because it normally can skip a lot of data internally. On the other hand if you run heavy analytic tasks on smaller data (like data mining) a smaller block size might be better because your task will be heavily CPU bound and a single block could take a long time. So the answer as usually is: It depends and you have to try it out for yourself what works in your specific scenario. 128MB is a good default but 256MB might work as well. Ir not. For the rest: What Artem said, 3 times + replication, really small files are bad and HAR files can be a good way to adjust for them.
... View more
02-11-2016
09:57 AM
2 Kudos
That does not sound like an Hive error at all but some library dependencies you have missing. http://stackoverflow.com/questions/4928271/how-to-install-jstl-the-absolute-uri-http-java-sun-com-jstl-core-cannot-be-r
... View more
02-11-2016
01:23 AM
2 Kudos
In the capacity scheduler your could set a high priority queue at 90% of the cluster with extension( max. Capacity) to 100% and a low priority queue wirh 10% with extension( max capacity) to 100%. In this case jobs the first queue would always get 90% of the cluster if it needs it and the second queue would only get a tiny amount of the cluster if the high priority Queues have Queries. The low priority queue would still be able to monopolize the cluster if if has very long running tasks. But you could fix that with preemption. ( Or by making sure tasks in your cluster don't run for too long which they shouldn't anyway.)
... View more
02-09-2016
01:32 PM
1 Kudo
Uber task sounds interesting I need to check this out. Also having a dedicated sub queue for the oozie actions is most likely the way for me to go because I have one type of action that doesn't take a kill nicely. ( It would result in duplicated data ). I just remember the summit last year when someone mentioned that they had some configuration where preemption resulted in essentially endlessly running tasks. I cannot remember the constellation when this happens.
... View more