About mbigelow

mbigelow · ‎02-03-2017

saranvisa is correct in that you should set a minimum and the max should not push the a single nodes memory limits as a single container cannot run across nodes. There is still the mismatch in what is in the configs versus what YARN is using and reporting. On the RM machine get the process id for the RM, sudo su yarn -c "jps" and then get the process info for that id, ps -ef | grep <id>. Does that show that ti is using the configs from the path that you changed, it should be listed in -classpath?

mbigelow · ‎01-30-2017

I'm not terrible familiar with Oozie but I believe the launcher was desperately from the actual job. Also, from the log "-Xmx4096m -Xmx4608m" it is launching with 4 GB container size and the heap is set to 3 GB. Is it set in the Oozie job settings?

mbigelow · ‎01-29-2017

It will work. This will diminish the network throughput and could impact the cluster performance if the typical workload is Network IO bound. In my experience, with predominantly 10 Ge networks, I have not been bound by the network running at the default 1500.

mbigelow · ‎01-29-2017

Track down container container_e29_1484466365663_87038_02_000001. It is most likely a reducer. I say that since you said both the Map and AM container size was set to 2 GB. Therefor the Reduce container size must be 3 GB. Well, in theory the user launching it could have overridden any of them. What is the value of mapreduce.reduce.memory.mb? Lets try another route as well, in the RM UI, in the job in question, does it have any failed maps or reducers? If yes, drill down to the failed one and view the logs. If not, then the AM container OOM'd. From my recollection though, that is the line the AM logs concerning one of the containers it is responsible for. Anyway, the short of it is, either the Reduce container size is 3 GB or the user set their own value to 3 GB as the values in the cluster configs are only the defaults.

mbigelow · ‎01-29-2017

Does this MR job access HBase at all? This error indicates that the Region trade_all was not accessible. Any errors on the HBase RegionServers? Access the HBase Master UI to see what RS are serving this region and split.

mbigelow · ‎01-28-2017

I don't know of any way. Hadoop in general doesn't care how long it takes; it is more concerned an auto-recover of the platform so that jobs can finish no matter what. You can limit the number of queries or jobs by user or group, you can limit the resources to users or groups. I just don't think there is a way to automatically kill jobs or queries running longer than X. I know other products, like Pepperdata, can track and alert you. It still require manually intervention. Can we step back and you explain what your issue is with long running jobs? As maybe the root cause can be addressed there so job do not run for so long or hold back others.

mbigelow · ‎01-28-2017

Can you post the container logs for one of the containers that was killed? In the RM UI drill down through the job until you get the list of Mappers/Reducers that succeeded or failed. Click through to a failed task and then open the logs. You should find an exception in it on the reason. The code mentioned usually does indicate a heap issue but I have seen it reported for other reason a container was killed, such as when preemption strikes.

mbigelow · ‎01-26-2017

1. Yes it could. I personally don't like the threshold. It is not a great indicator of there being a small file issue. 2. The number reported by the DN is for all the replicas. It could mean a lot of small files or just a lot of data. At the defaults it could mean that the DN heap could use a boost although I always end up bumping it sooner. 3. Yes. 4. Yes. Each file takes up one or more blocks. The NN has to track it and it's replicas in its memeory. So a lot of small files can chew through the NN heap quickly. The DN heap is less concerned with Metadata associated with a block as it is related to the blocks being read,written, or replicated. 5. I'd worry less on the block count and more on the heap.

mbigelow · ‎01-26-2017

The cmd will use the instance profile from where it is launched. So if you want access but not for all you need to specify the key in the S3 URI.

mbigelow · ‎01-19-2017

Yes check there. I don't know the HIve source code but I do know that HDFS still does a username/group lookup against the OS.

Online	Offline
Last Visited	‎03-25-2019 05:55 PM

Member Since	‎08-16-2016 08:51 PM
Last Visited	‎03-25-2019 05:55 PM
Posts	642
Kudos received	129

Cloudera Community

Re: Configuring the HDFS superuser in Kerberos

Re: Hive process crash

Re: Upgrade from CDH 5.11 Express to Enterprise

Re: Adding user to Cloudera Manager using REST AP...

Re: Running in non-interactive mode, and data appe...

Re: Yarn-site.xml changes not reflecting

Re: Yarn Application failed on out of memory

Re: Any specified MTU required for cloudera cluste...

Re: Yarn Application failed on out of memory

Re: Map jobs are failing with exit code 143

Re: How to kill Long running queries on hive or im...

Re: Map jobs are failing with exit code 143

Re: Hadoop Data Node: why is there a "magic" numbe...

Re: Unable to access S3 bucket using S3A, HDFS CLI...

Re: Failed to validate proxy privilege of hue_hive...