About Fawze

Fawze · ‎02-27-2017

Yes I did, but since i didn't the catch the issue on time i got: The filesystem under path '/user/dataint/.staging' has 0 CORRUPT files I will try to catch the issue when i happens, I'm suspect also on bad disk that might cause the issue, why such directory could be 1 replica? Is there a default for this, all my cluster with replication factor 3.

Fawze · ‎02-27-2017

The weird thing it's happening sporadic and on the same error on different block: When i'm trying to list the file in the HDFS i cann't find the file, i suspect it happen on specifc disk on specific data node but it's happens only with one job, the job after 3 failures on the same node blaclisted the node till it blacklisted all data nodes and fail then. At the next run it success 2017-02-27 13:36:03,460 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1486363199991_135199_m_000014_0: Error: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-76191351-10.160.96.6-1447247246852:blk_1158119244_84380902 file=/user/dataint/.staging/job_1486363199991_135199/job.split at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:963)

Fawze · ‎02-27-2017

Hi, Anyone can help to understand this ERROR: The IP: 10.160.96.6 is the standby NN 2017-02-26 01:35:40,427 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1486363199991_126195_m_000026_3: Error: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-76191351-10.160.96.6-1447247246852:blk_1157585017_83846591 file=/user/dataint/.staging/job_1486363199991_126195/job.split at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:963) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:610) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:851) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:904) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:704) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:471) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) 2017-02-26 01:35:40,427 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1486363199991_126195_m_000026_3 TaskAttempt Transitioned from RUNNING to FAIL_FINISHING_CONTAINER 2017-02-26 01:35:40,429 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1486363199991_126195_m_000026 Task Transitioned from RUNNING to FAILED 2017-02-26 01:35:40,429 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 20 2017-02-26 01:35:40,429 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Job failed as tasks failed. failedMaps:1 failedReduces:0 2017-02-26 01:35:40,430 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1486363199991_126195Job Transitioned from RUNNING to FAIL_WAIT 2017-02-26 01:35:40,435 ERROR [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1486363199991_126195_r_000000_0 2017-02-26 01:35:40,435 INFO [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_DEALLOCATE 2017-02-26 01:35:40,435 ERROR [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1486363199991_126195_r_000001_0 2017-02-26 01:35:40,435 INFO [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_DEALLOCATE 2017-02-26 01:35:40,435 ERROR [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1486363199991_126195_r_000002_0

Fawze · ‎02-26-2017

I manage to solve by adding mapred-site.xml at the oozie server under /etc/hadoop/conf and overwriting the submit replication

Fawze · ‎02-26-2017

I want to find the place that i can disable passing specific host to submit the job through. Is see that the oozie launcher for the job is submitting from slpr-mha01 which is the JT,NN and Oozie node but he the job itself is submitted through DN. The jobs are scheduled using Oozie.

Fawze · ‎02-26-2017

Hi, Can i enforce this at the cluster level? This is the coord job configuration for the running job, vlpr-mha01 is acts as JT and NN. <configuration> <property> <name>jobType</name> <value>rm</value> </property> <property> <name>dwhType</name> <value>da</value> </property> <property> <name>oozie.coord.application.path</name> <value>hdfs://vlpr-mha01:54310/liveperson/code/server_dataaccess_retention/lp-dataaccess-retention-1.0.0.1/sched/</value> </property> <property> <name>recycleBinDir</name> <value>hdfs://vlpr-mha01:54310/liveperson/data/server_dataaccess_retention/recycle_bin/</value> </property> <property> <name>freq</name> <value>1440</value> </property> <property> <name>workflowAppUri</name> <value>hdfs://vlpr-mha01:54310/liveperson/code/server_dataaccess_retention/lp-dataaccess-retention-1.0.0.1/sched/</value> </property> <property> <name>start</name> <value>2014-03-02T10:24Z</value> </property> <property> <name>user.name</name> <value>dataaccess</value> </property> <property> <name>jobRoot</name> <value>hdfs://vlpr-mha01:54310/liveperson/code/server_dataaccess_retention/lp-dataaccess-retention-1.0.0.1</value> </property> <property> <name>workingOnDir</name> <value>hdfs://vlpr-mha01:54310/liveperson/data/server_dataaccess_retention/recycle_bin/</value> </property> <property> <name>oozie.libpath</name> <value>hdfs://vlpr-mha01:54310/liveperson/code/server_dataaccess_retention/lp-dataaccess-retention-1.0.0.1/lib</value> </property> <property> <name>nameNode</name> <value>hdfs://vlpr-mha01:54310</value> </property> <property> <name>end</name> <value>2020-01-01T00:00Z</value> </property> <property> <name>jobTracker</name> <value>vlpr-mha01:54311</value> </property> </configuration> This is an old cluster that trying not make changes at the job level, should be dead in 6 months

Fawze · ‎02-26-2017

Hi, I see some jobs in my cluster that submitted the job via the Jobtracker node. Looking at all data nodes and mapred.submit.replication is 2, in the job tracker mapred-site.xml there is no mapred.submit.replication property, i added it manually to the file and restarted the job tracker, but still see in the job file for the running jobs that has job tracker as the Submit Host the mapred.submit.replication is 10 and not 2.

Fawze · ‎02-25-2017

Just thinking to delete the spark service and then add it. Anyone who experienced the same issue?

Fawze · ‎02-23-2017

Hi, Is it planned to add this ability to the express cloudera manager version? is there any similar thing i can do woth the express version?

Fawze · ‎02-23-2017

Digging down in the cluster, i found one of the application that runs outside of the hadoop cluster has clients that make hdfs dfs -put to the hadoop cluster, these clients weren't have hdfs-site.xml and it got the default replication factor for the cluster, what i did? tested the hdfs dfs -put from a cleint server in my cluster and the client out side the cluster and notice the client outside the cluster put files with replication factor 3, to solve the issue i added hdfs-site.xml to each of the clients outside the cluster and override the default replication factor at the file.

Online	Offline
Last Visited	‎10-19-2023 10:11 PM

Member Since	‎01-25-2017 01:09 PM
Last Visited	‎10-19-2023 10:11 PM
Posts	396
Kudos received	27

Cloudera Community

Re: How to make Yarn deploy resources to new added...

Re: Upgrade to CDH 6.0.x from 5.15

Re: How to define concrete resource consumption fo...

Re: Excution of the following command gives warnin...

Re: Excution of the following command gives warnin...

Re: Mapreduce failed on Could not deallocate conta...

Re: Mapreduce failed on Could not deallocate conta...

Unable to read block from a DataNode

Re: mapred.submit.replication for jobs that has th...

Re: mapred.submit.replication for jobs that has th...

Re: mapred.submit.replication for jobs that has th...

mapred.submit.replication for jobs that has the Jo...

Re: Wrong spark history redirection for finished j...

Re: External authentication with AD and Cloudera M...

Re: NameNode alerting on Blocks under replicated e...