Created 11-29-2020 02:22 PM
Hello,
I wonder what's the correct way of launching a distcp job (or any MapReduce job) from outside the hadoop cluster?
Basic info about my issue:
hadoop distcp -Dmapreduce.map.maxattempts=5 -update -m 50 -skipcrccheck -i -numListstatusThreads 30 -strategy dynamic <source folder> <destination folder>
20/11/29 15:30:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/29 15:30:38 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
20/11/29 15:30:38 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://ns-atg-hadoop01-dca19/user/weiping.he/weather], targetPath=hdfs://ns-atg-hadoop01-dca19/user/weiping.he, targetPathExists=true, preserveRawXattrs=false}
20/11/29 15:30:42 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
20/11/29 15:30:42 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
20/11/29 15:30:57 INFO mapreduce.JobSubmitter: number of splits:3
20/11/29 15:30:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1604716321987_0085
20/11/29 15:31:02 INFO impl.YarnClientImpl: Submitted application application_1604716321987_0085
20/11/29 15:31:02 INFO mapreduce.Job: The url to track the job: http://rm1-hadoop01.mydomain.net:8088/proxy/application_1604716321987_0085/
20/11/29 15:31:02 INFO tools.DistCp: DistCp job-id: job_1604716321987_0085
20/11/29 15:31:02 INFO mapreduce.Job: Running job: job_1604716321987_0085
20/11/29 15:31:08 INFO mapreduce.Job: Job job_1604716321987_0085 running in uber mode : false
20/11/29 15:31:08 INFO mapreduce.Job: map 0% reduce 0%
20/11/29 15:31:12 INFO mapreduce.Job: Task Id : attempt_1604716321987_0085_m_000000_0, Status : FAILED
Error: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readLong(DataInputStream.java:416)
at org.apache.hadoop.tools.CopyListingFileStatus.readFields(CopyListingFileStatus.java:366)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2344)
at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2317)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:562)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
...
...
<repeat of the same exception error message>
...
...
...
20/11/29 15:31:22 INFO mapreduce.Job: Task Id : attempt_1604716321987_0085_m_000002_2, Status : FAILED
Error: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readLong(DataInputStream.java:416)
at org.apache.hadoop.tools.CopyListingFileStatus.readFields(CopyListingFileStatus.java:366)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2344)
at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2317)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:562)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
20/11/29 15:31:22 INFO mapreduce.Job: Task Id : attempt_1604716321987_0085_m_000000_2, Status : FAILED
Error: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readLong(DataInputStream.java:416)
at org.apache.hadoop.tools.CopyListingFileStatus.readFields(CopyListingFileStatus.java:366)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2344)
at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2317)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:562)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
20/11/29 15:31:28 INFO mapreduce.Job: map 100% reduce 0%
20/11/29 15:31:29 INFO mapreduce.Job: Job job_1604716321987_0085 failed with state FAILED due to: Task failed task_1604716321987_0085_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
20/11/29 15:31:29 INFO mapreduce.Job: Counters: 12
Job Counters
Failed map tasks=10
Killed map tasks=2
Launched map tasks=12
Other local map tasks=12
Total time spent by all maps in occupied slots (ms)=36283
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=36283
Total vcore-milliseconds taken by all map tasks=36283
Total megabyte-milliseconds taken by all map tasks=37153792
Map-Reduce Framework
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
20/11/29 15:31:30 ERROR tools.DistCp: Exception encountered
java.io.IOException: DistCp failure: Job job_1604716321987_0085 has failed: Task failed task_1604716321987_0085_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
at org.apache.hadoop.tools.DistCp.waitForJobCompletion(DistCp.java:205)
at org.apache.hadoop.tools.DistCp.execute(DistCp.java:156)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:126)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:430)
Information about my laptop setup:
$ which hadoop
/Users/birdfly/dev/hadoop-2.7.2/bin/hadoop
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>mapreduce.jobhistory.address</name>
<value>rm1-hadoop01.mydomain.net:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>rm1-hadoop01.mydomain.net:19888</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.https.address</name>
<value>rm1-hadoop01.mydomain.net:19890</value>
</property>
<property>
<name>mapreduce.jobhistory.admin.address</name>
<value>rm1-hadoop01.mydomain.net:10033</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
</configuration>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>mydomain-hadoop01</value>
</property>
<property>
<name>yarn.resourcemanager.address.rm11</name>
<value>rm1-hadoop01.mydomain.net:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm11</name>
<value>rm1-hadoop01.mydomain.net:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address.rm11</name>
<value>rm1-hadoop01.mydomain.net:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address.rm11</name>
<value>rm1-hadoop01.mydomain.net:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm11</name>
<value>rm1-hadoop01.mydomain.net:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.https.address.rm11</name>
<value>rm1-hadoop01.mydomain.net:8090</value>
</property>
<property>
<name>yarn.resourcemanager.address.rm77</name>
<value>rm2-hadoop01.mydomain.net:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm77</name>
<value>rm2-hadoop01.mydomain.net:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address.rm77</name>
<value>rm2-hadoop01.mydomain.net:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address.rm77</name>
<value>rm2-hadoop01.mydomain.net:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm77</name>
<value>rm2-hadoop01.mydomain.net:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.https.address.rm77</name>
<value>rm2-hadoop01.mydomain.net:8090</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm11,rm77</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/tools/lib/*</value>
</property>
I searched on the web for "java.io.DataInputStream.readFully(DataInputStream.java:197)" and haven't come across anything similar to my use case.
I also searched for "mapreduce job from outside cluster" and the closet that I can find is this: https://stackoverflow.com/questions/29268845/running-mapreduce-remotely/41238404
Any suggestion is much appreciated! Thanks!
Created 01-28-2021 06:54 PM
Hi,
I just came across the same issue. And I want to consult that if this issue has been resolved.