Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Launch distcp (or other MapReduce jobs) from outside the hadoop cluster, such as laptop

avatar
New Contributor

Hello,

 

I wonder what's the correct way of launching a distcp job (or any MapReduce job) from outside the hadoop cluster?

 

Basic info about my issue:

  • I can start a distcp from within the hadoop cluster just fine. Meaning that I can log into any hadoop node such as a yarn node and launch it with command

 

hadoop distcp -Dmapreduce.map.maxattempts=5 -update -m 50 -skipcrccheck -i -numListstatusThreads 30 -strategy dynamic <source folder> <destination folder>​

 

  • if I launch the distcp job from my laptop with the same command, I got this error:

 

20/11/29 15:30:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/29 15:30:38 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
20/11/29 15:30:38 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://ns-atg-hadoop01-dca19/user/weiping.he/weather], targetPath=hdfs://ns-atg-hadoop01-dca19/user/weiping.he, targetPathExists=true, preserveRawXattrs=false}
20/11/29 15:30:42 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
20/11/29 15:30:42 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
20/11/29 15:30:57 INFO mapreduce.JobSubmitter: number of splits:3
20/11/29 15:30:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1604716321987_0085
20/11/29 15:31:02 INFO impl.YarnClientImpl: Submitted application application_1604716321987_0085
20/11/29 15:31:02 INFO mapreduce.Job: The url to track the job: http://rm1-hadoop01.mydomain.net:8088/proxy/application_1604716321987_0085/
20/11/29 15:31:02 INFO tools.DistCp: DistCp job-id: job_1604716321987_0085
20/11/29 15:31:02 INFO mapreduce.Job: Running job: job_1604716321987_0085
20/11/29 15:31:08 INFO mapreduce.Job: Job job_1604716321987_0085 running in uber mode : false
20/11/29 15:31:08 INFO mapreduce.Job:  map 0% reduce 0%
20/11/29 15:31:12 INFO mapreduce.Job: Task Id : attempt_1604716321987_0085_m_000000_0, Status : FAILED
Error: java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
	at java.io.DataInputStream.readLong(DataInputStream.java:416)
	at org.apache.hadoop.tools.CopyListingFileStatus.readFields(CopyListingFileStatus.java:366)
	at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
	at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
	at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2344)
	at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2317)
	at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:562)
	at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
	at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
...
...
<repeat of the same exception error message>
...
...
...
20/11/29 15:31:22 INFO mapreduce.Job: Task Id : attempt_1604716321987_0085_m_000002_2, Status : FAILED
Error: java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
	at java.io.DataInputStream.readLong(DataInputStream.java:416)
	at org.apache.hadoop.tools.CopyListingFileStatus.readFields(CopyListingFileStatus.java:366)
	at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
	at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
	at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2344)
	at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2317)
	at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:562)
	at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
	at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)

20/11/29 15:31:22 INFO mapreduce.Job: Task Id : attempt_1604716321987_0085_m_000000_2, Status : FAILED
Error: java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
	at java.io.DataInputStream.readLong(DataInputStream.java:416)
	at org.apache.hadoop.tools.CopyListingFileStatus.readFields(CopyListingFileStatus.java:366)
	at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
	at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
	at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2344)
	at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2317)
	at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:562)
	at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
	at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)

20/11/29 15:31:28 INFO mapreduce.Job:  map 100% reduce 0%
20/11/29 15:31:29 INFO mapreduce.Job: Job job_1604716321987_0085 failed with state FAILED due to: Task failed task_1604716321987_0085_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

20/11/29 15:31:29 INFO mapreduce.Job: Counters: 12
	Job Counters
		Failed map tasks=10
		Killed map tasks=2
		Launched map tasks=12
		Other local map tasks=12
		Total time spent by all maps in occupied slots (ms)=36283
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=36283
		Total vcore-milliseconds taken by all map tasks=36283
		Total megabyte-milliseconds taken by all map tasks=37153792
	Map-Reduce Framework
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
20/11/29 15:31:30 ERROR tools.DistCp: Exception encountered
java.io.IOException: DistCp failure: Job job_1604716321987_0085 has failed: Task failed task_1604716321987_0085_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

	at org.apache.hadoop.tools.DistCp.waitForJobCompletion(DistCp.java:205)
	at org.apache.hadoop.tools.DistCp.execute(DistCp.java:156)
	at org.apache.hadoop.tools.DistCp.run(DistCp.java:126)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.tools.DistCp.main(DistCp.java:430)

 

Information about my laptop setup:

  •  hadoop folder

 

$ which hadoop
/Users/birdfly/dev/hadoop-2.7.2/bin/hadoop

 

  • mapred-site.xml on my laptop:

 

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>rm1-hadoop01.mydomain.net:10020</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>rm1-hadoop01.mydomain.net:19888</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.https.address</name>
    <value>rm1-hadoop01.mydomain.net:19890</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.admin.address</name>
    <value>rm1-hadoop01.mydomain.net:10033</value>
  </property>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.staging-dir</name>
    <value>/user</value>
  </property>
</configuration>

 

  •  yarn-site.xml on my laptop

 

  <property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>mydomain-hadoop01</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address.rm11</name>
    <value>rm1-hadoop01.mydomain.net:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm11</name>
    <value>rm1-hadoop01.mydomain.net:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm11</name>
    <value>rm1-hadoop01.mydomain.net:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm11</name>
    <value>rm1-hadoop01.mydomain.net:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm11</name>
    <value>rm1-hadoop01.mydomain.net:8088</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm11</name>
    <value>rm1-hadoop01.mydomain.net:8090</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address.rm77</name>
    <value>rm2-hadoop01.mydomain.net:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm77</name>
    <value>rm2-hadoop01.mydomain.net:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm77</name>
    <value>rm2-hadoop01.mydomain.net:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm77</name>
    <value>rm2-hadoop01.mydomain.net:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm77</name>
    <value>rm2-hadoop01.mydomain.net:8088</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm77</name>
    <value>rm2-hadoop01.mydomain.net:8090</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm11,rm77</value>
  </property>
  <property>
    <name>yarn.application.classpath</name>
    <value>$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/tools/lib/*</value>
  </property>

 

 

I searched on the web for "java.io.DataInputStream.readFully(DataInputStream.java:197)" and haven't come across anything similar to my use case.

 

I also searched for "mapreduce job from outside cluster" and the closet that I can find is this: https://stackoverflow.com/questions/29268845/running-mapreduce-remotely/41238404

 

Any suggestion is much appreciated! Thanks!

1 REPLY 1

avatar
New Contributor

Hi,

 

I just came across the same issue. And I want to consult that if this issue has been resolved.