Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

config hdfs to distcp to/from google cloud storage

Solved Go to solution

config hdfs to distcp to/from google cloud storage

New Contributor

Is there anyone who can guide me on how to add gcs-connector.jar to Hadoop on HDP 2.5 so that I can distcp from/to Google Cloud Storage?

I followed this Manually installing the connector article and got this error

[centos@namenode ~]$ hadoop distcp gs://bucket/image.png /
17/01/24 07:33:22 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2
17/01/24 07:33:24 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[gs://bucket/image.png], targetPath=/, targetPathExists=true, filtersFile='null'}
17/01/24 07:33:25 INFO impl.TimelineClientImpl: Timeline service address: http://internal:8188/ws/v1/timeline/
17/01/24 07:33:25 INFO client.RMProxy: Connecting to ResourceManager at internal/xxx.xxx.xxx.xxx:8050
17/01/24 07:33:26 INFO client.AHSProxy: Connecting to Application History server at internal/xxx.xxx.xxx.xxx:10200
17/01/24 07:33:28 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://bucket/'
17/01/24 07:33:30 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
17/01/24 07:33:30 INFO tools.SimpleCopyListing: Build file listing completed.
17/01/24 07:33:30 INFO tools.DistCp: Number of paths in the copy list: 1
17/01/24 07:33:30 INFO tools.DistCp: Number of paths in the copy list: 1
17/01/24 07:33:31 INFO impl.TimelineClientImpl: Timeline service address: http://internal:8188/ws/v1/timeline/
17/01/24 07:33:31 INFO client.RMProxy: Connecting to ResourceManager at internal/xxx.xxx.xxx.xxx:8050
17/01/24 07:33:31 INFO client.AHSProxy: Connecting to Application History server at internal/xxx.xxx.xxx.xxx:10200
17/01/24 07:33:31 INFO mapreduce.JobSubmitter: number of splits:1
17/01/24 07:33:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1485241695662_0003
17/01/24 07:33:33 INFO impl.YarnClientImpl: Submitted application application_1485241695662_0003
17/01/24 07:33:33 INFO mapreduce.Job: The url to track the job: http://internal:8088/proxy/application_1485241695662_0003/
17/01/24 07:33:33 INFO tools.DistCp: DistCp job-id: job_1485241695662_0003
17/01/24 07:33:33 INFO mapreduce.Job: Running job: job_1485241695662_0003
17/01/24 07:33:39 INFO mapreduce.Job: Job job_1485241695662_0003 running in uber mode : false
17/01/24 07:33:39 INFO mapreduce.Job:  map 0% reduce 0%
17/01/24 07:33:47 INFO mapreduce.Job: Task Id : attempt_1485241695662_0003_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2214)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2746)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2759)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:218)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2120)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2212)
	... 17 more




17/01/24 07:33:52 INFO mapreduce.Job: Task Id : attempt_1485241695662_0003_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2214)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2746)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2759)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:218)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2120)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2212)
	... 17 more




17/01/24 07:33:56 INFO mapreduce.Job: Task Id : attempt_1485241695662_0003_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2214)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2746)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2759)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:218)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2120)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2212)
	... 17 more




Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143




17/01/24 07:34:01 INFO mapreduce.Job:  map 100% reduce 0%
17/01/24 07:34:01 INFO mapreduce.Job: Job job_1485241695662_0003 failed with state FAILED due to: Task failed task_1485241695662_0003_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0




17/01/24 07:34:01 INFO mapreduce.Job: Counters: 8
	Job Counters 
		Failed map tasks=4
		Launched map tasks=4
		Other local map tasks=4
		Total time spent by all maps in occupied slots (ms)=14884
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=14884
		Total vcore-milliseconds taken by all map tasks=14884
		Total megabyte-milliseconds taken by all map tasks=15241216
17/01/24 07:34:01 ERROR tools.DistCp: Exception encountered 
java.io.IOException: DistCp failure: Job job_1485241695662_0003 has failed: Task failed task_1485241695662_0003_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0




	at org.apache.hadoop.tools.DistCp.waitForJobCompletion(DistCp.java:215)
	at org.apache.hadoop.tools.DistCp.execute(DistCp.java:158)
	at org.apache.hadoop.tools.DistCp.run(DistCp.java:128)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.hadoop.tools.DistCp.main(DistCp.java:462

hadoop fs -cp gs://... works fine but it's very slow when moving a very large files.

I've added the gcs-connector.jar to every nodes in the cluster (NameNode, SNameNode, DataNodes) and also config the class path to add the jar file.

I've added this line to "hadoop-env template" on Ambari UI

export HADOOP_CLASSPATH=/var/lib/gcs-connector/gcs-connector-latest-hadoop2.jar:$HADOOP_CLASSPATH

Result from running "hadoop classpath" in NameNode

/usr/hdp/2.5.3.0-37/hadoop/conf:/usr/hdp/2.5.3.0-37/hadoop/lib/*:/usr/hdp/2.5.3.0-37/hadoop/.//*:/usr/hdp/2.5.3.0-37/hadoop-hdfs/./:/usr/hdp/2.5.3.0-37/hadoop-hdfs/lib/*:/usr/hdp/2.5.3.0-37/hadoop-hdfs/.//*:/usr/hdp/2.5.3.0-37/hadoop-yarn/lib/*:/usr/hdp/2.5.3.0-37/hadoop-yarn/.//*:/usr/hdp/2.5.3.0-37/hadoop-mapreduce/lib/*:/usr/hdp/2.5.3.0-37/hadoop-mapreduce/.//*:/var/lib/gcs-connector/gcs-connector-latest-hadoop2.jar::/usr/hdp/2.5.3.0-37/tez/*:/usr/hdp/2.5.3.0-37/tez/lib/*:/usr/hdp/2.5.3.0-37/tez/conf

Result from running "hadoop classpath" in one of my DataNode

/usr/hdp/2.5.3.0-37/hadoop/conf:/usr/hdp/2.5.3.0-37/hadoop/lib/*:/usr/hdp/2.5.3.0-37/hadoop/.//*:/usr/hdp/2.5.3.0-37/hadoop-hdfs/./:/usr/hdp/2.5.3.0-37/hadoop-hdfs/lib/*:/usr/hdp/2.5.3.0-37/hadoop-hdfs/.//*:/usr/hdp/2.5.3.0-37/hadoop-yarn/lib/*:/usr/hdp/2.5.3.0-37/hadoop-yarn/.//*:/usr/hdp/2.5.3.0-37/hadoop-mapreduce/lib/*:/usr/hdp/2.5.3.0-37/hadoop-mapreduce/.//*:/var/lib/gcs-connector/gcs-connector-latest-hadoop2.jar::mysql-connector-java.jar:/usr/hdp/2.5.3.0-37/tez/*:/usr/hdp/2.5.3.0-37/tez/lib/*:/usr/hdp/2.5.3.0-37/tez/conf

I can confirm that there is a file /var/lib/gcs-connector/gcs-connector-latest-hadoop2.jar in every nodes.

I also add these 3 properties to Custom core-site on Ambari UI

fs.gs.project.id
fs.gs.impl
fs.AbstractFileSystem.gs.impl

Any suggestion on how to make the distcp works?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: config hdfs to distcp to/from google cloud storage

Cloudera Employee
@Phakin CheangkrachangeThe DistCp is a mapreduce job and the issue seems to be with the JVM created for the job. That is the "mapreduce.application.classpath" might not have picked this jar file before creating the JVM.

Could you please add /var/lib/gcs-connector/gcs-connector-latest-hadoop2.jar at the end of mapreduce.application.classpath in the MapReduce2 service from Amabri and recycle the service so that the new JVM would pick this jar.

Let me know if it helps.

3 REPLIES 3

Re: config hdfs to distcp to/from google cloud storage

Cloudera Employee
@Phakin CheangkrachangeThe DistCp is a mapreduce job and the issue seems to be with the JVM created for the job. That is the "mapreduce.application.classpath" might not have picked this jar file before creating the JVM.

Could you please add /var/lib/gcs-connector/gcs-connector-latest-hadoop2.jar at the end of mapreduce.application.classpath in the MapReduce2 service from Amabri and recycle the service so that the new JVM would pick this jar.

Let me know if it helps.

Re: config hdfs to distcp to/from google cloud storage

New Contributor

@ssivachandran thank you so much. It works now!!!

Re: config hdfs to distcp to/from google cloud storage

Cloudera Employee

@Phakin Cheangkrachange

Glad to know that it worked! Kindly vote the answer since it helped you in resolving.