Created 01-24-2017 07:55 AM
Is there anyone who can guide me on how to add gcs-connector.jar to Hadoop on HDP 2.5 so that I can distcp from/to Google Cloud Storage?
I followed this Manually installing the connector article and got this error
[centos@namenode ~]$ hadoop distcp gs://bucket/image.png / 17/01/24 07:33:22 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2 17/01/24 07:33:24 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[gs://bucket/image.png], targetPath=/, targetPathExists=true, filtersFile='null'} 17/01/24 07:33:25 INFO impl.TimelineClientImpl: Timeline service address: http://internal:8188/ws/v1/timeline/ 17/01/24 07:33:25 INFO client.RMProxy: Connecting to ResourceManager at internal/xxx.xxx.xxx.xxx:8050 17/01/24 07:33:26 INFO client.AHSProxy: Connecting to Application History server at internal/xxx.xxx.xxx.xxx:10200 17/01/24 07:33:28 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://bucket/' 17/01/24 07:33:30 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0 17/01/24 07:33:30 INFO tools.SimpleCopyListing: Build file listing completed. 17/01/24 07:33:30 INFO tools.DistCp: Number of paths in the copy list: 1 17/01/24 07:33:30 INFO tools.DistCp: Number of paths in the copy list: 1 17/01/24 07:33:31 INFO impl.TimelineClientImpl: Timeline service address: http://internal:8188/ws/v1/timeline/ 17/01/24 07:33:31 INFO client.RMProxy: Connecting to ResourceManager at internal/xxx.xxx.xxx.xxx:8050 17/01/24 07:33:31 INFO client.AHSProxy: Connecting to Application History server at internal/xxx.xxx.xxx.xxx:10200 17/01/24 07:33:31 INFO mapreduce.JobSubmitter: number of splits:1 17/01/24 07:33:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1485241695662_0003 17/01/24 07:33:33 INFO impl.YarnClientImpl: Submitted application application_1485241695662_0003 17/01/24 07:33:33 INFO mapreduce.Job: The url to track the job: http://internal:8088/proxy/application_1485241695662_0003/ 17/01/24 07:33:33 INFO tools.DistCp: DistCp job-id: job_1485241695662_0003 17/01/24 07:33:33 INFO mapreduce.Job: Running job: job_1485241695662_0003 17/01/24 07:33:39 INFO mapreduce.Job: Job job_1485241695662_0003 running in uber mode : false 17/01/24 07:33:39 INFO mapreduce.Job: map 0% reduce 0% 17/01/24 07:33:47 INFO mapreduce.Job: Task Id : attempt_1485241695662_0003_m_000000_0, Status : FAILED Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2214) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2746) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2759) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:218) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2120) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2212) ... 17 more 17/01/24 07:33:52 INFO mapreduce.Job: Task Id : attempt_1485241695662_0003_m_000000_1, Status : FAILED Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2214) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2746) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2759) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:218) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2120) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2212) ... 17 more 17/01/24 07:33:56 INFO mapreduce.Job: Task Id : attempt_1485241695662_0003_m_000000_2, Status : FAILED Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2214) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2746) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2759) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:218) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2120) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2212) ... 17 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 17/01/24 07:34:01 INFO mapreduce.Job: map 100% reduce 0% 17/01/24 07:34:01 INFO mapreduce.Job: Job job_1485241695662_0003 failed with state FAILED due to: Task failed task_1485241695662_0003_m_000000 Job failed as tasks failed. failedMaps:1 failedReduces:0 17/01/24 07:34:01 INFO mapreduce.Job: Counters: 8 Job Counters Failed map tasks=4 Launched map tasks=4 Other local map tasks=4 Total time spent by all maps in occupied slots (ms)=14884 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=14884 Total vcore-milliseconds taken by all map tasks=14884 Total megabyte-milliseconds taken by all map tasks=15241216 17/01/24 07:34:01 ERROR tools.DistCp: Exception encountered java.io.IOException: DistCp failure: Job job_1485241695662_0003 has failed: Task failed task_1485241695662_0003_m_000000 Job failed as tasks failed. failedMaps:1 failedReduces:0 at org.apache.hadoop.tools.DistCp.waitForJobCompletion(DistCp.java:215) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:158) at org.apache.hadoop.tools.DistCp.run(DistCp.java:128) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.tools.DistCp.main(DistCp.java:462
hadoop fs -cp gs://... works fine but it's very slow when moving a very large files.
I've added the gcs-connector.jar to every nodes in the cluster (NameNode, SNameNode, DataNodes) and also config the class path to add the jar file.
I've added this line to "hadoop-env template" on Ambari UI
export HADOOP_CLASSPATH=/var/lib/gcs-connector/gcs-connector-latest-hadoop2.jar:$HADOOP_CLASSPATH
Result from running "hadoop classpath" in NameNode
/usr/hdp/2.5.3.0-37/hadoop/conf:/usr/hdp/2.5.3.0-37/hadoop/lib/*:/usr/hdp/2.5.3.0-37/hadoop/.//*:/usr/hdp/2.5.3.0-37/hadoop-hdfs/./:/usr/hdp/2.5.3.0-37/hadoop-hdfs/lib/*:/usr/hdp/2.5.3.0-37/hadoop-hdfs/.//*:/usr/hdp/2.5.3.0-37/hadoop-yarn/lib/*:/usr/hdp/2.5.3.0-37/hadoop-yarn/.//*:/usr/hdp/2.5.3.0-37/hadoop-mapreduce/lib/*:/usr/hdp/2.5.3.0-37/hadoop-mapreduce/.//*:/var/lib/gcs-connector/gcs-connector-latest-hadoop2.jar::/usr/hdp/2.5.3.0-37/tez/*:/usr/hdp/2.5.3.0-37/tez/lib/*:/usr/hdp/2.5.3.0-37/tez/conf
Result from running "hadoop classpath" in one of my DataNode
/usr/hdp/2.5.3.0-37/hadoop/conf:/usr/hdp/2.5.3.0-37/hadoop/lib/*:/usr/hdp/2.5.3.0-37/hadoop/.//*:/usr/hdp/2.5.3.0-37/hadoop-hdfs/./:/usr/hdp/2.5.3.0-37/hadoop-hdfs/lib/*:/usr/hdp/2.5.3.0-37/hadoop-hdfs/.//*:/usr/hdp/2.5.3.0-37/hadoop-yarn/lib/*:/usr/hdp/2.5.3.0-37/hadoop-yarn/.//*:/usr/hdp/2.5.3.0-37/hadoop-mapreduce/lib/*:/usr/hdp/2.5.3.0-37/hadoop-mapreduce/.//*:/var/lib/gcs-connector/gcs-connector-latest-hadoop2.jar::mysql-connector-java.jar:/usr/hdp/2.5.3.0-37/tez/*:/usr/hdp/2.5.3.0-37/tez/lib/*:/usr/hdp/2.5.3.0-37/tez/conf
I can confirm that there is a file /var/lib/gcs-connector/gcs-connector-latest-hadoop2.jar in every nodes.
I also add these 3 properties to Custom core-site on Ambari UI
fs.gs.project.id fs.gs.impl fs.AbstractFileSystem.gs.impl
Any suggestion on how to make the distcp works?
Created 01-24-2017 10:59 AM
Could you please add /var/lib/gcs-connector/gcs-connector-latest-hadoop2.jar at the end of mapreduce.application.classpath in the MapReduce2 service from Amabri and recycle the service so that the new JVM would pick this jar.
Let me know if it helps.
Created 01-24-2017 10:59 AM
Could you please add /var/lib/gcs-connector/gcs-connector-latest-hadoop2.jar at the end of mapreduce.application.classpath in the MapReduce2 service from Amabri and recycle the service so that the new JVM would pick this jar.
Let me know if it helps.
Created 01-24-2017 11:41 AM
@ssivachandran thank you so much. It works now!!!
Created 01-24-2017 11:59 AM
Glad to know that it worked! Kindly vote the answer since it helped you in resolving.