Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

DISCTCP --update does an overwrite on s3a

DISCTCP --update does an overwrite on s3a

Contributor

Hi, 
I am using CDH 6.3.2. 
And I am currently implement a job that daily sync a folder in hdfs to s3. This fodler can have new files or modified files.
But the -update options doesn't seems to be working. All the files in my "test" folder are gettin re-written every-time.
Exemple If I dot this command once :


 hadoop distcp -update /user/maurin/test s3a://test_bucket/test/

ERROR: Tools helper /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/bin/../lib/hadoop/libexec//tools/hadoop-distcp.sh was not found.
20/03/23 19:36:37 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
20/03/23 19:36:37 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
20/03/23 19:36:37 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started
20/03/23 19:36:39 INFO Configuration.deprecation: fs.s3a.server-side-encryption-key is deprecated. Instead, use fs.s3a.server-side-encryption.key
20/03/23 19:36:40 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=false, useRdiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=0.0, copyStrategy='uniformsize', preserveStatus=[BLOCKSIZE], atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[/user/maurin/test], targetPath=s3a://test_bucket/test, filtersFile='null', blocksPerChunk=0, copyBufferSize=8192, verboseLog=false}, sourcePaths=[/user/maurin/test], targetPathExists=true, preserveRawXattrsfalse
20/03/23 19:36:42 INFO hdfs.DFSClient: Created token for maurin: HDFS_DELEGATION_TOKEN owner=maurin/lore_staff@net.getlore.io, renewer=yarn, realUser=, issueDate=1585017402455, maxDate=1585622202455, sequenceNumber=32271, masterKeyId=886 on ha-hdfs:nameservice1
20/03/23 19:36:42 INFO security.TokenCache: Got dt for hdfs://nameservice1; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (token for maurin: HDFS_DELEGATION_TOKEN owner=maurin/lore_staff@net.getlore.io, renewer=yarn, realUser=, issueDate=1585017402455, maxDate=1585622202455, sequenceNumber=32271, masterKeyId=886)
20/03/23 19:36:42 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 4; dirCnt = 1
20/03/23 19:36:42 INFO tools.SimpleCopyListing: Build file listing completed.
20/03/23 19:36:42 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
20/03/23 19:36:42 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
20/03/23 19:36:42 INFO tools.DistCp: Number of paths in the copy list: 4
20/03/23 19:36:42 INFO tools.DistCp: Number of paths in the copy list: 4
20/03/23 19:36:43 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm756
20/03/23 19:36:43 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/maurin/.staging/job_1584390517558_0074
20/03/23 19:36:43 INFO mapreduce.JobSubmitter: number of splits:3
20/03/23 19:36:43 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
20/03/23 19:36:43 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/03/23 19:36:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1584390517558_0074
20/03/23 19:36:43 INFO mapreduce.JobSubmitter: Executing with tokens: [Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (token for maurin: HDFS_DELEGATION_TOKEN owner=maurin/lore_staff@net.getlore.io, renewer=yarn, realUser=, issueDate=1585017402455, maxDate=1585622202455, sequenceNumber=32271, masterKeyId=886)]
20/03/23 19:36:43 INFO conf.Configuration: resource-types.xml not found
20/03/23 19:36:43 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/03/23 19:36:44 INFO impl.YarnClientImpl: Submitted application application_1584390517558_0074
20/03/23 19:36:44 INFO mapreduce.Job: The url to track the job: http://cdhmaster3.net.cuberonlabs.com:8088/proxy/application_1584390517558_0074/
20/03/23 19:36:44 INFO tools.DistCp: DistCp job-id: job_1584390517558_0074
20/03/23 19:36:44 INFO mapreduce.Job: Running job: job_1584390517558_0074
20/03/23 19:36:52 INFO mapreduce.Job: Job job_1584390517558_0074 running in uber mode : false
20/03/23 19:36:52 INFO mapreduce.Job: map 0% reduce 0%
20/03/23 19:37:11 INFO mapreduce.Job: map 84% reduce 0%
20/03/23 19:37:13 INFO mapreduce.Job: map 100% reduce 0%
20/03/23 19:37:22 INFO mapreduce.Job: Job job_1584390517558_0074 completed successfully
20/03/23 19:37:22 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=694053
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1656
HDFS: Number of bytes written=0
HDFS: Number of read operations=35
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
HDFS: Number of bytes read erasure-coded=0
S3A: Number of bytes read=0
S3A: Number of bytes written=4
S3A: Number of read operations=44
S3A: Number of large read operations=0
S3A: Number of write operations=33
Job Counters
Launched map tasks=3
Other local map tasks=3
Total time spent by all maps in occupied slots (ms)=268400
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=53680
Total vcore-milliseconds taken by all map tasks=429440
Total megabyte-milliseconds taken by all map tasks=274841600
Map-Reduce Framework
Map input records=4
Map output records=0
Input split bytes=354
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=364
CPU time spent (ms)=19010
Physical memory (bytes) snapshot=1625214976
Virtual memory (bytes) snapshot=18979409920
Total committed heap usage (bytes)=6963068928
Peak Map Physical memory (bytes)=556732416
Peak Map Virtual memory (bytes)=6332137472
File Input Format Counters
Bytes Read=1298
File Output Format Counters
Bytes Written=0
DistCp Counters
Bandwidth in Btyes=0
Bytes Copied=4
Bytes Expected=4
Files Copied=3
DIR_COPY=1
20/03/23 19:37:22 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...
20/03/23 19:37:22 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.
20/03/23 19:37:22 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

 We can see that it copied 3 files. 
Then If I trigger it again: 

distcp -update /user/maurin/test s3a://test_bucket/test/
ERROR: Tools helper /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/bin/../lib/hadoop/libexec//tools/hadoop-distcp.sh was not found.
20/03/23 19:41:38 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
20/03/23 19:41:38 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
20/03/23 19:41:38 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started
20/03/23 19:41:41 INFO Configuration.deprecation: fs.s3a.server-side-encryption-key is deprecated. Instead, use fs.s3a.server-side-encryption.key
20/03/23 19:41:41 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=false, useRdiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=0.0, copyStrategy='uniformsize', preserveStatus=[BLOCKSIZE], atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[/user/maurin/test], targetPath=s3a://test_bucket/test, filtersFile='null', blocksPerChunk=0, copyBufferSize=8192, verboseLog=false}, sourcePaths=[/user/maurin/test], targetPathExists=true, preserveRawXattrsfalse
20/03/23 19:41:43 INFO hdfs.DFSClient: Created token for maurin: HDFS_DELEGATION_TOKEN owner=maurin/lore_staff@net.getlore.io, renewer=yarn, realUser=, issueDate=1585017703760, maxDate=1585622503760, sequenceNumber=32272, masterKeyId=886 on ha-hdfs:nameservice1
20/03/23 19:41:43 INFO security.TokenCache: Got dt for hdfs://nameservice1; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (token for maurin: HDFS_DELEGATION_TOKEN owner=maurin/lore_staff@net.getlore.io, renewer=yarn, realUser=, issueDate=1585017703760, maxDate=1585622503760, sequenceNumber=32272, masterKeyId=886)
20/03/23 19:41:44 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 4; dirCnt = 1
20/03/23 19:41:44 INFO tools.SimpleCopyListing: Build file listing completed.
20/03/23 19:41:44 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
20/03/23 19:41:44 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
20/03/23 19:41:44 INFO tools.DistCp: Number of paths in the copy list: 4
20/03/23 19:41:44 INFO tools.DistCp: Number of paths in the copy list: 4
20/03/23 19:41:44 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm756
20/03/23 19:41:44 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/maurin/.staging/job_1584390517558_0075
20/03/23 19:41:44 INFO mapreduce.JobSubmitter: number of splits:2
20/03/23 19:41:44 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
20/03/23 19:41:44 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/03/23 19:41:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1584390517558_0075
20/03/23 19:41:45 INFO mapreduce.JobSubmitter: Executing with tokens: [Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (token for maurin: HDFS_DELEGATION_TOKEN owner=maurin/lore_staff@net.getlore.io, renewer=yarn, realUser=, issueDate=1585017703760, maxDate=1585622503760, sequenceNumber=32272, masterKeyId=886)]
20/03/23 19:41:45 INFO conf.Configuration: resource-types.xml not found
20/03/23 19:41:45 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/03/23 19:41:45 INFO impl.YarnClientImpl: Submitted application application_1584390517558_0075
20/03/23 19:41:45 INFO mapreduce.Job: The url to track the job: http://cdhmaster3.net.cuberonlabs.com:8088/proxy/application_1584390517558_0075/
20/03/23 19:41:45 INFO tools.DistCp: DistCp job-id: job_1584390517558_0075
20/03/23 19:41:45 INFO mapreduce.Job: Running job: job_1584390517558_0075
20/03/23 19:41:55 INFO mapreduce.Job: Job job_1584390517558_0075 running in uber mode : false
20/03/23 19:41:55 INFO mapreduce.Job: map 0% reduce 0%
20/03/23 19:42:14 INFO mapreduce.Job: map 65% reduce 0%
20/03/23 19:42:24 INFO mapreduce.Job: map 100% reduce 0%
20/03/23 19:42:33 INFO mapreduce.Job: Job job_1584390517558_0075 completed successfully
20/03/23 19:42:33 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=462696
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1259
HDFS: Number of bytes written=0
HDFS: Number of read operations=27
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
HDFS: Number of bytes read erasure-coded=0
S3A: Number of bytes read=0
S3A: Number of bytes written=4
S3A: Number of read operations=43
S3A: Number of large read operations=0
S3A: Number of write operations=21
Job Counters
Launched map tasks=2
Other local map tasks=2
Total time spent by all maps in occupied slots (ms)=217900
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=43580
Total vcore-milliseconds taken by all map tasks=348640
Total megabyte-milliseconds taken by all map tasks=223129600
Map-Reduce Framework
Map input records=4
Map output records=0
Input split bytes=234
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=323
CPU time spent (ms)=14370
Physical memory (bytes) snapshot=1128972288
Virtual memory (bytes) snapshot=12650110976
Total committed heap usage (bytes)=4581752832
Peak Map Physical memory (bytes)=569790464
Peak Map Virtual memory (bytes)=6332850176
File Input Format Counters
Bytes Read=1021
File Output Format Counters
Bytes Written=0
DistCp Counters
Bandwidth in Btyes=0
Bytes Copied=4
Bytes Expected=4
Files Copied=3
DIR_COPY=1
20/03/23 19:42:33 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...
20/03/23 19:42:33 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.
20/03/23 19:42:33 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

 We can see that it did the same operation again: Files Copied=3

Anything that I am missing to only copy the newly created files and replace the ones modified?

 

thanks

Don't have an account?
Coming from Hortonworks? Activate your account here