Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Running TestDFSIO against S3A storage with HDP 2.6.2

Highlighted

Running TestDFSIO against S3A storage with HDP 2.6.2

New Contributor

Hello,

We are currently evaluating different performance profiles between HDFS and S3 storages.

So, we think that a good candidate tool for the job would be TestDFSIO....

We currently use HDP 2.6.3 with correct fs.s3a.* settings, because some stuff like Spark inside Zeppelin works well. Or simply put, hdfs dfs and hdfs distcp works as expected for s3a:// URI identifiers.

But, when lauching TestsDFSIO with some parameters to write inside an S3A bucket, I get some errors.

By looking at the code, there's a parameter test.build.data that we can change at runtime:

$ hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -Dtest.build.data=/tmp/TestsDFSIO -write

I'm getting this output:

18/01/24 14:10:38 INFO fs.TestDFSIO: TestDFSIO.1.8
18/01/24 14:10:38 INFO fs.TestDFSIO: nrFiles = 1
18/01/24 14:10:38 INFO fs.TestDFSIO: nrBytes (MB) = 1.0
18/01/24 14:10:38 INFO fs.TestDFSIO: bufferSize = 1000000
18/01/24 14:10:38 INFO fs.TestDFSIO: baseDir = s3a://benchmarks/
18/01/24 14:10:39 INFO fs.TestDFSIO: creating control file: 1048576 bytes, 1 files
java.lang.IllegalArgumentException: Wrong FS: s3a://benchmarks/io_control, expected: hdfs://experimentation2
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:214)
        at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:816)
        at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:812)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:812)
        at org.apache.hadoop.fs.TestDFSIO.createControlFile(TestDFSIO.java:304)
        at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:815)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:712)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
        at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
        at org.apache.hadoop.test.MapredTestDriver.run(MapredTestDriver.java:130)
        at org.apache.hadoop.test.MapredTestDriver.main(MapredTestDriver.java:138)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

Ok, now let's try modifying fs.defaultFS:

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -Dfs.defaultFS=s3a://benchmarks/ -write

But now it fails because :

  • The test.build.data parameter sticks to its default value even if we change fs.defaultFS parameter.
  • Some directories doesn't exists under my bucket, and it can't under object storage because there's no directories, just files identified by keys even if it looks like a directory structure.

    In this run they are:
    • /user/benchmarks/.staging/
    • /ats/active
18/01/24 14:12:34 INFO fs.TestDFSIO: TestDFSIO.1.8
18/01/24 14:12:34 INFO fs.TestDFSIO: nrFiles = 1
18/01/24 14:12:34 INFO fs.TestDFSIO: nrBytes (MB) = 1.0
18/01/24 14:12:34 INFO fs.TestDFSIO: bufferSize = 1000000
18/01/24 14:12:34 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
18/01/24 14:12:36 INFO fs.TestDFSIO: creating control file: 1048576 bytes, 1 files
18/01/24 14:12:38 INFO fs.TestDFSIO: created control files for: 1 files
18/01/24 14:12:39 INFO client.AHSProxy: Connecting to Application History server at node003.domain.com/10.40.128.20:10200
18/01/24 14:12:39 INFO client.AHSProxy: Connecting to Application History server at node003.domain.com/10.40.128.20:10200
18/01/24 14:12:40 INFO mapreduce.JobSubmissionFiles: Permissions on staging directory /user/benchmarks/.staging are incorrect: rwxrwxrwx. Fixing permissions to correct value rwx------
18/01/24 14:12:40 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/01/24 14:12:41 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
18/01/24 14:12:42 INFO mapred.FileInputFormat: Total input paths to process : 1
18/01/24 14:12:42 INFO mapreduce.JobSubmitter: number of splits:1
18/01/24 14:12:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516378201653_0029
18/01/24 14:12:44 INFO impl.TimelineClientImpl: Timeline service address: http://node003.domain.com:8188/ws/v1/timeline/
18/01/24 14:12:44 INFO service.AbstractService: Service org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl failed in state STARTED; cause: java.io.FileNotFoundException: /ats/active does not exist
java.io.FileNotFoundException: /ats/active does not exist
        at org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.<init>(FileSystemTimelineWriter.java:118)
        at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:320)
        at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:312)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:356)
        at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:331)
        [...]

So, what am I doing wrong, and how to run TestDFSIO against an S3 backend?

Thanks in advance
Bruno Lavoie