Reply
New Contributor
Posts: 1
Registered: ‎08-09-2013

Hadoop MR with S3

Hello,

 

is there any Infos/HowTo's/tUTutorials who explain how to onfigure to use Hadoop Mapreduce with Amazon S3 Input/Output?

 

thanks a lot :)

Expert Contributor
Posts: 63
Registered: ‎08-06-2013

Re: Hadoop MR with S3

Expert Contributor
Posts: 63
Registered: ‎08-06-2013

Re: Hadoop MR with S3

New Contributor
Posts: 1
Registered: ‎08-11-2013

Re: Hadoop MR with S3

Hi:

 

When we tried to startup JT using S3 as a replacement for HDFS, there are exceptions occued as the following.

2013-05-08 13:48:08,099 INFO org.apache.hadoop.mapred.JobTracker: Creating the system directory
2013-05-08 13:48:08,601 WARN org.jets3t.service.S3Service: Encountered 1 S3 Internal Server error(s), will retry in 50ms
2013-05-08 13:48:12,699 INFO org.apache.hadoop.mapred.JobTracker: problem cleaning system directory: s3n://bkt0424/mapred/system
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Encountered too many S3 Internal Server errors (6), aborting request.
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleServiceException(Jets3tNativeFileSystemStore.java:229)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeEmptyFile(Jets3tNativeFileSystemStore.java:97)
...
2013-05-08 13:48:37,873 WARN org.apache.hadoop.mapred.JobTracker: Failed to operate on mapred.system.dir (s3n://bkt0424/mapred/system) because of permissions.
2013-05-08 13:48:37,874 WARN org.apache.hadoop.mapred.JobTracker: This directory should be owned by the user 'mapred (auth:SIMPLE)'
2013-05-08 13:48:37,874 WARN org.apache.hadoop.mapred.JobTracker: Bailing out ...
org.apache.hadoop.security.AccessControlException: The systemdir s3n://bkt0424/mapred/system is not owned by mapred
2013-05-08 13:48:37,874 FATAL org.apache.hadoop.mapred.JobTracker: org.apache.hadoop.security.AccessControlException: The systemdir s3n://bkt0424/mapred/system is not owned by mapred
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1915)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1724)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:297)
2013-05-08 13:48:37,875 INFO org.apache.hadoop.mapred.JobTracker: SHUTDOWN_MSG:

 

Then we found that JT will check the owner of system directory as the following:

public class org.apache.hadoop.mapred.JobTracker
 
org.apache.hadoop.fs.FileSystem.FileSystem fs = null;
JobTracker(final JobConf conf, String identifier, Clock clock) {
...
FileStatus systemDirStatus = fs.getFileStatus(systemDir);
...
if (!systemDirStatus.getOwner().equals(getMROwner().getShortUserName())) {
 throw new AccessControlException("The systemdir " + systemDir +
 " is not owned by " + getMROwner().getShortUserName());
}

 

As described in HADOOP-8984, these checks won't work with existing s3 implementation.

Do you have any suggestion for running MR job using S3 as a replacement for HDFS?

 

Thanks ~

 

 

Posts: 1,896
Kudos: 433
Solutions: 303
Registered: ‎07-31-2013

Re: Hadoop MR with S3

MR requires a proper distributed file system to run in a distributed environment and to be able to support security. S3 doesn't qualify as such a filesystem.

You can use S3 for I/O on jobs within an MR cluster running on HDFS. This won't touch/use HDFS for anything other than job jars and distributed cache files, so you'll still be working with S3.
Announcements