Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hadoop MR with S3

Hadoop MR with S3

New Contributor

Hello,

 

is there any Infos/HowTo's/tUTutorials who explain how to onfigure to use Hadoop Mapreduce with Amazon S3 Input/Output?

 

thanks a lot :)

4 REPLIES 4

Re: Hadoop MR with S3

Rising Star

Re: Hadoop MR with S3

Rising Star

Re: Hadoop MR with S3

New Contributor

Hi:

 

When we tried to startup JT using S3 as a replacement for HDFS, there are exceptions occued as the following.

2013-05-08 13:48:08,099 INFO org.apache.hadoop.mapred.JobTracker: Creating the system directory
2013-05-08 13:48:08,601 WARN org.jets3t.service.S3Service: Encountered 1 S3 Internal Server error(s), will retry in 50ms
2013-05-08 13:48:12,699 INFO org.apache.hadoop.mapred.JobTracker: problem cleaning system directory: s3n://bkt0424/mapred/system
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Encountered too many S3 Internal Server errors (6), aborting request.
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleServiceException(Jets3tNativeFileSystemStore.java:229)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeEmptyFile(Jets3tNativeFileSystemStore.java:97)
...
2013-05-08 13:48:37,873 WARN org.apache.hadoop.mapred.JobTracker: Failed to operate on mapred.system.dir (s3n://bkt0424/mapred/system) because of permissions.
2013-05-08 13:48:37,874 WARN org.apache.hadoop.mapred.JobTracker: This directory should be owned by the user 'mapred (auth:SIMPLE)'
2013-05-08 13:48:37,874 WARN org.apache.hadoop.mapred.JobTracker: Bailing out ...
org.apache.hadoop.security.AccessControlException: The systemdir s3n://bkt0424/mapred/system is not owned by mapred
2013-05-08 13:48:37,874 FATAL org.apache.hadoop.mapred.JobTracker: org.apache.hadoop.security.AccessControlException: The systemdir s3n://bkt0424/mapred/system is not owned by mapred
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1915)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1724)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:297)
2013-05-08 13:48:37,875 INFO org.apache.hadoop.mapred.JobTracker: SHUTDOWN_MSG:

 

Then we found that JT will check the owner of system directory as the following:

public class org.apache.hadoop.mapred.JobTracker
 
org.apache.hadoop.fs.FileSystem.FileSystem fs = null;
JobTracker(final JobConf conf, String identifier, Clock clock) {
...
FileStatus systemDirStatus = fs.getFileStatus(systemDir);
...
if (!systemDirStatus.getOwner().equals(getMROwner().getShortUserName())) {
 throw new AccessControlException("The systemdir " + systemDir +
 " is not owned by " + getMROwner().getShortUserName());
}

 

As described in HADOOP-8984, these checks won't work with existing s3 implementation.

Do you have any suggestion for running MR job using S3 as a replacement for HDFS?

 

Thanks ~

 

 

Re: Hadoop MR with S3

Master Guru
MR requires a proper distributed file system to run in a distributed environment and to be able to support security. S3 doesn't qualify as such a filesystem.

You can use S3 for I/O on jobs within an MR cluster running on HDFS. This won't touch/use HDFS for anything other than job jars and distributed cache files, so you'll still be working with S3.