Support Questions

Find answers, ask questions, and share your expertise

Is it possible to use S3 for Falcon feeds?

I have not seen any example of using s3 in Falcon except for mirroring. Is it possible to use an S3-bucket as location path for a feed?

1 ACCEPTED SOLUTION

@Liam Murphy: Please find the details below

1> Ensure that you have an Account with Amazon S3 and a designated bucket for your data

2> You must have an Access Key ID and a Secret Key

3> Configure HDFS for S3 storage by making the following changes to core-site.xml

<property> 
<name>fs.default.name</name> 
<value>s3n://your-bucket-name</value>
</property>

<property> 
<name>fs.s3n.awsAccessKeyId</name> 
<value>YOUR_S3_ACCESS_KEY</value></property>

<property> 
<name>fs.s3n.awsSecretAccessKey</name>   
<value> YOUR_S3_SECRET_KEY </value>
</property>

4>In the falcon feed.xml, specify the Amazon S3 location and schedule the feed

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="S3Replication" description="S3-Replication" xmlns="uri:falcon:feed:0.1">    
<frequency>
hours(1)
</frequency>    
<clusters>        
<cluster name="cluster1" type="source">            
<validity start="2016-09-01T00:00Z" end="2034-12-20T08:00Z"/>            
<retention limit="days(24)" action="delete"/>       
</cluster>        
<cluster name="cluster2" type="target">            
<validity start="2016-09-01T00:00Z" end="2034-12-20T08:00Z"/>           
<retention limit="days(90)" action="delete"/>            
<locations>                
<location type="data" path="s3://<bucket-name>/<path-folder>/${YEAR}-${MONTH}-${DAY}-${HOUR}/"/>            
</locations>        
</cluster>     
</clusters>

View solution in original post

13 REPLIES 13

Cloudera Employee

Document exists for wasb http://falcon.apache.org/DataReplicationAzure.html, may be just use s3a instead.

@Liam Murphy: Please find the details below

1> Ensure that you have an Account with Amazon S3 and a designated bucket for your data

2> You must have an Access Key ID and a Secret Key

3> Configure HDFS for S3 storage by making the following changes to core-site.xml

<property> 
<name>fs.default.name</name> 
<value>s3n://your-bucket-name</value>
</property>

<property> 
<name>fs.s3n.awsAccessKeyId</name> 
<value>YOUR_S3_ACCESS_KEY</value></property>

<property> 
<name>fs.s3n.awsSecretAccessKey</name>   
<value> YOUR_S3_SECRET_KEY </value>
</property>

4>In the falcon feed.xml, specify the Amazon S3 location and schedule the feed

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="S3Replication" description="S3-Replication" xmlns="uri:falcon:feed:0.1">    
<frequency>
hours(1)
</frequency>    
<clusters>        
<cluster name="cluster1" type="source">            
<validity start="2016-09-01T00:00Z" end="2034-12-20T08:00Z"/>            
<retention limit="days(24)" action="delete"/>       
</cluster>        
<cluster name="cluster2" type="target">            
<validity start="2016-09-01T00:00Z" end="2034-12-20T08:00Z"/>           
<retention limit="days(90)" action="delete"/>            
<locations>                
<location type="data" path="s3://<bucket-name>/<path-folder>/${YEAR}-${MONTH}-${DAY}-${HOUR}/"/>            
</locations>        
</cluster>     
</clusters>

Thanks for that Sowyma,

This is definitely trying to do something! But I now see an exception in oozie logs which says

160908110420441-oozie-oozi-W] ACTION[0000034-160908110420441-oozie-oozi-W@eviction] Launcher exception: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain

org.apache.oozie.action.hadoop.JavaMainException: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain

at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:59)

at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47)

..

Explorer

I see that Sowmya already answered. Yes, we can specify S3 as the source/destination cluster(s) with paths (we support Azure as well). Here is a Falcon screenshot.

falcon1.png

Full exception in oozie log is as follows:

org.apache.oozie.action.hadoop.JavaMainException: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain

at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:59)

at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47)

at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:35)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:236)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Caused by: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain

at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)

at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)

at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)

at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)

at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)

at org.apache.falcon.hadoop.HadoopClientFactory$1.run(HadoopClientFactory.java:200)

at org.apache.falcon.hadoop.HadoopClientFactory$1.run(HadoopClientFactory.java:198)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)

at org.apache.falcon.hadoop.HadoopClientFactory.createFileSystem(HadoopClientFactory.java:198)

at org.apache.falcon.hadoop.HadoopClientFactory.createProxiedFileSystem(HadoopClientFactory.java:153)

at org.apache.falcon.hadoop.HadoopClientFactory.createProxiedFileSystem(HadoopClientFactory.java:145)

at org.apache.falcon.entity.FileSystemStorage.fileSystemEvictor(FileSystemStorage.java:317)

at org.apache.falcon.entity.FileSystemStorage.evict(FileSystemStorage.java:300)

at org.apache.falcon.retention.FeedEvictor.run(FeedEvictor.java:76)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.falcon.retention.FeedEvictor.main(FeedEvictor.java:52)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:56)

... 15 more

I have defined fs.s3a.acccess.key, fs.s3a.secret.key, fs.s3a.endpoint in hdfs-stite.xml. I can use hdfs dfs -ls s3a://<my-buckt> from the command line, and it works. I've also set the path in the feed example to be s3a://<my-bucket>...

But this exception would seem to day oozie can't see the AWS access/secret key from some reason?

Regards,

Liam

Rising Star

If you are using multiple clusters, you need to make sure that the hadoop configuration that Oozie uses for the target cluster (see oozie.service.HadoopAccessorService.hadoop.configurations property in oozie-site.xml) is correctly configured. By default in a single cluster environment, Oozie will point to the local core-site.xml for this by default

Hi Venkat,

The property is set to *=/etc/hadoop/conf. This is just a simple single node cluster (HDP 2.3 sandbox). The s3a properties have been added to both core and hdfs site files, but still the same problem I'm afraid.

@Liam Murphy: Can you attach the Feed xml and Falcon and oozie logs? Looks like eviction is failing. Can you see if the replication succeeded? Oozie bundle created will have one for retention and another for replication. Thanks!

Hi Sowmya,

Attached file contains the feed definition, falcon and oozie logs. I submitted and scheduled the feed around the 14:40

timestamp

file.tar.gz

Thanks for your help

Liam

Hi Sowmya,

Is there another debug information I can provide to help solve the cause of the problem?

Kind Regards,

Liam

@Liam Murphy: In Oozie log I can see that replication paths don't exist. Can you make sure files exist ?

Eviction fails because of credentials issue. Can you make sure core-site and hdfs-site has the required configs and restart the services and resubmit the feed? Thanks!

2016-09-09 14:44:43,680  INFO CoordActionInputCheckXCommand:520 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000058-160909120521096-oozie-oozi-C] ACTION[0000058-160909120521096-oozie-oozi-C@10] [0000058-160909120521096-oozie-oozi-C@10]::ActionInputCheck:: File:hftp://192.168.39.108:50070/falcon/2016-09-09-01, Exists? :false
2016-09-09 14:44:43,817  INFO CoordActionInputCheckXCommand:520 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000058-160909120521096-oozie-oozi-C] ACTION[0000058-160909120521096-oozie-oozi-C@11] [0000058-160909120521096-oozie-oozi-C@11]::CoordActionInputCheck:: Missing deps:hftp://192.168.39.108:50070/falcon/2016-09-09-01 
2016-09-09 14:44:43,818  INFO CoordActionInputCheckXCommand:520 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000058-160909120521096-oozie-oozi-C] ACTION[0000058-160909120521096-oozie-oozi-C@11] [0000058-160909120521096-oozie-oozi-C@11]::ActionInputCheck:: In checkListOfPaths: hftp://192.168.39.108:50070/falcon/2016-09-09-01 is Missing.

I just noticed that when a path does not exist for a given hour falcon/oozie just get stuck!.. rather than check for the next hour? My misunderstanding I guess. Have got it working now.

Explorer

Hi Team / @Sowmya Ramesh, I am trying to use falcon to replicate HDFS to S3. I have tried above steps and I see the HDFStoS3 replication Job status KILLED after running the workflow. After launching Oozie, I can see the workflow changing status from RUNNING to KILLED. Is there a way to troubleshoot. I can run hadoop fs -ls commands on my s3 bucket so definitely got access. I suspect its the s3 URL. I tried downloading the xml changing the URL without the s3.region.amazonaws.com and uploading with no luck. Any other suggestions. Appreciate all your help/support in advance. Regards

Anil