Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Reading files from s3 bucket sub folders

Solved Go to solution
Highlighted

Reading files from s3 bucket sub folders

New Contributor

Hi all,

I am trying to read the files from s3 bucket (which contain many sub directories). As of now i am giving the phyisical path to read the files. How to read the files without hard coded values.

File path : S3 bucket name/Folder/1005/SoB/20180722_zpsx3Gcc7J2MlNnViVp61/JPR_DM2_ORG/ *.gz files

"S3 bucket name/Folder/" this path is fixed one and client id(1005) we have to pass as a parameter.

Under Sob folder, we are having monthly wise folders and I have to take only latest two months data.

Please help me how to read the data without hard-coded.

Many thanks for your help.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Reading files from s3 bucket sub folders

@Lakshmi Prathyusha,

You can write a simple python snippet like below to read the subfolders. I have put a print statement in the code, but you can replace it some subprocess command to run it.

from datetime import date, timedelta
from dateutil.relativedelta import relativedelta

today = date.today()
two_months_back = today - relativedelta(months=2)

delta = today - two_months_back

for i in range(delta.days + 1):
dt = str(two_months_back + timedelta(i)).replace("-", "")
print "hdfs dfs -ls s3a://bucket/Folder/1005/SoB/%s" % dt

.

-Aditya

3 REPLIES 3

Re: Reading files from s3 bucket sub folders

@Lakshmi Prathyusha,

You can write a simple python snippet like below to read the subfolders. I have put a print statement in the code, but you can replace it some subprocess command to run it.

from datetime import date, timedelta
from dateutil.relativedelta import relativedelta

today = date.today()
two_months_back = today - relativedelta(months=2)

delta = today - two_months_back

for i in range(delta.days + 1):
dt = str(two_months_back + timedelta(i)).replace("-", "")
print "hdfs dfs -ls s3a://bucket/Folder/1005/SoB/%s" % dt

.

-Aditya

Re: Reading files from s3 bucket sub folders

New Contributor

Hi Aditya,

Thanks a lot for your help. Is it possible to do in scala? As i dont have knowledge on python.

Re: Reading files from s3 bucket sub folders

@Lakshmi Prathyusha,

I'm not sure of how to do this in Scala. I guess you may have similar date time functions in Scala as well. You can apply this logic in Scala.