Created 10-09-2018 03:52 AM
Hi all,
I am trying to read the files from s3 bucket (which contain many sub directories). As of now i am giving the phyisical path to read the files. How to read the files without hard coded values.
File path : S3 bucket name/Folder/1005/SoB/20180722_zpsx3Gcc7J2MlNnViVp61/JPR_DM2_ORG/ *.gz files
"S3 bucket name/Folder/" this path is fixed one and client id(1005) we have to pass as a parameter.
Under Sob folder, we are having monthly wise folders and I have to take only latest two months data.
Please help me how to read the data without hard-coded.
Many thanks for your help.
Created 10-09-2018 09:52 AM
You can write a simple python snippet like below to read the subfolders. I have put a print statement in the code, but you can replace it some subprocess command to run it.
from datetime import date, timedelta
from dateutil.relativedelta import relativedelta
today = date.today()
two_months_back = today - relativedelta(months=2)
delta = today - two_months_back
for i in range(delta.days + 1):
dt = str(two_months_back + timedelta(i)).replace("-", "")
print "hdfs dfs -ls s3a://bucket/Folder/1005/SoB/%s" % dt
.
-Aditya
Created 10-09-2018 09:52 AM
You can write a simple python snippet like below to read the subfolders. I have put a print statement in the code, but you can replace it some subprocess command to run it.
from datetime import date, timedelta
from dateutil.relativedelta import relativedelta
today = date.today()
two_months_back = today - relativedelta(months=2)
delta = today - two_months_back
for i in range(delta.days + 1):
dt = str(two_months_back + timedelta(i)).replace("-", "")
print "hdfs dfs -ls s3a://bucket/Folder/1005/SoB/%s" % dt
.
-Aditya
Created 10-09-2018 01:04 PM
Hi Aditya,
Thanks a lot for your help. Is it possible to do in scala? As i dont have knowledge on python.
Created 10-09-2018 02:20 PM
I'm not sure of how to do this in Scala. I guess you may have similar date time functions in Scala as well. You can apply this logic in Scala.