Hello everyone,
I have a situation and I would like to count on the community advice and perspective. I'm working with pyspark 2.0 and python 3.6 in an AWS environment with Glue.
I need to catch some historical information for many years and then I need to apply a join for a bunch of previous queries. So decide to create a DF for every query so easily I would be able to iterate in the years and months I want to go back and create on the flight the DF's.
The problem comes up when I need to apply a join among the DF's created in a loop because I use the same DF name within the loop and if I tried to create a DF name in a loop the name is read as a string not really as a DF then I can not join them later,
So far my code looks like:
query = 'SELECT * FROM TABLE WHERE MONTH = {}'
months = [1,2]
frame_list = []
for item in months:
df = 'cohort_2013_{}'.format(item)
query = query_text.format(item)
frame_list.append(df) # I pretend to retain in a list the name of DF to recall it later
df = spark.sql(query)
df = DynamicFrame.fromDF( df , glueContext, "df")
applyformat = ApplyMapping.apply(frame = df, mappings =
[("field1","string","field1","string"),
("field2","string","field2","string")],
transformation_ctx = "applyformat")
for df in frame_list:
create a join query for all created DF.
Please if someone knows how could I achieve this requirement let me know your ideas.
thanks so much