Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Pyspark: Create dataframes in a loop and then run a join among all of them

avatar
Expert Contributor

Hello everyone,

I have a situation and I would like to count on the community advice and perspective. I'm working with pyspark 2.0 and python 3.6 in an AWS environment with Glue.

I need to catch some historical information for many years and then I need to apply a join for a bunch of previous queries. So decide to create a DF for every query so easily I would be able to iterate in the years and months I want to go back and create on the flight the DF's.

The problem comes up when I need to apply a join among the DF's created in a loop because I use the same DF name within the loop and if I tried to create a DF name in a loop the name is read as a string not really as a DF then I can not join them later,

So far my code looks like:

query = 'SELECT * FROM TABLE WHERE MONTH = {}'
months = [1,2]
frame_list = []


for item in months:
    df = 'cohort_2013_{}'.format(item)
    query = query_text.format(item) 
    frame_list.append(df)  # I pretend to retain in a list the name of DF to recall it later
    df = spark.sql(query)
    df = DynamicFrame.fromDF( df , glueContext, "df")
    applyformat = ApplyMapping.apply(frame = df, mappings =
        [("field1","string","field1","string"),
         ("field2","string","field2","string")],
        transformation_ctx = "applyformat")

for df in frame_list:
      create a join query for all created DF.	

Please if someone knows how could I achieve this requirement let me know your ideas.

thanks so much

0 REPLIES 0