Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Looping through chunks of massive Hive data, transforming in Spark dataframe, sending back to Hive

avatar
New Contributor

I am new to Spark so apologize if this is a simple question.

I have a python program through Spark. The program pulls data from a Hive table, processes it (in code below 'func_1'), and then sends back to hive to a new Hive table. The program works fine on a small sample but the actual input Hive table is massive, and other inputs for func_1 and processing also entails a lot of data and memory, such that I run into memory issues on Spark cluster when I program on entire input Hive table, and ultimately it fails.

This is skeleton of pertinent pyspark code:

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)

sqlContext.sql("drop table if exists hivetable_out")

sqlContext.sql("create table hivetable_out(id bigint, date bigint, a string, b string)")
t = sqlContext.sql("select * from hivetable_in") 

rdd = t.rdd.map(lambda x: func_1(x))    

columns = ['id', 'date', 'a', 'b']    

df=sqlContext.createDataFrame(rdd, columns)

df.registerTempTable("newdata")
sqlContext.sql("insert into table hivetable_out select id, date, a, b from newdata")

How would I loop through chunks (e.g., predetermined percentages, ntiles, or even x number of rows) from input data until complete, iteratively sending back data to the output Hive table? Would I use a Sql window function? Or would I just pull data over to Spark context, iterating on func_1, unpersisting the rdd after sending each chunck to output Hive table?

Many thanks for any assistance.

1 REPLY 1

avatar
Super Collaborator

Hi @D Mortimer,

I presume the problem is converting to RDD and processing as chunks,

instead of that, I could think of by porting the custom function into Register UDF and apply the logic on the data(frame) which you have retrieved into t , and not to persist the data into memory, so that the UDF get applied across multiple executors while streaming the data and writes back to the table ( shuffles if needed).