Support Questions

Find answers, ask questions, and share your expertise

Looping through chunks of massive Hive data, transforming in Spark dataframe, sending back to Hive

avatar
New Contributor

I am new to Spark so apologize if this is a simple question.

I have a python program through Spark. The program pulls data from a Hive table, processes it (in code below 'func_1'), and then sends back to hive to a new Hive table. The program works fine on a small sample but the actual input Hive table is massive, and other inputs for func_1 and processing also entails a lot of data and memory, such that I run into memory issues on Spark cluster when I program on entire input Hive table, and ultimately it fails.

This is skeleton of pertinent pyspark code:

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)

sqlContext.sql("drop table if exists hivetable_out")

sqlContext.sql("create table hivetable_out(id bigint, date bigint, a string, b string)")
t = sqlContext.sql("select * from hivetable_in") 

rdd = t.rdd.map(lambda x: func_1(x))    

columns = ['id', 'date', 'a', 'b']    

df=sqlContext.createDataFrame(rdd, columns)

df.registerTempTable("newdata")
sqlContext.sql("insert into table hivetable_out select id, date, a, b from newdata")

How would I loop through chunks (e.g., predetermined percentages, ntiles, or even x number of rows) from input data until complete, iteratively sending back data to the output Hive table? Would I use a Sql window function? Or would I just pull data over to Spark context, iterating on func_1, unpersisting the rdd after sending each chunck to output Hive table?

Many thanks for any assistance.

1 REPLY 1

avatar
Super Collaborator

Hi @D Mortimer,

I presume the problem is converting to RDD and processing as chunks,

instead of that, I could think of by porting the custom function into Register UDF and apply the logic on the data(frame) which you have retrieved into t , and not to persist the data into memory, so that the UDF get applied across multiple executors while streaming the data and writes back to the table ( shuffles if needed).