I am new to Spark so apologize if this is a simple question.
I have a python program through Spark. The program pulls data from a Hive table, processes it (in code below 'func_1'), and then sends back to hive to a new Hive table. The program works fine on a small sample but the actual input Hive table is massive, and other inputs for func_1 and processing also entails a lot of data and memory, such that I run into memory issues on Spark cluster when I program on entire input Hive table, and ultimately it fails.
This is skeleton of pertinent pyspark code:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("drop table if exists hivetable_out")
sqlContext.sql("create table hivetable_out(id bigint, date bigint, a string, b string)")
t = sqlContext.sql("select * from hivetable_in")
rdd = t.rdd.map(lambda x: func_1(x))
columns = ['id', 'date', 'a', 'b']
df=sqlContext.createDataFrame(rdd, columns)
df.registerTempTable("newdata")
sqlContext.sql("insert into table hivetable_out select id, date, a, b from newdata")
How would I loop through chunks (e.g., predetermined percentages, ntiles, or even x number of rows) from input data until complete, iteratively sending back data to the output Hive table? Would I use a Sql window function? Or would I just pull data over to Spark context, iterating on func_1, unpersisting the rdd after sending each chunck to output Hive table?
Many thanks for any assistance.