I am new to Spark so apologize if this is a simple question.
I have a python program through Spark. The program pulls data from a Hive table, processes it (in code below 'func_1'), and then sends back to hive to a new Hive table. The program works fine on a small sample but the actual input Hive table is massive, and other inputs for func_1 and processing also entails a lot of data and memory, such that I run into memory issues on Spark cluster when I program on entire input Hive table, and ultimately it fails.
This is skeleton of pertinent pyspark code:
from pyspark.sql import HiveContext sqlContext = HiveContext(sc) sqlContext.sql("drop table if exists hivetable_out") sqlContext.sql("create table hivetable_out(id bigint, date bigint, a string, b string)")
t = sqlContext.sql("select * from hivetable_in") rdd = t.rdd.map(lambda x: func_1(x)) columns = ['id', 'date', 'a', 'b'] df=sqlContext.createDataFrame(rdd, columns) df.registerTempTable("newdata")
sqlContext.sql("insert into table hivetable_out select id, date, a, b from newdata")
How would I loop through chunks (e.g., predetermined percentages, ntiles, or even x number of rows) from input data until complete, iteratively sending back data to the output Hive table? Would I use a Sql window function? Or would I just pull data over to Spark context, iterating on func_1, unpersisting the rdd after sending each chunck to output Hive table?
Many thanks for any assistance.
Hi @D Mortimer,
I presume the problem is converting to RDD and processing as chunks,
instead of that, I could think of by porting the custom function into Register UDF and apply the logic on the data(frame) which you have retrieved into t , and not to persist the data into memory, so that the UDF get applied across multiple executors while streaming the data and writes back to the table ( shuffles if needed).