Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to fix size limit error when working with hive table in pyspark

avatar
New Contributor

I have a hive table with 4 billion rows that I need to work with in pyspark:

my_table = sqlContext.table('my_hive_table')

When I try to do any actions such as counting against that table, I get the following exception (followed by TaskKilledexceptions):

my_table.count()
Py4JJavaError: An error occurred while calling o89.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6732 in stage 13.0 failed
4 times, most recent failure: Lost task 6732.3 in stage 13.0 (TID 30759, some_server.XX.net, executor 38): org.apache.hive.com.google.protobuf.InvalidProtocolBufferException: Protocol mess
age was too large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase the size limi
t.

Is there some way I can get around this issue without upgrading anything, maybe by modifying an environment variable or config somewhere, or by passing an argument to pyspark via the command line?

1 REPLY 1

avatar
Guru