I have a hive table with 4 billion rows that I need to work with in pyspark:
my_table = sqlContext.table('my_hive_table')
When I try to do any actions such as counting against that table, I get the following exception (followed by TaskKilled
exceptions):
my_table.count()
Py4JJavaError: An error occurred while calling o89.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6732 in stage 13.0 failed
4 times, most recent failure: Lost task 6732.3 in stage 13.0 (TID 30759, some_server.XX.net, executor 38): org.apache.hive.com.google.protobuf.InvalidProtocolBufferException: Protocol mess
age was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limi
t.
Is there some way I can get around this issue without upgrading anything, maybe by modifying an environment variable or config somewhere, or by passing an argument to pyspark via the command line?