Support Questions

maya_tydykov · ‎07-26-2017

I have a hive table with 4 billion rows that I need to work with in pyspark:

my_table = sqlContext.table('my_hive_table')

When I try to do any actions such as counting against that table, I get the following exception (followed by TaskKilledexceptions):

my_table.count()
Py4JJavaError: An error occurred while calling o89.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6732 in stage 13.0 failed
4 times, most recent failure: Lost task 6732.3 in stage 13.0 (TID 30759, some_server.XX.net, executor 38): org.apache.hive.com.google.protobuf.InvalidProtocolBufferException: Protocol mess
age was too large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase the size limi
t.

Is there some way I can get around this issue without upgrading anything, maybe by modifying an environment variable or config somewhere, or by passing an argument to pyspark via the command line?

yvora · ‎07-31-2017

@Maya Tydykov, below thread might help you.

https://community.hortonworks.com/questions/23242/caused-by-comgoogleprotobufinvalidprotocolbufferex...

Cloudera Community

Support Questions

How to fix size limit error when working with hive table in pyspark