About maya_tydykov

maya_tydykov · ‎07-26-2017

I have a hive table with 4 billion rows that I need to work with in pyspark: my_table = sqlContext.table('my_hive_table') When I try to do any actions such as counting against that table, I get the following exception (followed by TaskKilled exceptions): my_table.count() Py4JJavaError: An error occurred while calling o89.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6732 in stage 13.0 failed 4 times, most recent failure: Lost task 6732.3 in stage 13.0 (TID 30759, some_server.XX.net, executor 38): org.apache.hive.com.google.protobuf.InvalidProtocolBufferException: Protocol mess age was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limi t. Is there some way I can get around this issue without upgrading anything, maybe by modifying an environment variable or config somewhere, or by passing an argument to pyspark via the command line?

Online	Offline
Last Visited	‎07-27-2017 02:56 PM

Member Since	‎07-26-2017 01:28 PM
Last Visited	‎07-27-2017 02:56 PM
Posts	1
Kudos received	1

Cloudera Community

How to fix size limit error when working with hive...