Member since
07-26-2017
1
Post
1
Kudos Received
0
Solutions
07-26-2017
01:55 PM
1 Kudo
I have a hive table with 4 billion rows that I need to work with in pyspark: my_table = sqlContext.table('my_hive_table')
When I try to do any actions such as counting against that table, I get the following exception (followed by TaskKilled exceptions):
my_table.count()
Py4JJavaError: An error occurred while calling o89.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6732 in stage 13.0 failed
4 times, most recent failure: Lost task 6732.3 in stage 13.0 (TID 30759, some_server.XX.net, executor 38): org.apache.hive.com.google.protobuf.InvalidProtocolBufferException: Protocol mess
age was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limi
t.
Is there some way I can get around this issue without upgrading anything, maybe by modifying an environment variable or config somewhere, or by passing an argument to pyspark via the command line?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark