Support Questions

Find answers, ask questions, and share your expertise

Unable to fetch more data like more than 7 million rows using Pyspark from the hive tables

Explorer

Resources we have in place:

7 nodes with each having 250 gb memory

vcpu = 32 per each node

configuration specified in spark-defaults.conf :

spark.executor.memory = 100g
spark.executor.memoryOverhead = 49g
spark.driver.memoryOverhead=200g
spark.driver.memory = 500g

Query tried to execute :

hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(8000000)

 

So, we are facing below issue when trying to fetch 8 million rows with the above mentioned query. However , we are not facing any issue with fetching 7 million rows

 

Traceback (most recent call last):
  File "/root/hivespark.py", line 29, in <module>
    hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(8000000)
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 381, in show
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.showString.
: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:141)
        at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
        at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:357)
        at scala.collection.AbstractTraversable.addString(Traversable.scala:104)
        at org.apache.spark.sql.Dataset$$anonfun$showString$2.apply(Dataset.scala:330)
        at org.apache.spark.sql.Dataset$$anonfun$showString$2.apply(Dataset.scala:330)
        at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:330)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)

 

 

Got below error message when tried to fetch 9 million rows

 

22/06/08 02:44:04 WARN hdfs.DataStreamer: Exception for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92560
java.io.EOFException: Unexpected EOF while trying to read response from server
        at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
        at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1101)
22/06/08 02:44:04 WARN hdfs.DataStreamer: Error Recovery for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92560 in pipeline [DatanodeInfoWithStorage[172.18.105.88:9866,DS-df61a542-f662-46db-9fc6-4c0b325e2e68,DISK], DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK], DatanodeInfoWithStorage[172.18.105.56:9866,DS-12f0be58-8862-4606-9b0c-0b3d6f77ce42,DISK]]: datanode 0(DatanodeInfoWithStorage[172.18.105.88:9866,DS-df61a542-f662-46db-9fc6-4c0b325e2e68,DISK]) is bad.
22/06/08 02:45:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on b7-38.lab.archivas.com:44696 in memory (size: 39.6 KB, free: 266.5 GB)
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 1
22/06/08 02:45:12 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on b7-38.lab.archivas.com:44696 in memory (size: 39.6 KB, free: 266.5 GB)
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 2
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 4
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 3
22/06/08 02:50:55 WARN hdfs.DataStreamer: Exception for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92573
java.io.EOFException: Unexpected EOF while trying to read response from server
        at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
        at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1101)
22/06/08 02:50:55 WARN hdfs.DataStreamer: Error Recovery for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92573 in pipeline [DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK], DatanodeInfoWithStorage[172.18.105.56:9866,DS-12f0be58-8862-4606-9b0c-0b3d6f77ce42,DISK], DatanodeInfoWithStorage[172.18.105.84:9866,DS-aaf0af68-8eaf-4bba-99f1-5d641ddfe726,DISK]]: datanode 0(DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK]) is bad.
Traceback (most recent call last):
  File "/root/hivespark.py", line 29, in <module>
    hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(9000000)
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 381, in show
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.showString.
: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:141)
        at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
        at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 

 

Can anyone help me with exact memory configurations based on the available resources i provided or if there is any additional parameters to be set ?

1 REPLY 1

Expert Contributor

Hi @mmk 

 

I think you have shared the following information.

 

7 nodes with each having 250 gb memory and vcpu = 32 per each node

 

spark-defaults.conf 

spark.executor.memory = 100g
spark.executor.memoryOverhead = 49g
spark.driver.memoryOverhead=200g
spark.driver.memory = 500g

 

You have maximum of 250 gb for node and you have specified driver memory is (500gb and 200gb overhead). How it possible to driver to get 700gb?

 

Generally you should not exceed the driver/executor memory beyond yarn physical memory.

 

Coming to the actual problem, please avoid the show() to print 8000000 records. If you need to get the print the all values, then implement a logic to 1000 records at once and next 1000 records for another iteration. 

 

https://stackoverflow.com/questions/29227949/how-to-implement-spark-sql-pagination-query

 

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.