Unable to fetch more data like more than 7 million rows using Pyspark from the hive tables

mmk — Wed, 08 Jun 2022 07:26:52 GMT

Resources we have in place:

7 nodes with each having 250 gb memory

vcpu = 32 per each node

configuration specified in spark-defaults.conf :

spark.executor.memory = 100g
spark.executor.memoryOverhead = 49g
spark.driver.memoryOverhead=200g
spark.driver.memory = 500g

Query tried to execute :

hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(8000000)

So, we are facing below issue when trying to fetch 8 million rows with the above mentioned query. However , we are not facing any issue with fetching 7 million rows

Traceback (most recent call last): File "/root/hivespark.py", line 29, in <module> hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(8000000) File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 381, in show File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o71.showString. : java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) at java.lang.StringBuilder.append(StringBuilder.java:141) at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200) at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:357) at scala.collection.AbstractTraversable.addString(Traversable.scala:104) at org.apache.spark.sql.Dataset$$anonfun$showString$2.apply(Dataset.scala:330) at org.apache.spark.sql.Dataset$$anonfun$showString$2.apply(Dataset.scala:330) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74) at org.apache.spark.sql.Dataset.showString(Dataset.scala:330) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750)

Got below error message when tried to fetch 9 million rows

22/06/08 02:44:04 WARN hdfs.DataStreamer: Exception for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92560 java.io.EOFException: Unexpected EOF while trying to read response from server at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1101) 22/06/08 02:44:04 WARN hdfs.DataStreamer: Error Recovery for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92560 in pipeline [DatanodeInfoWithStorage[172.18.105.88:9866,DS-df61a542-f662-46db-9fc6-4c0b325e2e68,DISK], DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK], DatanodeInfoWithStorage[172.18.105.56:9866,DS-12f0be58-8862-4606-9b0c-0b3d6f77ce42,DISK]]: datanode 0(DatanodeInfoWithStorage[172.18.105.88:9866,DS-df61a542-f662-46db-9fc6-4c0b325e2e68,DISK]) is bad. 22/06/08 02:45:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on b7-38.lab.archivas.com:44696 in memory (size: 39.6 KB, free: 266.5 GB) 22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 1 22/06/08 02:45:12 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on b7-38.lab.archivas.com:44696 in memory (size: 39.6 KB, free: 266.5 GB) 22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 2 22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 4 22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 3 22/06/08 02:50:55 WARN hdfs.DataStreamer: Exception for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92573 java.io.EOFException: Unexpected EOF while trying to read response from server at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1101) 22/06/08 02:50:55 WARN hdfs.DataStreamer: Error Recovery for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92573 in pipeline [DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK], DatanodeInfoWithStorage[172.18.105.56:9866,DS-12f0be58-8862-4606-9b0c-0b3d6f77ce42,DISK], DatanodeInfoWithStorage[172.18.105.84:9866,DS-aaf0af68-8eaf-4bba-99f1-5d641ddfe726,DISK]]: datanode 0(DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK]) is bad. Traceback (most recent call last): File "/root/hivespark.py", line 29, in <module> hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(9000000) File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 381, in show File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o71.showString. : java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) at java.lang.StringBuilder.append(StringBuilder.java:141) at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200) at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

Can anyone help me with exact memory configurations based on the available resources i provided or if there is any additional parameters to be set ?

Re: Unable to fetch more data like more than 7 million rows using Pyspark from the hive tables

RangaReddy — Thu, 01 Sep 2022 05:48:46 GMT

Hi @mmk

I think you have shared the following information.

7 nodes with each having 250 gb memory and vcpu = 32 per each node

spark-defaults.conf

spark.executor.memory = 100g
spark.executor.memoryOverhead = 49g
spark.driver.memoryOverhead=200g
spark.driver.memory = 500g

You have maximum of 250 gb for node and you have specified driver memory is (500gb and 200gb overhead). How it possible to driver to get 700gb?

Generally you should not exceed the driver/executor memory beyond yarn physical memory.

Coming to the actual problem, please avoid the show() to print 8000000 records. If you need to get the print the all values, then implement a logic to 1000 records at once and next 1000 records for another iteration.

https://stackoverflow.com/questions/29227949/how-to-implement-spark-sql-pagination-query

question Unable to fetch more data like more than 7 million rows using Pyspark from the hive tables in Support Questions

Unable to fetch more data like more than 7 million rows using Pyspark from the hive tables

Re: Unable to fetch more data like more than 7 million rows using Pyspark from the hive tables