Created 06-08-2022 12:26 AM
Resources we have in place:
7 nodes with each having 250 gb memory
vcpu = 32 per each node
configuration specified in spark-defaults.conf :
spark.executor.memory = 100g
spark.executor.memoryOverhead = 49g
spark.driver.memoryOverhead=200g
spark.driver.memory = 500g
Query tried to execute :
hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(8000000)
So, we are facing below issue when trying to fetch 8 million rows with the above mentioned query. However , we are not facing any issue with fetching 7 million rows
Traceback (most recent call last):
File "/root/hivespark.py", line 29, in <module>
hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(8000000)
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 381, in show
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.showString.
: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:141)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:357)
at scala.collection.AbstractTraversable.addString(Traversable.scala:104)
at org.apache.spark.sql.Dataset$$anonfun$showString$2.apply(Dataset.scala:330)
at org.apache.spark.sql.Dataset$$anonfun$showString$2.apply(Dataset.scala:330)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:330)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Got below error message when tried to fetch 9 million rows
22/06/08 02:44:04 WARN hdfs.DataStreamer: Exception for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92560
java.io.EOFException: Unexpected EOF while trying to read response from server
at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1101)
22/06/08 02:44:04 WARN hdfs.DataStreamer: Error Recovery for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92560 in pipeline [DatanodeInfoWithStorage[172.18.105.88:9866,DS-df61a542-f662-46db-9fc6-4c0b325e2e68,DISK], DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK], DatanodeInfoWithStorage[172.18.105.56:9866,DS-12f0be58-8862-4606-9b0c-0b3d6f77ce42,DISK]]: datanode 0(DatanodeInfoWithStorage[172.18.105.88:9866,DS-df61a542-f662-46db-9fc6-4c0b325e2e68,DISK]) is bad.
22/06/08 02:45:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on b7-38.lab.archivas.com:44696 in memory (size: 39.6 KB, free: 266.5 GB)
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 1
22/06/08 02:45:12 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on b7-38.lab.archivas.com:44696 in memory (size: 39.6 KB, free: 266.5 GB)
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 2
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 4
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 3
22/06/08 02:50:55 WARN hdfs.DataStreamer: Exception for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92573
java.io.EOFException: Unexpected EOF while trying to read response from server
at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1101)
22/06/08 02:50:55 WARN hdfs.DataStreamer: Error Recovery for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92573 in pipeline [DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK], DatanodeInfoWithStorage[172.18.105.56:9866,DS-12f0be58-8862-4606-9b0c-0b3d6f77ce42,DISK], DatanodeInfoWithStorage[172.18.105.84:9866,DS-aaf0af68-8eaf-4bba-99f1-5d641ddfe726,DISK]]: datanode 0(DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK]) is bad.
Traceback (most recent call last):
File "/root/hivespark.py", line 29, in <module>
hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(9000000)
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 381, in show
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.showString.
: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:141)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
Can anyone help me with exact memory configurations based on the available resources i provided or if there is any additional parameters to be set ?
Created 08-31-2022 10:48 PM
Hi @mmk
I think you have shared the following information.
7 nodes with each having 250 gb memory and vcpu = 32 per each node
spark-defaults.conf
spark.executor.memory = 100g
spark.executor.memoryOverhead = 49g
spark.driver.memoryOverhead=200g
spark.driver.memory = 500g
You have maximum of 250 gb for node and you have specified driver memory is (500gb and 200gb overhead). How it possible to driver to get 700gb?
Generally you should not exceed the driver/executor memory beyond yarn physical memory.
Coming to the actual problem, please avoid the show() to print 8000000 records. If you need to get the print the all values, then implement a logic to 1000 records at once and next 1000 records for another iteration.
https://stackoverflow.com/questions/29227949/how-to-implement-spark-sql-pagination-query