Created 06-08-2022 12:26 AM
Resources we have in place:
7 nodes with each having 250 gb memory
vcpu = 32 per each node
configuration specified in spark-defaults.conf :
spark.executor.memory = 100g
spark.executor.memoryOverhead = 49g
spark.driver.memoryOverhead=200g
spark.driver.memory = 500g
Query tried to execute :
hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(8000000)
So, we are facing below issue when trying to fetch 8 million rows with the above mentioned query. However , we are not facing any issue with fetching 7 million rows
Traceback (most recent call last):
File "/root/hivespark.py", line 29, in <module>
hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(8000000)
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 381, in show
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.showString.
: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:141)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:357)
at scala.collection.AbstractTraversable.addString(Traversable.scala:104)
at org.apache.spark.sql.Dataset$$anonfun$showString$2.apply(Dataset.scala:330)
at org.apache.spark.sql.Dataset$$anonfun$showString$2.apply(Dataset.scala:330)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:330)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Got below error message when tried to fetch 9 million rows
22/06/08 02:44:04 WARN hdfs.DataStreamer: Exception for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92560
java.io.EOFException: Unexpected EOF while trying to read response from server
at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1101)
22/06/08 02:44:04 WARN hdfs.DataStreamer: Error Recovery for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92560 in pipeline [DatanodeInfoWithStorage[172.18.105.88:9866,DS-df61a542-f662-46db-9fc6-4c0b325e2e68,DISK], DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK], DatanodeInfoWithStorage[172.18.105.56:9866,DS-12f0be58-8862-4606-9b0c-0b3d6f77ce42,DISK]]: datanode 0(DatanodeInfoWithStorage[172.18.105.88:9866,DS-df61a542-f662-46db-9fc6-4c0b325e2e68,DISK]) is bad.
22/06/08 02:45:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on b7-38.lab.archivas.com:44696 in memory (size: 39.6 KB, free: 266.5 GB)
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 1
22/06/08 02:45:12 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on b7-38.lab.archivas.com:44696 in memory (size: 39.6 KB, free: 266.5 GB)
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 2
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 4
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 3
22/06/08 02:50:55 WARN hdfs.DataStreamer: Exception for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92573
java.io.EOFException: Unexpected EOF while trying to read response from server
at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1101)
22/06/08 02:50:55 WARN hdfs.DataStreamer: Error Recovery for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92573 in pipeline [DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK], DatanodeInfoWithStorage[172.18.105.56:9866,DS-12f0be58-8862-4606-9b0c-0b3d6f77ce42,DISK], DatanodeInfoWithStorage[172.18.105.84:9866,DS-aaf0af68-8eaf-4bba-99f1-5d641ddfe726,DISK]]: datanode 0(DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK]) is bad.
Traceback (most recent call last):
File "/root/hivespark.py", line 29, in <module>
hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(9000000)
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 381, in show
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.showString.
: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:141)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
Can anyone help me with exact memory configurations based on the available resources i provided or if there is any additional parameters to be set ?