<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Unable to fetch more data like more than 7 million rows using Pyspark from the hive tables in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Unable-to-fetch-more-data-like-more-than-7-million-rows/m-p/345187#M234409</link>
    <description>&lt;P&gt;&lt;STRONG&gt;Resources we have in place:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;7 nodes with each having 250 gb memory&lt;/P&gt;&lt;P&gt;vcpu = 32 per each node&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;configuration specified in spark-defaults.conf :&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;spark.executor.memory = 100g&lt;BR /&gt;spark.executor.memoryOverhead = 49g&lt;BR /&gt;spark.driver.memoryOverhead=200g&lt;BR /&gt;spark.driver.memory = 500g&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Query tried to execute :&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(8000000)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, we are facing below issue when trying to fetch 8 million rows with the above mentioned query. However , we are not facing any issue with fetching 7 million rows&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Traceback (most recent call last):
  File "/root/hivespark.py", line 29, in &amp;lt;module&amp;gt;
    hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(8000000)
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 381, in show
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.showString.
: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:141)
        at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
        at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:357)
        at scala.collection.AbstractTraversable.addString(Traversable.scala:104)
        at org.apache.spark.sql.Dataset$$anonfun$showString$2.apply(Dataset.scala:330)
        at org.apache.spark.sql.Dataset$$anonfun$showString$2.apply(Dataset.scala:330)
        at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:330)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Got below error message when tried to fetch 9 million rows&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;22/06/08 02:44:04 WARN hdfs.DataStreamer: Exception for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92560
java.io.EOFException: Unexpected EOF while trying to read response from server
        at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
        at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1101)
22/06/08 02:44:04 WARN hdfs.DataStreamer: Error Recovery for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92560 in pipeline [DatanodeInfoWithStorage[172.18.105.88:9866,DS-df61a542-f662-46db-9fc6-4c0b325e2e68,DISK], DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK], DatanodeInfoWithStorage[172.18.105.56:9866,DS-12f0be58-8862-4606-9b0c-0b3d6f77ce42,DISK]]: datanode 0(DatanodeInfoWithStorage[172.18.105.88:9866,DS-df61a542-f662-46db-9fc6-4c0b325e2e68,DISK]) is bad.
22/06/08 02:45:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on b7-38.lab.archivas.com:44696 in memory (size: 39.6 KB, free: 266.5 GB)
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 1
22/06/08 02:45:12 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on b7-38.lab.archivas.com:44696 in memory (size: 39.6 KB, free: 266.5 GB)
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 2
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 4
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 3
22/06/08 02:50:55 WARN hdfs.DataStreamer: Exception for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92573
java.io.EOFException: Unexpected EOF while trying to read response from server
        at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
        at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1101)
22/06/08 02:50:55 WARN hdfs.DataStreamer: Error Recovery for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92573 in pipeline [DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK], DatanodeInfoWithStorage[172.18.105.56:9866,DS-12f0be58-8862-4606-9b0c-0b3d6f77ce42,DISK], DatanodeInfoWithStorage[172.18.105.84:9866,DS-aaf0af68-8eaf-4bba-99f1-5d641ddfe726,DISK]]: datanode 0(DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK]) is bad.
Traceback (most recent call last):
  File "/root/hivespark.py", line 29, in &amp;lt;module&amp;gt;
    hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(9000000)
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 381, in show
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.showString.
: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:141)
        at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
        at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can anyone help me with exact memory configurations based on the available resources i provided or if there is any additional parameters to be set ?&lt;/P&gt;</description>
    <pubDate>Wed, 08 Jun 2022 07:26:52 GMT</pubDate>
    <dc:creator>mmk</dc:creator>
    <dc:date>2022-06-08T07:26:52Z</dc:date>
    <item>
      <title>Unable to fetch more data like more than 7 million rows using Pyspark from the hive tables</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Unable-to-fetch-more-data-like-more-than-7-million-rows/m-p/345187#M234409</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Resources we have in place:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;7 nodes with each having 250 gb memory&lt;/P&gt;&lt;P&gt;vcpu = 32 per each node&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;configuration specified in spark-defaults.conf :&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;spark.executor.memory = 100g&lt;BR /&gt;spark.executor.memoryOverhead = 49g&lt;BR /&gt;spark.driver.memoryOverhead=200g&lt;BR /&gt;spark.driver.memory = 500g&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Query tried to execute :&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(8000000)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, we are facing below issue when trying to fetch 8 million rows with the above mentioned query. However , we are not facing any issue with fetching 7 million rows&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Traceback (most recent call last):
  File "/root/hivespark.py", line 29, in &amp;lt;module&amp;gt;
    hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(8000000)
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 381, in show
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.showString.
: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:141)
        at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
        at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:357)
        at scala.collection.AbstractTraversable.addString(Traversable.scala:104)
        at org.apache.spark.sql.Dataset$$anonfun$showString$2.apply(Dataset.scala:330)
        at org.apache.spark.sql.Dataset$$anonfun$showString$2.apply(Dataset.scala:330)
        at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:330)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Got below error message when tried to fetch 9 million rows&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;22/06/08 02:44:04 WARN hdfs.DataStreamer: Exception for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92560
java.io.EOFException: Unexpected EOF while trying to read response from server
        at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
        at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1101)
22/06/08 02:44:04 WARN hdfs.DataStreamer: Error Recovery for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92560 in pipeline [DatanodeInfoWithStorage[172.18.105.88:9866,DS-df61a542-f662-46db-9fc6-4c0b325e2e68,DISK], DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK], DatanodeInfoWithStorage[172.18.105.56:9866,DS-12f0be58-8862-4606-9b0c-0b3d6f77ce42,DISK]]: datanode 0(DatanodeInfoWithStorage[172.18.105.88:9866,DS-df61a542-f662-46db-9fc6-4c0b325e2e68,DISK]) is bad.
22/06/08 02:45:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on b7-38.lab.archivas.com:44696 in memory (size: 39.6 KB, free: 266.5 GB)
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 1
22/06/08 02:45:12 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on b7-38.lab.archivas.com:44696 in memory (size: 39.6 KB, free: 266.5 GB)
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 2
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 4
22/06/08 02:45:12 INFO spark.ContextCleaner: Cleaned accumulator 3
22/06/08 02:50:55 WARN hdfs.DataStreamer: Exception for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92573
java.io.EOFException: Unexpected EOF while trying to read response from server
        at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
        at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1101)
22/06/08 02:50:55 WARN hdfs.DataStreamer: Error Recovery for BP-1037869773-172.18.105.90-1650524469800:blk_1073833296_92573 in pipeline [DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK], DatanodeInfoWithStorage[172.18.105.56:9866,DS-12f0be58-8862-4606-9b0c-0b3d6f77ce42,DISK], DatanodeInfoWithStorage[172.18.105.84:9866,DS-aaf0af68-8eaf-4bba-99f1-5d641ddfe726,DISK]]: datanode 0(DatanodeInfoWithStorage[172.18.105.83:9866,DS-b781a5d9-5114-4807-9c91-0170578a8bb6,DISK]) is bad.
Traceback (most recent call last):
  File "/root/hivespark.py", line 29, in &amp;lt;module&amp;gt;
    hive_context.sql("select * from 5mcsv CROSS JOIN 2mcsv").show(9000000)
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 381, in show
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.showString.
: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:141)
        at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
        at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can anyone help me with exact memory configurations based on the available resources i provided or if there is any additional parameters to be set ?&lt;/P&gt;</description>
      <pubDate>Wed, 08 Jun 2022 07:26:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Unable-to-fetch-more-data-like-more-than-7-million-rows/m-p/345187#M234409</guid>
      <dc:creator>mmk</dc:creator>
      <dc:date>2022-06-08T07:26:52Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to fetch more data like more than 7 million rows using Pyspark from the hive tables</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Unable-to-fetch-more-data-like-more-than-7-million-rows/m-p/351292#M236215</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/97792"&gt;@mmk&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I think you have shared the following information.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;7 nodes with each having 250 gb memory and vcpu = 32 per each node&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;spark-defaults.conf&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;spark.executor.memory = 100g&lt;BR /&gt;spark.executor.memoryOverhead = 49g&lt;BR /&gt;spark.driver.memoryOverhead=200g&lt;BR /&gt;spark.driver.memory = 500g&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You have maximum of 250 gb for node and you have specified driver memory is (500gb and 200gb overhead). How it possible to driver to get 700gb?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Generally you should not exceed the &lt;STRONG&gt;driver/executor&lt;/STRONG&gt; memory beyond yarn physical memory.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Coming to the actual problem, please avoid the show() to print&amp;nbsp;&lt;SPAN&gt;8000000 records. If you need to get the print the all values, then implement a logic to 1000 records at once and next 1000 records for another iteration.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;A href="https://stackoverflow.com/questions/29227949/how-to-implement-spark-sql-pagination-query" target="_blank"&gt;https://stackoverflow.com/questions/29227949/how-to-implement-spark-sql-pagination-query&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Sep 2022 05:48:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Unable-to-fetch-more-data-like-more-than-7-million-rows/m-p/351292#M236215</guid>
      <dc:creator>RangaReddy</dc:creator>
      <dc:date>2022-09-01T05:48:46Z</dc:date>
    </item>
  </channel>
</rss>

