Support Questions

Find answers, ask questions, and share your expertise

processing 1GB file pyspark in my HDP cluster java.lang.OutOfMemoryError: Java heap space

avatar
Expert Contributor
$pyspark 

$json_file = sqlContext.read.json(sc.wholeTextFiles('/user/admin/emp/*').values())

18/01/08 15:34:36 ERROR Utils: Uncaught exception in thread stdout writer for python2.7 java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at org.spark_project.guava.io.ByteStreams.copy(ByteStreams.java:211) at org.spark_project.guava.io.ByteStreams.toByteArray(ByteStreams.java:252) at org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:79) at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:65) at org.apache.spark.rdd.NewHadoopRDD$anon$1.hasNext(NewHadoopRDD.scala:182) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Exception in thread "stdout writer for python2.7" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at org.spark_project.guava.io.ByteStreams.copy(ByteStreams.java:211) at org.spark_project.guava.io.ByteStreams.toByteArray(ByteStreams.java:252) at org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:79) at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:65) at org.apache.spark.rdd.NewHadoopRDD$anon$1.hasNext(NewHadoopRDD.scala:182) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

2 REPLIES 2

avatar
Expert Contributor

This is a known limitation of wholeTextFiles as reported in https://issues.apache.org/jira/browse/SPARK-18965. Try using binaryFiles as suggested in https://issues.apache.org/jira/browse/SPARK-22225.

avatar
Expert Contributor

then how to slove that issue,how to process the file and also i try (json_file = sqlContext.read.json('/user/admin/emp/empData.json') its also not work same issue only come