Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

BlockReaderFactory error reading data from HDFS in PySpark

Highlighted

BlockReaderFactory error reading data from HDFS in PySpark

New Contributor

I have some JSON data stored in HDFS which is partitioned into multiple parts. As I am trying to load this data into a DataFrame using PySpark I get the following error:

WARN BlockReaderFactory: BlockReaderFactory(fileName=/user/admin/lukas/sensordata/sensordata20170907.json/part-r-00129-3e3f35d1-1569-4dea-9394-b596a5e9dbd8, block=BP-411713710-192.168.128.16-1500637232474:blk_1073846150_114130): error creating ShortCircuitReplica.java.io.IOException: Illegal seekat sun.nio.ch.FileDispatcherImpl.pread0(Native Method)

I have some JSON data stored in HDFS which is partitioned into multiple parts. As I am trying to load this data into a DataFrame using PySpark I get the following error: 
WARN BlockReaderFactory: BlockReaderFactory(fileName=/user/admin/lukas/sensordata/sensordata20170907.json/part-r-00129-3e3f35d1-1569-4dea-9394-b596a5e9dbd8, block=BP-411713710-192.168.128.16-1500637232474:blk_1073846150_114130): error creating ShortCircuitReplica.java.io.IOException: Illegal seekat sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:741)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:124)
at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.<init>(ShortCircuitReplica.java:126)at org.apache.hadoop.hdfs.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:556)
at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:488)
at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784)
at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718)
at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422)at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:898)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:955)at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208)at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:254)at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208)at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)at scala.collection.Iterator$class.foreach(Iterator.scala:893)at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)at scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)at scala.collection.AbstractIterator.fold(Iterator.scala:1336)at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1063)at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1063)at org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1935)at org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1935)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)at org.apache.spark.scheduler.Task.run(Task.scala:86)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:745)
Does anybody have an Idea what this means?
at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:741)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:124)
at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.<init>(ShortCircuitReplica.java:126)
at org.apache.hadoop.hdfs.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:556)
at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:488)
at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784)
at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718)
at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:898)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:955)
at java.io.DataInputStream.read(DataInputStream.java:149)at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:254)at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208)at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)at scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
at scala.collection.AbstractIterator.fold(Iterator.scala:1336)
+at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1063)at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1063)
at org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1935)
at org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1935)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Does anybody have an Idea what this means?

Don't have an account?
Coming from Hortonworks? Activate your account here