Generally, the above kind of data will be generated while doing union all operations in Hive.
By using spark, if we try to load the hive table data, we will get the following exception:
scala> spark.sql("SELECT * FROM test").show()
java.io.IOException: Not a file: hdfs://localhost:8020/tmp/test/dir1
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:340)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
.....
By default spark will not read the table data if it contains subdirectories. To solve the this issue, we need to set the following parameter: