Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Super Collaborator

Let's assume we have a Hive table with the name test and the table is stored under /tmp directory. In the test table, data is stored as below:

 

hdfs dfs -ls -R /tmp/test
drwxr-xr-x   - hive hive          0 2022-08-24 09:15 /tmp/test/dir1
-rw-r--r--   3 hive hive        685 2022-08-24 09:15 /tmp/test/dir1/000000_0
drwxr-xr-x   - hive hive          0 2022-08-24 09:15 /tmp/test/dir2
-rw-r--r--   3 hive hive        685 2022-08-24 09:15 /tmp/test/dir2/000000_0

 

Generally, the above kind of data will be generated while doing union all operations in Hive.

 

By using spark, if we try to load the hive table data, we will get the following exception:

 

scala> spark.sql("SELECT * FROM test").show()
java.io.IOException: Not a file: hdfs://localhost:8020/tmp/test/dir1
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:340)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
  at scala.Option.getOrElse(Option.scala:121)
 .....

 

By default spark will not read the table data if it contains subdirectories. To solve the this issue, we need to set the following parameter:

 

spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true")

 

We can also get same kind of exception while reading data in hive table. To solve this issue in Hive, we need to set the following two parameters:

 

hive> set mapred.input.dir.recursive=true;
hive> set hive.mapred.supports.subdirectories=true;

 

We can also set above two parameters in hive-site.xml.

2,974 Views
0 Kudos