Community Articles

RangaReddy · ‎10-14-2022

Let's assume we have a Hive table with the name test and the table is stored under /tmp directory. In the test table, data is stored as below:

hdfs dfs -ls -R /tmp/test
drwxr-xr-x   - hive hive          0 2022-08-24 09:15 /tmp/test/dir1
-rw-r--r--   3 hive hive        685 2022-08-24 09:15 /tmp/test/dir1/000000_0
drwxr-xr-x   - hive hive          0 2022-08-24 09:15 /tmp/test/dir2
-rw-r--r--   3 hive hive        685 2022-08-24 09:15 /tmp/test/dir2/000000_0

Generally, the above kind of data will be generated while doing union all operations in Hive.

By using spark, if we try to load the hive table data, we will get the following exception:

scala> spark.sql("SELECT * FROM test").show()

java.io.IOException: Not a file: hdfs://localhost:8020/tmp/test/dir1
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:340)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
  at scala.Option.getOrElse(Option.scala:121)
 .....

By default spark will not read the table data if it contains subdirectories. To solve the this issue, we need to set the following parameter:

spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true")

We can also get same kind of exception while reading data in hive table. To solve this issue in Hive, we need to set the following two parameters:

hive> set mapred.input.dir.recursive=true;
hive> set hive.mapred.supports.subdirectories=true;

We can also set above two parameters in hive-site.xml.

Cloudera Community

Community Articles

Spark to read the Hive table sub-directory data

Apache Hive

Apache Spark

Cloudera Data Platform (CDP)