- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 10-14-2022 06:10 AM
Let's assume we have a Hive table with the name test and the table is stored under /tmp directory. In the test table, data is stored as below:
hdfs dfs -ls -R /tmp/test
drwxr-xr-x - hive hive 0 2022-08-24 09:15 /tmp/test/dir1
-rw-r--r-- 3 hive hive 685 2022-08-24 09:15 /tmp/test/dir1/000000_0
drwxr-xr-x - hive hive 0 2022-08-24 09:15 /tmp/test/dir2
-rw-r--r-- 3 hive hive 685 2022-08-24 09:15 /tmp/test/dir2/000000_0
Generally, the above kind of data will be generated while doing union all operations in Hive.
By using spark, if we try to load the hive table data, we will get the following exception:
scala> spark.sql("SELECT * FROM test").show()
java.io.IOException: Not a file: hdfs://localhost:8020/tmp/test/dir1
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:340)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
.....
By default spark will not read the table data if it contains subdirectories. To solve the this issue, we need to set the following parameter:
spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
We can also get same kind of exception while reading data in hive table. To solve this issue in Hive, we need to set the following two parameters:
hive> set mapred.input.dir.recursive=true;
hive> set hive.mapred.supports.subdirectories=true;
We can also set above two parameters in hive-site.xml.