Created 12-04-2015 06:45 PM
Hi,
I am currently trying to query an external Hive Table that is pointed to a directory via SparkSQL. When I attempt to do a SELECT * FROM TABLE, I get the following error:
15/11/30 15:25:01 INFO DefaultExecutionContext: Created broadcast 3 from broadcast at TableReader.scala:68 15/11/30 15:25:01 INFO FileInputFormat: Total input paths to process : 2 java.io.IOException: Not a file: hdfs://clster/data/raw/EDW/PROD/Prod_DB/test/20151124/2014 at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:218) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:218) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:218) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
Created 12-08-2015 03:01 PM
Okay we figured it out. I was talking about creating an external table in Hive and then using Spark SQL to query it. The external table had sub-directories (e.g. ParentDirectory/2015/01/data.txt), that Hive was easily able to traverse and query However SparkSQL (and Presto) weren't able to, SparkSQL would give the error mentioned above.
It wasn't until we properly defined the sub-directories as partitions in Hive (e.g. ParentDirectory/year=2015/month=01) and added them to the metastore (alter table add partition) that SparkSQL (and Presto) were able to finally query the table without issues.
Created 12-04-2015 08:34 PM
@Andrew Watson Is it partitioned table?
Created 12-05-2015 03:12 PM
@Andrew Watson Do you see the following location?
hdfs dfs -ls /data/raw/EDW/PROD/Prod_DB/test/20151124/2014
Created on 12-04-2015 08:36 PM - edited 08-19-2019 05:43 AM
I would love to reproduce this in my env if you can share the table ddl.
In the meantime, please see this thread and thread2
@Andrew Watson more details.
[hdfs@phdns02 ~]$ hdfs dfs -ls /tmp/bat
Found 1 items
-rw-r--r-- 1 hdfs hdfs 621765 2015-12-05 07:07 /tmp/bat/BattingPost.csv
[hdfs@phdns02 ~]$
scala> sqlContext.sql("CREATE external TABLE IF NOT EXISTS batting ( col_value STRING) location '/tmp/bat/'")
sqlContext.sql("select count(*) from batting").collect().foreach(println)
select * did work.
Created 12-08-2015 03:04 PM
@Neeraj Sabharwal see my below comment. If you want to reproduce. Create an external table that references a directory higher than the directory with data in it. Don't specify partitions and try running it.
CREATE EXTERNAL TABLE TEST1 (COL1 STRING) location '/location/to/parentdirectory' ;
Put data in /location/to/parentdirectory/2015/01
then try to query.
Created 12-07-2015 11:38 PM
Note: In Spark Scala, you can use reflection to infer schema; then you can register it as a table, and afterwards, you can run SQL commands against it using the DataFrames API.
From Spark example file: RDDRelation.scala
// Any RDD containing case classes can be registered as a table. The schema of the table is
// automatically inferred using scala reflection.
df.registerTempTable("records")
// Once tables have been registered, you can run SQL queries over them.
println("Result of SELECT *:")
sqlContext.sql("SELECT * FROM records").collect().foreach(println)
Created 12-08-2015 03:01 PM
Okay we figured it out. I was talking about creating an external table in Hive and then using Spark SQL to query it. The external table had sub-directories (e.g. ParentDirectory/2015/01/data.txt), that Hive was easily able to traverse and query However SparkSQL (and Presto) weren't able to, SparkSQL would give the error mentioned above.
It wasn't until we properly defined the sub-directories as partitions in Hive (e.g. ParentDirectory/year=2015/month=01) and added them to the metastore (alter table add partition) that SparkSQL (and Presto) were able to finally query the table without issues.
Created 12-08-2015 03:40 PM
Thanks @Andrew Watson ....It was related to the setup.
Created 10-16-2017 06:43 AM
This can be achieved by setting the following property in spark, sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
Note here that the property is set usign sqlContext instead of sparkContext. And I tested this in spark 1.6.2
,This can be achieved by setting the following property in the spark.
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
Note: i tested it in spark 1.6.2
Do not set this using spark context but use sqlContext to for dataframes created out of hive tables.