Support Questions

Find answers, ask questions, and share your expertise

Spark HiveContext - Querying External Hive Table

avatar

Hi,

I am currently trying to query an external Hive Table that is pointed to a directory via SparkSQL. When I attempt to do a SELECT * FROM TABLE, I get the following error:

15/11/30 15:25:01 INFO DefaultExecutionContext: Created broadcast 3 from broadcast at TableReader.scala:68

15/11/30 15:25:01 INFO FileInputFormat: Total input paths to process : 2

java.io.IOException: Not a file: hdfs://clster/data/raw/EDW/PROD/Prod_DB/test/20151124/2014

        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)

        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

1 ACCEPTED SOLUTION

avatar

Okay we figured it out. I was talking about creating an external table in Hive and then using Spark SQL to query it. The external table had sub-directories (e.g. ParentDirectory/2015/01/data.txt), that Hive was easily able to traverse and query However SparkSQL (and Presto) weren't able to, SparkSQL would give the error mentioned above.

It wasn't until we properly defined the sub-directories as partitions in Hive (e.g. ParentDirectory/year=2015/month=01) and added them to the metastore (alter table add partition) that SparkSQL (and Presto) were able to finally query the table without issues.

View solution in original post

8 REPLIES 8

avatar
Master Mentor

@Andrew Watson Is it partitioned table?

avatar
Master Mentor

@Andrew Watson Do you see the following location?

hdfs dfs -ls /data/raw/EDW/PROD/Prod_DB/test/20151124/2014

avatar
Master Mentor

@Andrew Watson

I would love to reproduce this in my env if you can share the table ddl.

In the meantime, please see this thread and thread2

@Andrew Watson more details.

[hdfs@phdns02 ~]$ hdfs dfs -ls /tmp/bat

Found 1 items

-rw-r--r-- 1 hdfs hdfs 621765 2015-12-05 07:07 /tmp/bat/BattingPost.csv

[hdfs@phdns02 ~]$

scala> sqlContext.sql("CREATE external TABLE IF NOT EXISTS batting ( col_value STRING) location '/tmp/bat/'")

sqlContext.sql("select count(*) from batting").collect().foreach(println)

select * did work.

628-screen-shot-2015-12-05-at-101044-am.png

avatar

@Neeraj Sabharwal see my below comment. If you want to reproduce. Create an external table that references a directory higher than the directory with data in it. Don't specify partitions and try running it.

CREATE EXTERNAL TABLE TEST1 (COL1 STRING) location '/location/to/parentdirectory' ;

Put data in /location/to/parentdirectory/2015/01

then try to query.

avatar

Note: In Spark Scala, you can use reflection to infer schema; then you can register it as a table, and afterwards, you can run SQL commands against it using the DataFrames API.

From Spark example file: RDDRelation.scala

// Any RDD containing case classes can be registered as a table. The schema of the table is

// automatically inferred using scala reflection.

df.registerTempTable("records")

// Once tables have been registered, you can run SQL queries over them.

println("Result of SELECT *:")

sqlContext.sql("SELECT * FROM records").collect().foreach(println)

avatar

Okay we figured it out. I was talking about creating an external table in Hive and then using Spark SQL to query it. The external table had sub-directories (e.g. ParentDirectory/2015/01/data.txt), that Hive was easily able to traverse and query However SparkSQL (and Presto) weren't able to, SparkSQL would give the error mentioned above.

It wasn't until we properly defined the sub-directories as partitions in Hive (e.g. ParentDirectory/year=2015/month=01) and added them to the metastore (alter table add partition) that SparkSQL (and Presto) were able to finally query the table without issues.

avatar
Master Mentor

Thanks @Andrew Watson ....It was related to the setup.

avatar
New Contributor

This can be achieved by setting the following property in spark, sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")

Note here that the property is set usign sqlContext instead of sparkContext. And I tested this in spark 1.6.2

,

This can be achieved by setting the following property in the spark.

sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")

Note: i tested it in spark 1.6.2

Do not set this using spark context but use sqlContext to for dataframes created out of hive tables.