Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark HiveContext - Querying External Hive Table

avatar

Hi,

I am currently trying to query an external Hive Table that is pointed to a directory via SparkSQL. When I attempt to do a SELECT * FROM TABLE, I get the following error:

15/11/30 15:25:01 INFO DefaultExecutionContext: Created broadcast 3 from broadcast at TableReader.scala:68

15/11/30 15:25:01 INFO FileInputFormat: Total input paths to process : 2

java.io.IOException: Not a file: hdfs://clster/data/raw/EDW/PROD/Prod_DB/test/20151124/2014

        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)

        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

1 ACCEPTED SOLUTION

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
8 REPLIES 8

avatar
Master Mentor

@Andrew Watson Is it partitioned table?

avatar
Master Mentor

@Andrew Watson Do you see the following location?

hdfs dfs -ls /data/raw/EDW/PROD/Prod_DB/test/20151124/2014

avatar
Master Mentor

@Andrew Watson

I would love to reproduce this in my env if you can share the table ddl.

In the meantime, please see this thread and thread2

@Andrew Watson more details.

[hdfs@phdns02 ~]$ hdfs dfs -ls /tmp/bat

Found 1 items

-rw-r--r-- 1 hdfs hdfs 621765 2015-12-05 07:07 /tmp/bat/BattingPost.csv

[hdfs@phdns02 ~]$

scala> sqlContext.sql("CREATE external TABLE IF NOT EXISTS batting ( col_value STRING) location '/tmp/bat/'")

sqlContext.sql("select count(*) from batting").collect().foreach(println)

select * did work.

628-screen-shot-2015-12-05-at-101044-am.png

avatar

@Neeraj Sabharwal see my below comment. If you want to reproduce. Create an external table that references a directory higher than the directory with data in it. Don't specify partitions and try running it.

CREATE EXTERNAL TABLE TEST1 (COL1 STRING) location '/location/to/parentdirectory' ;

Put data in /location/to/parentdirectory/2015/01

then try to query.

avatar

Note: In Spark Scala, you can use reflection to infer schema; then you can register it as a table, and afterwards, you can run SQL commands against it using the DataFrames API.

From Spark example file: RDDRelation.scala

// Any RDD containing case classes can be registered as a table. The schema of the table is

// automatically inferred using scala reflection.

df.registerTempTable("records")

// Once tables have been registered, you can run SQL queries over them.

println("Result of SELECT *:")

sqlContext.sql("SELECT * FROM records").collect().foreach(println)

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Master Mentor

Thanks @Andrew Watson ....It was related to the setup.

avatar
New Contributor

This can be achieved by setting the following property in spark, sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")

Note here that the property is set usign sqlContext instead of sparkContext. And I tested this in spark 1.6.2

,

This can be achieved by setting the following property in the spark.

sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")

Note: i tested it in spark 1.6.2

Do not set this using spark context but use sqlContext to for dataframes created out of hive tables.