Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark HiveContext - Querying External Hive Table

Solved Go to solution
Highlighted

Spark HiveContext - Querying External Hive Table

Hi,

I am currently trying to query an external Hive Table that is pointed to a directory via SparkSQL. When I attempt to do a SELECT * FROM TABLE, I get the following error:

15/11/30 15:25:01 INFO DefaultExecutionContext: Created broadcast 3 from broadcast at TableReader.scala:68

15/11/30 15:25:01 INFO FileInputFormat: Total input paths to process : 2

java.io.IOException: Not a file: hdfs://clster/data/raw/EDW/PROD/Prod_DB/test/20151124/2014

        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)

        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Spark HiveContext - Querying External Hive Table

Okay we figured it out. I was talking about creating an external table in Hive and then using Spark SQL to query it. The external table had sub-directories (e.g. ParentDirectory/2015/01/data.txt), that Hive was easily able to traverse and query However SparkSQL (and Presto) weren't able to, SparkSQL would give the error mentioned above.

It wasn't until we properly defined the sub-directories as partitions in Hive (e.g. ParentDirectory/year=2015/month=01) and added them to the metastore (alter table add partition) that SparkSQL (and Presto) were able to finally query the table without issues.

8 REPLIES 8

Re: Spark HiveContext - Querying External Hive Table

@Andrew Watson Is it partitioned table?

Re: Spark HiveContext - Querying External Hive Table

@Andrew Watson Do you see the following location?

hdfs dfs -ls /data/raw/EDW/PROD/Prod_DB/test/20151124/2014

Re: Spark HiveContext - Querying External Hive Table

@Andrew Watson

I would love to reproduce this in my env if you can share the table ddl.

In the meantime, please see this thread and thread2

@Andrew Watson more details.

[hdfs@phdns02 ~]$ hdfs dfs -ls /tmp/bat

Found 1 items

-rw-r--r-- 1 hdfs hdfs 621765 2015-12-05 07:07 /tmp/bat/BattingPost.csv

[hdfs@phdns02 ~]$

scala> sqlContext.sql("CREATE external TABLE IF NOT EXISTS batting ( col_value STRING) location '/tmp/bat/'")

sqlContext.sql("select count(*) from batting").collect().foreach(println)

select * did work.

628-screen-shot-2015-12-05-at-101044-am.png

Re: Spark HiveContext - Querying External Hive Table

@Neeraj Sabharwal see my below comment. If you want to reproduce. Create an external table that references a directory higher than the directory with data in it. Don't specify partitions and try running it.

CREATE EXTERNAL TABLE TEST1 (COL1 STRING) location '/location/to/parentdirectory' ;

Put data in /location/to/parentdirectory/2015/01

then try to query.

Re: Spark HiveContext - Querying External Hive Table

Note: In Spark Scala, you can use reflection to infer schema; then you can register it as a table, and afterwards, you can run SQL commands against it using the DataFrames API.

From Spark example file: RDDRelation.scala

// Any RDD containing case classes can be registered as a table. The schema of the table is

// automatically inferred using scala reflection.

df.registerTempTable("records")

// Once tables have been registered, you can run SQL queries over them.

println("Result of SELECT *:")

sqlContext.sql("SELECT * FROM records").collect().foreach(println)

Re: Spark HiveContext - Querying External Hive Table

Okay we figured it out. I was talking about creating an external table in Hive and then using Spark SQL to query it. The external table had sub-directories (e.g. ParentDirectory/2015/01/data.txt), that Hive was easily able to traverse and query However SparkSQL (and Presto) weren't able to, SparkSQL would give the error mentioned above.

It wasn't until we properly defined the sub-directories as partitions in Hive (e.g. ParentDirectory/year=2015/month=01) and added them to the metastore (alter table add partition) that SparkSQL (and Presto) were able to finally query the table without issues.

Re: Spark HiveContext - Querying External Hive Table

Thanks @Andrew Watson ....It was related to the setup.

Re: Spark HiveContext - Querying External Hive Table

New Contributor

This can be achieved by setting the following property in spark, sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")

Note here that the property is set usign sqlContext instead of sparkContext. And I tested this in spark 1.6.2

,

This can be achieved by setting the following property in the spark.

sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")

Note: i tested it in spark 1.6.2

Do not set this using spark context but use sqlContext to for dataframes created out of hive tables.

Don't have an account?
Coming from Hortonworks? Activate your account here