question Re: Spark HiveContext - Querying External Hive Table in Support Questions

Spark HiveContext - Querying External Hive Table

awatson — Sat, 05 Dec 2015 02:45:48 GMT

Hi,

I am currently trying to query an external Hive Table that is pointed to a directory via SparkSQL. When I attempt to do a SELECT * FROM TABLE, I get the following error:

15/11/30 15:25:01 INFO DefaultExecutionContext: Created broadcast 3 from broadcast at TableReader.scala:68

15/11/30 15:25:01 INFO FileInputFormat: Total input paths to process : 2

java.io.IOException: Not a file: hdfs://clster/data/raw/EDW/PROD/Prod_DB/test/20151124/2014

        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)

        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

Re: Spark HiveContext - Querying External Hive Table

nsabharwal — Sat, 05 Dec 2015 04:34:48 GMT

@Andrew Watson Is it partitioned table?

Re: Spark HiveContext - Querying External Hive Table

nsabharwal — Mon, 19 Aug 2019 12:43:43 GMT

@Andrew Watson

I would love to reproduce this in my env if you can share the table ddl.

In the meantime, please see this thread and thread2

@Andrew Watson more details.

[hdfs@phdns02 ~]$ hdfs dfs -ls /tmp/bat

Found 1 items

-rw-r--r-- 1 hdfs hdfs 621765 2015-12-05 07:07 /tmp/bat/BattingPost.csv

[hdfs@phdns02 ~]$

scala> sqlContext.sql("CREATE external TABLE IF NOT EXISTS batting ( col_value STRING) location '/tmp/bat/'")

sqlContext.sql("select count(*) from batting").collect().foreach(println)

select * did work.

Re: Spark HiveContext - Querying External Hive Table

nsabharwal — Sat, 05 Dec 2015 23:12:49 GMT

@Andrew Watson Do you see the following location?

hdfs dfs -ls /data/raw/EDW/PROD/Prod_DB/test/20151124/2014

Re: Spark HiveContext - Querying External Hive Table

phargis — Tue, 08 Dec 2015 07:38:18 GMT

Note: In Spark Scala, you can use reflection to infer schema; then you can register it as a table, and afterwards, you can run SQL commands against it using the DataFrames API.

From Spark example file: RDDRelation.scala

// Any RDD containing case classes can be registered as a table. The schema of the table is

// automatically inferred using scala reflection.

df.registerTempTable("records")

// Once tables have been registered, you can run SQL queries over them.

println("Result of SELECT *:")

sqlContext.sql("SELECT * FROM records").collect().foreach(println)

Re: Spark HiveContext - Querying External Hive Table

awatson — Tue, 08 Dec 2015 23:01:18 GMT

Okay we figured it out. I was talking about creating an external table in Hive and then using Spark SQL to query it. The external table had sub-directories (e.g. ParentDirectory/2015/01/data.txt), that Hive was easily able to traverse and query However SparkSQL (and Presto) weren't able to, SparkSQL would give the error mentioned above.

It wasn't until we properly defined the sub-directories as partitions in Hive (e.g. ParentDirectory/year=2015/month=01) and added them to the metastore (alter table add partition) that SparkSQL (and Presto) were able to finally query the table without issues.

Re: Spark HiveContext - Querying External Hive Table

awatson — Tue, 08 Dec 2015 23:04:16 GMT

@Neeraj Sabharwal see my below comment. If you want to reproduce. Create an external table that references a directory higher than the directory with data in it. Don't specify partitions and try running it.

CREATE EXTERNAL TABLE TEST1 (COL1 STRING) location '/location/to/parentdirectory' ;

Put data in /location/to/parentdirectory/2015/01

then try to query.

Re: Spark HiveContext - Querying External Hive Table

nsabharwal — Tue, 08 Dec 2015 23:40:10 GMT

Thanks @Andrew Watson ....It was related to the setup.

Re: Spark HiveContext - Querying External Hive Table

sadiga — Mon, 16 Oct 2017 13:43:20 GMT

This can be achieved by setting the following property in spark, sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")

Note here that the property is set usign sqlContext instead of sparkContext. And I tested this in spark 1.6.2

This can be achieved by setting the following property in the spark.

sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")

Note: i tested it in spark 1.6.2

Do not set this using spark context but use sqlContext to for dataframes created out of hive tables.