<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark HiveContext - Querying External Hive Table in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/98000#M61150</link>
    <description>&lt;P&gt;Thanks &lt;A rel="user" href="https://community.cloudera.com/users/200/awatson.html" nodeid="200"&gt;@Andrew Watson&lt;/A&gt; ....It was related to the setup. &lt;/P&gt;</description>
    <pubDate>Tue, 08 Dec 2015 23:40:10 GMT</pubDate>
    <dc:creator>nsabharwal</dc:creator>
    <dc:date>2015-12-08T23:40:10Z</dc:date>
    <item>
      <title>Spark HiveContext - Querying External Hive Table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97993#M61143</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I am currently trying to query an external Hive Table that is pointed to a directory via SparkSQL. When I attempt to do a SELECT * FROM TABLE, I get the following error:&lt;/P&gt;&lt;PRE&gt;15/11/30 15:25:01 INFO DefaultExecutionContext: Created broadcast 3 from broadcast at TableReader.scala:68

15/11/30 15:25:01 INFO FileInputFormat: Total input paths to process : 2

java.io.IOException: Not a file: hdfs://clster/data/raw/EDW/PROD/Prod_DB/test/20151124/2014

        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)

        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)

        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

&lt;/PRE&gt;</description>
      <pubDate>Sat, 05 Dec 2015 02:45:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97993#M61143</guid>
      <dc:creator>awatson</dc:creator>
      <dc:date>2015-12-05T02:45:48Z</dc:date>
    </item>
    <item>
      <title>Re: Spark HiveContext - Querying External Hive Table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97994#M61144</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/200/awatson.html" nodeid="200"&gt;@Andrew Watson&lt;/A&gt; Is it partitioned table?&lt;/P&gt;</description>
      <pubDate>Sat, 05 Dec 2015 04:34:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97994#M61144</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2015-12-05T04:34:48Z</dc:date>
    </item>
    <item>
      <title>Re: Spark HiveContext - Querying External Hive Table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97995#M61145</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/200/awatson.html" nodeid="200" target="_blank"&gt;@Andrew Watson&lt;/A&gt;&lt;/P&gt;&lt;P&gt; I would love to reproduce this in my env if you can share the table ddl.&lt;/P&gt;&lt;P&gt; In the meantime, please see this &lt;A target="_blank" href="http://stackoverflow.com/questions/32921510/how-to-avoid-not-a-file-exceptions-when-reading-from-hdfs-with-spark" rel="nofollow noopener noreferrer"&gt;thread&lt;/A&gt;  and &lt;A target="_blank" href="http://stackoverflow.com/questions/28817940/recursively-fetch-file-contents-from-subdirectories-using-sc-textfile" rel="nofollow noopener noreferrer"&gt;thread2&lt;/A&gt;  &lt;/P&gt;&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/200/awatson.html" nodeid="200" target="_blank"&gt;@Andrew Watson&lt;/A&gt; more details.&lt;/P&gt;&lt;P&gt;[hdfs@phdns02 ~]$ hdfs dfs -ls /tmp/bat&lt;/P&gt;&lt;P&gt;Found 1 items&lt;/P&gt;&lt;P&gt;-rw-r--r--   1 hdfs hdfs     621765 2015-12-05 07:07 /tmp/bat/BattingPost.csv&lt;/P&gt;&lt;P&gt;[hdfs@phdns02 ~]$ &lt;/P&gt;&lt;P&gt;scala&amp;gt; sqlContext.sql("CREATE  external  TABLE IF NOT EXISTS batting ( col_value STRING) location '/tmp/bat/'")&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt; sqlContext.sql("select count(*) from batting").collect().foreach(println)&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;select * did work. &lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="628-screen-shot-2015-12-05-at-101044-am.png" style="width: 1486px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/23933i2248B6307B1BD566/image-size/medium?v=v2&amp;amp;px=400" role="button" title="628-screen-shot-2015-12-05-at-101044-am.png" alt="628-screen-shot-2015-12-05-at-101044-am.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Aug 2019 12:43:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97995#M61145</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2019-08-19T12:43:43Z</dc:date>
    </item>
    <item>
      <title>Re: Spark HiveContext - Querying External Hive Table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97996#M61146</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/200/awatson.html" nodeid="200"&gt;@Andrew Watson&lt;/A&gt;  Do you see the following location?&lt;/P&gt;&lt;P&gt;hdfs dfs -ls /data/raw/EDW/PROD/Prod_DB/test/20151124/2014&lt;/P&gt;</description>
      <pubDate>Sat, 05 Dec 2015 23:12:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97996#M61146</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2015-12-05T23:12:49Z</dc:date>
    </item>
    <item>
      <title>Re: Spark HiveContext - Querying External Hive Table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97997#M61147</link>
      <description>&lt;P&gt;Note: In Spark Scala, you can use reflection to infer schema; then you can register it as a table, and afterwards, you can run SQL commands against it using the DataFrames API.&lt;/P&gt;&lt;P&gt;From Spark example file: RDDRelation.scala&lt;/P&gt;&lt;P&gt;// Any RDD containing case classes can be registered as a table.  The schema of the table is&lt;/P&gt;&lt;P&gt;    // automatically inferred using scala reflection.&lt;/P&gt;&lt;P&gt;    df.registerTempTable("records")&lt;/P&gt;&lt;P&gt;// Once tables have been registered, you can run SQL queries over them.&lt;/P&gt;&lt;P&gt;    println("Result of SELECT *:")&lt;/P&gt;&lt;P&gt;    sqlContext.sql("SELECT * FROM records").collect().foreach(println)&lt;/P&gt;</description>
      <pubDate>Tue, 08 Dec 2015 07:38:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97997#M61147</guid>
      <dc:creator>phargis</dc:creator>
      <dc:date>2015-12-08T07:38:18Z</dc:date>
    </item>
    <item>
      <title>Re: Spark HiveContext - Querying External Hive Table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97998#M61148</link>
      <description>&lt;P&gt;Okay we figured it out. I was talking about creating an external table in Hive and then using Spark SQL to query it. The external table had sub-directories (e.g. ParentDirectory/2015/01/data.txt), that Hive was easily able to traverse and query However SparkSQL (and Presto) weren't able to, SparkSQL would give the error mentioned above. &lt;/P&gt;&lt;P&gt;It wasn't until we properly defined the sub-directories as partitions in Hive (e.g. ParentDirectory/year=2015/month=01) and added them to the metastore (alter table add partition) that SparkSQL (and Presto) were able to finally query the table without issues.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Dec 2015 23:01:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97998#M61148</guid>
      <dc:creator>awatson</dc:creator>
      <dc:date>2015-12-08T23:01:18Z</dc:date>
    </item>
    <item>
      <title>Re: Spark HiveContext - Querying External Hive Table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97999#M61149</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/140/nsabharwal.html" nodeid="140"&gt;@Neeraj Sabharwal&lt;/A&gt; see my below comment. If you want to reproduce. Create an external table that references a directory higher than the directory with data in it. Don't specify partitions and try running it. &lt;/P&gt;&lt;PRE&gt;CREATE EXTERNAL TABLE TEST1 (COL1 STRING) location '/location/to/parentdirectory' ;&lt;/PRE&gt;&lt;P&gt;Put data in /location/to/parentdirectory/2015/01 &lt;/P&gt;&lt;P&gt;then try to query.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Dec 2015 23:04:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/97999#M61149</guid>
      <dc:creator>awatson</dc:creator>
      <dc:date>2015-12-08T23:04:16Z</dc:date>
    </item>
    <item>
      <title>Re: Spark HiveContext - Querying External Hive Table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/98000#M61150</link>
      <description>&lt;P&gt;Thanks &lt;A rel="user" href="https://community.cloudera.com/users/200/awatson.html" nodeid="200"&gt;@Andrew Watson&lt;/A&gt; ....It was related to the setup. &lt;/P&gt;</description>
      <pubDate>Tue, 08 Dec 2015 23:40:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/98000#M61150</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2015-12-08T23:40:10Z</dc:date>
    </item>
    <item>
      <title>Re: Spark HiveContext - Querying External Hive Table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/98001#M61151</link>
      <description>&lt;P&gt;This can be achieved by setting the following property in spark, &lt;CODE&gt;sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")&lt;/CODE&gt;&lt;/P&gt;&lt;P&gt;Note here that the property is set usign sqlContext instead of sparkContext. And I tested this in spark 1.6.2&lt;/P&gt;,&lt;P&gt;This can be achieved by setting the following property in the spark.&lt;/P&gt;&lt;P&gt;sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true") &lt;/P&gt;&lt;P&gt;Note: i tested it in spark 1.6.2&lt;/P&gt;&lt;P&gt;Do not set this using spark context but use sqlContext to for dataframes created out of hive tables.&lt;/P&gt;</description>
      <pubDate>Mon, 16 Oct 2017 13:43:20 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-HiveContext-Querying-External-Hive-Table/m-p/98001#M61151</guid>
      <dc:creator>sadiga</dc:creator>
      <dc:date>2017-10-16T13:43:20Z</dc:date>
    </item>
  </channel>
</rss>

