Support Questions

Find answers, ask questions, and share your expertise

How to read table into Spark using the Hive tablename, not HDFS filename?

avatar
New Contributor

I successfully worked through Tutorial -400 (Using Hive with ORC from Apache Spark). But, what I would really like to do is to read established Hive ORC tables into Spark without having to know the HDFS path and filenames. I created an ORC table in Hive, then did the following commands from the tutorial in scala, but from the exception, it appears that the read/load is expecting the HDFS filename. How do I read directly from the Hive table, not HDFS? I searched, but could not find an existing answer.

Thanks much!

-Greg

hive> create table test_enc_orc stored as ORC as select * from test_enc;
hive> select count(*) from test_enc_orc; 
OK 
10

spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val test_enc_orc = hiveContext.read.format("orc").load("test_enc_orc")

java.io.FileNotFoundException: File does not exist: 
hdfs://sandbox.hortonworks.com:8020/user/xxxx/test_enc_orc
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1319)
1 ACCEPTED SOLUTION

avatar

@Greg Polanchyck if you have an existing ORC table in the Hive metastore, and you want to load the whole table into a Spark DataFrame, you can use the sql method on the hiveContext to run:

val test_enc_orc = hiveContext.sql("select * from test_enc_orc")

View solution in original post

4 REPLIES 4

avatar

@Greg Polanchyck if you have an existing ORC table in the Hive metastore, and you want to load the whole table into a Spark DataFrame, you can use the sql method on the hiveContext to run:

val test_enc_orc = hiveContext.sql("select * from test_enc_orc")

avatar
New Contributor

I like this more

val test_enc_orc = hiveContext.table("test_enc_orc")

avatar
New Contributor

@slachterman Thank you very much ! That worked well ! -Greg

avatar
New Contributor

i m also having same problem giving error

INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1492763204120 end=1492763204592 duration=472 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>

Exception in thread "main" java.util.NoSuchElementException: next on empty iterator

at scala.collection.Iterator$anon$2.next(Iterator.scala:39) at scala.collection.Iterator$anon$2.next(Iterator.scala:37) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64) at scala.collection.IterableLike$class.head(IterableLike.scala:91) at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$super$head(ArrayOps.scala:108) at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120) at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108) at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1422) at org.apache.spark.sql.DataFrame.first(DataFrame.scala:1429) at com.apollobit.jobs.TestData$.main(TestData.scala:32) at com.apollobit.jobs.TestData.main(TestData.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Please any body can help????