Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to read table into Spark using the Hive tablename, not HDFS filename?

avatar
New Contributor

I successfully worked through Tutorial -400 (Using Hive with ORC from Apache Spark). But, what I would really like to do is to read established Hive ORC tables into Spark without having to know the HDFS path and filenames. I created an ORC table in Hive, then did the following commands from the tutorial in scala, but from the exception, it appears that the read/load is expecting the HDFS filename. How do I read directly from the Hive table, not HDFS? I searched, but could not find an existing answer.

Thanks much!

-Greg

hive> create table test_enc_orc stored as ORC as select * from test_enc;
hive> select count(*) from test_enc_orc; 
OK 
10

spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val test_enc_orc = hiveContext.read.format("orc").load("test_enc_orc")

java.io.FileNotFoundException: File does not exist: 
hdfs://sandbox.hortonworks.com:8020/user/xxxx/test_enc_orc
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1319)
1 ACCEPTED SOLUTION

avatar

@Greg Polanchyck if you have an existing ORC table in the Hive metastore, and you want to load the whole table into a Spark DataFrame, you can use the sql method on the hiveContext to run:

val test_enc_orc = hiveContext.sql("select * from test_enc_orc")

View solution in original post

4 REPLIES 4

avatar

@Greg Polanchyck if you have an existing ORC table in the Hive metastore, and you want to load the whole table into a Spark DataFrame, you can use the sql method on the hiveContext to run:

val test_enc_orc = hiveContext.sql("select * from test_enc_orc")

avatar
New Contributor

I like this more

val test_enc_orc = hiveContext.table("test_enc_orc")

avatar
New Contributor

@slachterman Thank you very much ! That worked well ! -Greg

avatar
New Contributor

i m also having same problem giving error

INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1492763204120 end=1492763204592 duration=472 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>

Exception in thread "main" java.util.NoSuchElementException: next on empty iterator

at scala.collection.Iterator$anon$2.next(Iterator.scala:39) at scala.collection.Iterator$anon$2.next(Iterator.scala:37) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64) at scala.collection.IterableLike$class.head(IterableLike.scala:91) at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$super$head(ArrayOps.scala:108) at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120) at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108) at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1422) at org.apache.spark.sql.DataFrame.first(DataFrame.scala:1429) at com.apollobit.jobs.TestData$.main(TestData.scala:32) at com.apollobit.jobs.TestData.main(TestData.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Please any body can help????