Created on 07-10-2016 01:59 PM - edited 09-16-2022 03:29 AM
I successfully worked through Tutorial -400 (Using Hive with ORC from Apache Spark). But, what I would really like to do is to read established Hive ORC tables into Spark without having to know the HDFS path and filenames. I created an ORC table in Hive, then did the following commands from the tutorial in scala, but from the exception, it appears that the read/load is expecting the HDFS filename. How do I read directly from the Hive table, not HDFS? I searched, but could not find an existing answer.
Thanks much!
-Greg
hive> create table test_enc_orc stored as ORC as select * from test_enc; hive> select count(*) from test_enc_orc; OK 10 spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql._ val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val test_enc_orc = hiveContext.read.format("orc").load("test_enc_orc") java.io.FileNotFoundException: File does not exist: hdfs://sandbox.hortonworks.com:8020/user/xxxx/test_enc_orc at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1319)
Created 07-10-2016 10:02 PM
@Greg Polanchyck if you have an existing ORC table in the Hive metastore, and you want to load the whole table into a Spark DataFrame, you can use the sql method on the hiveContext to run:
val test_enc_orc = hiveContext.sql("select * from test_enc_orc")
Created 07-10-2016 10:02 PM
@Greg Polanchyck if you have an existing ORC table in the Hive metastore, and you want to load the whole table into a Spark DataFrame, you can use the sql method on the hiveContext to run:
val test_enc_orc = hiveContext.sql("select * from test_enc_orc")
Created 07-27-2017 02:45 PM
I like this more
val test_enc_orc = hiveContext.table("test_enc_orc")
Created 07-11-2016 03:03 AM
@slachterman Thank you very much ! That worked well ! -Greg
Created 04-21-2017 01:03 PM
i m also having same problem giving error
INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1492763204120 end=1492763204592 duration=472 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
Exception in thread "main" java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$anon$2.next(Iterator.scala:39) at scala.collection.Iterator$anon$2.next(Iterator.scala:37) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64) at scala.collection.IterableLike$class.head(IterableLike.scala:91) at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$super$head(ArrayOps.scala:108) at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120) at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108) at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1422) at org.apache.spark.sql.DataFrame.first(DataFrame.scala:1429) at com.apollobit.jobs.TestData$.main(TestData.scala:32) at com.apollobit.jobs.TestData.main(TestData.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Please any body can help????