Member since
07-05-2017
7
Posts
3
Kudos Received
0
Solutions
03-20-2018
09:17 PM
@ekoifman we have tried with val sparLLP="com.hortonworks.spark" % "spark-llap-assembly_2.11" % "1.1.3-2.1" resolvers +="hortonWorks Group" at "http://repo.hortonworks.com/content/groups/public/" with sample code .. val table1="transactional_table" val sparkConf = new SparkConf() sparkConf.set("spark.sql.warehouse.dir",<<dir>>) sparkConf.set("hive.exec.dynamic.partition", "true") sparkConf.set("hive.exec.dynamic.partition.mode", "nonstrict") sparkConf.set("hive.enforce.bucketing", "true") sparkConf.set("spark.sql.hive.llap", "true") sparkConf.set("spark.sql.hive.hiveserver2.jdbc.url","jdbc:hive2://host1:2181,host2:2181,host3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=hive/_HOST@R2.MADM.NET") val spark = SparkSession.builder.appName("test").enableHiveSupport().config(sparkConf).getOrCreate() val sqlContext: SQLContext = spark.sqlContext val df = spark.sql(s"SELECT * FROM $table1") We are getting following error - "main" java.sql.SQLException: Could not open client transport for any of the Server URI's in ZooKeeper: Unable to read HiveServer2 uri from ZooKeeper Whereas without "llap" we are able to run spark sql.
... View more
11-08-2017
06:42 AM
we have requirement where we have to stream DDL statements from kafka and apply them on hive table, can we use spark streaming with hive JDBC to perform same, because spark 2.1.x does not support "ALTER TABLE".
... View more
07-06-2017
02:11 AM
Thanks. Overall What I understood running such example described here https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest can't run outside of edge node until unless that machine is part of HDP network. We are working with Akka Streaming (http://doc.akka.io/docs/akka/snapshot/scala/stream/index.html) to get data from Kafka and sink with Hive using (HCatalog Streaming API) , so It can scale horizontally with multiple pods (docker). That will allow us to scale on demand. So ideally we would be running a program similar to what is written here https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest outside of edge node.
... View more
07-06-2017
01:49 AM
This is really good example of how streaming API can be used with hive. I tried to run this example code outside from the edge node, It is throwing following error: hdfs.DFSClient: Exception in createBlockOutputStream
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=xxxx/xxx.yy.zz.135:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1601)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1342)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1295)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463) hdfs.DFSClient: Abandoning BP- Does it mean that hive stream can work only from edge node? Thanks Arun
... View more
07-06-2017
01:11 AM
2 Kudos
thanks Constantin Stanca , running Hive streaming (using hive endpoint) internally access hive data node, either hive streaming design is a issue (the storage id - SDS ID is mapped to data nodes) or why such restriction is imposed that programs can run only from edge nodes. Somewhere on internet I read as following: "Typically in case you have a multi tenant cluster (which most hadoop clusters are bound to be) then ideally no one other than administrators have access to the machines that are the part of the cluster.
Developers setup their own "edge-nodes". Edge Nodes basically have hadoop libraries and have the client configuration deployed to them (various xml files which tell the local installation where namenode, job tracker, zookeeper etc are core-site, mapred-site, hdfs-site.xml). But the edge node does not have any role as such in the cluster i.e. no persistent hadoop services are running on this node" please can guide me how I can add edge node clients on spark nodes, can't we use kerberos.principal and firewall (IP address restrictions) to provide extra security. Thanks & Regards, Arun
... View more
07-06-2017
12:10 AM
1 Kudo
Hi, I have spark cluster running in separate nodes (other than data nodes), I am trying to use hive streaming (using hiveendpoint) which is going through edge node. The issue is hive streaming internally access data nodes (based on storage description defined in metastore) to store/push data to data nodes. The data nodes 50010 is not accessable outside edge node, so it throws following exception: <code>exception in thread "main" org.apache.hive.hcatalog.streaming.StreamingIOFailure: Unable to flush recordUpdater
at org.apache.hive.hcatalog.streaming.AbstractRecordWriter.flush(AbstractRecordWriter.java:168)
at org.apache.hive.hcatalog.streaming.StrictJsonWriter.flush(StrictJsonWriter.java:41)
at org.apache.hive.hcatalog.streaming.HiveEndPoint$TransactionBatchImpl.commitImpl(HiveEndPoint.java:858)
at org.apache.hive.hcatalog.streaming.HiveEndPoint$TransactionBatchImpl.commit(HiveEndPoint.java:833)
at I am not sure what security risk is involved to open 50010 for spark cluster (that can be configured on firewall what IP address are allowed to access 50010 port). I see this is limitation for not able to perform hive stream outside edge node. Thanks & Regards, Arun
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive