About sethia_arun

sethia_arun · ‎03-20-2018

@ekoifman we have tried with val sparLLP="com.hortonworks.spark" % "spark-llap-assembly_2.11" % "1.1.3-2.1" resolvers +="hortonWorks Group" at "http://repo.hortonworks.com/content/groups/public/" with sample code .. val table1="transactional_table" val sparkConf = new SparkConf() sparkConf.set("spark.sql.warehouse.dir",<<dir>>) sparkConf.set("hive.exec.dynamic.partition", "true") sparkConf.set("hive.exec.dynamic.partition.mode", "nonstrict") sparkConf.set("hive.enforce.bucketing", "true") sparkConf.set("spark.sql.hive.llap", "true") sparkConf.set("spark.sql.hive.hiveserver2.jdbc.url","jdbc:hive2://host1:2181,host2:2181,host3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=hive/_HOST@R2.MADM.NET") val spark = SparkSession.builder.appName("test").enableHiveSupport().config(sparkConf).getOrCreate() val sqlContext: SQLContext = spark.sqlContext val df = spark.sql(s"SELECT * FROM $table1") We are getting following error - "main" java.sql.SQLException: Could not open client transport for any of the Server URI's in ZooKeeper: Unable to read HiveServer2 uri from ZooKeeper Whereas without "llap" we are able to run spark sql.

sethia_arun · ‎11-08-2017

we have requirement where we have to stream DDL statements from kafka and apply them on hive table, can we use spark streaming with hive JDBC to perform same, because spark 2.1.x does not support "ALTER TABLE".

sethia_arun · ‎07-06-2017

Thanks. Overall What I understood running such example described here https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest can't run outside of edge node until unless that machine is part of HDP network. We are working with Akka Streaming (http://doc.akka.io/docs/akka/snapshot/scala/stream/index.html) to get data from Kafka and sink with Hive using (HCatalog Streaming API) , so It can scale horizontally with multiple pods (docker). That will allow us to scale on demand. So ideally we would be running a program similar to what is written here https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest outside of edge node.

sethia_arun · ‎07-06-2017

This is really good example of how streaming API can be used with hive. I tried to run this example code outside from the edge node, It is throwing following error: hdfs.DFSClient: Exception in createBlockOutputStream org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=xxxx/xxx.yy.zz.135:50010] at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534) at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1601) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1342) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1295) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463) hdfs.DFSClient: Abandoning BP- Does it mean that hive stream can work only from edge node? Thanks Arun

sethia_arun · ‎07-06-2017

thanks Constantin Stanca , running Hive streaming (using hive endpoint) internally access hive data node, either hive streaming design is a issue (the storage id - SDS ID is mapped to data nodes) or why such restriction is imposed that programs can run only from edge nodes. Somewhere on internet I read as following: "Typically in case you have a multi tenant cluster (which most hadoop clusters are bound to be) then ideally no one other than administrators have access to the machines that are the part of the cluster. Developers setup their own "edge-nodes". Edge Nodes basically have hadoop libraries and have the client configuration deployed to them (various xml files which tell the local installation where namenode, job tracker, zookeeper etc are core-site, mapred-site, hdfs-site.xml). But the edge node does not have any role as such in the cluster i.e. no persistent hadoop services are running on this node" please can guide me how I can add edge node clients on spark nodes, can't we use kerberos.principal and firewall (IP address restrictions) to provide extra security. Thanks & Regards, Arun

sethia_arun · ‎07-06-2017

Hi, I have spark cluster running in separate nodes (other than data nodes), I am trying to use hive streaming (using hiveendpoint) which is going through edge node. The issue is hive streaming internally access data nodes (based on storage description defined in metastore) to store/push data to data nodes. The data nodes 50010 is not accessable outside edge node, so it throws following exception: <code>exception in thread "main" org.apache.hive.hcatalog.streaming.StreamingIOFailure: Unable to flush recordUpdater at org.apache.hive.hcatalog.streaming.AbstractRecordWriter.flush(AbstractRecordWriter.java:168) at org.apache.hive.hcatalog.streaming.StrictJsonWriter.flush(StrictJsonWriter.java:41) at org.apache.hive.hcatalog.streaming.HiveEndPoint$TransactionBatchImpl.commitImpl(HiveEndPoint.java:858) at org.apache.hive.hcatalog.streaming.HiveEndPoint$TransactionBatchImpl.commit(HiveEndPoint.java:833) at I am not sure what security risk is involved to open 50010 for spark cluster (that can be configured on firewall what IP address are allowed to access 50010 port). I see this is limitation for not able to perform hive stream outside edge node. Thanks & Regards, Arun

Online	Offline
Last Visited	‎03-20-2018 09:16 PM

Member Since	‎07-05-2017 02:25 PM
Last Visited	‎03-20-2018 09:16 PM
Posts	7
Kudos received	3

Cloudera Community

Re: Hive Transactional Tables are not readable by ...

Re: Spark 2.1 Alter table - Alternatives / Work A...

Re: Access hive endpoint outside edge node ,datano...

Re: Implementing a real-time Hive Streaming exampl...

Re: Access hive endpoint outside edge node ,datano...

Access hive endpoint outside edge node ,datanode p...