Member since
05-11-2017
11
Posts
0
Kudos Received
0
Solutions
05-16-2018
12:40 PM
We are using HDP version 2.6.0.3 and use spark streaming ( version ;2.1.0) to read data from kafka and perisit to hive. We are seeing an unusual behavior. The spark job failed with an error and we had to restart it. Upon restart , it is reading all over from the beginning ignoring the previous commits for the groupid. This happens say once in 1 month. Spark Streaming is storing offset in Kafka and we can see from the Kafka Manager that the last committed offset is properly reflected in Kafka. Not sure why spark streaming is not picking it up from Kafka and instead starts reading from the very beginning. Can you please help.
... View more
11-01-2017
05:46 AM
We need run Create table and alter table statements to hive from within a spark streaming application. The spark version is 2.1.X and it is HDP 2.6.2. spark.sqlContext.sql("CREATE ...")spark.sqlContext.sql("ALTER ...") The create statement works but alter fails with Spark error "Operation not allowed". As a work around we thought why cant we use hive jdbc to issue alter statements if spark is not allowing it. But the main problem is with hive jdbc fails with kerberos authenticaiton when tried from within a spark application. The same program ( hive -jdbc ) works with kerberos when tried as a standalone java applicaiton. What is the way to supply kerberos credentials to hive jdbc when invoked inside a spark applicaiton. Class.forName("org.apache.hive.jdbc.HiveDriver")
val conf: Configuration = new org.apache.hadoop.conf.Configuration();
conf.set("hadoop.security.authentication", "Kerberos");
UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab("user1@Domain, "hdfs://user/user1/user1.keytab");
println(s"********************** before connection")
val conn=DriverManager.getConnection("jdbc:hive2://domain:port/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=hive/_HOST@Domain")
println(s"********************** ${conn}")
This code works when run as a standalone java program but fails with kerberos error when called within a spark applicaiton. It gives error javax.security.auth.login.LoginException: Unable to obtain password from user. Can you please help in this working from inside a spark application. Why are going this approach because spark 2.1 is not supporting alter tables.
... View more
Labels:
11-01-2017
05:36 AM
We need run Create table and alter table statements to hive from within a spark streaming application. The spark version is 2.1.X and it is HDP 2.6.2. spark.sqlContext.sql("CREATE ...")
spark.sqlContext.sql("ALTER ...") The create statement works but alter fails with Spark error "Operation not allowed". We are stuck and not able to proceed further. Is there any way by which we can run Alter table commands to hive with HDP 2.6.2 from within a spark application. From the SparkSession we cannot get hold of HiveContext hence that approach is also not possible. Any help in this matter is much appreciated.
... View more
Labels:
08-17-2017
08:24 AM
@Eugene Koifman We have tested with Hive JDBC and Hive streaming. The behavior seems to be same when we do compaction along with Hive JDBC. If we do compaction then we don't see much difference between these two. It would be of great help if you could share more details of the advantages of Hive streaming compared to Hive JDBC.
... View more
08-03-2017
04:31 AM
Is streaming API integrated with Spark. When we tried to use HiveEndPoint classes within a spark context, many weird class loader issues have come up.
... View more
08-01-2017
08:47 AM
I am using spark session to save a data frame to hive table. The code is as below.
df.write.mode(SaveMode.Append).format("orc").insertInto("table")
The data comes to spark from kafka. This can be huge amount of data coming throughout the day. Does , spark dataframe save internally does hive compaction ?. If not what is the best way to do compaction at regular intervals without affecting the table insertions.
... View more
Labels:
07-03-2017
11:37 AM
we are trying to run HiveStream outside edge node; the name nodes are in HA, we are able to get table meta data during connection, but while trying to commit transaction it is failing because of host is not reachable , the error is "org.apache.hive.hcatalog.streaming.StreamingIOFailure: Unable to flush recordUpdater" The program works when run from the hadoop edge node, but it fails with the above exception when run from any other machine. The root cause is shown as below Caused by: java.io.IOException: DataStreamer Exception:
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:577)
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1601)
The logs indicate it is able to connect to hive metastore nut when flushing recordUpdater fails with the exception. Code val hiveEP: HiveEndPoint = new HiveEndPoint(conf.getVar(ConfVars.METASTOREURIS), dbName, tableName, partitionVals.asJava)
val conn: StreamingConnection = hiveEP.newConnection(true, conf, "HiveStreamProcessor")
System.err.println("Got new connection")
val jsonWriter: StrictJsonWriter = new StrictJsonWriter(hiveEP, conf, conn)
val start = System.currentTimeMillis()
Is there anything specific we need to do to make it work from outside edgenode ?
... View more
Labels:
05-23-2017
11:03 AM
What is the advantage of hive-streaming over hive-jdbc. If we do batch in jdbc what extra advantage hive-streaming has over hive jdbc
... View more
Labels:
05-14-2017
12:06 PM
Anybody has used hive-streaming inside spark and deployed in a cluster ? Is this something correct or wrong usage. Is there any url that shows using hive-streaming inside spark program in cluster mode.
... View more
05-14-2017
12:05 PM
Anybody has used hive-streaming inside spark and deployed in a cluster ? Is this something correct or wrong usage
... View more
05-11-2017
05:11 PM
Is it a valid scenario to use hive-streaming inside a spark program. I have seen examples of hive streaming as standalone program and spark streaming for writing to hive. Never seen any program where hive-streaming is used inside a spark application and submitted to cluster. Does hive streaming work inside a spark application or is this a totally wrong usage. Please share your thoughts.
... View more
Labels: