Created 07-06-2017 12:10 AM
Hi,
I have spark cluster running in separate nodes (other than data nodes), I am trying to use hive streaming (using hiveendpoint) which is going through edge node. The issue is hive streaming internally access data nodes (based on storage description defined in metastore) to store/push data to data nodes. The data nodes 50010 is not accessable outside edge node, so it throws following exception:
<code>exception in thread "main" org.apache.hive.hcatalog.streaming.StreamingIOFailure: Unable to flush recordUpdater at org.apache.hive.hcatalog.streaming.AbstractRecordWriter.flush(AbstractRecordWriter.java:168) at org.apache.hive.hcatalog.streaming.StrictJsonWriter.flush(StrictJsonWriter.java:41) at org.apache.hive.hcatalog.streaming.HiveEndPoint$TransactionBatchImpl.commitImpl(HiveEndPoint.java:858) at org.apache.hive.hcatalog.streaming.HiveEndPoint$TransactionBatchImpl.commit(HiveEndPoint.java:833) at
I am not sure what security risk is involved to open 50010 for spark cluster (that can be configured on firewall what IP address are allowed to access 50010 port).
I see this is limitation for not able to perform hive stream outside edge node.
Thanks & Regards,
Arun
Created 07-06-2017 01:57 AM
Clients that run Hive, Pig and potentially M/R jobs that use HCatalog won't have this problem. This is about Spark.
I guess you app accesses the endpoint something like this:
HiveEndPoint("thrift://"+hostname+":9083","HIVE_DATABASE","HIVE_TABLE_NAME",null);
Your app uses the thrift service installed in the edge node.
Like in MapReduce, this service just tells Spark where data is and then executors will parallelize data action and will need to access each individual data node, as such the 50010 is a requirement which for those clients using HCatalog is not a problem, but not Spark. If your Spark cluster is inside of your HDP perimeter then opening port 50010 in all data nodes should not be a security concern. You may need to work with your admin to open that port for all data nodes. It seems the better approach.
If your Spark is outside of HDP perimeter (truly different cluster) then that is a bit more difficult. I am not aware of a successful implementation that implemented the proper security.
I am not sure what was the reasoning to use Spark for this ingest to Hive use case. NiFi would have been a better candidate.
Created 07-06-2017 12:23 AM
If you open that port you will not only open it for Spark cluster, but also for anybody to exploit it for good or bad reasons. An edge node acts as a trusted proxy. This is part of the architecture and the folks enforcing data security policies in your organization may not like to break it.
Created 07-06-2017 12:27 AM
Edge nodes have also the clients. You could add clients to each Spark node.
Created 07-06-2017 01:11 AM
thanks Constantin Stanca , running Hive streaming (using hive endpoint) internally access hive data node, either hive streaming design is a issue (the storage id - SDS ID is mapped to data nodes) or why such restriction is imposed that programs can run only from edge nodes.
Somewhere on internet I read as following:
"Typically in case you have a multi tenant cluster (which most hadoop clusters are bound to be) then ideally no one other than administrators have access to the machines that are the part of the cluster. Developers setup their own "edge-nodes". Edge Nodes basically have hadoop libraries and have the client configuration deployed to them (various xml files which tell the local installation where namenode, job tracker, zookeeper etc are core-site, mapred-site, hdfs-site.xml). But the edge node does not have any role as such in the cluster i.e. no persistent hadoop services are running on this node"
please can guide me how I can add edge node clients on spark nodes, can't we use kerberos.principal and firewall (IP address restrictions) to provide extra security.
Thanks & Regards,
Arun
Created 07-06-2017 01:57 AM
Clients that run Hive, Pig and potentially M/R jobs that use HCatalog won't have this problem. This is about Spark.
I guess you app accesses the endpoint something like this:
HiveEndPoint("thrift://"+hostname+":9083","HIVE_DATABASE","HIVE_TABLE_NAME",null);
Your app uses the thrift service installed in the edge node.
Like in MapReduce, this service just tells Spark where data is and then executors will parallelize data action and will need to access each individual data node, as such the 50010 is a requirement which for those clients using HCatalog is not a problem, but not Spark. If your Spark cluster is inside of your HDP perimeter then opening port 50010 in all data nodes should not be a security concern. You may need to work with your admin to open that port for all data nodes. It seems the better approach.
If your Spark is outside of HDP perimeter (truly different cluster) then that is a bit more difficult. I am not aware of a successful implementation that implemented the proper security.
I am not sure what was the reasoning to use Spark for this ingest to Hive use case. NiFi would have been a better candidate.
Created 07-06-2017 02:11 AM
Thanks. Overall What I understood running such example described here https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest can't run outside of edge node until unless that machine is part of HDP network.
We are working with Akka Streaming (http://doc.akka.io/docs/akka/snapshot/scala/stream/index.html) to get data from Kafka and sink with Hive using (HCatalog Streaming API) , so It can scale horizontally with multiple pods (docker). That will allow us to scale on demand. So ideally we would be running a program similar to what is written here https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest outside of edge node.