About cstanca

cstanca · ‎07-19-2017

Overview The following versions of Apache Kafka have been incorporated in HDP 2.2.0 to 2.6.1: 0.8.1, 0.8.2, 0.9.0, 0.10.0, 0.10.1. Apache Kafka is now at 0.11. Hortonworks is working to make Kafka easier for enterprises to use. New focus areas include creation of a Kafka Admin Panel to create/delete topics and manage user permissions, easier and safer distribution of security tokens and support for multiple ways of publishing/consuming data via a Kafka REST server/API. Here are a few areas of strong contribution: Operations: Rack awareness for Increased resilience and availability such that replicas are isolated so they are guaranteed to span multiple racks or availability zones. Automated replica leader election for automated, even distribution of leaders in a cluster capability by detecting uneven distribution with some brokers serving more data compared to others and makes adjustments. Message Timestamps so every message in Kafka now has a timestamp field that indicates the time at which the message was produced. SASL improvements including external authentication servers and support of multiple types of SASL authentication on one server Ambari Views for visualization of Kafka operational metrics Security: Kafka security encompasses multiple needs – the need to encrypt the data flowing through Kafka and preventing rogue agents from publishing data to Kafka, as well as the ability to manage access to specific topics on an individual or group level. As a result, latest updates in Kafka support wire encryption via SSL, Kerberos based authentication and granular authorization options via Apache Ranger or other pluggable authorization system. This article lists below new features beyond Hortonworks contribution. At the high level, the following have been added by the overall community. Kafka Streams API Kafka Connect API New unified Consumer API Transport encryption using TLS/SSL Kerberos/SASL Authentication support Access Control Lists Timestamps on messages Reduced client dependence on zookeeper (offsets stored in Kafka topic) Client interceptors New Features Since HDP 2.2 Here is the list of NEW FEATURES as they have been included in the release notes. Kafka 0.8.1: https://archive.apache.org/dist/kafka/0.8.1/RELEASE_NOTES.html [KAFKA-330] - Add delete topic support [KAFKA-554] - Move all per-topic configuration into ZK and add to the CreateTopicCommand [KAFKA-615] - Avoid fsync on log segment roll [KAFKA-657] - Add an API to commit offsets [KAFKA-925] - Add optional partition key override in producer [KAFKA-1092] - Add server config parameter to separate bind address and ZK hostname [KAFKA-1117] - tool for checking the consistency among replicas Kafka 0.8.2: https://archive.apache.org/dist/kafka/0.8.2.0/RELEASE_NOTES.html [KAFKA-1000] - Inbuilt consumer offset management feature for kakfa [KAFKA-1227] - Code dump of new producer [KAFKA-1384] - Log Broker state [KAFKA-1443] - Add delete topic to topic commands and update DeleteTopicCommand [KAFKA-1512] - Limit the maximum number of connections per ip address [KAFKA-1597] - New metrics: ResponseQueueSize and BeingSentResponses [KAFKA-1784] - Implement a ConsumerOffsetClient library Kafka 0.9.0: https://archive.apache.org/dist/kafka/0.9.0.0/RELEASE_NOTES.html [KAFKA-1499] - Broker-side compression configuration [KAFKA-1785] - Consumer offset checker should show the offset manager and offsets partition [KAFKA-2120] - Add a request timeout to NetworkClient [KAFKA-2187] - Introduce merge-kafka-pr.py script Kafka 0.10.0: https://archive.apache.org/dist/kafka/0.10.0.0/RELEASE_NOTES.html [KAFKA-2832] - support exclude.internal.topics in new consumer [KAFKA-3046] - add ByteBuffer Serializer&Deserializer [KAFKA-3490] - Multiple version support for ducktape performance tests Kafka 0.10.0.1: https://archive.apache.org/dist/kafka/0.10.0.1/RELEASE_NOTES.html [KAFKA-3538] - Abstract the creation/retrieval of Producer for stream sinks for unit testing Kafka 0.10.1: https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html [KAFKA-1464] - Add a throttling option to the Kafka replication tool [KAFKA-3176] - Allow console consumer to consume from particular partitions when new consumer is used. [KAFKA-3492] - support quota based on authenticated user name [KAFKA-3776] - Unify store and downstream caching in streams [KAFKA-3858] - Add functions to print stream topologies [KAFKA-3909] - Queryable state for Kafka Streams [KAFKA-4015] - Change cleanup.policy config to accept a list of valid policies [KAFKA-4093] - Cluster id Final Notes Apache Kafka shines in use cases like: replacement for a more traditional message broker user activity tracking pipeline as a set of real-time publish-subscribe feeds (the original use case) operational monitoring data log aggregation stream processing event sourcing commit log Apache Kafka continues to be a dynamic and extremely popular project with more and more adoption.

cstanca · ‎07-17-2017

Assume using Zeppelin to execute Hive queries. See: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_zeppelin-component-guide/content/config-hive-access.html Would the auditing capture the actions as the actual Zeppelin user or as the single hive user defined for the Zeppelin connection that is shared among many Zeppelin users?

cstanca · ‎07-06-2017

@Arun Sethia Clients that run Hive, Pig and potentially M/R jobs that use HCatalog won't have this problem. This is about Spark. I guess you app accesses the endpoint something like this: HiveEndPoint("thrift://"+hostname+":9083","HIVE_DATABASE","HIVE_TABLE_NAME",null); Your app uses the thrift service installed in the edge node. Like in MapReduce, this service just tells Spark where data is and then executors will parallelize data action and will need to access each individual data node, as such the 50010 is a requirement which for those clients using HCatalog is not a problem, but not Spark. If your Spark cluster is inside of your HDP perimeter then opening port 50010 in all data nodes should not be a security concern. You may need to work with your admin to open that port for all data nodes. It seems the better approach. If your Spark is outside of HDP perimeter (truly different cluster) then that is a bit more difficult. I am not aware of a successful implementation that implemented the proper security. I am not sure what was the reasoning to use Spark for this ingest to Hive use case. NiFi would have been a better candidate.

cstanca · ‎07-06-2017

@Arun Sethia Edge nodes have also the clients. You could add clients to each Spark node.

cstanca · ‎07-06-2017

@Arun Sethia If you open that port you will not only open it for Spark cluster, but also for anybody to exploit it for good or bad reasons. An edge node acts as a trusted proxy. This is part of the architecture and the folks enforcing data security policies in your organization may not like to break it.

cstanca · ‎06-12-2017

@Alvin Jin Also, let's keep in mind that Confluent Schema Registry is for Avro and for Kafka only. Hortonworks Schema Registry is meant to manage much more than Avro for Kafka. Stay tuned. Hortonworks Schema Registry is meant for all consumers from HDF and HDP platforms, not only for Kafka and will gradually support all commonly used formats.

cstanca · ‎06-07-2017

@Matt Burgess Thank you so much.

cstanca · ‎06-07-2017

Assume a data file in avro schema "avro schema short". Assume "avro schema full" that is inclusive of the "avro schema short" and has some additional fields, set for default values. The short schema Avro can have fields in any order and field-wise is a sub-set of the super-schema with fields not necessarily in the same order. How would one use NiFi out-of-box processors to transform the first data file into "avro schema full" setting the values for the additional fields to the default values as specified in the "avro schema full"? It could be a creative solution involving one of those Execute ... or Scripted ... processors.

cstanca · ‎05-30-2017

@Jitendra Yadav Atlas in the mix? Any data tagged? This indicates that one of the tables involved in the view join where the field is involved in the join is passing a null value when a not null is expected. I've seen this type of behavior with tagged data on base tables for the view. Is there more in hive log? The error excerpt above is just a symptom of passing a null object when one is required, not very helpful. To test the assumption presented above you could test whether you can query step by step the base tables with the same where clause, but without the join.

cstanca · ‎05-02-2017

This is a KB at most. It is not even complete. Just read "More investigations to follow up ..."

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Apache Kafka New Features Journey from HDP 2.2.0 t...

What would the audit record on Zeppelin users acti...

Re: Access hive endpoint outside edge node ,datano...

Re: Access hive endpoint outside edge node ,datano...

Re: Access hive endpoint outside edge node ,datano...

Re: Can NiFiv1.2 work with Confluent Schema Regist...

Re: How to Transform an Avro to a Super-Schema Avr...

How to Transform an Avro to a Super-Schema Avro?

Re: hive view NullPointerException

Re: Upgrade to HDP 2.5.3 : ConcurrentModificationE...