Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
04-15-2018
08:53 PM
13 Kudos
Abstract This
article objective is to cover key design and deployment considerations for High Availability Apache Kafka service. Introduction Kafka’s
through its distributed design allows a large number of permanent or ad-hoc
consumers, being highly available and resilient to node failures, also supporting
automatic recovery. These characteristics make Kafka an ideal fit for
communication and integration between components of large scale data systems. Kafka
application resilience is not enough to have a true HA system. Consumers and producers or the network design need also to be HA. Zookeepers
and brokers need to be able to communicate among themselves, producers and
consumers need to be able to access Kafka API. A fast car can reach the maximum speed only on a good track. A swamp is a swamp for a fast or a slow car. Kafka Design Kafka
utilizes Zookeeper for storing metadata information about the brokers, topics,
and partitions. Writes to Zookeeper are only performed on changes to the
membership of consumer groups or on changes to the Kafka cluster itself. This
amount of traffic is minimal, and it does not justify the use of a dedicated
Zookeeper ensemble for a single Kafka cluster. Many deployments will use a
single Zookeeper ensemble for multiple Kafka clusters. Prior
to Apache Kafka 0.9.0.0, consumers, in addition to the brokers, utilized
Zookeeper to directly store information about the composition of the consumer
group, what topics it was consuming, and to periodically commit offsets for
each partition being consumed (to enable failover between consumers in the
group). With version 0.9.0.0, a new consumer interface was introduced which
allows this to be managed directly with the Kafka brokers. However, there is a
concern with consumers and Zookeeper under certain configurations. Consumers
have a configurable choice to use either Zookeeper or Kafka brokers for
committing offsets, and they can also configure the interval between commits.
If the consumer uses Zookeeper for offsets, each consumer will perform a
Zookeeper write at every interval for every partition it consumes. A reasonable
interval for offset commits is 1 minute, as this is the period of time over
which a consumer group will read duplicate messages in the case of a consumer
failure. These commits can be a significant amount of Zookeeper traffic,
especially in a cluster with many consumers, and will need to be taken into
account. It may be necessary to use a longer commit interval if the Zookeeper
ensemble is not able to handle the traffic. However, it is recommended that
consumers using the latest Kafka libraries use Kafka brokers for committing
offsets, removing the dependency on Zookeeper. Outside of using a single ensemble
for multiple Kafka clusters, it is not recommended to share the ensemble with
other applications, if it can be avoided. Kafka
is sensitive to Zookeeper latency and timeouts, and an interruption in
communications with the ensemble will cause the brokers to behave
unpredictably. This can easily cause multiple brokers to go offline at the same
time, should they lose Zookeeper connections, which will result in offline
partitions. It also puts stress on the cluster controller, which can show up as
subtle errors long after the interruption has passed, such as when trying to
perform a controlled shutdown of a broker. Other applications that can put
stress on the Zookeeper ensemble, either through heavy usage or improper
operations, should be segregated to their own ensemble. Infrastructure and Network Design Challenges Application
is the last layer in top of other 6 layers of OSI stack, including Network,
Data Link and Physical. A power source that is not redundant can take out the
rack switch and none of the servers in the rack are accessible. There are at
least two issues with that implementation, e.g. non-redundant power source for
the switch, lack of redundancy for the actual rack switch. Add that there is a
single communication path from consumers and producers and exactly that is down
so your mission critical system does not deliver the service that is maybe your
main stream of revenue. Add to that rack-awareness is not implemented and all
topic partitions reside in the infrastructure hosted by that exact rack that
failed due a bad switch or a bad power source. Did it happen to you? What was the price that you afforded to pay when
that bad day came? Network Design and Deployment Considerations Implementing
a resilient Kafka cluster is similar with implementing a resilient HDFS cluster.
Kafka or HDFS reliability is for data/app reliability, at most compensating for
limited server failure, but not for network failure, especially in cases when
infrastructure and network has many single points of failure coupled with a bad
deployment of Kafka zookeepers, brokers or replicated topics. Dual NIC, dedicated core switches and
redundant, rack top switches, balancing replicas across racks
are common for a good network design for HA. Kafka, like HDFS, supports rack
awareness. A single point of failure should not impact more than one zookeeper
or one broker, or one partition of a topic. Zookeepers and Brokers should have
HA communication, if one path is down, another path is used. They should be
distributed ideally in different racks. Network
redundancy needs to provide the alternate access path to brokers for producers
and consumers, when failure arises. Also, as a good practice, brokers should be
distributed across multiple racks. Configure Network
Correctly Network
configuration with Kafka is similar to other distributed systems, with a few
caveats mentioned below. Infrastructure, whether on-premises or cloud, offers a
variety of different IP and DNS options. Choose an option that keeps
inter-broker network traffic on the private subnet and allows clients to
connect to the brokers. Inter-broker and client communication use the same
network interface and port. When
a broker is started, it registers its hostname with ZooKeeper. The producer
since Kafka 0.8.1 and the consumer since 0.9.0 are configured with a
bootstrapped (or “starting”) list of Kafka brokers. Prior versions were
configured with ZooKeeper. In both cases, the client makes a request (either to
a broker in the bootstrap list, or to ZooKeeper) to fetch all broker hostnames
and begin interacting with the cluster. Depending
on how the operating system’s hostname and network are configured, brokers on server
instances may register hostnames with ZooKeeper that aren’t reachable by
clients. The purpose of advertised.listeners is to address exactly
this problem; the configured protocol, hostname, and port in advertised.listeners is
registered with ZooKeeper instead of the operating system’s hostname. In
a multi-datacenter architecture, careful consideration has to be made for
MirrorMaker. Under the covers, MirrorMaker is simply a consumer and a producer
joined together. If MirrorMaker is configured to consume from a static ID, the
single broker tied to the static IP will be reachable, but the other brokers in
the source cluster won’t be. MirrorMaker needs access to all brokers in the
source and destination data center, which in most cases is best implemented
with a VPN between data centers. Client
service discovery can be implemented in a number of different ways. One option
is to use HAProxy on each
client machine, proxying localhost requests to an available
broker. Synapse works
well for this. Another option is to use a Load Balancer appliance. In this
configuration, ensure the Load Balancer is not public to the internet. Sessions
and stickiness do not need to be configured because Kafka clients only make a
request to the load balancer at startup. A health check can be a ping or a
telnet. Distribute Kafka brokers across multiple racks Kafka
was designed to run within a single data center. As such, we discourage
distributing brokers in a single cluster across multiple data centers. However,
we recommend “stretching” brokers in a single cluster across multiple racks. It
could be in the same private network or over multiple private networks within
the same data center. There are considerations that each enterprise is
responsible for implementing resilient systems. A
multi-rack cluster offers stronger fault tolerance because a failed rack won’t
cause Kafka downtime. However,
in this configuration, prior to Kafka 0.10, you must assign partition replicas manually
to ensure that replicas for each partition are spread across availability
zones. Replicas can be assigned manually either when a topic is created, or by
using the kafka-reassign-partitions command line tool. Kafka 0.10 or later
supports rack awareness, which makes spreading replicas across racks much
easier to configure. At
the minimum, for the HA of your Kafka-based service: Use dedicated “Top of Rack”
(TOR) switches (can be shared with Hadoop cluster). Use dedicated core switching
blades or switches. If deployed to a physical
environment, then make certain to place the cluster on in a VLAN. For the switch communicating
between the racks you will want to establish HA connections. Implement Kafka
rack-awareness configuration. It does not apply retroactively to the existent
zookeepers, brokers or topics. Test periodically your infrastructure
for resiliency and impact on Kafka service availability, e.g. disconnect one
rack switch impacting a single broker for example. If a correct design of the
network and topic replication was implemented, producers and consumers should
be able to work as usual, the only acceptable impact should be due to reduced
capacity producers may accumulate a lag in processing or consumers may have
some performance impact, however, everything else should be 100% functional. Implement continuous monitoring
of producers and consumers to detect failure events. Alerts and possibly automated
corrective actions. See below a
simplified example of a logical architecture for HA. The
above diagram shows an implementation for HA for an enterprise that uses 5
racks, 5 zookeepers distributed across multiple racks, multiple brokers
distributed across those 5 racks, replicated topics, HA producers and
consumers. This is similar with HDFS HA network design. Distribute ZooKeeper nodes across multiple racks ZooKeeper
should be distributed across multiple racks as well, to increase fault
tolerance. To tolerate a rack failure, ZooKeeper must be running in at least
three different racks. Obviously, there is also the private network concept of
failure. An enterprise can have zookeepers separated in different networks.
That comes at the price of latency. It is a conscious decision between
performance and reliability. Separating in different racks is a good compromise.
In a configuration where three ZooKeepers are running in two racks, if the
rack with two ZooKeepers fails, ZooKeeper will not have quorum and will not be
available. Monitor broker performance and terminate poorly performing brokers Kafka broker performance can decrease unexpectedly over time for unknown
reasons. It is a good practice to terminate and replace a broker if, for example, the 99 percentile of
produce/fetch request latency is higher than is tolerable for your application. Datacenter Layout For
development systems, the physical location of the Kafka brokers within a
datacenter is not as much of a concern, as there is not as severe an impact if
the cluster is partially or completely unavailable for short periods of time. When serving production traffic, however, downtime means dollars lost, whether
through loss of services to users or loss of telemetry on what the users are
doing. This is when it becomes critical to configure replication within the
Kafka cluster, which is also when it is important to consider the physical
location of brokers in their racks in the datacenter. If a deployment model
across multiple racks and rack-awareness configuration was not implement prior
to deploying Kafka, expensive maintenance to move servers around may be needed.
The Kafka broker has no rack-awareness when assigning new partitions to brokers
so everything to the date when the rack-awareness was implemented could be
brittle to failure. This means that it cannot take into account that two
brokers may be located in the same physical rack, or in the same network, therefore
can easily assign all replicas for a partition to brokers that share the same
power and network connections in the same rack. Should that rack have a
failure, these partitions would be offline and inaccessible to clients. In
addition, it can result in additional lost data on recovery due to an unclean
leader election. The best practice is to have each Kafka broker in a cluster
installed in a different rack, or at the very least not share single points of
failure for infrastructure services such as power and network. This typically
means at least deploying the servers that will run brokers with dual power
connections (to two different circuits) and dual network switches (with a
bonded interface on the servers themselves to failover seamlessly). Even with
dual connections, there is a benefit to having brokers in completely separate
racks. From time to time, it may be necessary to perform physical maintenance
on a rack or cabinet that requires it to be offline (such as moving servers
around, or rewiring power connections). Other Good
Practices I
am sure I missed other good practices, so here are a few more: Produce
to a replicated topic. Consume
from a replicated topic (consumers must be in the same consumer group). Each
partition gets assigned a consumer; need to have more partitions than consumers. Resilient
producers – spill data to local disk until Kafka is available. Other Useful References Mirroring data between clusters: http://kafka.apache.org/documentation.html#basic_ops_mirror_maker Data centers: http://kafka.apache.org/documentation.html#datacenters Thanks I'd like to thank all Apache Kafka contributors and the community that drives the innovation. I mean to thank everyone that contributes to improve documentation and publications on how to design and deploy Kafka as a Service in HA mode.
... View more
Labels:
02-05-2018
04:12 AM
14 Kudos
Apache NiFi evolution from version 1.2 included in HDF 3.0 and version 1.5 included in HDF is significant. I find myself quite often puzzled when required to provide differences between releases and just reading the release notes history at https://cwiki.apache.org/confluence/display/NIFI/Release+Notes and looking at the latest list of NiFi processors is not trivial to determine which new processors were added. I put together matrix which I hope will help developers to take advantage of new processor to improve old and develop new flows. In a nutshell, main functionality added is around: AzureEventHub Kafka 0.11 and 1.0 processors Record processors RethinkDB Flatten Json Execute Spark Interactive Execute Groovy Script My favorite improvements are
around record processors, flattening JSON and executing Spark
interactively. The following is a table of the matrix, arranged alphabetically from A-D: See here for the Matrix Table from E-J See here for the Matrix Tabke from K-Z For NiFi 1.5 NiFi 1.4 NiFi 1.3 NiFi 1.2 AttributeRollingWindow AttributeRollingWindow AttributeRollingWindow AttributeRollingWindow AttributesToJSON AttributesToJSON AttributesToJSON AttributesToJSON Base64EncodeContent Base64EncodeContent Base64EncodeContent Base64EncodeContent CaptureChangeMySQL CaptureChangeMySQL CaptureChangeMySQL CaptureChangeMySQL CompareFuzzyHash CompareFuzzyHash CompareFuzzyHash CompareFuzzyHash CompressContent CompressContent CompressContent CompressContent ConnectWebSocket ConnectWebSocket ConnectWebSocket ConnectWebSocket ConsumeAMQP ConsumeAMQP ConsumeAMQP ConsumeAMQP ConsumeAzureEventHub ConsumeEWS ConsumeEWS ConsumeEWS ConsumeEWS ConsumeIMAP ConsumeIMAP ConsumeIMAP ConsumeIMAP ConsumeJMS ConsumeJMS ConsumeJMS ConsumeJMS ConsumeKafka ConsumeKafka ConsumeKafka ConsumeKafka ConsumeKafka_0_10 ConsumeKafka_0_10 ConsumeKafka_0_10 ConsumeKafka_0_10 ConsumeKafka_0_11 ConsumeKafka_0_11 ConsumeKafkaRecord_0_10 ConsumeKafkaRecord_0_10 ConsumeKafkaRecord_0_10 ConsumeKafkaRecord_0_10 ConsumeKafkaRecord_0_11 ConsumeKafkaRecord_0_11 ConsumeKafka_1_0 ConsumeKafkaRecord_1_0 ConsumeMQTT ConsumeMQTT ConsumeMQTT ConsumeMQTT ConsumePOP3 ConsumePOP3 ConsumePOP3 ConsumePOP3 ConsumeWindowsEventLog ConsumeWindowsEventLog ConsumeWindowsEventLog ConsumeWindowsEventLog ControlRate ControlRate ControlRate ControlRate ConvertAvroSchema ConvertAvroSchema ConvertAvroSchema ConvertAvroSchema ConvertAvroToJSON ConvertAvroToJSON ConvertAvroToJSON ConvertAvroToJSON ConvertAvroToORC ConvertAvroToORC ConvertAvroToORC ConvertAvroToORC ConvertCharacterSet ConvertCharacterSet ConvertCharacterSet ConvertCharacterSet ConvertCSVToAvro ConvertCSVToAvro ConvertCSVToAvro ConvertCSVToAvro ConvertExcelToCSVProcessor ConvertExcelToCSVProcessor ConvertExcelToCSVProcessor ConvertExcelToCSVProcessor ConvertJSONToAvro ConvertJSONToAvro ConvertJSONToAvro ConvertJSONToAvro ConvertJSONToSQL ConvertJSONToSQL ConvertJSONToSQL ConvertJSONToSQL ConvertRecord ConvertRecord ConvertRecord ConvertRecord CreateHadoopSequenceFile CreateHadoopSequenceFile CreateHadoopSequenceFile CreateHadoopSequenceFile CountText DebugFlow DebugFlow DebugFlow DebugFlow DeleteDynamoDB DeleteDynamoDB DeleteDynamoDB DeleteDynamoDB DeleteGCSObject DeleteGCSObject DeleteGCSObject DeleteGCSObject DeleteHDFS DeleteHDFS DeleteHDFS DeleteHDFS DeleteElasticsearch5 DeleteElasticsearch5 DeleteRethinkDB DeleteRethinkDB DeleteS3Object DeleteS3Object DeleteS3Object DeleteS3Object DeleteMongo DeleteSQS DeleteSQS DeleteSQS DeleteSQS DetectDuplicate DetectDuplicate DetectDuplicate DetectDuplicate DistributeLoad DistributeLoad DistributeLoad DistributeLoad DuplicateFlowFile DuplicateFlowFile DuplicateFlowFile DuplicateFlowFile
... View more
Labels:
10-06-2017
07:20 PM
6 Kudos
Introduction This is a continuation of an article I wrote about 1 year ago: https://community.hortonworks.com/articles/60580/jmeter-setup-for-hive-load-testing-draft.htmlhttps://www.blazemeter.com/blog/windows-authentication-apache-jmeter Steps 1) Enable Kerberos on your cluster Perform all steps specified here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_security/content/configuring_amb_hdp_for_kerberos.html and connect successfully to hive service via command line using your user keytab. That implies a valid ticket. 2) Install JMeter See previous article mentioned in Introduction. 3) Set Hive User keytab in jaas.conf JMETER_HOME/bin/jaas.conf Your jaas.conf should look something like this: JMeter {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=false
doNotPrompt=true
useKeyTab=true
keyTab="/etc/security/keytabs/hive.service.keytab"
principal="hive/server.example.com@EXAMPLE.COM"
debug=true;
}; 4) JMeter Setup There are 2 files under /bin folder of the JMeter installation which are used for Kerberos configuration: krb5.conf - file of .ini format which contains Kerberos configuration details jaas.conf - file which holds configuration details of Java Authentication and Authorization service These files aren’t being used by default, so you have to tell JMeter where they are via system properties such as: -Djava.security.krb5.conf=krb5.conf
-Djava.security.auth.login.config=jaas.conf Alternatively you can add the next two lines to the system.properties file which is located at the same /bin folder. java.security.krb5.conf=krb5.conf
java.security.auth.login.config=jaas.conf I suggest using full paths to files. 5) Manage Issues If you encounter any issues: - enable debug by adding the following to your command: -Dsun.security.krb5.debug=true
-Djava.security.debug=gssloginconfig,configfile,configparser,logincontext - check jmeter.log to see whether all properties are set as expected and map to existent file paths. 6) Turn-off Subject Credentials -Djavax.security.auth.useSubjectCredsOnly=false 7) Example of JMeter Command JVM_ARGS="-Xms1024m
-Xmx1024m" bin/jmeter -Dsun.security.krb5.debug=true
-Djavax.security.auth.useSubjectCredsOnly=false
-Djava.security.debug=gssloginconfig,configfile,configparser,logincontext
-Djava.security.krb5.conf=/path/to/krb5.conf
-Djava.security.auth.login.config=/path/to/jaas.conf -n -t t1.jmx -l results -e
-o output This could be simplified if you add those two lines mentioned earlier to be added to system.properties.
... View more
Labels:
10-06-2017
06:21 PM
@uday kv See this new article: https://community.hortonworks.com/articles/141035/jmeter-kerberos-setup-for-hive-load-testing.html
... View more
07-26-2017
05:34 PM
10 Kudos
Introduction GeoMesa is an Apache licensed open source suite of tools that
enables large-scale geospatial analytics on cloud and distributed computing
systems, letting you manage and analyze the huge spatio-temporal datasets that
IoT, social media, tracking, and mobile phone applications seek to take
advantage of today. GeoMesa does this by providing spatio-temporal data persistence
on top of the Accumulo, HBase, and Cassandra distributed column-oriented
databases for massive storage of point, line, and polygon data. It allows rapid
access to this data via queries that take full advantage of geographical
properties to specify distance and area. GeoMesa also provides support for near
real time stream processing of spatio-temporal data by layering spatial
semantics on top of the Apache Kafka messaging system. GeoMesa features include the ability to: Store
gigabytes to petabytes of spatial data (tens of billions of points or more) Serve
up tens of millions of points in seconds Ingest
data faster than 10,000 records per second per node Scale
horizontally easily (add more servers to add more capacity) Support
Spark analytics Drive
a map through GeoServer or other OGC Clients Installation GeoMesa supports traditional HBase installations as well as
HBase running on Amazon’s EMR and Hortonworks’ Data
Platform (HDP).
For instructions on bootstrapping an EMR cluster, please read this tutorial: Bootstrapping GeoMesa HBase on AWS S3. Tutorial Overview The
code in this tutorial only does a few things:
Establishes a new
(static) SimpleFeatureType Prepares the HBase
table to store this type of data Creates 1000 example
SimpleFeatures Writes these
SimpleFeatures to the HBase table Queries for a given
geographic rectangle and time range, and attribute filter, and writes out
the entries in the result set Prerequisites
Java Development
Kit 1.8, Apache Maven, a
GitHub client, HBase
1.2.x (optional), and GeoServer
2.9.1 (optional). An existing HBase 1.1.x
installation is helpful but not necessary. The tutorial described will work
either with an existing HBase server or by downloading the HBase binary
distribution and running it in "standalone" mode (described below). GeoServer is only
required for visualizing the HBase data. Setting up GeoServer is beyond the
scope of this tutorial. Download and Build the Tutorial Clone the
geomesa-tutorials distribution from GitHub: $ git clone https://github.com/geomesa/geomesa-tutorials.git
$ cd geomesa-tutorials The pom.xml file contains an explicit list of dependent libraries that
will be bundled together into the final tutorial. You should confirm that the
versions of HBase and Hadoop match what you are running; if it does not match,
change the values of the hbase.version and hbase.hadoop.version properties. The version of GeoMesa that this tutorial
targets matches the project version of the pom.xml . (Note that this tutorial has been
tested with GeoMesa 1.2.2 or later). The only
reason these libraries are bundled into the final JAR is that this is easier
for most people to do this than it is to set the classpath when running the
tutorial. If you would rather not bundle these dependencies, mark them as provided in the POM, and update your classpath as appropriate. GeoMesa's HBaseDataStore searches for a file called hbase-site.xml , which among other things configures the Zookeeper host(s) and
port. If this file is not present on the classpath, the hbase-default.xml provided by hbase-common sets the default zookeeper quorum
to "localhost" and port to 2181, which is what is used by the
standalone HBase described in "Setting up HBase in standalone mode"
above. If you have an existing HBase installation, you should copy your hbase-site.xml file into geomesa-quickstart-hbase/src/main/resources (or otherwise add it to the
classpath when you run the tutorial). To build the tutorial
code: $ cd geomesa-quickstart-hbase
$ mvn clean install When this is complete,
it should have built a JAR file that contains all of the code you need to run
the tutorial. Run the Tutorial First, make sure that hbase.table.sanity.check
property is set to false in hbase-site.xml On the command line,
run: $ java -cp target/geomesa-quickstart-hbase-$VERSION.jar com.example.geomesa.hbase.HBaseQuickStart --bigtable_table_name geomesa
The only argument passed
is the name of the HBase table where GeoMesa will store the feature type
information. It will also create a table called <tablename>_<featuretype>_z3 which will store the
Z3-indexed features. In our specific case, the table name will be geomesa_QuickStart_z3. You should see output
similar to the following (not including some of Maven's output and log4j's
warnings), which lists the features that match the specified query in the
tutorial do Creating feature-type (schema): QuickStart
Creating new features
Inserting new features
Submitting query
1. Bierce|676|Fri Jul 18 08:22:03 EDT 2014|POINT (-78.08495724535888 37.590866849120395)|null
2. Bierce|190|Sat Jul 26 19:06:19 EDT 2014|POINT (-78.1159944062711 37.64226959044015)|null
3. Bierce|550|Mon Aug 04 08:27:52 EDT 2014|POINT (-78.01884511971093 37.68814732634964)|null
4. Bierce|307|Tue Sep 09 11:23:22 EDT 2014|POINT (-78.18782181976381 37.6444865782879)|null
5. Bierce|781|Wed Sep 10 01:14:16 EDT 2014|POINT (-78.0250604717695 37.58285696304815)|null
To see how the data is
stored in HBase, use the HBase shell. $ /path/to/hbase-1.2.3/bin/hbase shell The type information is
in the geomesa table (or whatever name you specified on the command
line): hbase> scan 'geomesa'
ROW COLUMN+CELL
QuickStart column=M:schema, timestamp=1463593804724, value=Who:String,What:Long,When:Date,*Where:Point:s
rid=4326,Why:String
The features are stored
in <tablename>_<featuretype>_z3 ( geomesa_QuickStart_z3 in this example): hbase> scan 'geomesa_QuickStart_z3', { LIMIT => 3 }
ROW COLUMN+CELL \x08\xF7\x0F#\x83\x91\xAE\xA2\x column=D:\x0F#\x83\x91\xAE\xA2\xA8PObservation.452, timestamp=1463593805801, value=\x02\x00\x A8P 00\x00@Observation.45\xB2Clemen\xF3\x01\x00\x00\x00\x00\x00\x00\x01\xC4\x01\x00\x00\x01CM8\x0 E\xA0\x01\x01\xC0S!\x93\xBCSg\x00\xC0CG\xBF$\x0DO\x7F\x80\x14\x1B$-? \x08\xF8\x06\x03\x19\xDFf\xA3p\ column=D:\x06\x03\x19\xDFf\xA3p\x0CObservation.362, timestamp=1463593805680, value=\x02\x00\x x0C 00\x00@Observation.36\xB2Clemen\xF3\x01\x00\x00\x00\x00\x00\x00\x01j\x01\x00\x00\x01CQ\x17wh\ x01\x01\xC0S\x05\xA5b\xD49"\xC0B\x88*~\xD1\xA0}\x80\x14\x1B$-? \x08\xF8\x06\x07\x19S\xD0\xA21> column=D:\x06\x07\x19S\xD0\xA21>Observation.35, timestamp=1463593805664, value=\x02\x00\x00\x 00?Observation.3\xB5Clemen\xF3\x01\x00\x00\x00\x00\x00\x00\x00#\x01\x00\x00\x01CS?`x\x01\x01\
xC0S_\xA7+G\xADH\xC0B\x90\xEB\xF7`\xC2T\x80\x13\x1A#,>
Recommendations There are many reasons that GeoMesa can provide the best solution to your spatio-temporal database needs:
You have Big Spatial Data sets and are reaching performance limitations of relational database systems. Perhaps you are looking at sharding strategies and wondering if now is the time to look for a new storage solution. You have very high-velocity data and need high read and write speeds. Your analytics operate in the cloud, perhaps using Spark, and you want to enable spatial analytics. You are looking for a supported, open-source alternative to expensive proprietary solutions. You are looking for a Platform as a Service (PaaS) database where you can store Big Spatial Data. You want to filter data using the rich Common Query Language (CQL) defined by the OGC. Reference: https://github.com/geomesa/geomesa-tutorials/tree/master/geomesa-quickstart-hbase
... View more
07-19-2017
09:26 PM
9 Kudos
Overview The following versions of Apache Kafka have been incorporated in HDP 2.2.0 to 2.6.1: 0.8.1, 0.8.2, 0.9.0, 0.10.0, 0.10.1. Apache Kafka is now at 0.11. Hortonworks is working to make Kafka easier for enterprises to use. New focus areas include creation of a Kafka Admin Panel to create/delete topics and manage user permissions, easier and safer distribution of security tokens and support for multiple ways of publishing/consuming data via a Kafka REST server/API. Here are a few areas of strong contribution: Operations: Rack awareness for Increased resilience and availability such that replicas are isolated so they are guaranteed to span multiple racks or availability zones. Automated replica leader election for automated, even distribution of leaders in a cluster capability by detecting uneven distribution with some brokers serving more data compared to others and makes adjustments. Message Timestamps so every message in Kafka now has a timestamp field that indicates the time at which the message was produced. SASL improvements including external authentication servers and support of multiple types of SASL authentication on one server Ambari Views for visualization of Kafka operational metrics Security: Kafka security encompasses multiple needs – the need to encrypt the data flowing through Kafka and preventing rogue agents from publishing data to Kafka, as well as the ability to manage access to specific topics on an individual or group level. As a result, latest updates in Kafka support wire encryption via SSL, Kerberos based authentication and granular authorization options via Apache Ranger or other pluggable authorization system. This article lists below new features beyond Hortonworks contribution. At the high level, the following have been added by the overall community.
Kafka Streams API Kafka Connect API New unified Consumer API Transport encryption using TLS/SSL Kerberos/SASL Authentication support Access Control Lists Timestamps on messages Reduced client dependence on zookeeper (offsets stored in Kafka topic) Client interceptors New Features Since HDP 2.2 Here is the list of NEW FEATURES as they have been included in the release notes. Kafka 0.8.1:
https://archive.apache.org/dist/kafka/0.8.1/RELEASE_NOTES.html
[KAFKA-330] -
Add delete topic support [KAFKA-554] -
Move all per-topic configuration into ZK and add to the CreateTopicCommand [KAFKA-615] -
Avoid fsync on log segment roll [KAFKA-657] -
Add an API to commit offsets [KAFKA-925] -
Add optional partition key override in producer [KAFKA-1092] -
Add server config parameter to separate bind address and ZK hostname [KAFKA-1117] -
tool for checking the consistency among replicas Kafka 0.8.2: https://archive.apache.org/dist/kafka/0.8.2.0/RELEASE_NOTES.html
[KAFKA-1000] -
Inbuilt consumer offset management feature for kakfa [KAFKA-1227] -
Code dump of new producer [KAFKA-1384] -
Log Broker state [KAFKA-1443] -
Add delete topic to topic commands and update DeleteTopicCommand [KAFKA-1512] -
Limit the maximum number of connections per ip address [KAFKA-1597] -
New metrics: ResponseQueueSize and BeingSentResponses [KAFKA-1784] -
Implement a ConsumerOffsetClient library Kafka 0.9.0: https://archive.apache.org/dist/kafka/0.9.0.0/RELEASE_NOTES.html
[KAFKA-1499] -
Broker-side compression configuration [KAFKA-1785] -
Consumer offset checker should show the offset manager and offsets
partition [KAFKA-2120] -
Add a request timeout to NetworkClient [KAFKA-2187] -
Introduce merge-kafka-pr.py script Kafka 0.10.0: https://archive.apache.org/dist/kafka/0.10.0.0/RELEASE_NOTES.html
[KAFKA-2832] -
support exclude.internal.topics in new consumer [KAFKA-3046] -
add ByteBuffer Serializer&Deserializer [KAFKA-3490] -
Multiple version support for ducktape performance tests Kafka 0.10.0.1: https://archive.apache.org/dist/kafka/0.10.0.1/RELEASE_NOTES.html
[KAFKA-3538] -
Abstract the creation/retrieval of Producer for stream sinks for unit
testing Kafka 0.10.1: https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html
[KAFKA-1464] - Add a throttling option to the
Kafka replication tool [KAFKA-3176] - Allow console consumer to
consume from particular partitions when new consumer is used. [KAFKA-3492] - support quota based on
authenticated user name [KAFKA-3776] - Unify store and downstream
caching in streams [KAFKA-3858] - Add functions to print stream
topologies [KAFKA-3909] - Queryable state for Kafka
Streams [KAFKA-4015] - Change cleanup.policy config
to accept a list of valid policies [KAFKA-4093] - Cluster id Final Notes Apache Kafka shines in use cases like: replacement for a more traditional message broker user activity tracking pipeline as a set of real-time publish-subscribe feeds (the original use case) operational monitoring data log aggregation stream processing event sourcing commit log Apache Kafka continues to be a dynamic and extremely popular project with more and more adoption.
... View more
05-02-2017
03:56 AM
This is a KB at most. It is not even complete. Just read "More investigations to follow up ..."
... View more
04-18-2017
03:45 PM
@rbiswas What about adding new host and assign it at the same time to a specific rack? I'd like to avoid the host (data node) and then have to set the rack and restart again. Is there a way via Ambari UI?
... View more
04-03-2017
12:47 AM
@Rohan Pednekar This is true also for any scan that requires evaluation before retrieving anything. I am not sure why this would be an HCC article. This is merely one paragraph of what could have been a well-written article about tips and tricks when dealing with HBase. I recommend looking at some of the featured articles in HCC and write that quality. This section you published could be very useful in a larger article. Thanks for your efforts.
... View more
03-06-2017
10:16 PM
13 Kudos
Demonstrate how easy is to create a simple data flow with NiFi, stream to Hive and visualize via Zeppelin. Pre-requisites
Apache NiFi 1.1.0.2.1.0.0-165, included with Hortonworks DataFlow 2.1.0 Apache Zeppelin 0.6.0.2.5.0.0-1245, included with Hortonworks Data Platform 2.5.0 My repo for Apache NiFi "CSVToHive.xml" template, customer demographics data (customer_demographics.header, customer_demographics.csv), "Customer Demographics.json" Apache Zeppelin notebook, customer_demographics_orc_table_ddl.hql database and table DDLs Apache Hive 1.2.1 included with HDP 2.5.0 Hive configured to support ACID transactions and demo database and customer_demographics created using customer_demographics_orc_table_ddl.hql Steps Import NiFi Template Assuming NiFi is started and the UI available at <NiFiMasterHostName>:8086:/nifi, import the template CSVToHive.xml: screen-shot-2017-03-06-at-74106-pm.png Create Data Folder and Upload Data Files In your home directory create /home/username/customer_demographics and upload data files specified above. Grant appropriate access to your NiFi user to be able to access it and process it via GetFile processor. Change the directory path specified in GetFile processor to match your path. Also, change the "Keep Source File" property of the GetFile processor to false as such the file is processed once and then deleted. For test reasons, I kept it as true. also, you will have to adjust Hive Metastore URI to match your environment host name. Import Zeppelin Notebook Execute NiFi Flow Start all processors or start one processor at the time and follow the flow. The outcome is that each record of your CSV file will be posted to Hive demo.customer_demographics table via Hive Streaming API. As you noticed from the DDL, the Hive table is transactional. Enabling the global ACID feature of Hive and creating the table as transactional and bucketed is a requirement for this to work. Also, the data format required to allow using PutHiveStreaming processor is Avro, as such we converted the CSV to Avro. At one of the intermediary steps we could infer the Avro schema or define the CSV file header, the later option has been selected for this demo. Execute Zeppelin Notebook During the demo you could change from NiFi to Zeppelin showing how the data is posted in Hive and how is reflected in Zeppelin by re-executing the HiveQL blocks. The markdown (md) and shell (sh) blocks were included only for demonstration purposes, showing how a data engineer, a data analyst or a data scientist can benefit from the use of Zeppelin.
... View more
Labels: