Member since
11-19-2015
158
Posts
25
Kudos Received
21
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
11724 | 09-01-2018 01:27 AM | |
1096 | 09-01-2018 01:18 AM | |
3666 | 08-20-2018 09:39 PM | |
484 | 07-20-2018 04:51 PM | |
1461 | 07-16-2018 09:41 PM |
03-18-2019
07:08 PM
@Junfeng Chen, as mentioned, it depends on your use of it. It will run okay in most deployment patterns, and it can run fine in VMs, but of course having dedicated hardware is always preffered.
... View more
03-18-2019
07:05 PM
@Gaurang Shah If you see broker metrics, then that is where you exposed the port to. Kafka Connect is meant to be run on separate machines from the brokers, but if you are running on the same ones, then you must expose and monitor two differerent JMX ports.
... View more
03-11-2019
07:35 PM
"Can they" - Yes. "Should they" - I would say no. Kafka is very memory and disk sensitive. Depending on your use of it, it could even use more I/O than the combination of the DataNode and NodeManager on the same machine. Personally, I would recommend installing Kafka brokers on dedicated hardware, even separate from the Zookeeper servers it needs, if at all possible. The Spark executors do not need to be running on the Kafka brokers, they should work fine pulling remotely from the YARN NodeManagers.
... View more
03-11-2019
07:31 PM
JMX is not exposed in a property file. It is toggled from two environment variable. Confluent and Apache Kafka installations share the same variables for this. Source code here - https://github.com/apache/kafka/blob/trunk/bin/kafka-run-class.sh#L166-L174 Basically, if you export the JMX_PORT to a valid port number, it will open that port for JMX monitoring of any Kafka-related script that you run. The recommendation, however, is to use a cluster of machines running Connect Distributed, as it is a long-running process, and you can scale your metrics collection using tools like Prometheus JMX Exporter combined with a Grafana server for dashboarding.
... View more
10-09-2018
07:35 PM
btw, these questions are copied from https://data-flair.training/forums/topic/how-client-can-interact-with-hive
... View more
10-09-2018
07:30 PM
Can you please verify / show the Kafka Server properties? When you run Kafka within a container, you need to make sure that clients are getting the external addrress from Zookeeper. A simple port forward is not enough. https://rmoff.net/2018/08/02/kafka-listeners-explained/
... View more
10-06-2018
06:41 PM
It is a Java class for reading Hadoop SequenceFIles http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/SequenceFileInputFormat.html
... View more
10-03-2018
10:36 PM
Hello @nicole wells Please find the question that you copied (and my answer) here https://stackoverflow.com/a/51687883/2308683
... View more
09-30-2018
01:31 AM
You only need to use a Schema Registry if you plain on using Confluent's AvroConverter Note: NiFI can also be used to do CDC from MySQL https://community.hortonworks.com/articles/113941/change-data-capture-cdc-with-apache-nifi-version-1-1.html
... View more
09-30-2018
01:27 AM
On brokers termination, they remove themselves from Zookeeper
... View more
09-24-2018
06:35 PM
Hi @Zach, please see my answer on StackOverflow here. https://stackoverflow.com/a/52266219/2308683 Burrow does essentially the same thing, but in Golang How you read the data and perform the lag calculations also depends on what is currently being consumed, however, which is not being stored immediately within the offsets topic.
... View more
09-17-2018
07:48 PM
@ssarkar Is it not possible to use Ambari to install separate Zookeeper Host group, then configure a Kafka host group to use the secondary Zookeeper quorum?
... View more
09-17-2018
07:45 PM
These are spam accounts, by the way. Look at all the "answers" from the other users for every question, and they all link back to dataflair's website.
... View more
09-10-2018
06:55 PM
If you are running Kafka 0.10 or newer, connect-distributed.sh exists somewhere under /usr/hdp/current/kafka already. You can run that process on multiple machines to create a Kafka Connect cluster.
... View more
09-10-2018
06:46 PM
I think you are asking about adding directories to Datanodes.
dfs.datanode.data.dir in the hdfs-site.xml file is a comma-delimited list of directories for where the DataNode will store blocks for HDFS. Plus, https://community.hortonworks.com/questions/89786/file-uri-required-for-dfsdatanodedatadir.html
Property
Default
Description
dfs.datanode.data.dir
file://${hadoop.tmp.dir}/dfs/data
Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored
Otherwise, I'm afraid your question doesn't make sense other than running mkdir HDFS command to "add a new directory in HDFS"
... View more
09-04-2018
06:58 PM
@Manish
Tiwari, perhaps you can look at https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-2.7.1/content/data-lake/index.html Otherwise, you can search https://docs.hortonworks.com/ for the keywords you are looking for
... View more
09-01-2018
01:36 AM
If you expose Kafka via HTTP, then I don't see the downside of exposing Kafka itself. If you did enable HTTPS on the "Kafka REST api", (via Knox, for example https://knox.apache.org/books/knox-1-1-0/user-guide.html#Kafka) then you should be enabling TLS/SSL on Kafka, in which case, certificates would be needed to make external clients secure. Kafka should realistically not be treated as a "walled off" service behind the Hadoop network, and you cannot proxy requests though another server without manually setting up that TLS tunnel yourself. Kafka is a common access point for getting data into Hadoop as well, so it should be treated as a first-class "edge ingestion layer" itself. You should take similar care to setup authentication and access rules around every single broker just like you've done for the Hadoop "edge node". You could alternatively use NiFi to listen on some other random port, then send to a Kafka producer processor, then someone scanning open ports wouldn't be able to detect it's Kafka responding, it would be NiFi, though you still would have the same problem of people can send random messages into that socket if they don't require authentication
... View more
09-01-2018
01:27 AM
Nagios / OpsView / Sensu are popular options I've seen StatsD / CollectD / MetricBeat are daemon metric collectors (MetricBeat is somewhat tied to an Elasticsearch cluster though) that run on each server Prometheus is a popular option nowadays that would scrape metrics exposed by local service I have played around a bit with netdata, though I'm not sure if it can be applied for Hadoop monitoring use cases. DataDog is a vendor that offers lots of integrations such as Hadoop, YARN, Kafka, Zookeeper, etc. ... Realistically, you need some JMX + System monitoring tool, and a bunch exist
... View more
09-01-2018
01:18 AM
1 Kudo
A Data Lake is not tied to a platform or technology. Hadoop is not a requirement for a datalake either. IMO, a "data lake project" should not be a project description or the end goal; you can say you got your data from "source X", using "code Y", transformed and analyzed using "framework Z", but the combinations of tools out in the market that support such statements are so broad and vague that it really depends on what business use cases you are trying to solve. For example, S3 is replaceable with HDFS or GCS or Azure Storage. Redshift is replaceable with Postgres (and you really should use Athena anyway if the data you want to query is in S3, where Athena is replaceable by PrestoDB), and those can be compared to Google BigQuery. My suggestion would be not to tie yourself to a certain toolset, but if you are in AWS, their own documentation pages are very extensive. Since you are not asking about a Hortonworks specific question, I'm not sure what information you are looking for from this site.
... View more
08-24-2018
06:30 PM
You can enable JMX for metrics + Grafana for visualization, then Ambari Infra for log collection However, you will not have visibility into Consumer Lag like Confluent Control Center offers, and you will need to find some external tools to do that for you such as LinkedIn Burrow. If you are not satisfied with that, Confluent Control Center can be added to an HDP cluster with manual setup. https://docs.confluent.io/current/control-center/docs/installation/install-apache-kafka.html You will need to copy the Confluent Metrics Reporter JARs from the Confluent Enterprise download over onto your HDP Kafka nodes under /usr/hdp/current/kafka
... View more
08-21-2018
08:07 PM
This is a very broad topic, and might make sense to use a vendor supported tool like EMR or Qubole. Cloudbreak or Hortonworks itself doesn't offer very well-defined backup tools. For example, Hadoop DistCP and mysqldump/pgdump, Hive/HBase Export only get you so far.
... View more
08-21-2018
08:05 PM
Hive Streaming tables need to be ORC, right? Do the Avro records automatically get converted?
... View more
08-21-2018
07:53 PM
@Shobhna Dhami After "available connectors" it does not list it, so you have not setup the classpath correctly, as I linked to. In Kafka 0.10, you need to run $ export CLASSPATH=/path/to/extracted-debezium-folder/*.jar # Replace with the real address
$ connect-distributed ... # Start Connect Server You can also perform a request to the /connector-plugins URL address before sending any configuration to verify the Debezium connector was correctly installed.
... View more
08-21-2018
07:46 PM
@Vamshi
Reddy
Yes, "Confluent" is not some custom version of Apache Kafka In fact, this process is very repeatable for all other Kafka Connect plugins. Download the code Build it against the Kafka version you run Move the package to the Connect server Extract the JAR files onto the Connect server CLASSPATH Run/Restart Connect
... View more
08-20-2018
11:26 PM
From a non-Hadoop machine, install Java+Maven+Git
git clone https://github.com/confluentinc/kafka-connect-hdfs
cd kafka-connect-hdfs
git fetch --all --tags --prune
git checkout tags/v4.1.2 # This is a Confluent Release number, which corresponds to a Kafka release number
mvn clean install -DskipTests
This should generate some files under the target folder in that directory.
So, using the 4.1.2 example, I would
ZIP up the "target/kafka-connect-hdfs-4.1.2-package/share/java/" folder that was built, then copy this file and extract it into all HDP servers that I want to run Kafka Connect on. For example, /opt/kafka-connect-hdfs/share/java
From there, you would find your "connect-distributed.properties" file and add a line for
plugin.path=/opt/kafka-connect-hdfs/share/java
Now, run something like this (I don't know the full location of the property files)
connect-distributed /usr/hdp/current/kafka/.../connect-distributed.properties
Once that starts, you can attempt to hit http://connect-server:8083/connector-plugins , and you should see an item for "io.confluent.connect.hdfs.HdfsSinkConnector"
If successful, continue to read the HDFS Connector documentation, then POST the JSON configuration body to the Connect Server endpoint. (or use Landoop's Connect UI tool)
... View more
08-20-2018
09:39 PM
@Shobhna Dhami Somewhere under /usr/hdp/current/kafka there is a connect-distributed script. You run this and provide a connect-distributed.properties file. Assuming you are running a recent Kafka version (above 0.11.0), In the properties file, you would add a line that includes "plugin.path" that points to a directory containing the extracted package of the debezium connector. As mentioned in the Debezium documentation Simply download the connector’s plugin archive, extract the JARs into your Kafka Connect environment, and add the directory with the JARs to Kafka Connect’s classpath. Restart your Kafka Connect process to pick up the new JARs. Kafka Documentation - http://kafka.apache.org/documentation/#connect Confluent Documentation - https://docs.confluent.io/current/connect/index.html (note: Confluent is not a "custom version" of Kafka, they just provide a stronger ecosystem around it)
... View more
08-19-2018
05:37 AM
If the end goal is to transfer Kafka to Postgres, you have access to NiFi. Otherwise, exposing the internal Postgres server that is used for Hive, Ambari, Oozie, and other services is probably not a good idea. It would be recommended to run a standalone Postgres server to minimize the blast-radius of failure and maintain service uptime.
... View more
07-31-2018
10:26 PM
@Michael Bronson - Well, the obvious; Kafka Leader election would fail if only one Zookeeper stops responding. Your consumers and producers wouldn't be able to determine which topic partition should serve any requests. Hardware fails for a variety of reasons, and it would be better if you converted two of the 160 available worker nodes to be dedicated Zookeeper servers.
... View more
07-31-2018
10:23 PM
Load balancers would help in the case where you want a more friendly name than some DNS records or the case where IP's are dynamic. Besides that, remembering one address is easier than a long list of 3-5 servers.
... View more
07-30-2018
06:53 PM
@Michael Bronson - The terms "master/worker" don't really mean anything in Kafka terms. 17 Kafka brokers seems like a lot (we have about that many brokers in AWS handling about 2million messages per day), but yes, a minimum of 5 ZKs is encouraged to account for maintenance and hardware failure, as mentioned.
... View more