Member since
11-19-2015
158
Posts
25
Kudos Received
21
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
11899 | 09-01-2018 01:27 AM | |
1143 | 09-01-2018 01:18 AM | |
3798 | 08-20-2018 09:39 PM | |
509 | 07-20-2018 04:51 PM | |
1521 | 07-16-2018 09:41 PM |
09-24-2018
06:35 PM
Hi @Zach, please see my answer on StackOverflow here. https://stackoverflow.com/a/52266219/2308683 Burrow does essentially the same thing, but in Golang How you read the data and perform the lag calculations also depends on what is currently being consumed, however, which is not being stored immediately within the offsets topic.
... View more
09-17-2018
07:48 PM
@ssarkar Is it not possible to use Ambari to install separate Zookeeper Host group, then configure a Kafka host group to use the secondary Zookeeper quorum?
... View more
09-17-2018
07:45 PM
These are spam accounts, by the way. Look at all the "answers" from the other users for every question, and they all link back to dataflair's website.
... View more
09-10-2018
06:55 PM
If you are running Kafka 0.10 or newer, connect-distributed.sh exists somewhere under /usr/hdp/current/kafka already. You can run that process on multiple machines to create a Kafka Connect cluster.
... View more
09-10-2018
06:46 PM
I think you are asking about adding directories to Datanodes.
dfs.datanode.data.dir in the hdfs-site.xml file is a comma-delimited list of directories for where the DataNode will store blocks for HDFS. Plus, https://community.hortonworks.com/questions/89786/file-uri-required-for-dfsdatanodedatadir.html
Property
Default
Description
dfs.datanode.data.dir
file://${hadoop.tmp.dir}/dfs/data
Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored
Otherwise, I'm afraid your question doesn't make sense other than running mkdir HDFS command to "add a new directory in HDFS"
... View more
09-04-2018
06:58 PM
@Manish
Tiwari, perhaps you can look at https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-2.7.1/content/data-lake/index.html Otherwise, you can search https://docs.hortonworks.com/ for the keywords you are looking for
... View more
09-01-2018
01:36 AM
If you expose Kafka via HTTP, then I don't see the downside of exposing Kafka itself. If you did enable HTTPS on the "Kafka REST api", (via Knox, for example https://knox.apache.org/books/knox-1-1-0/user-guide.html#Kafka) then you should be enabling TLS/SSL on Kafka, in which case, certificates would be needed to make external clients secure. Kafka should realistically not be treated as a "walled off" service behind the Hadoop network, and you cannot proxy requests though another server without manually setting up that TLS tunnel yourself. Kafka is a common access point for getting data into Hadoop as well, so it should be treated as a first-class "edge ingestion layer" itself. You should take similar care to setup authentication and access rules around every single broker just like you've done for the Hadoop "edge node". You could alternatively use NiFi to listen on some other random port, then send to a Kafka producer processor, then someone scanning open ports wouldn't be able to detect it's Kafka responding, it would be NiFi, though you still would have the same problem of people can send random messages into that socket if they don't require authentication
... View more
09-01-2018
01:27 AM
Nagios / OpsView / Sensu are popular options I've seen StatsD / CollectD / MetricBeat are daemon metric collectors (MetricBeat is somewhat tied to an Elasticsearch cluster though) that run on each server Prometheus is a popular option nowadays that would scrape metrics exposed by local service I have played around a bit with netdata, though I'm not sure if it can be applied for Hadoop monitoring use cases. DataDog is a vendor that offers lots of integrations such as Hadoop, YARN, Kafka, Zookeeper, etc. ... Realistically, you need some JMX + System monitoring tool, and a bunch exist
... View more
09-01-2018
01:18 AM
1 Kudo
A Data Lake is not tied to a platform or technology. Hadoop is not a requirement for a datalake either. IMO, a "data lake project" should not be a project description or the end goal; you can say you got your data from "source X", using "code Y", transformed and analyzed using "framework Z", but the combinations of tools out in the market that support such statements are so broad and vague that it really depends on what business use cases you are trying to solve. For example, S3 is replaceable with HDFS or GCS or Azure Storage. Redshift is replaceable with Postgres (and you really should use Athena anyway if the data you want to query is in S3, where Athena is replaceable by PrestoDB), and those can be compared to Google BigQuery. My suggestion would be not to tie yourself to a certain toolset, but if you are in AWS, their own documentation pages are very extensive. Since you are not asking about a Hortonworks specific question, I'm not sure what information you are looking for from this site.
... View more
08-24-2018
06:30 PM
You can enable JMX for metrics + Grafana for visualization, then Ambari Infra for log collection However, you will not have visibility into Consumer Lag like Confluent Control Center offers, and you will need to find some external tools to do that for you such as LinkedIn Burrow. If you are not satisfied with that, Confluent Control Center can be added to an HDP cluster with manual setup. https://docs.confluent.io/current/control-center/docs/installation/install-apache-kafka.html You will need to copy the Confluent Metrics Reporter JARs from the Confluent Enterprise download over onto your HDP Kafka nodes under /usr/hdp/current/kafka
... View more