Member since
11-19-2015
158
Posts
25
Kudos Received
21
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
11713 | 09-01-2018 01:27 AM | |
1096 | 09-01-2018 01:18 AM | |
3659 | 08-20-2018 09:39 PM | |
484 | 07-20-2018 04:51 PM | |
1459 | 07-16-2018 09:41 PM |
07-25-2018
06:58 PM
1 Kudo
There is no such rule for Kafka Brokers. Zookeeper should maintain a quorum or (n/2 + 1) total machines (of n) that agree on leader-election values and locks, that results in a total odd number to accommodate for hardware and network failure scenarios. From "Kafka - The Definitive Guide", as well as Apache Zookeeper site, you generally will have negative side effects from having more than 5 or 7 Zookeeper servers total serving applications using it. You should have more than 3 Zookeepers because if one goes down, you are only left with 2, which results in that "split brain". With 5 servers, two can go down, and you still have 2 servers + 1 available for the "tie breaker" vote. For 7, you can loose up to 4 zookeepers and still be good.
... View more
07-24-2018
06:26 PM
Contrary to answer by @Harshali Patel, exhaustion is not defined as an uneven distribution, it is rather a cause of it. A datanode has a property that you can set which defines a threshold of data must be reserved for the OS on that server. Once that limit is exceeded, the datanode process will stop and log an error telling you to delete some files from it. HDFS will continue to function with the other datanodes. The balancer can be ran to keep storage space healthy and even.
... View more
07-20-2018
04:52 PM
TaskTracker & JobTracker doesn't exist with YARN. The default replication factor is 3.
... View more
07-20-2018
04:51 PM
1 Kudo
What component are you asking about? What are you trying to achieve? They typically call each other over combinations of separate protocols. - HDFS and YARN interact via RPC/IPC. - Ambari Server and Agents are over HTTP & REST. Ambari also needs JDBC connections to the backing database. - Hive, Hbase, and Spark can use Thrift Server. The Hive metastore uses JDBC. - Kafka has its own TCP protocol. I would suggest starting on a specific component for the use case(s) you want. Hadoop itself is only comprised of HDFS & YARN + MapReduce
... View more
07-16-2018
09:41 PM
1 Kudo
@Sambasivam Subramanian
By definition, an edge node is just a host only with clients installed and configured. If you install no server services in Ambari for a host, then you will end up with an edge node for the clients that you selected.
... View more
06-20-2018
06:55 PM
Please see previous question - https://community.hortonworks.com/questions/167618/how-to-specify-more-than-one-path-for-the-storage.html
... View more
05-19-2018
05:13 AM
The configs are on the top line. It will say "Configs: " if none are customized $ kafka-topics --describe --topic $TOPIC --zookeeper $ZOOKEEPER
Topic:******** PartitionCount:20 ReplicationFactor:3 Configs:retention.ms=10800000
... View more
05-19-2018
05:10 AM
There is no such support for renaming https://issues.apache.org/jira/browse/KAFKA-2333 If you want to clone, then use MirrorMaker https://community.hortonworks.com/articles/79891/kafka-mirror-maker-best-practices.html
... View more
05-14-2018
03:31 AM
@Michael Bronson Kafka stores the latest offsets in memory before they are sent to disk, therefore, the more memory the better, with a max of 8G. And I would assume that the heap properties can be set from Ambari rather than individually on the broker, but I don't use Kafka from HDP, so I can't say.
... View more
05-11-2018
01:16 AM
1 Kudo
The recommendation here would be to increase the heap space allocated to the Kafka process or reduce the amount of other processes running on the same server. For example, in a production environment, the Kafka brokers should be standalone servers -- not on the same hardware as Zookeeper or other Hadoop processes.
... View more
04-18-2018
09:41 PM
Probably a better question for the Ambari mailing lists. https://ambari.apache.org/mail-lists.html Also lots of issues tagged in the Ambari JIRA for 3.0.0 - https://issues.apache.org/jira/browse/AMBARI-23611?jql=project%20%3D%20AMBARI%20AND%20fixVersion%20%3D%203.0.0
... View more
04-10-2018
08:35 PM
Yes, the commands work the same assuming you have winutils.exe on your PATH as well as HADOOP_HOME and HADOOP_CONF_DIR defined as environment variables. Windows is not as stable or as supported as Linux, however.
... View more
04-09-2018
03:12 PM
Hello, it seems you duplicated this post. https://community.hortonworks.com/questions/183845/hdpcd-exam-issue.html Please remove this one.
... View more
04-09-2018
03:09 PM
@bob rabih sh scripts are meant to be ran from *nix machines, not from Windows. Use the appropriate bat files in the kafka\bin\windows folder.
... View more
04-06-2018
04:47 PM
@Rajesh Reddy - No, wget only tests HTTP/S connections. Not plain TCP that Kafka and Zookeeper are using
... View more
04-05-2018
07:57 PM
Well, can you issue "kafka-topics --zookeeper $ZK_VIP" or "kafka-console-producer --bootstrap-server $KAFKA_VIP" commands without error?
... View more
04-03-2018
06:57 PM
You should consider just using Kafka for all ingestion. Run Kafka Connect locally. Point it at a directory. http://kafka.apache.org/documentation/#connect https://github.com/jcustenborder/kafka-connect-spooldir Alternative solutions include Fluentd or Filebeat
... View more
04-02-2018
09:12 PM
If you are using a Hortonworks offering, then you can use Apache NiFi. https://community.hortonworks.com/articles/97773/how-to-retrieve-files-from-a-sftp-server-using-nif.html Otherwise, according to that documentation, if you have Sqoop 1.99.7, then yes, it appears to be possible.
... View more
03-30-2018
03:06 AM
@Geoffrey Shelton Okot - That requires installing sh module into the Ambari python installation, which I do not want to maintain. If I could do that, then I would use a Python Hive driver.
... View more
03-28-2018
08:37 PM
Ambari alerts only check process and port health, yes? There are no smoke tests being ran by Ambari unless a manual service check is ran?
I was able to write a bash script to detect a timeout from a simple COUNT query, but later found out that Ambari only accepts Python scripts as alerts.
What are my options if I would like to periodically run a Hive query and do a validity check against a Hive table?
This script works on it's own in bash, but I need help calling it / re-doing it using Python if I want to use Ambari alerts
#!/usr/bin/env bash
set -u
TIMEOUT=1m
BEELINE=/usr/hdp/current/hive-client/bin/beeline
RUNAS=hive
TEZ_QUEUE=infrastructure
OUTPUT_FILE=/tmp/hivecanary.out
CONNECT_URI=jdbc:hive2://localhost:10000?tez.queue.name=infrastructure
if [[ ! -x $BEELINE ]]; then
echo CRITICAL
exit 2
fi
if [[ -f $OUTPUT_FILE ]]; then
rm -f "$OUTPUT_FILE"
fi
echo "$BEELINE --showHeader=false --outputformat=tsv2 -n hive -u 'jdbc:hive2://localhost:10000?tez.queue.name=infrastructure' -e 'SELECT COUNT(*) FROM default.customers'" | bash &
a=0
sleep 10
while true;do
a=$(($a+1))
PID=$(ps aux | grep beeline | grep -v grep | grep 'SELECT COUNT' | awk '{ print $2 }')
if [ -z "$PID" ]; then
echo OK
exit 0
else
if [ "$a" -gt 60 ]; then
echo CRITICAL
exit 1
fi
fi
sleep 1
done
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hive
03-20-2018
06:57 PM
2 Kudos
You will need to query the Hive metastore to 1. Filter on External tables 2. JOIN all tables (TBLS table) with all databases (DBS table) 3. Select the path See here, for example - https://stackoverflow.com/questions/44151670/search-a-table-in-all-databases-in-hive Or if you cannot connect to the metastore, you will need to scan over Hive tables - https://stackoverflow.com/questions/35004455/how-to-get-all-table-definitions-in-a-database-in-hive
... View more
03-15-2018
03:50 PM
You need to install the Ambari Agent on the machines, then configure the configuration in the /var/lib/ambari-agent/ folder to point at the Ambari Server, then you can add a host and manually register it. At this point Ambari can monitor the hosts, but not the Hadoop services. You can follow the above steps for any server, not just Hadoop. The caveat for managing the Hadoop services is that the HDP Stack requires a specific layout for Java and Hadoop packages. JAVA_HOME needs to be symlinked into /usr/default/java/ If you want to manage a Hadoop installation, it needs to follow the layout for the stack that Ambari is managing. For example, in HDP, the Hadoop libraries are all installed under /usr/hdp/current and /usr/hdp/<hdp.version> .. For you to start/stop/configure your services, then the files must be located here. Also, Ambari prefers that you have a user account for each service. So, hadoop, oozie, yarn, and hbase are all user accounts that own each of the corresponding installation locations. So, without knowing how you've installed things, I would say it's not easily possible to add Ambari to a non-Ambari installed cluster. However, that doesn't mean you can't backup the Ambari database, oozie database, & Hive metastore, unmount the datanode volumes, snapshot the Hbase tables, etc. Then you can restore those onto an Ambari installed cluster.
... View more
03-12-2018
08:05 PM
I would suggest you use HDFS connect rather than Spark Streaming as it is more fault tolerant. Kafka Connect is built into the base Kafka libraries, but you need to compile and add HDFS Connect separately to the classpath of Connect. Build from here: https://github.com/confluentinc/kafka-connect-hdfs and use a tagged branch rather than master as the releases are publicly available libraries, not SNAPSHOT builds that require you to compile Kafka from source.
... View more
03-06-2018
10:11 PM
Anything you can do in a VM, you can deploy on a physical machine. If you want to manage the services via Ambari, you need both Ambari Agent and Server. However, I would not really recommend running all those services on one machine. It's called "distributed" computing for a reason 😉
... View more
03-01-2018
10:57 PM
For loops are not a syntax in Hive. Sqoop "export" command or SparkSQL are alternative solutions to what you are doing, but all solutions will be slow, depending on the size of the database tables. There is only so fast a single CPU and network interface can process data.
... View more
03-01-2018
07:48 PM
Recursion in what? There is no reason you can't write a recursive function in MapReduce or Spark programming. I've personally done it. What matters is what data you have in the current task / executor that can be processed recursively.
... View more
02-24-2018
07:02 PM
2 Kudos
@Tom C That's just a warning that you have existing processes on the machines. If you let it uninstall packages or delete user accounts, you'll have downtime on the cluster, and services might not stop gracefully, so you risk additional corruption. I've added machines like this that are provisioned by Puppet, and so there are some extra background services running, but I just ignore that warning, and Ambari has set them up fine. Regarding the Hive Metastore, if you have set it up to use an external Postgres/MySQL database (recommended), I would probably first let Ambari first install the embedded Derby database for Hive, then manually edit the hive-site XML to point to the old one.
... View more
02-24-2018
06:56 PM
Answered - https://stackoverflow.com/questions/48961391/container-allocation-container-size-in-hadoop-cluster-read-the-scenario-below/48964023#48964023
... View more
02-24-2018
03:25 AM
1 Kudo
Ambari doesn't control any data residing on the datanodes, so you should be safe there. What I would do is let all the Hadoop components remain running "in the dark", by stopping all ambari-agents in the cluster, maybe even uninstalling it. Then, install and setup a new Ambari server, add a cluster, but register no hosts. Configure each of stopped ambari agents to point at the new Ambari server address, and start them. Add the hosts in the Ambari server UI, selecting "manual registration" option at the bottom of the dialog. Hopefully all the hosts register successfully. After which, you are given the option of installing clients and servers. Now, you could try to "reinstall" what is already there, but you might want to deselect all the servers on the datanode column. In theory, it will try to perform OS package installation, and say that the service already exists, and doesn't error out. If it does error, then restart the install process, but deselect everything -- At which point, it should continue and now you have Ambari back up and running with all the hosts monitored, just with no processes to configure. To add the services back, you would need to use the Ambari REST API to add back in the respective Services, Components, and Host Components that you have running on the cluster. If you can't remember what those are from all the things you have the options of installing in HDP, then go to each host and do a process check to see what's running.
... View more
02-22-2018
08:19 PM
@Rakesh AN I have worked for at least three companies trying to follow Agile/Scrum, and their development cycles of code does follow it. It's hard to upgrade hundreds of Hadoop nodes and software versions, make sure they all work with other components of the cluster, all without breaking other pieces in two-week sprints, though. Stand up meetings are all about perception management between team members and management. It again, has no special relationship or difference whether it is Hadoop development, web or mobile development, etc.
... View more