Member since
11-19-2015
158
Posts
25
Kudos Received
21
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
11730 | 09-01-2018 01:27 AM | |
1096 | 09-01-2018 01:18 AM | |
3668 | 08-20-2018 09:39 PM | |
485 | 07-20-2018 04:51 PM | |
1461 | 07-16-2018 09:41 PM |
02-21-2018
06:51 PM
It might be beneficial to stuff this in a Docker container and run
make prod tarball
Then, you can run the Docker container for the respective environment, and simply copy the tarball to external clusters.
... View more
02-21-2018
06:49 PM
I don't believe hardware or infrastructure setup should follow any such workflow. Maybe you need separate environments to isolate workloads, but other than that, it's the code itself that follows development patterns as anything else. Examples of such code could include MapReduce, Hive scripts, Oozie jobs, Spark processes, NiFi dataflows, etc. In terms of MapReduce or Spark, you can use CI/CD processes to build code and push it to HDFS, and submit it to YARN to run once, or submit it to Oozie to run on a schedule. Hadoop itself just offers HDFS, YARN, and MapReduce. It's everything else that is very specific to your needs and processes.
... View more
02-13-2018
07:42 PM
Can you provide what issues you are having? Or what you have tried already. Kafka Connect JDBC or Apache NiFi should be able to set JDBC properties per SQL Server and provide you the ability to produce to Kafka
... View more
02-12-2018
06:25 PM
@Mahesh Jadhav Hue is deprecated in latest HDP, but see http://gethue.com/hadoop-hue-3-on-hdp-installation-tutorial/ http://gethue.com/how-to-configure-hue-in-your-hadoop-cluster Note - Hue does not need to be part of the cluster. It is completely detachable from the cluster, and it communicates with the appropriate network ports for each service.
... View more
02-09-2018
08:09 PM
It might be beneficial to search before you ask. https://stackoverflow.com/questions/22769129/differences-between-hadoop-jar-and-yarn-jar
... View more
02-07-2018
08:55 PM
Ideally, you would not be removing datanodes, only node manangers (YARN), or Spark executors (Standalone Spark). The need to do this depends on your hardware resources. For example, in the cloud, such as AWS EMR, you can scale up a job to add more compute in AWS auto-scaling groups. The data is persisted for long-term in S3, and only exists briefly on HDFS for running the necessary actions quickly. You pay for the run time of these clusters, and if you are running a job only in the morning everyday, then you don't need the cluster running in the afternoon just sitting idle.
... View more
02-07-2018
08:49 PM
1 Kudo
You could write an alias/wrapper around the hadoop (and hdfs) cli commands that would block this.
For example, spark-submit in HDP is not the real "spark-submit", it can detect if you have exported SPARK_MAJOR_VERSION to the values of 1 or 2, then it forwards to the real spark bin folder
Essentially, put this script somewhere, then you make sure it's first in the $PATH for all users
#!/usr/bin/env bash
if [[ "${@:1:2}" = "namenode -format" ]]; then
echo "ERROR: Namenode Formatting disabled"
exit 1
fi
exec hadoop "$@"
Sample usage $ ./hadoop namenode -format
ERROR: Namenode Formatting disabled
... View more
02-07-2018
08:37 PM
Backup node isn't a concept. Several components have High-Availability setup, for example your other question about the NameNode. https://community.hortonworks.com/questions/171755/what-is-active-and-passive-namenode-in-hadoop.html ResourceManager, HiveServer, HBase Masters, and other components have availability considerations.
... View more
02-02-2018
09:58 PM
I assume this returns a limited result set, though, for large tables?
... View more
02-02-2018
09:56 PM
Well, Oozie just executes spark-submit Also, this leaves out Spark runs "over" YARN as a resource manager, or "over" HDFS as a filesystem.
... View more
02-02-2018
09:55 PM
Can you clarify what information you have not learned from the Spark documentation?
... View more
02-02-2018
09:45 PM
1 Kudo
I believe you meant Spark "Thrift Server" @kgautam https://community.hortonworks.com/articles/29928/using-spark-to-virtually-integrate-hadoop-with-ext.html http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server The alternative would be to use Apache Livy http://livy.apache.org/docs/latest/programmatic-api.html
... View more
02-02-2018
04:32 PM
@Junfeng Chen - Sorry, it would appear you are correct. I think your best option would be to mirror the Apache Bigtop repo on a machine that does have internet access, then you can install Hue via the respective package management system of your OS. Not sure what version of Hue that would be, though. Another option is to build Hue somewhere with internet access, and copy it over. As long as the Python version is the same, it should be okay, since it builds a portable virtualenv when you do the make command. I have done this and used the FPM tool to package an RPM before.
... View more
02-02-2018
04:07 PM
@Carlton Patterson , you don't need to understand beeline, if you take that command that was given to you, it is mostly copy-paste. The URL is even on the Ambari dashboard for you to copy exactly beeline -u <URL> --outputformat=<FORMAT> -f <YOUR_SCRIPT> > <DESTINATION FILE ON LOCAL DISK> The only confusing part there if you are unfamiliar with shell commands is Output Redirection to a file. The rest is very similar to any terminal based execution of a SQL script. As far as I know, this is the only free way (as in, money) to get the data out to a file, in full. The alternative solution is to download a trial version of RazorSQL or pay for a tool like Tableau, that can export the SQL results. Depending on your data size, Excel or LibreOffice might work
... View more
02-02-2018
03:56 PM
Use "-getmerge" to combine all files into one.
... View more
02-02-2018
12:24 AM
The Download link on this page should be a pre-built version that you don't need to clone from Github and run "make apps" http://gethue.com/hue-4-1-is-out/
... View more
01-30-2018
09:31 PM
"Add Service" in Ambari creates multiple Zeppelin Servers. You would need an external load balancer like HAProxy, Nginx, etc to get a single URL to switch between all instances. Cluster work loads are typically running in YARN, and should be distributed on their own, with or without Zeppelin.
... View more
01-26-2018
11:05 PM
Did you mean version 0.7? https://cwiki.apache.org/confluence/display/RANGER/Support+for+%24username+variable
... View more
01-25-2018
04:15 PM
The default ports are as follows
Kafka: 9092 NiFi: 8080 Zookeeper: 2181 You can access NiFi via the Web UI running on port 8080. You can access Kafka from your local machine, outside of a broker by downloading the kafka package for your respective Kafka broker version, then run a command like this to see all the topics in the cluster. kafka-topics --list --zookeeper yourZkServer:2181 Perhaps, the Kafka Quickstart guide would be a good start for you. And you can extend that knowledge of produce/consume over to the NiFi UI. http://kafka.apache.org/quickstart
... View more
01-23-2018
06:50 PM
@phil gib If you want to try using Control Center, you can use version 3.1.2, which is for brokers running 0.10.1.1 https://docs.confluent.io/3.1.2/control-center/docs/index.html
... View more
01-23-2018
06:46 PM
I'm not sure what "beautiful GUI" you are referring to, whether that is NiFi, or SAM, but these tools did not always exist, and they only exist as part of the HDF package, not native HDP. Pentaho works with all Hadoop environments, not only HDP. As for why people use it, you have to ask them, but if I had to guess, they were sold it by some vendor/consultant, or it was marketed to them through other channels.
... View more
01-21-2018
01:52 AM
Confluent, the support company of Kafka have this documentation https://docs.confluent.io/current/kafka/deployment.html#multi-node-configuration
Each Broker must connect to the same ZooKeeper ensemble at the same chroot via the zookeeper.connect configuration. Each Broker must have a unique value for broker.id set explicitly in the configuration OR broker.id.generation.enable must be set to true . Each Broker must be able to communicate with each other broker directly via one of the methods specified in the listeners or advertised.listeners configuration Word of advice: Don't store Zookeeper and Kafka data on the same volume. Also store the OS and process logs separate from the actual Zookeeper and Kafka data
... View more
01-20-2018
04:15 AM
Confluent Control Center requires an enterprise license in the long-term, which if you are using, you probably should install and maintain your own Kafka installation outside of HDP; maybe sharing a Zookeeper. In any case, Confluent 3.3.0 is for Kafka version 0.11.0.0, which means your brokers are old. Also, the Kafka topic that control center listens to requires having Confluent's packages on your Kafka classpath. Therefore, while maybe possible, it needs to be distributed on all Kafka brokers and Connect workers, but Ambari isn't going to do that just by changing some configurations. If you want some type of alerting and performance monitoring, simply exposing JMX metrics can provide lots of useful information like In-Sync Replicas, Partition Count per broker, Total Bytes In/Out, message throughput. For example, see https://www.robustperception.io/monitoring-kafka-with-prometheus/
... View more
01-20-2018
04:04 AM
Ambari is only okay if the agents are healthy and responding. You will at least need something like Nagios to check when services are down, disks are dead or full, fan stopped worked, RAM is bad, etc. Personally, I'm a big fan of Ansible for running distributed SSH commands across the entire cluster. Ansible uses Jinja2 templates just like Ambari for templating out config files, it can start/stop services, sync files across machines, etc. Much better than ssh-ing to each host one by one. With the recent release of Ansible Tower, you can make a centralized location for all your Ansible scripts. Alternative tools such as Puppet/Chef exist, and many older infrastructures already have those tools in place elsewhere in their infrastructure. If you have RHEL, then Satellite might be worth using. For tracing problems, you absolutely need some log collection framework and enabling JMX on every single Java / Hadoop process. You can pay for Splunk, or you can roll your own setup using Solr or Elasticsearch. Ambari recently added Ambari Infra and Log Search, which are backed by Solr. Lucidworks has a project named Banana that adds a nice dashboarding UI on top of Solr, although Grafana is also nice for dashboarding. If you go with Elasticsearch, it offers Logstash and Beats products that integrate well with many other external systems.
... View more
01-17-2018
03:52 AM
1 Kudo
Hi Micheal. I trust your ability to make your own PowerPoint with the following information. Most importantly, Ambari has nothing to do with Kafka. I strongly suggest you explain Kafka on its own, without ever mentioning Ambari. Moving on, at a high level-view, there is the Ambari Server (the web UI you login to), and agents (the hosts that you can add services to, manage, and monitor). Ambari has no concept of workers. Ambari Server requires a running relational database of PostgreSQL, MySQL, or Oracle. Perhaps you should start here, but I will try to continue. https://cwiki.apache.org/confluence/display/AMBARI/Ambari+Design Ambari uses widgets to display the dashboards and graphs. Services running on external systems are configured by SSH communication to the Ambari agents. Ambari allows you to have a central location to define configuration files for any environment. Hadoop is not required for Ambari to work; while it is commonly used for it, Ambari is fully extendable via what are called "stacks." The HDP stack includes Hadoop, Hive, HBase, Pig, Spark, Ranger, etc. When you first login to a fresh Ambari server, you have a default login account, and you must define a cluster and add hosts before you can do anything useful with Ambari. It is preferred to use Ambari to setup and manage services itself on new hosts rather than attempting to add existing hosts with pre-installed services to Ambari. For example, you should not attempt to install Hadoop with Puppet/Chef/Ansible, and then add this server to Ambari. You should use those tools to manage the Ambari Agent installation, then continue on with a typical Ambari "Add Host" operation. The agents communicate with the Ambari Server periodically sending heartbeats to let it know that it is alive, and able to accept requests. Ambari offers different account access restrictions via its login methods. For example, if you want administrators to change and restart services, as well as read-only users to view overall cluster usage, or access the HDFS file system browser, you can selectively allow these actions. Ambari also has "Ambari Views," which allow you to extend and expose your own type of "web portal" to any system running in your environment. Hope this gets you stared, but the Ambari wiki page is a fine resource for more information
... View more
01-17-2018
03:19 AM
@Tu Nguyen I suggest you post a new question, rather than hijack this one. Your error does not relate directly to transactional tables, but rather the OrcSplits generated by your table. How about if you should try to use spark.read.format("orc") from the filesystem? org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 111 more
Caused by: java.lang.NumberFormatException: For input string: "0248155_0000"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
... View more
01-16-2018
02:47 AM
@Guillaume Roger And what will the end user do with those zipped CSV files once they get them? Load them into Excel? Surely, you can expose some SQL interface or BI tool to allow the datasets to be queried and explored as they were meant to be within the hadoop ecosystem.
... View more
01-16-2018
02:29 AM
@Tu Nguyen Where are you reading that you need to use JDBC from Spark to communicate with Hive? It isn't in the SparkSQL documentation. https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables 1. Try using an alternative JDBC client, see if you get similar results. 2. What happens when you simply use the following? val spark = SparkSession
.builder()
.appName("Spark Transactional Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
spark.table("tnguy.table_transactional_test").count()
... View more