About sroberts

sroberts · ‎06-02-2016

As far as I know Kerberos is for authentication. Not the encryption of Hive communication.

sroberts · ‎06-02-2016

@Sri Bandaru- No. That's for Ambari HTTPS. I'm referring to SSL of HiveServer2 connections.

sroberts · ‎06-02-2016

What configuration is required in the Hive Ambari View for supporting Hive SSL?

sroberts · ‎04-21-2016

@Ali Bajwa A simplified approach: On the Ambari Server: yum -y install git git clone https://github.com/seanorama/ambari-bootstrap cd ambari-bootstrap export ambari_server_custom_script=${ambari_server_custom_script:-~/ambari-bootstrap/ambari-extras.sh} export install_ambari_server=true ./ambari-bootstrap.sh Then deploy the cluster. The "extras" script above takes care of all the tedious stuff automatically (cloning Zeppelin, the blueprint defaults, the role command order, ...). yum -y install python-argparse cd deploy export ambari_services="HDFS MAPREDUCE2 YARN ZOOKEEPER HIVE SPARK ZEPPELIN" bash ./deploy-recommended-cluster.bash

sroberts · ‎04-21-2016

The Google Cloud Storage Connector for Hadoop is configured at the cluster level without any knowledge of Kerberos. So the output you showed is what I would expect. But some thoughts: In secure environments, ideally a user can never even reach Hadoop without authentication against the Kerberos or Directory. With that assumed, you would never get the chance to run 'hadoop fs -ls ...' anyway. So lock down all access to the environment & network so only authorized users can even run the commands. It couldn't hurt to submit a feature request for a configuration option that disables 'gs' unless the user is authenticated to Hadoop. Personally I see this as a bug report, but technically it's a feature request. You would have to raise it with Google since the Connector is not currently a part of Apache Hadoop. Google maintains it separately. Why it's not a bug: Kerberos governs communications between services, not the executions of commands. Since GS doesn't do Kerberos, it works as intended since it already has it's authentication done separately. I've not done it, but you could check if individual users/applications can pass the GCS token. If possible then you would remove it from the cluster-wide configuration and the users would be required to do this themselves. It would still not be using Kerberos but would be another layer of security. s3a://, swift://, and wasb:// support this method.

sroberts · ‎04-04-2016

Prerequisites: Launch Sandbox on Azure VM Size: Minimum of A4 or A5 A Twitter App You'll use the API credentials The "Application Details" don't matter Prepare the Sandbox Connect to SSH & Ambari Connect to the Sandbox using SSH or web console: http://<<ip>>:4200/ Become root: sudo su - Reset the Ambari password: ambari-admin-password-reset Login to Ambari: http://<<ip>>:8080 User: admin Before moving to the next steps, ensure all services on the left are started (green) or in maintenance mode (black). Install NiFi In Ambari, Click "Actions" (bottom left) -> Add Service Choose NiFi and continue through the dialogs. You shouldn't need to change anything NiFi should now be accessible at http:<<ip>>:9090/nifi/ Tune Sandbox The Sandbox is tuned to run on minimal hardware. We need to update the Hive, Tez & YARN configuration for our use case. This could take up to 15 minutes to complete: bash <(curl -sSL https://git.io/vVRPs) Solr & Banana Solr enables the ability to search across large corpuses of information through specialized indexing techniques. Banana is a dashboard visualization tool for Solr. Download the Banana Dashboard curl -L https://git.io/vVRP3 -o /opt/hostname-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/default.json Update Solr to support Twitter's timestamp format curl -L https://git.io/vVRPz -o /opt/hostname-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml Start Solr JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64 /opt/hostname-hdpsearch/solr/bin/solr start -c -z localhost:2181 Create Solr collection for tweets /opt/hostname-hdpsearch/solr/bin/solr create -c tweets -d data_driven_schema_configs -s 1 -rf 1

sroberts · ‎03-12-2016

As with many topics, "it depends". For slave/worker/data hosts which only have distributed services you can likely disable swap. With distributed services it's preferred to let the process/host be killed rather than swap. The killing of that process or host shouldn't affect cluster availability. Said another way: you want to "fail fast" not to "slowly degrade". Just 1 bad process/host can greatly degrade performance of the whole cluster. For example, in a 350 host cluster removal of 2 bad nodes improved throughput by ~2x: http://www.slideshare.net/t3rmin4t0r/tez8-ui-walkthrough/23 http://pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf For masters, swap is also often disabled though it's not a set rule from Hortonworks and I assume there will be some discussion/disagreement. Masters can be treated somewhat like you'd treat masters in other, non-Hadoop, environments. The fear with disabling swap on masters is that an OOM (out of memory) event could affect cluster availability. But that will still happen even with swap configured, it just will take slightly longer. Good administrator/operator practices would be to monitor RAM availability, then fix any issues before running out of memory. Thus maintaining availability without affecting performance. No swap is needed then. Scenarios where you might want swap: playing/testing functionality, not performance, on hosts with very little RAM so will likely need to swap. if you have the need to use more memory, or expect to need more, than the amount of RAM which has been purchased. And can accept severe degradation in failure. In this case you would need a lot of swap configured. Your better off buying the right amount of memory. Extra thoughts: if you want to disable swap, but your organization require their to be a swap partition, set swappiness=0 if you choose to have swap, set swappiness=1 to avoid swapping until all physical memory has been used. most Cloud/Virtualization providers disable swap by default. Don't change that. some advise to avoid swap on SSDs due to reducing their lifespan

sroberts · ‎03-11-2016

The questions will be: - 1. Should there be a swap partition at all (i.e. swappiness=0)? - 2. Do recommendations vary between masters, workers or certain components? - 3. If swappiness>=1, what should the amount be?

sroberts · ‎03-11-2016

David - Thanks for posting. As discussed separately, the 2xRAM recommendation is definitely out of date. I'm working on some consensus with my team on their recommendations, and look forward to others comments coming in below.

sroberts · ‎01-12-2016

Mind if we convert this to an Article and update together since no answer will be correct for more than a couple months?

Online	Offline
Last Visited	‎05-01-2020 01:20 PM

Member Since	‎09-21-2015 02:23 PM
Last Visited	‎05-01-2020 01:20 PM
Posts	85
Kudos received	71

Cloudera Community

Re: Google Storage and Kerberos integration

Re: What is the Hortonworks recommendation on Swap...

Re: Ambari Blueprint : Sizing Components memory ba...

Re: Ranger KMS won't start

Re: Ambari Kerberos Wizard: How to configure Activ...

Re: Hive Ambari View coniguration for Hive SSL

Re: Hive Ambari View coniguration for Hive SSL

Hive Ambari View coniguration for Hive SSL

Re: Deploy HDP 2.3.x cluster with Zeppelin 0.5.5 u...

Re: Google Storage and Kerberos integration

Azure Sandbox prep for Twitter/HDP/HDF demo

Re: What is the Hortonworks recommendation on Swap...

Re: What is the Hortonworks recommendation on Swap...

Re: What is the Hortonworks recommendation on Swap...

Re: Recommendations for Microsoft Azure HDP Deploy...