About aanghel

aanghel · ‎08-11-2016

Well, this is the basis of security in Hadoop. In a nutshell, the following separate authorization policies apply: beeline -> Hive policies in Ranger (or SQL based authorization) hive cli and sqoop -> HDFS policies in Ranger (or HDFS POSIX permissions) And authorization would be meaningless if you don't have any authentication (Kerberos) as anyone can impersonate anyone or the admin - hdfs user. You can think of HiveServer2 and beeline as similar with how a "normal" database operates: a process + a user owning that process and all the files that process writes - in this case hive is the user owning all files under /apps/warehouse/hive. But in Hadoop other users can also write those files, via Pig, Sqoop, Hive CLI, etc, bypassing the HiveServer2 "database service". So the only way to prevent that is by using HDFS permissions, for example don't allow the user running the sqoop or hive cli to access some of the hive database folders, but that would be meaningless if you don't have Kerberos as anyone can become the hdfs user (and you cannot block the hive cli as anyone with a shell access can execute the hdfs command to still read those database files). You can also think in terms of network access and type of users. For example, the users running sqoop or hdfs commands are data engineers/scientist or a scheduled oozie service user that normally would have access to most of the data and have shell access to the edge or other nodes. While other users, users that normally consume the data (for example analysts using Tableau), would not have shell access and would only have access to the HiverServer2 port, thus enforcing the permissions would be easier in this case. By default there's no Hive authentication, but with this specific access pattern you could configure LDAP authentication only for HiveServer2 (or Knox) and not needing Kerberos as these type of users cannot access the cluster other than the HiveServer2 port anyway.

aanghel · ‎08-09-2016

Hello, You'll find some useful information on: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_dataintegration/content/beeline-vs-hive-cli.html Essentially beeline would use the JDBC/Thrift (or alternatively HTTP) protocol to communicate with HiveServer2. HiveServer2 then handles the hive logic (finding the table definition in the metastore, reading the data from HDFS, etc). On the other hand, the hive shell access directly the Hive metastore and HDFS data, and bypasses HiveServer2. The big influence that this difference can have in your situation is security. Hive security is implemented in HiveServer2, thus Hive shell bypasses any Hive access policies you might have set on specific databases using Ranger or SQL based authorization (only HDFS policies apply in this case).

aanghel · ‎08-05-2016

For the Ranger Audit tab to work and display information, you need to first install Solr - HDP Search as per https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_hdp_search/content/ch_hdp-search-install.html And then follow the steps from the link provided by Sunile: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/enabling_audit_logging_hdfs_solr.html

aanghel · ‎08-05-2016

Hi @Bhavin Tandel This is a valid question. Ambari 2.2.1.1 introduced a bug, by adding upstart support on all flavours of Linux: https://issues.apache.org/jira/browse/AMBARI-14842 Unfortunately, the same /etc/init/ambari-agent.conf file was used for all flavours of Linux. And in CentOS/RHEL6, the line 'kill signal SIGKILL' from this file is not compatible with the older version of upstart used in CentOS/RHEL6. Ansible service module would always prefer to use upstart instead of SysVinit to start services, hence the error, as the upstart stop/start ambari-agent does not work in CentOS/RHEL6. This is fixed in Ambari 2.2.2.0 so I suggest you use Ambari 2.2.2.0. If you can't, then you'll need to use the following Ansible task before your service task: - name: Fix for upstart script in RHEL6 lineinfile: dest=/etc/init/ambari-agent.conf state=absent regexp='^kill(.*)' when: ansible_os_family == "RedHat" and ansible_distribution_major_version == "6" Kind regards, Alexandru

aanghel · ‎06-10-2016

Hi Alex, Just in case someone else has this issue, I'll try to expand. Yes, indeed, beeline should be the tool of choice when connecting remotely or even locally as it supports more features and things like security work as expected. Beeline just creates a TCP connection to the HiveServer2 (on port 10000) and everything else is managed by the HiveServer2 process. Hive CLI starts an embedded HiverServer so it needs direct access to the Hive Metastore (running on port 9083) and HDFS. This bypasses things like Ranger Hive policies as it can access directly HDFS files. Latest versions of hive CLI don't even include the -h option. However, there might be reasons why you want the CLI, for example to not overload a production HiveServer2 with an experimental query that uses MapJoins. To connect using the hive CLI to a different cluster, you can copy the hive-site.xml file from the remote cluster to any local folder and set the HIVE_CONF_DIR variable to this folder: export HIVE_CONF_DIR=/home/alex/remote This will allow hive CLI to load the configuration variables needed to access the remote metastore. Make sure all cluster nodes can resolve the hostnames of all nodes from the remote cluster (update /etc/hosts if you're not using a DNS server). Then you need to set the fs.defaultFS variable to the remote NameNode address: hive --hiveconf fs.defaultFS=hdfs://<REMOTE>:8020 Best, Alex

aanghel · ‎06-10-2016

Hello, Since it's a exit code [1] error, you'll need to get more information about what is causing the exit code. Find out the application ID of the job launched: yarn application -list -appTypes ALL su - as the user used to launch the job and run: yarn logs -applicationId <application ID>

aanghel · ‎06-10-2016

If you want to use a standard tool rather than managing the API/HTTP calls via scripts, you can use Ansible. We've enabled such a feature on the Rackspace deployment playbooks: https://github.com/rackerlabs/ansible-hadoop/blob/master/playbooks/roles/ambari-server/tasks/main.yml#L127 I've create a gist only for that function: https://gist.github.com/alexandruanghel/68a16994028563be12cee4e3b93f7e89 if you want to use it straight away. So once you download the statuscheck.yml just set the variables and run it: AMBARI_HOST=127.0.0.1 AMBARI_PASSWORD=admin CLUSTER_NAME=hadoop-poc ansible-playbook -e "ansible_nodename=$AMBARI_HOST cluster_name=$CLUSTER_NAME ambari_password=$AMBARI_PASSWORD wait_timeout=1800" statuscheck.yml

aanghel · ‎06-10-2016

Hi Antony, By default, impersonation is enabled (hive.server2.enable.doAs is set to true) so the job appears to be running as the user submitting it. However, when Ranger is enabled, this is turned off so the queries are submitted as the system user that the HiveServer2 process is running under (which is hive). You can use either setting depending on how your users access the data. If only HiveServer2 is used to access the data and all tables are managed by hive, then you can leave the impersonation turned off. However if the hive data is being accessed and written from other tools (such as Pig or MR jobs), then you can turn the impersonation back on and also use Ranger to configure the correct HDFS permissions. More about this and the best practice for each use case: http://hortonworks.com/blog/best-practices-for-hive-authorization-using-apache-ranger-in-hdp-2-2/ Regards, Alex

aanghel · ‎06-10-2016

Hi Mohan, Looks like you're missing the following from hdfs-site properties: "dfs.nameservices" : "mycluster" Which is quite important as it defines the HA HDFS logical name. Hope it works once you add it. Alex

aanghel · ‎06-10-2016

Hi Mohan, Have you started from the example blueprint from https://cwiki.apache.org/confluence/download/attachments/55151584/hdfs_ha_blueprint.json?version=4&modificationDate=1434548806000&api=v2 ? You'll need to have a ZKFC component where you have the NAMENODE component: { "name": "NAMENODE" }, { "name": "ZKFC" }, And 3 of JOURNALNODE: { "name": "JOURNALNODE" }, If that's the case, can you attach your blueprint and cluster creation template please? Alex

Online	Offline
Last Visited	‎08-17-2019 02:55 PM

Member Since	‎02-21-2019 01:50 AM
Last Visited	‎08-17-2019 02:55 PM
Posts	69
Kudos received	44

Cloudera Community

Re: HBase Table REST Endpoint

Re: Nifi flow: Bash script execution, regex on std...

Re: HDF3 install via blueprint, NiFi fails to star...

Re: systemctl stop iptables or systemctl disable i...

Re: Yarn queue manager error: Couldn't connect to ...

Re: Connecting hive - Beeline vs hive?

Re: Connecting hive - Beeline vs hive?

Re: Apache Ranger does not show the access audit d...

Re: Ansible outputs Unknown Job for ambari-agent

Re: How can I connect a remote server through Hive...

Re: Sqoop to import data to Hive through oozie she...

Re: How to track "HDP installation status using Am...

Re: Job user name issue

Re: Cannot create a HA cluster (active and standby...

Re: Cannot create a HA cluster (active and standby...