Member since
02-21-2019
69
Posts
45
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1452 | 06-06-2018 02:51 PM | |
4372 | 10-12-2017 02:48 PM | |
1367 | 08-01-2017 08:58 PM | |
29891 | 06-12-2017 02:36 PM | |
4657 | 02-16-2017 04:58 PM |
08-11-2016
08:22 PM
Well, this is the basis of security in Hadoop. In a nutshell, the following separate authorization policies apply: beeline -> Hive policies in Ranger (or SQL based authorization) hive cli and sqoop -> HDFS policies in Ranger (or HDFS POSIX permissions) And authorization would be meaningless if you don't have any authentication (Kerberos) as anyone can impersonate anyone or the admin - hdfs user. You can think of HiveServer2 and beeline as similar with how a "normal" database operates: a process + a user owning that process and all the files that process writes - in this case hive is the user owning all files under /apps/warehouse/hive. But in Hadoop other users can also write those files, via Pig, Sqoop, Hive CLI, etc, bypassing the HiveServer2 "database service". So the only way to prevent that is by using HDFS permissions, for example don't allow the user running the sqoop or hive cli to access some of the hive database folders, but that would be meaningless if you don't have Kerberos as anyone can become the hdfs user (and you cannot block the hive cli as anyone with a shell access can execute the hdfs command to still read those database files). You can also think in terms of network access and type of users. For example, the users running sqoop or hdfs commands are data engineers/scientist or a scheduled oozie service user that normally would have access to most of the data and have shell access to the edge or other nodes. While other users, users that normally consume the data (for example analysts using Tableau), would not have shell access and would only have access to the HiverServer2 port, thus enforcing the permissions would be easier in this case. By default there's no Hive authentication, but with this specific access pattern you could configure LDAP authentication only for HiveServer2 (or Knox) and not needing Kerberos as these type of users cannot access the cluster other than the HiveServer2 port anyway.
... View more
08-09-2016
02:46 PM
6 Kudos
Hello, You'll find some useful information on: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_dataintegration/content/beeline-vs-hive-cli.html Essentially beeline would use the JDBC/Thrift (or alternatively HTTP) protocol to communicate with HiveServer2. HiveServer2 then handles the hive logic (finding the table definition in the metastore, reading the data from HDFS, etc). On the other hand, the hive shell access directly the Hive metastore and HDFS data, and bypasses HiveServer2. The big influence that this difference can have in your situation is security. Hive security is implemented in HiveServer2, thus Hive shell bypasses any Hive access policies you might have set on specific databases using Ranger or SQL based authorization (only HDFS policies apply in this case).
... View more
08-05-2016
03:05 PM
For the Ranger Audit tab to work and display information, you need to first install Solr - HDP Search as per https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_hdp_search/content/ch_hdp-search-install.html And then follow the steps from the link provided by Sunile: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/enabling_audit_logging_hdfs_solr.html
... View more
08-05-2016
01:53 PM
1 Kudo
Hi @Bhavin Tandel This is a valid question. Ambari 2.2.1.1 introduced a bug, by adding upstart support on all flavours of Linux: https://issues.apache.org/jira/browse/AMBARI-14842 Unfortunately, the same /etc/init/ambari-agent.conf file was used for all flavours of Linux.
And in CentOS/RHEL6, the line 'kill signal SIGKILL' from this file is not compatible with the older version of upstart used in CentOS/RHEL6. Ansible service module would always prefer to use upstart instead of SysVinit to start services, hence the error, as the upstart stop/start ambari-agent does not work in CentOS/RHEL6. This is fixed in Ambari 2.2.2.0 so I suggest you use Ambari 2.2.2.0.
If you can't, then you'll need to use the following Ansible task before your service task: - name: Fix for upstart script in RHEL6
lineinfile: dest=/etc/init/ambari-agent.conf
state=absent
regexp='^kill(.*)'
when: ansible_os_family == "RedHat" and ansible_distribution_major_version == "6" Kind regards, Alexandru
... View more
06-10-2016
05:16 PM
1 Kudo
Hi Alex, Just in case someone else has this issue, I'll try to expand. Yes, indeed, beeline should be the tool of choice when connecting remotely or even locally as it supports more features and things like security work as expected. Beeline just creates a TCP connection to the HiveServer2 (on port 10000) and everything else is managed by the HiveServer2 process. Hive CLI starts an embedded HiverServer so it needs direct access to the Hive Metastore (running on port 9083) and HDFS. This bypasses things like Ranger Hive policies as it can access directly HDFS files. Latest versions of hive CLI don't even include the -h option. However, there might be reasons why you want the CLI, for example to not overload a production HiveServer2 with an experimental query that uses MapJoins. To connect using the hive CLI to a different cluster, you can copy the hive-site.xml file from the remote cluster to any local folder and set the HIVE_CONF_DIR variable to this folder: export HIVE_CONF_DIR=/home/alex/remote This will allow hive CLI to load the configuration variables needed to access the remote metastore. Make sure all cluster nodes can resolve the hostnames of all nodes from the remote cluster (update /etc/hosts if you're not using a DNS server). Then you need to set the fs.defaultFS variable to the remote NameNode address: hive --hiveconf fs.defaultFS=hdfs://<REMOTE>:8020 Best, Alex
... View more
06-10-2016
03:26 PM
Hello, Since it's a exit code [1] error, you'll need to get more information about what is causing the exit code. Find out the application ID of the job launched: yarn application -list -appTypes ALL su - as the user used to launch the job and run: yarn logs -applicationId <application ID>
... View more
06-10-2016
12:35 PM
If you want to use a standard tool rather than managing the API/HTTP calls via scripts, you can use Ansible. We've enabled such a feature on the Rackspace deployment playbooks: https://github.com/rackerlabs/ansible-hadoop/blob/master/playbooks/roles/ambari-server/tasks/main.yml#L127 I've create a gist only for that function: https://gist.github.com/alexandruanghel/68a16994028563be12cee4e3b93f7e89 if you want to use it straight away. So once you download the statuscheck.yml just set the variables and run it: AMBARI_HOST=127.0.0.1
AMBARI_PASSWORD=admin
CLUSTER_NAME=hadoop-poc
ansible-playbook -e "ansible_nodename=$AMBARI_HOST cluster_name=$CLUSTER_NAME ambari_password=$AMBARI_PASSWORD wait_timeout=1800" statuscheck.yml
... View more
06-10-2016
10:44 AM
Hi Antony, By default, impersonation is enabled (hive.server2.enable.doAs is set to true) so the job appears to be running as the user submitting it. However, when Ranger is enabled, this is turned off so the queries are submitted as the system user that the HiveServer2 process is running under (which is hive). You can use either setting depending on how your users access the data. If only HiveServer2 is used to access the data and all tables are managed by hive, then you can leave the impersonation turned off. However if the hive data is being accessed and written from other tools (such as Pig or MR jobs), then you can turn the impersonation back on and also use Ranger to configure the correct HDFS permissions. More about this and the best practice for each use case: http://hortonworks.com/blog/best-practices-for-hive-authorization-using-apache-ranger-in-hdp-2-2/ Regards, Alex
... View more
06-10-2016
10:30 AM
1 Kudo
Hi Mohan, Looks like you're missing the following from hdfs-site properties: "dfs.nameservices" : "mycluster" Which is quite important as it defines the HA HDFS logical name. Hope it works once you add it. Alex
... View more
06-10-2016
09:32 AM
Hi Mohan, Have you started from the example blueprint from https://cwiki.apache.org/confluence/download/attachments/55151584/hdfs_ha_blueprint.json?version=4&modificationDate=1434548806000&api=v2 ? You'll need to have a ZKFC component where you have the NAMENODE component: { "name": "NAMENODE" },
{ "name": "ZKFC" }, And 3 of JOURNALNODE: { "name": "JOURNALNODE" },
If that's the case, can you attach your blueprint and cluster creation template please? Alex
... View more
- « Previous
- Next »