Member since
01-09-2016
11
Posts
33
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1861 | 04-20-2016 10:46 PM |
05-09-2019
08:34 AM
Have run into the same issue. It works with FairScheduler but not CapacityScheduler. To add to the instructions above for those who normally use CapacityScheduler (99.99% of the Hadoop population :-)) but want to try with FairScheduler, remember also to disable other CS specific features, such as Preemption as Resource Manager won't start otherwise.
... View more
04-28-2016
09:42 AM
@Blanca Sanz As a workaround, if the groups you want to sync are associated to the users through the memberOf or ismemberof properties, then you can just disable Group Sync (set Enable Group Sync to No). That will make groups to be sync'd based on the User Search Filter through the memberof property. For example: User Search Filter: (|(memberOf=CN=Group1,OU=Users,DC=example,DC=es)(memberOf=CN=Group2,OU=Users,DC=example,DC=es))
User Group Name Attribute: memberOf
That will sync those groups with Ranger and all associated users that are members of those groups.
... View more
04-27-2016
06:57 PM
13 Kudos
Introduction By default, Zookeeper runs without the option of becoming a superuser to administrate znodes in the ZK ensemble, for example, to fix ACLs, remove znodes that are not required anymore, or create new ones in specific locations. Zookeeper grants permissions through ACLs through different schemas or authentication methods, such as 'world', 'digest', or 'sasl' if we use Kerberos. We can potentially we locked out if we were to grant everyone just read permissions to a znode, as we would not be able to delete it or modify it anymore. Example of the problem For example, we connect to Zookeeper through zkCli: /usr/hdp/current/zookeeper-client/bin/zkCli.sh -server `hostname -f`:2181 [zk: sandbox.hortonworks.com:2181(CONNECTED) 1] getAcl /config/topics
'world,'anyone
: r If we need to modify that znode so that, for example, user 'kafka' can have access to it to create new topics: [zk: sandbox.hortonworks.com:2181(CONNECTED) 2] setAcl /config/topics world:anyone:r sasl:kafka:cdrwa
Authentication is not valid : /config/topics Using superDigest to become a Zookeeper superuser The following can be done to run as a Zookeeper superuser and be able to make ACL changes or delete/modify znodes. We can run the DigestAuthenticationProvider to get the digest of a given password. Foe example, if we want our superuser 'super' to have the password 'super123' we can: export ZK_CLASSPATH=/etc/zookeeper/conf/:/usr/hdp/current/zookeeper-server/lib/*:/usr/hdp/current/zookeeper-server/* java -cp $ZK_CLASSPATH org.apache.zookeeper.server.auth.DigestAuthenticationProvider super:super123 OUTPUT:
super:super123->super:UdxDQl4f9v5oITwcAsO9bmWgHSI= From the output we can just add the following to SERVER_JVMFLAGS and restart Zookeeper: SERVER_JVMFLAGS=-Dzookeeper.DigestAuthenticationProvider.superDigest=super:UdxDQl4f9v5oITwcAsO9bmWgHSI= Then, in zkCli do: [zk: sandbox.hortonworks.com:2181(CONNECTED) 1] addauth digest super:super123 Now, we can perform the action we couldn't before: [zk: sandbox.hortonworks.com:2181(CONNECTED) 1] setAcl /config/topics world:anyone:cdrwa,sasl:kafka:cdrwa
cZxid = 0x29
ctime = Tue Jul 21 18:46:36 UTC 2015
mZxid = 0x29
mtime = Tue Jul 21 18:46:36 UTC 2015
pZxid = 0x22410
cversion = 4
dataVersion = 0
aclVersion = 9
ephemeralOwner = 0x0
dataLength = 0
numChildren = 4 [zk: sandbox.hortonworks.com:2181(CONNECTED) 3] getAcl /config/topics
'world,'anyone
: cdrwa
'sasl,'kafka
: cdrwa
... View more
04-24-2016
07:15 AM
@Guilherme Braccialli Indeed, the hook is for not reusing initialised sessions. Thanks very much for sharing the REST calls. Look quite useful to have handy.
... View more
04-23-2016
07:34 AM
2 Kudos
Introduction HBase replication provides with a way of replicating HBase data from one cluster to another by adding the remote Zookeeper quorum as remote peer. Configuration on the cluster First of all, it is necessary to set thehbase.replication property to true. Then add the remote peer through hbase shell. The peer id can be any short name: For example: add_peer '1', "hdpdstzk01.machine.domain, hdpdstzk02.machine.domain, hdpdstzk03.machine.domain:2181:/hbase-secure" (If using Kerberos then the right JAAS configuration needs to be used, or it would be required to have the hbase service keytab in the cache to authenticate correctly against Zookeeper through SASL). Configuration on the tables Replication is set at table and column family level by setting the propertyREPLICATION_SCOPE to ‘1’. The default value that tables get created with if not specified is ‘0’, which means no replication. If applying on already existing tables, then they need to be disabled, then the property added through alter, and then re-enabled back. For example: alter "product:user", {NAME => 'document', REPLICATION_SCOPE => '1'} Copying existing data across If there is already data on the source table, it can be replicated initially through the CopyTable command: bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable –peer.adr=hdpdstzk01.machine.domain, hdpdstzk02.machine.domain, hdpdstzk03.machine.domain:2181:/hbase-secure mytable [--new.name=mytableCopy] [--starttime=abc --endtime=xyz] new.name is only used when the destination table name is different from the source one starttime and endtime can be used when we want to replicate a specific interval of HBase timestamps
... View more
Labels:
04-20-2016
11:40 PM
1 Kudo
If you want to test it, even on HDP 2.3.x (Spark 1.5.2) you can use the following library which has been compiled with the necessary changes in: https://github.com/beto983/Streaming-Kafka-Spark-1.5.2/blob/master/spark-streaming-kafka_2.10-1.5.2.jar
... View more
04-20-2016
10:46 PM
4 Kudos
Hi Sunile, You can use a Hive hook to determine who submitted the query, even with doAs=false. As an example you can even use that to direct the query to an specific YARN queue. See the following for more info: https://github.com/beto983/Hive-Utils/blob/master/README.md
... View more
03-26-2016
12:58 PM
4 Kudos
Introduction MirrorMaker can be used to replicate messages of a defined list of topics across clusters. Configuration We need to configure both the producer and the consumer through config files. The config file for the producer will define the destination cluster Kafka brokers, whereas the consumer one will point to the Zookeeper servers from the source cluster. The following configuration also assumes that security is enabled (Kerberos) so the PLAINTEXTSASL protocol will be used: mm_producer-1.properties: bootstrap.servers=dest-kafkabrk1:6667,dest-kafkabrk2:6667,dest-kafkabrk3:6667 producer.type=async queue.time=1000000 queue.enqueueTimeout.ms=-1 security.protocol=PLAINTEXTSASL #sasl.kerberos.service.name=kafka mm_consumer-1.properties: zookeeper.connect=src-zkhost1:2181,src-zkhost2:2181,src-zkhost3:2181 zk.connectiontimeout.ms=1000000 bootstrap.servers=src-kafkabrk1:6667,src-kafkabrk2:6667,src-kafkabrk3:6667 consumer.timeout.ms=-1 security.protocol=PLAINTEXTSASL group.id=kafka-mirror #sasl.kerberos.service.name=kafka Running MirrorMaker MirrorMaker will take either whitelist or blacklist, to either define which topics needs mirroring or which ones don't, but only one of these options can be specified. They can be comma-separated or can even use wildcards (*) for the topic names. For example: bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config mm_consumer-1.properties --producer.config mm_producer-1.properties --whitelist mm-topic-1,mm-topic-2
... View more
Labels:
03-22-2016
07:32 PM
8 Kudos
Introduction Security best practices when using Ranger dictate that Hive jobs should ideally run as user 'hive' so that only Ranger Hive policies apply for end user access to data, and letting 'hive' own all the directory/file structure for Hive on HDFS. This is achieved by using hive.server2.enable.doAs set to 'false'. It also allows to improve performance as it enables container pre-warming for Tez, as it is only applicable for those jobs started by 'hive', and not by other end users. Problem The problem introduced by doAs = false is that, if YARN Capacity Scheduler queue mappings have been defined on a user/group basis, the mappings will not apply since all the jobs will be started as the same user (i.e. 'hive'), making the queue definitions completely useless. Solution One solution could be to use a Hive hook that could detect the real user that started the query so that we could submit the job to the right queue even if it still runs as user 'hive'. Then, the hook could find the list of groups the user belongs to and try to match them with a group-mappings file (with the structure groupname:queuename). When it finds one of the user groups it will automatically submit the job to the matched queue. The Hive hook can be found in: https://github.com/beto983/Hive-Utils This Hive hook is able to detect the user that started the hive session, find the groups that it belongs to, and send the job to the corresponding queue depending on that group and the mappings we define on the group-mappings file. It is based on this other hook which will submit the job to a queue named as the primary user's group: https://github.com/gbraccialli/HiveUtils Steps to follow:
On all HiveServer2 servers do: mkdir /usr/hdp/current/hive-client/auxlib/ && wget https://github.com/beto983/Hive-Utils/blob/master/Hive-Utils-1.0-jar-with-dependencies.jar -O /usr/hdp/current/hive-client/auxlib/Hive-Utils-1.0-jar-with-dependencies.jar Add the following setting on hive-site.xml (Custom hiveserver2-site on Ambari): hive.semantic.analyzer.hook=com.github.beto983.hive.hooks.YARNQueueHook Create a "group-mappings" file in /etc/hive/conf/ with the structure: groupname:queuename
groupname:queuename
groupname:queuename
...
Restart Hive
... View more
Labels: