Member since
02-16-2016
176
Posts
197
Kudos Received
17
Solutions
04-25-2016
08:33 PM
15 Kudos
Easily convert any XML document to JSON format using TransformXML processor. Save following stylesheet in a file. Use TransformXML processor and specify xslt stylesheet. It will convert any XML document to JSON format. <?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">{
<xsl:apply-templates select="*"/>}
</xsl:template>
<!-- Object or Element Property-->
<xsl:template match="*">
"<xsl:value-of select="name()"/>" : <xsl:call-template name="Properties"/>
</xsl:template>
<!-- Array Element -->
<xsl:template match="*" mode="ArrayElement">
<xsl:call-template name="Properties"/>
</xsl:template>
<!-- Object Properties -->
<xsl:template name="Properties">
<xsl:variable name="childName" select="name(*[1])"/>
<xsl:choose>
<xsl:when test="not(*|@*)">"<xsl:value-of select="."/>"</xsl:when>
<xsl:when test="count(*[name()=$childName]) > 1">{ "<xsl:value-of select="$childName"/>" :[<xsl:apply-templates select="*"
mode="ArrayElement"/>] }</xsl:when>
<xsl:otherwise>{
<xsl:apply-templates select="@*"/>
<xsl:apply-templates select="*"/>
}</xsl:otherwise>
</xsl:choose>
<xsl:if test="following-sibling::*">,</xsl:if>
</xsl:template>
<!-- Attribute Property -->
<xsl:template match="@*">"<xsl:value-of select="name()"/>" : "<xsl:value-of select="."/>",
</xsl:template>
</xsl:stylesheet> That's it !
... View more
Labels:
04-08-2016
03:51 AM
5 Kudos
With HDF 1.1.2.1 release, HDF supports accessing Kerberos enabled kafka topics. For a stanalone NiFi node, following instructions can be used. 1. new jass file should be created with following entries. Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="./conf/nifi.keytab"
useTicketCache=false
principal="nifi@EXAMPLE.COM”;
};
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=false
renewTicket=true
serviceName="kafka"
useKeyTab=true
keyTab="./conf/nifi.keytab"
principal="nifi@EXAMPLE.COM";
};
*** Both the KafkaClient and Client configs should use the same principal and key tab *** 2. bootstrap.conf file should be changed to include following line. java.arg.15=-Djava.security.auth.login.config=/<path>/zookeeper-jaas.conf Restart NiFi node after changes to bootstrap.conf. 3. GetKafka/PutKafka processors should be modified to include a new property security.protocol and value for this property should be set to PLAINTEXTSASL kafka.png That's it ! Now NiFi can read from Kerberos enabled kafka topics. Thank You @rgarcia for helping me with these configurations.
... View more
Labels:
02-24-2016
12:37 PM
6 Kudos
DB Visualizer is a popular free tool that allows developers to organize development tools for RDBMS development. With Apache Phoenix, that allows SQL like capability for Hbase, we can use DBVisualizer to connect to Apache Phoenix layer for HBase. Verified with following versions. DBVisualizer version 9.2.12 hbase-client-1.1.2.2.3.2.0-2950.jar phoenix-4.4.0.2.3.2.0-2950-client.jar First Add Phoenix driver to DBVisualizer. In DBVisualizer, go to Tools->Driver Manager and add a new driver. Add both hbase-client and phoenix-client jar. This will add a new Phoenix driver. 1. Connecting to Non-Kerberos cluster To connect to a Non-Kerberos cluster, use jdbc:phoenix:<zookeeper host>:<zookeeper port>:<hbase_z_node> as connection string where hbase_z_node is :/hbase by default. 2. Connecting to Kerberos cluster using cached ticket To connect to a Kerberos cluster, a. add following files to DBVisualizer resources directory. hdfs-site.xml hbase-site.xml core-site.xml b. Copy krb5.conf file to local workstation. c. Create a jaas file with following entry. Client {
com.sun.security.auth.module.Krb5LoginModule required useTicketCache=true renewTicket=true
serviceName="zookeeper";}; Modify dbvisgui.bat file to add following parameters for launching DBVisualizer -Djava.security.auth.login.config="<path-to-jaas-file>" -Djava.security.krb5.conf="<path-to-krb5-file>" d. Connection string for cached keytab will be jdbc:phoenix:<zookeeper host>:<zookeeper port>:/hbase-secure:<path-to-jaas file> 3. Connecting to Kerberos cluster using keytab a. add following files to DBVisualizer resources directory.
hdfs-site.xml
hbase-site.xml
core-site.xml b. Copy krb5.conf file to local workstation. c. copy keytab file use for connecting to Hbase. d. Create a jaas file with following entry. Client {
com.sun.security.auth.module.Krb5LoginModule
requireduseTicketCache=false
useKeytab=true
serviceName="zookeeper";}; Connection string for this case will be jdbc:phoenix:<zookeeper host>:<zookeeper port>:/hbase-secure:<Principal>:<path-to-keytab> Sample connection string jdbc:phoenix:host0001:2181:/hbase-secure:<principal>:\users\z_hadoop_test.keytab Test your connection !
... View more
Labels:
02-19-2016
06:00 AM
8 Kudos
There are 2 different ways of accessing HDFS over http. Using WebHDFS http://<active-namenode-server>:<namenode-port>/webhdfs/v1/<file-path>?op=OPEN Using HttpFs http://<hadoop-httpfs-server>:<httpfs-port>/webhdfs/v1/<file-path>?op=OPEN WebHDFS: Pros: Built-in with default Hadoop installation Efficient as load is streamed from each data node Cons: Does not work if high availability is enabled on cluster, Active namenode needs to be specified to use webHdfs HttpFs Pros: Works with HA enabled clusters. Cons: Needs to be installed as additional service. Impacts performance because data is streamed from single node. Creates single point of failure Additional performance implications of webHDFS vs HttpFs https://www.linkedin.com/today/post/article/20140717115238-176301000-accessing-hdfs-using-the-webhdfs-rest-api-vs-httpfs WebHDFS vs HttpFs Major difference between WebHDFS and HttpFs: WebHDFS needs access to all nodes of the cluster and when some data is read it is transmitted from that node directly, whereas in HttpFs, a singe node will act similar to a "gateway" and will be a single point of data transfer to the client node. So, HttpFs could be choked during a large file transfer but the good thing is that we are minimizing the footprint required to access HDFS.
... View more
Labels:
02-17-2016
06:43 PM
4 Kudos
If Kerberos for Hadoop cluster is implemented using enterprise AD, any Windows machine where users can sign-on using AD credentials has a cached ticket available. This cached ticket is available to any Windows applications by default, but any JAVA applications can't access cached ticket. To access cached ticket from a JAVA application, following registry entry in Windows should be set HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Lsa\Kerberos\ParametersValue
Name: AllowTGTSessionKeyValue
Type: REG_DWORD
Value:1 Using klist command on Windows machine, verify username and REALM are in correct case as specified in kerberos settings on cluster. Then, create a jaas.conf file with following entry Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=false
useTicketCache=true;}; To access kerborized cluster, JAVA program should be launched with following parameters. -Djava.security.auth.login.config="<path-to-jaas-conf>/jaas.conf" This will allow JAVA program to access cached ticket and pass user's own credentials to kerborized cluster.
... View more
Labels:
02-17-2016
03:39 AM
Sqoop can be used to bring data from RDBMS, but a limitation of sqoop is that data in HDFS is stored in one folder. If a partitioned table needs to be created in Hive for further queries, users need to create Hive script to distribute data to appropriate partitions. There is no direct option of creating partition tables based in Hive directly from sqoop. However, we can use sqoop features of putting output in a specific directory to simulate a partitioned table structure in HDFS. Since any partitioned table has a HDFS structure where each partition is <table name>/<partition column name=value> , we can use following sqoop structure to select appropriate data for each partition and move it to correct HDFS structure. sqoop --table <table1> --where <where clause for pt=0> --target-dir /home/user1/table1/pt=0
sqoop --table <table1> --where <where clause for pt=1> --target-dir /home/user1/table1/pt=1 Now, an external HIVE table can be created that is pointing to /home/user1/table1 directory with partition column as pt. CREATE EXTERNAL TABLE <hive_table_name>
--Column definitions---
PARTITIONED BY (pt string)
LOCATION '/home/user1/table1' This approach allows us to get data to HDFS in a structure that is appropriate for creating partitioned HIVE table. Some advantages with this approach are Independent of source table partition structure. Source table may not even be partitioned. Can be extended to use cases where Hive partitioning is based on multiple columns. Hive table partitioning scheme can be different from source table partitioning scheme. Multiple sqoop required to and Hive table creation script can be combined in one script to allow creation of any Hive partitioned table from RDBMS.
... View more
Labels:
02-17-2016
03:34 AM
8 Kudos
Choosing an approach for Kerberos implementation on Hadoop cluster is critical from a long term maintenance point. Enterprises have their own security policies and guidelines and a successful kerberos implementation needs to adhere to enterprise security architecture. There are multiple guides available on how to implement Kerberos but I couldn't find information on which approach to choose and Pros and Cons associated with each approach. In a Hortonworks Hadoop cluster, there are 3 different ways of generating keytabs and principals and managing them. a. Use an MIT KDC specific to Hadoop cluster - automated keytab management using Ambari KDC specific to Hadoop cluster can be installed and maintained on one of the Hadoop nodes. All users/keytabs required for kerberos implementation are automatically managed using Ambari. Pros: Enterprise security teams not involved with KDC setup. Hadoop administrators have complete control of KDC installation. Automated keytab management using Ambari. No need to manually manage any keytabs during cluster configuration changes or cluster topology changes. Non-expiring keytabs can be generated for developers and distributed to hadoop developers. Developers can have a copy of keytabs attached to their own id. One way trust can be set up so enterprise Active Directory can recognize hadoop users. Cons: May be against enterprise security policies. Hadoop administrators have additional responsibility of managing KDC. Any security vulnerabilities will be responsibility of Hadoop administrators. Ensuring KDC is setup for high availability and Disaster Recovery is responsibility of Hadoop administrators. Requires manual keytab generation for any developers. For any new developers, new keytabs need to be generated and distributed by hadoop administrators. Need to setup procedures for loss of keytabs. b. Use an existing Enterprise Active Directory - Manual setup An alternative to having local KDC for hadoop cluster is to manually generate usernames and principals required for kerberos using Ambari and then use corporate AD to create these users. Pros: Meets enterprise security standards by leveraging existing corporate AD infrastructure. Developers are part of existing AD and no keytabs generations are required for developers. Cons:
Manually managing keytabs in a large cluster becomes tedious and difficult to maintain with continous changes to cluster structure. Any changes in Hadoop cluster structure (add/delete node, add/delete service on new node) require new keytabs to be generated and distributed c. Using existing Enterprise AD with automated management using Ambari In this approach a new OU unit is created in enterprise AD and an AD account is created with complete administrative privilege on new OU. This account and OU are then used during automated setup in Ambari. This allows Ambari to automatically manage all keytabs/principal generation and keytab distribution. OU maintains all keytabs and principals for hadoop internal users required for kerberos functionality. Pros: Satisfies corporate security policies. Since complete auditing of users creation/maintenance is available within AD. All developers and users are part of enterprise AD and a kerberos ticket is already issued to them. Existing tickets are used for any communication with Kerberos cluster. Backup, High availability and other administrative tasks for KDC are taken care by enterprise AD teams managing AD. Separate OU within AD ensures hadoop internal users are not mixed with other users in AD. Any existing Active Directory groups are available in Ranger to implement security policies. Automated management of all hadoop internal users for keytab generation/distribution. Changes to cluster topology configuration are handled by Ambari. Cons: Any manual service users ( with non-expiring passwords ) for hadoop cluster need to be added to Active Directory manually and keytab distributed manually. ( May require service requests to generate new id and keytabs to other enterprise groups ) Developers do not have access to keytabs associated with their own ids. Keytabs associated to developer ids are invalidated due to password change policy rules ( Password expiration after certain number of days). Developers can use ticket associated to their id by Active Directory. Some JAVA applications/tools require copy of keytab files. It may be difficult to find workaround to use cached tickets with these applications/tools. This is a prelim guide based on my experience with implementing Kerberos. Any other suggestions/ideas are welcome.
... View more
Labels: