Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Master Collaborator

Introduction

Thrift proxy is a modern micro-service framework when comparing to other existing frameworks such as SOAP/JSON-RPC/Rest proxy. The Thrift proxy API has a higher performance, is more scalable, and is multi-language supported. (C++, Java, Python, PHP, Ruby, Perl, C#, Objective-C, JavaScript, NodeJs, and other languages).

The application can interact with HBase via Thrift proxy.

This article will discuss how to use correct libraries and methods to interact with HBase via Thrift proxy. 

Outline

  1. The basic concept of Thrift proxy and how the thrift language bindings are generated.
  2. How Python thrift functions align with the correct settings of HBase configurations from Cloudera Manager.
  3. Sample client codes in security disabled/ enabled HBase clusters.
  4. Some known bugs when using TSaslClientTransport with Kerberos enabled in some CDP versions.

The basic concept of Thrift proxy and how the Thrift cross-language bindings are generated

The Apache Thrift library provides cross-language client-server remote procedure calls (RPCs), using Thrift bindings. A Thrift binding is a client code generated by the Apache Thrift Compiler for a target language (such as Python) that allows communication between the Thrift server and clients using that client code. HBase includes an Apache Thrift Proxy API, which allows you to write HBase applications in Python, C, C++, or another language that Thrift supports. The Thrift Proxy API is slower than the Java API and may have fewer features. To use the Thrift Proxy API, you need to configure and run the HBase Thrift server on your cluster. You also need to install the Apache Thrift compiler on your development system.

willx_0-1631975294433.png

Image credits: The above figure is copied from Programmer’s Guide to Apache Thrift

The IDL file named Hbase.thrift is in CDP parcels.

 

find / -name "Hbase.thrift"

 

IDL compiler will be installed by following the steps in Building Apache Thrift on CentOS 6.5.

Follow this article to generate Python library bindings (Server stubs). Now, you should be able to import Python libraries into your client code.

How Python functions align with the HBase Configurations from Cloudera Manager

In many examples, you will see several functions to interact with thrift. The concepts of Transport, socket, protocol are described in the book Programmer’s Guide to Apache Thrift.

willx_1-1631975294481.png

Image credits: The above figure is copied from Programmer’s Guide to Apache Thrift

We will discuss how the functions work with HBase configurations.

These parameters are taken into consideration:

  • Is SSL enabled? (search “SSL” in CM > HBase configuration, usually auto-enabled by CM) 
    • Use SSLSocket, otherwise use socket
  • hbase.thrift.security.qop=auth-conf ? This means Kerberos is enabled.
    • Use TSaslClientTransport
  • hbase.regionserver.thrift.compact=true?
    • Use TCompactProtocol, otherwise use TBinaryProtocol
  • hbase.regionserver.thrift.framed=true?
    • Use TFramedTransport otherwise use TBufferedTransport
  • hbase.regionserver.thrift.http=true and hbase.thrift.support.proxyuser=true?
    • means DoAs implementation is required. The http mode cannot co-exist with Frame mode. Use THttpClient

Sample client codes in security disabled/ enabled HBase clusters

Kerberos enabled / SSL disabled:

Settings:

  • SSL disabled
  • hbase.thrift.security.qop=auth-conf 
  • hbase.regionserver.thrift.compact = false 
  • hbase.regionserver.thrift.framed=false
  • hbase.regionserver.thrift.http=false 
  • hbase.thrift.support.proxyuser=false

 

from thrift.transport import TSocket
from thrift.protocol import TBinaryProtocol
from thrift.transport import TTransport
from hbase import Hbase
import kerberos
import sasl
from subprocess import call
thrift_host=<thrift host>
thrift_port=9090
# call kinit commands to get the kerberos ticket. 
krb_service='hbase'
principal='hbase/<host>'
keytab="/path/to/hbase.keytab"
kinitCommand="kinit"+" "+"-kt"+" "+keytab+" "+principal
call(kinitCommand,shell="True")
socket = TSocket.TSocket(thrift_host, thrift_port)
transport = TTransport.TSaslClientTransport(socket,host=thrift_host,service='hbase',mechanism='GSSAPI')
protocol = TBinaryProtocol.TBinaryProtocol(transport)
transport.open()
client = Hbase.Client(protocol)
print(client.getTableNames())
transport.close()

 

This works in CDH 6, but does not work in some CDP versions due to a known bug described in the next section.

Kerberos enabled /SSL enabled:

Settings:

  • SSL enabled
  • hbase.thrift.security.qop=auth-conf 
  • hbase.regionserver.thrift.compact = false 
  • hbase.regionserver.thrift.framed=false
  • hbase.regionserver.thrift.http=true
  • hbase.thrift.support.proxyuser=true

The following code is changed and tested based on @manjilhk 's post here.

 

from thrift.transport import THttpClient
from thrift.protocol import TBinaryProtocol
from hbase.Hbase import Client
from subprocess import call
import ssl
import kerberos
def kerberos_auth():
 call("kdestroy",shell="True")
 clientPrincipal='hbase@<DOMAIN.COM>'
# hbase client keytab is copied from /keytabs/hbase.keytab 
# you can find the location using “find”
 keytab="/path/to/hbase.keytab"
 kinitCommand="kinit"+" "+"-kt"+" "+keytab+" "+clientPrincipal
 call(kinitCommand,shell="True")
# this is the hbase service principal of HTTP, check with
# klist -kt /var/run/cloudera-scm-agent/process/<latest-thrift-process>/hbase.keytab
 hbaseService="HTTP/<host>@<DOMAIN.COM>"
 __, krb_context = kerberos.authGSSClientInit(hbaseService)
 kerberos.authGSSClientStep(krb_context, "")
 negotiate_details = kerberos.authGSSClientResponse(krb_context)
 headers = {'Authorization': 'Negotiate ' + negotiate_details,'Content-Type':'application/binary'}
 return headers
#cert_file is copied from CDP, use “find” to get the location, scp to your app server.
httpClient = THttpClient.THttpClient('https://< thrift server fqdn>:9090/', cert_file='/root/certs/localhost.crt',key_file='/root/certs/localhost.key', ssl_context=ssl._create_unverified_context())
# if no ssl verification is required
httpClient.setCustomHeaders(headers=kerberos_auth())
protocol = TBinaryProtocol.TBinaryProtocol(httpClient)
httpClient.open()
client = Client(protocol)
tables=client.getTableNames()
print(tables)
httpClient.close()

 

Nowadays, security (SSL/Kerberos) is very important when applications interact with databases. And many popular services like Knox and Hue are interacting with HBase via Thrift server over HTTP client. So, we recommend using the second method.

Some known bugs when using TSaslClientTransport with Kerberos enabled in some CDP versions

Upstream Jira HBASE-21652 where a bug is introduced related to Kerberos principal handling.

When refactoring the Thrift server, making thrift2 server inherit from thrift1 server, ThriftServerRunner ThriftServer is merged and the principal switching step was omitted.

Before the refactoring, everything is run in a doAs() block in ThriftServerRunner.run().

References

  1. Programmer’s Guide to Apache Thrift
  2. Python3 connection to Kerberos Hbase thrift HTTPS
  3. Use the Apache Thrift Proxy API
  4. How-to: Use the HBase Thrift Interface, Part 1
  5. How-to: Use the HBase Thrift Interface, Part 2: Inserting/Getting Rows

Disclaimer

This article did not test all the versions; both methods are tested in Python 2.7.5 and Python 3.6.8. 

Change the code according to your need, if encounter an issue. Posting questions to the Community and raising cases with Cloudera support are recommended.

2,186 Views