Created on 09-19-2021 08:38 PM - edited on 09-29-2021 10:17 PM by subratadas
Thrift proxy is a modern micro-service framework when comparing to other existing frameworks such as SOAP/JSON-RPC/Rest proxy. The Thrift proxy API has a higher performance, is more scalable, and is multi-language supported. (C++, Java, Python, PHP, Ruby, Perl, C#, Objective-C, JavaScript, NodeJs, and other languages).
The application can interact with HBase via Thrift proxy.
This article will discuss how to use correct libraries and methods to interact with HBase via Thrift proxy.
The Apache Thrift library provides cross-language client-server remote procedure calls (RPCs), using Thrift bindings. A Thrift binding is a client code generated by the Apache Thrift Compiler for a target language (such as Python) that allows communication between the Thrift server and clients using that client code. HBase includes an Apache Thrift Proxy API, which allows you to write HBase applications in Python, C, C++, or another language that Thrift supports. The Thrift Proxy API is slower than the Java API and may have fewer features. To use the Thrift Proxy API, you need to configure and run the HBase Thrift server on your cluster. You also need to install the Apache Thrift compiler on your development system.
Image credits: The above figure is copied from Programmer’s Guide to Apache Thrift
The IDL file named Hbase.thrift is in CDP parcels.
find / -name "Hbase.thrift"
IDL compiler will be installed by following the steps in Building Apache Thrift on CentOS 6.5.
Follow this article to generate Python library bindings (Server stubs). Now, you should be able to import Python libraries into your client code.
In many examples, you will see several functions to interact with thrift. The concepts of Transport, socket, protocol are described in the book Programmer’s Guide to Apache Thrift.
Image credits: The above figure is copied from Programmer’s Guide to Apache Thrift
We will discuss how the functions work with HBase configurations.
These parameters are taken into consideration:
Kerberos enabled / SSL disabled:
Settings:
from thrift.transport import TSocket
from thrift.protocol import TBinaryProtocol
from thrift.transport import TTransport
from hbase import Hbase
import kerberos
import sasl
from subprocess import call
thrift_host=<thrift host>
thrift_port=9090
# call kinit commands to get the kerberos ticket.
krb_service='hbase'
principal='hbase/<host>'
keytab="/path/to/hbase.keytab"
kinitCommand="kinit"+" "+"-kt"+" "+keytab+" "+principal
call(kinitCommand,shell="True")
socket = TSocket.TSocket(thrift_host, thrift_port)
transport = TTransport.TSaslClientTransport(socket,host=thrift_host,service='hbase',mechanism='GSSAPI')
protocol = TBinaryProtocol.TBinaryProtocol(transport)
transport.open()
client = Hbase.Client(protocol)
print(client.getTableNames())
transport.close()
This works in CDH 6, but does not work in some CDP versions due to a known bug described in the next section.
Settings:
The following code is changed and tested based on @manjilhk 's post here.
from thrift.transport import THttpClient
from thrift.protocol import TBinaryProtocol
from hbase.Hbase import Client
from subprocess import call
import ssl
import kerberos
def kerberos_auth():
call("kdestroy",shell="True")
clientPrincipal='hbase@<DOMAIN.COM>'
# hbase client keytab is copied from /keytabs/hbase.keytab
# you can find the location using “find”
keytab="/path/to/hbase.keytab"
kinitCommand="kinit"+" "+"-kt"+" "+keytab+" "+clientPrincipal
call(kinitCommand,shell="True")
# this is the hbase service principal of HTTP, check with
# klist -kt /var/run/cloudera-scm-agent/process/<latest-thrift-process>/hbase.keytab
hbaseService="HTTP/<host>@<DOMAIN.COM>"
__, krb_context = kerberos.authGSSClientInit(hbaseService)
kerberos.authGSSClientStep(krb_context, "")
negotiate_details = kerberos.authGSSClientResponse(krb_context)
headers = {'Authorization': 'Negotiate ' + negotiate_details,'Content-Type':'application/binary'}
return headers
#cert_file is copied from CDP, use “find” to get the location, scp to your app server.
httpClient = THttpClient.THttpClient('https://< thrift server fqdn>:9090/', cert_file='/root/certs/localhost.crt',key_file='/root/certs/localhost.key', ssl_context=ssl._create_unverified_context())
# if no ssl verification is required
httpClient.setCustomHeaders(headers=kerberos_auth())
protocol = TBinaryProtocol.TBinaryProtocol(httpClient)
httpClient.open()
client = Client(protocol)
tables=client.getTableNames()
print(tables)
httpClient.close()
Nowadays, security (SSL/Kerberos) is very important when applications interact with databases. And many popular services like Knox and Hue are interacting with HBase via Thrift server over HTTP client. So, we recommend using the second method.
Upstream Jira HBASE-21652 where a bug is introduced related to Kerberos principal handling.
When refactoring the Thrift server, making thrift2 server inherit from thrift1 server, ThriftServerRunner ThriftServer is merged and the principal switching step was omitted.
Before the refactoring, everything is run in a doAs() block in ThriftServerRunner.run().
This article did not test all the versions; both methods are tested in Python 2.7.5 and Python 3.6.8.
Change the code according to your need, if encounter an issue. Posting questions to the Community and raising cases with Cloudera support are recommended.