Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Super Guru

Objective

This tutorial will walk you through the process of using the PyHive Python module from Dropbox to query HiveServer2. You can read more about PyHive here: PyHive

Prerequisites

  • You should already have Python 2.7 installed.
  • You should already have a version of the Hortonworks Sandbox 2.5 setup.

Scope

This tutorial was tested using the following environment and components:

  • Mac OS X 10.12.3
  • Anaconda 4.3.1 (Python 2.7.13)
  • Hortonworks HDP Sandbox 2.5
  • PyHive 0.1.5

Steps

Install PyHive and Dependancies

Before we can query Hive using Python, we have to install the PyHive module and associated dependancies. Because I'm using Anaconda, I chose to use the conda command to install PyHive. Because the PyHive module is provided by a third party, Blaze, you must specify -c blaze with the command line.

You can read more about Blaze PyHive for Anaconda here: Blaze PyHive

We need to instal PyHive using the following command:

$ conda install -c blaze pyhive

You will be doing this installation on your local computer. You should see something similar to the following:

$ conda install -c blaze pyhive
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /Users/myoung/anaconda:

The following NEW packages will be INSTALLED:

    pyhive: 0.1.5-py27_0 blaze
    sasl:   0.1.3-py27_0 blaze
    thrift: 0.9.2-py27_0 blaze

Proceed ([y]/n)? y

thrift-0.9.2-p 100% |#####################################################################################################################################| Time: 0:00:00   3.07 MB/s
sasl-0.1.3-py2 100% |#####################################################################################################################################| Time: 0:00:00  15.18 MB/s
pyhive-0.1.5-p 100% |#####################################################################################################################################| Time: 0:00:00  10.92 MB/s

As you can see, PyHive is dependant on the SASL and Thrift modules. Both of these modules were installed.

Create Python Script

Now that our local computer has the PyHive module installed, we can create a very simple Python script which will query Hive.

Edit a file called pyhive-test.py. You can do this anywhere you like, but I prefer to create a directory under ~/Development for this.

$ mkdir ~/Development/pyhive
cd ~/Development/pyhive

Now copy and paste the following test into your file. You can use any text editor you like. I usually use Microsoft Visual Studio Code or Atom.

from pyhive import hive

cursor = hive.connect('sandbox.hortonworks.com').cursor()
cursor.execute('SELECT * FROM sample_07 LIMIT 50')
print cursor.fetchall()

The sample07 database is already on the Sandbox, so this query should work without any problems.

Start Hortonworks HDP Sandbox

Before we can run our Python script, we have to make sure the Sandbox is started. Go ahead and do that now.

Run Python Script

Now that the Sandbox is runnig, we can run our script to execute the query.

$ python pyhive-test.py

You should see something similar to the following:

$ python pyhive-test.py
[[u'00-0000', u'All Occupations', 134354250, 40690], [u'11-0000', u'Management occupations', 6003930, 96150], [u'11-1011', u'Chief executives', 299160, 151370], [u'11-1021', u'General and operations managers', 1655410, 103780], [u'11-1031', u'Legislators', 61110, 33880], [u'11-2011', u'Advertising and promotions managers', 36300, 91100], [u'11-2021', u'Marketing managers', 165240, 113400], [u'11-2022', u'Sales managers', 322170, 106790], [u'11-2031', u'Public relations managers', 47210, 97170], [u'11-3011', u'Administrative services managers', 239360, 76370], [u'11-3021', u'Computer and information systems managers', 264990, 113880], [u'11-3031', u'Financial managers', 484390, 106200], [u'11-3041', u'Compensation and benefits managers', 41780, 88400], [u'11-3042', u'Training and development managers', 28170, 90300], [u'11-3049', u'Human resources managers, all other', 58100, 99810], [u'11-3051', u'Industrial production managers', 152870, 87550], [u'11-3061', u'Purchasing managers', 65600, 90430], [u'11-3071', u'Transportation, storage, and distribution managers', 92790, 81980], [u'11-9011', u'Farm, ranch, and other agricultural managers', 3480, 61030], [u'11-9012', u'Farmers and ranchers', 340, 42480], [u'11-9021', u'Construction managers', 216120, 85830], [u'11-9031', u'Education administrators, preschool and child care center/program', 47980, 44430], [u'11-9032', u'Education administrators, elementary and secondary school', 218820, 82120], [u'11-9033', u'Education administrators, postsecondary', 101160, 85870], [u'11-9039', u'Education administrators, all other', 28640, 74230], [u'11-9041', u'Engineering managers', 184410, 115610], [u'11-9051', u'Food service managers', 191460, 48660], [u'11-9061', u'Funeral directors', 24020, 57660], [u'11-9071', u'Gaming managers', 3740, 69600], [u'11-9081', u'Lodging managers', 31890, 51140], [u'11-9111', u'Medical and health services managers', 242640, 84980], [u'11-9121', u'Natural sciences managers', 39370, 113170], [u'11-9131', u'Postmasters and mail superintendents', 26500, 57850], [u'11-9141', u'Property, real estate, and community association managers', 159660, 53530], [u'11-9151', u'Social and community service managers', 112330, 59070], [u'11-9199', u'Managers, all other', 356690, 91990], [u'13-0000', u'Business and financial operations occupations', 6015500, 62410], [u'13-1011', u'Agents and business managers of artists, performers, and athletes', 11680, 82730], [u'13-1021', u'Purchasing agents and buyers, farm products', 12930, 53980], [u'13-1022', u'Wholesale and retail buyers, except farm products', 132550, 53580], [u'13-1023', u'Purchasing agents, except wholesale, retail, and farm products', 281950, 56060], [u'13-1031', u'Claims adjusters, examiners, and investigators', 279400, 55470], [u'13-1032', u'Insurance appraisers, auto damage', 12150, 52020], [u'13-1041', u'Compliance officers, except agriculture, construction, health and safety, and transportation', 231910, 52740], [u'13-1051', u'Cost estimators', 219070, 58640], [u'13-1061', u'Emergency management specialists', 11610, 51470], [u'13-1071', u'Employment, recruitment, and placement specialists', 193620, 52710], [u'13-1072', u'Compensation, benefits, and job analysis specialists', 109870, 55740], [u'13-1073', u'Training and development specialists', 202820, 53040], [u'13-1079', u'Human resources, training, and labor relations specialists, all other', 211770, 56740]]

Review

As you can see, using Python to query Hive is fairly straight forward. We were able to install the required Python modules in a single command, create a quick Python script and run the script to get 50 records from the sample07 database in Hive.

75,801 Views
Comments
avatar
Master Guru

Traceback (most recent call last): File "testhive.py", line 1, in <module> from pyhive import hive File "/usr/local/lib/python2.7/site-packages/pyhive/hive.py", line 10, in <module> from TCLIService import TCLIService File "/usr/local/lib/python2.7/site-packages/TCLIService/TCLIService.py", line 9, in <module> from thrift.Thrift import TType, TMessageType, TFrozenDict, TException, TApplicationException ImportError: No module named thrift.Thrift

Any ideas?

avatar

Great article...BUT...this only seems to work for small queries "show databases", "show tables", "SELECT * FROM X",....when I tried a JOIN like...

'SELECT * from spin1.patient JOIN cohortdefinitions.cohort_first25_spin1db ON patient.id=cohort_first25_spin1db.id'

It fails. with

errorCode=1, errorMessage='Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask'

ANY IDEAS????

avatar
Master Guru

Login with username

avatar
Contributor

Hi @TimothySpann ,

 

Could you please let me know if you was able to resolve issue, since I am also facing similar issue.

 

"

Traceback (most recent call last): File "testhive.py", line 1, in <module> from pyhive import hive File "/usr/local/lib/python2.7/site-packages/pyhive/hive.py", line 10, in <module> from TCLIService import TCLIService File "/usr/local/lib/python2.7/site-packages/TCLIService/TCLIService.py", line 9, in <module> from thrift.Thrift import TType, TMessageType, TFrozenDict, TException, TApplicationException ImportError: No module named thrift.Thrift

Any ideas?

"

avatar
Master Guru

Hive doesn't use thrift anymore.   new Python APIs work fine.

 

https://github.com/tebeka/hiver

avatar
New Contributor

Can I use pyhive to connect to Hive using Hive JDBC string instead of a single hostname? The following doesn't work for me. 

 

from pyhive import hive

hive_conn = hive.Connection(host=<JDBC STRING>, configuration {'serviceDiscoveryMode':'zooKeeper','zooKeeperNamespace':'hiveserver2'})

Version history
Last update:
‎04-24-2017 07:47 PM
Updated by:
Contributors