Created on 04-24-2017 07:47 PM
This tutorial will walk you through the process of using the PyHive Python module from Dropbox to query HiveServer2. You can read more about PyHive here: PyHive
Scope
This tutorial was tested using the following environment and components:
Before we can query Hive using Python, we have to install the PyHive module and associated dependancies. Because I'm using Anaconda, I chose to use the conda
command to install PyHive. Because the PyHive module is provided by a third party, Blaze, you must specify -c blaze
with the command line.
You can read more about Blaze PyHive for Anaconda here: Blaze PyHive
We need to instal PyHive using the following command:
$ conda install -c blaze pyhive
You will be doing this installation on your local computer. You should see something similar to the following:
$ conda install -c blaze pyhive Fetching package metadata ........... Solving package specifications: . Package plan for installation in environment /Users/myoung/anaconda: The following NEW packages will be INSTALLED: pyhive: 0.1.5-py27_0 blaze sasl: 0.1.3-py27_0 blaze thrift: 0.9.2-py27_0 blaze Proceed ([y]/n)? y thrift-0.9.2-p 100% |#####################################################################################################################################| Time: 0:00:00 3.07 MB/s sasl-0.1.3-py2 100% |#####################################################################################################################################| Time: 0:00:00 15.18 MB/s pyhive-0.1.5-p 100% |#####################################################################################################################################| Time: 0:00:00 10.92 MB/s
As you can see, PyHive is dependant on the SASL and Thrift modules. Both of these modules were installed.
Now that our local computer has the PyHive module installed, we can create a very simple Python script which will query Hive.
Edit a file called pyhive-test.py
. You can do this anywhere you like, but I prefer to create a directory under ~/Development for this.
$ mkdir ~/Development/pyhive cd ~/Development/pyhive
Now copy and paste the following test into your file. You can use any text editor you like. I usually use Microsoft Visual Studio Code or Atom.
from pyhive import hive cursor = hive.connect('sandbox.hortonworks.com').cursor() cursor.execute('SELECT * FROM sample_07 LIMIT 50') print cursor.fetchall()
The sample07 database is already on the Sandbox, so this query should work without any problems.
Before we can run our Python script, we have to make sure the Sandbox is started. Go ahead and do that now.
Now that the Sandbox is runnig, we can run our script to execute the query.
$ python pyhive-test.py
You should see something similar to the following:
$ python pyhive-test.py [[u'00-0000', u'All Occupations', 134354250, 40690], [u'11-0000', u'Management occupations', 6003930, 96150], [u'11-1011', u'Chief executives', 299160, 151370], [u'11-1021', u'General and operations managers', 1655410, 103780], [u'11-1031', u'Legislators', 61110, 33880], [u'11-2011', u'Advertising and promotions managers', 36300, 91100], [u'11-2021', u'Marketing managers', 165240, 113400], [u'11-2022', u'Sales managers', 322170, 106790], [u'11-2031', u'Public relations managers', 47210, 97170], [u'11-3011', u'Administrative services managers', 239360, 76370], [u'11-3021', u'Computer and information systems managers', 264990, 113880], [u'11-3031', u'Financial managers', 484390, 106200], [u'11-3041', u'Compensation and benefits managers', 41780, 88400], [u'11-3042', u'Training and development managers', 28170, 90300], [u'11-3049', u'Human resources managers, all other', 58100, 99810], [u'11-3051', u'Industrial production managers', 152870, 87550], [u'11-3061', u'Purchasing managers', 65600, 90430], [u'11-3071', u'Transportation, storage, and distribution managers', 92790, 81980], [u'11-9011', u'Farm, ranch, and other agricultural managers', 3480, 61030], [u'11-9012', u'Farmers and ranchers', 340, 42480], [u'11-9021', u'Construction managers', 216120, 85830], [u'11-9031', u'Education administrators, preschool and child care center/program', 47980, 44430], [u'11-9032', u'Education administrators, elementary and secondary school', 218820, 82120], [u'11-9033', u'Education administrators, postsecondary', 101160, 85870], [u'11-9039', u'Education administrators, all other', 28640, 74230], [u'11-9041', u'Engineering managers', 184410, 115610], [u'11-9051', u'Food service managers', 191460, 48660], [u'11-9061', u'Funeral directors', 24020, 57660], [u'11-9071', u'Gaming managers', 3740, 69600], [u'11-9081', u'Lodging managers', 31890, 51140], [u'11-9111', u'Medical and health services managers', 242640, 84980], [u'11-9121', u'Natural sciences managers', 39370, 113170], [u'11-9131', u'Postmasters and mail superintendents', 26500, 57850], [u'11-9141', u'Property, real estate, and community association managers', 159660, 53530], [u'11-9151', u'Social and community service managers', 112330, 59070], [u'11-9199', u'Managers, all other', 356690, 91990], [u'13-0000', u'Business and financial operations occupations', 6015500, 62410], [u'13-1011', u'Agents and business managers of artists, performers, and athletes', 11680, 82730], [u'13-1021', u'Purchasing agents and buyers, farm products', 12930, 53980], [u'13-1022', u'Wholesale and retail buyers, except farm products', 132550, 53580], [u'13-1023', u'Purchasing agents, except wholesale, retail, and farm products', 281950, 56060], [u'13-1031', u'Claims adjusters, examiners, and investigators', 279400, 55470], [u'13-1032', u'Insurance appraisers, auto damage', 12150, 52020], [u'13-1041', u'Compliance officers, except agriculture, construction, health and safety, and transportation', 231910, 52740], [u'13-1051', u'Cost estimators', 219070, 58640], [u'13-1061', u'Emergency management specialists', 11610, 51470], [u'13-1071', u'Employment, recruitment, and placement specialists', 193620, 52710], [u'13-1072', u'Compensation, benefits, and job analysis specialists', 109870, 55740], [u'13-1073', u'Training and development specialists', 202820, 53040], [u'13-1079', u'Human resources, training, and labor relations specialists, all other', 211770, 56740]]
As you can see, using Python to query Hive is fairly straight forward. We were able to install the required Python modules in a single command, create a quick Python script and run the script to get 50 records from the sample07 database in Hive.
Created on 05-05-2017 05:00 PM
Traceback (most recent call last): File "testhive.py", line 1, in <module> from pyhive import hive File "/usr/local/lib/python2.7/site-packages/pyhive/hive.py", line 10, in <module> from TCLIService import TCLIService File "/usr/local/lib/python2.7/site-packages/TCLIService/TCLIService.py", line 9, in <module> from thrift.Thrift import TType, TMessageType, TFrozenDict, TException, TApplicationException ImportError: No module named thrift.Thrift
Any ideas?
Created on 05-08-2017 04:30 PM
Great article...BUT...this only seems to work for small queries "show databases", "show tables", "SELECT * FROM X",....when I tried a JOIN like...
'SELECT * from spin1.patient JOIN cohortdefinitions.cohort_first25_spin1db ON patient.id=cohort_first25_spin1db.id'
It fails. with
errorCode=1, errorMessage='Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask'
ANY IDEAS????
Created on 05-08-2017 05:40 PM
Login with username
Created on 12-27-2019 06:12 AM
Hi @TimothySpann ,
Could you please let me know if you was able to resolve issue, since I am also facing similar issue.
"
Traceback (most recent call last): File "testhive.py", line 1, in <module> from pyhive import hive File "/usr/local/lib/python2.7/site-packages/pyhive/hive.py", line 10, in <module> from TCLIService import TCLIService File "/usr/local/lib/python2.7/site-packages/TCLIService/TCLIService.py", line 9, in <module> from thrift.Thrift import TType, TMessageType, TFrozenDict, TException, TApplicationException ImportError: No module named thrift.Thrift
Any ideas?
"
Created on 12-27-2019 06:18 AM
Created on 03-30-2020 11:45 AM
Can I use pyhive to connect to Hive using Hive JDBC string instead of a single hostname? The following doesn't work for me.
from pyhive import hive
hive_conn = hive.Connection(host=<JDBC STRING>, configuration {'serviceDiscoveryMode':'zooKeeper','zooKeeperNamespace':'hiveserver2'})