Member since
11-24-2015
56
Posts
57
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1169 | 05-21-2016 02:32 PM | |
1714 | 04-26-2016 05:22 AM | |
3018 | 01-15-2016 06:23 PM | |
6044 | 12-24-2015 04:52 PM |
12-01-2018
03:07 PM
Thank you for your answer. I wish I could buy you a beer. 🙂 My solution was slightly different but starring me right in the face. Your answer provided the perfect clue. In my case, there was no "universal" directory. However, I noticed there was a CA bundle file. It DID NOT come preconfigured with the GeoTrust CA we have for the SSL certificate we purchased for our F5. It did have some other GeoTrust CA and many others, as well. I simply opened up the cacerts.pem file and added that GeoTrust CA to the end, saved the file, and ran my test query. It then worked! [mpetronic@vmwhnsqsrclnt01 ~]$ echo "show tables" | isql -d, -b -v f5 mpetronic $(cat ~/.pw.dat)
cro_capacity_extract_tmp
cro_capacity_ranked_tmp
cro_capacity_report_final
cro_efficiency_extract_tmp
cro_efficiency_hourly_tmp
cro_efficiency_report_final
j1_total_user_counts
san_data_2
test
Here is what my directory structure looks like for the ODBC driver version I am using: [root@vmwhnsqsrclnt01 lib]# tree /usr/lib/hive/
hive
└── lib
└── native
├── hiveodbc
│ ├── ErrorMessages
│ │ └── en-US
│ │ ├── DSMessages.xml
│ │ ├── HiveODBCMessages.xml
│ │ ├── ODBCMessages.xml
│ │ ├── SQLEngineMessages.xml
│ │ └── ThriftExtensionMessages.xml
│ ├── EULA.txt
│ ├── Hortonworks\ Hive\ ODBC\ Driver\ User\ Guide.pdf
│ ├── Release\ Notes.txt
│ └── Setup
│ ├── odbc.ini
│ └── odbcinst.ini
└── Linux-amd64-64
├── api.prod.quasar.nadops.net.pem
├── cacerts.pem <<< Added GeoTrust CA to end of this file
├── cacerts.pem.orig
├── HiveODBC.did
├── hortonworks.hiveodbc.ini
└── libhortonworkshiveodbc64.so
... View more
11-12-2018
11:39 PM
HDP 2.3.4.7 hive-odbc-native-2.6.1.1001-1.x86_64.rpm I have a working setup using this ODBC driver on CentOS 7.2 via Knox. Now we have added an F5 proxy in front of Knox. I can successfully connect to Hive using beeline through the F5 as well as connect with other JDBC apps (dbvisualizer, for example works fine). The F5 is using a Geotrust CA for it's SSL connection. My odbc.ini file looks like this where the [Knox] section is for directly connecting via Knox and the F5 section is for going through the F5 then through Knox: [ODBC Data Sources]
Knox=DSN for access to the production cluster using Beeline via Knox
[Knox]
Description=Hortonworks Knox Production DSN
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
Host=api01.prod.quasar.nadops.net
port=8443
HttpPath=quasar/jupstats/hive
schema=default
ServiceDiscoveryMode=0
HiveServerType=2
AuthMech=3
ThriftTransport=2
SSL=1
TwoWaySSL=0
ClientCert=/users/mpetronic/keystore/keystore.jks
trustStorePassword=######
AllowSelfSignedServerCert=1
UseNativeQuery=1
[F5]
Description=Hortonworks Knox Production DSN (F5)
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
Host=api.prod.quasar.nadops.net
port=443
# The knox/ part get stripped by an F5 rewrite rule then forwarded to Knox
HttpPath=knox/quasar/jupstats/hive
schema=default
ServiceDiscoveryMode=0
HiveServerType=2
AuthMech=3
ThriftTransport=2
SSL=1
TwoWaySSL=0
ClientCert=/users/mpetronic/keystore/keystore.jks
trustStorePassword=######
AllowSelfSignedServerCert=1
UseNativeQuery=1
I have a valid Geotrust CA in the OS certificate store proven by this test: [mpetronic@vmwhnsqsrclnt01 ~]$ openssl s_client -connect api.prod.quasar.nadops.net:443 -state
CONNECTED(00000003)
SSL_connect:before/connect initialization
SSL_connect:SSLv2/v3 write client hello A
SSL_connect:SSLv3 read server hello A
depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
verify return:1
depth=1 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = GeoTrust RSA CA 2018
verify return:1
depth=0 CN = api.prod.quasar.nadops.net
verify return:1
SSL_connect:SSLv3 read server certificate A
SSL_connect:SSLv3 read server key exchange A
SSL_connect:SSLv3 read server done A
SSL_connect:SSLv3 write client key exchange A
SSL_connect:SSLv3 write change cipher spec A
SSL_connect:SSLv3 write finished A
SSL_connect:SSLv3 flush data
SSL_connect:SSLv3 read finished A
---
Certificate chain
0 s:/CN=api.prod.quasar.nadops.net
i:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=GeoTrust RSA CA 2018
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIEuTCCA6GgAwIBAgIQDFoE4dcFI+tHxeZBx3DCKDANBgkqhkiG9w0BAQsFADBe
MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3
d3cuZGlnaWNlcnQuY29tMR0wGwYDVQQDExRHZW9UcnVzdCBSU0EgQ0EgMjAxODAe
Fw0xODAxMTcwMDAwMDBaFw0yMTAxMTYxMjAwMDBaMCUxIzAhBgNVBAMTGmFwaS5w
cm9kLnF1YXNhci5uYWRvcHMubmV0MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIB
CgKCAQEAvzSCRUpL3S3XF82i7A/qZPOgwAbGagJCZUss5ZAU2Xb0bNi2FFGT5rhq
6sC59tAVF15BoY24KfMvxbp86B0n7OCvvVCJ4IY21/i2MNrCOM2t09UOMFdSuKhD
IdKLcYNrw8xhvMqGPeyFgHYZhPFfvB2XrwO9K8hjwgq01AHB490W6PylU+S0cedW
J1pqC57gXne3sQd20+DRvLZ9dP/phMofeMoO5zrcUZIMZ1cjP32LEVjwQ93CEgbG
BZ+cdHNJcyDtoP5t4Z2of8+f5mp2WK583vQVpKdZYw/AT1ZdEQrcePDVt+wAROHJ
x2SRtk2ouIcBUCcwMH4goIFparN4lQIDAQABo4IBqjCCAaYwHwYDVR0jBBgwFoAU
kFj/sJx1qFFUd7Ht8qNDFjiebMUwHQYDVR0OBBYEFJwKxfT5OhncAC2KcYLtJVnu
0ajNMCUGA1UdEQQeMByCGmFwaS5wcm9kLnF1YXNhci5uYWRvcHMubmV0MA4GA1Ud
DwEB/wQEAwIFoDAdBgNVHSUEFjAUBggrBgEFBQcDAQYIKwYBBQUHAwIwPgYDVR0f
BDcwNTAzoDGgL4YtaHR0cDovL2NkcC5nZW90cnVzdC5jb20vR2VvVHJ1c3RSU0FD
QTIwMTguY3JsMEwGA1UdIARFMEMwNwYJYIZIAYb9bAECMCowKAYIKwYBBQUHAgEW
HGh0dHBzOi8vd3d3LmRpZ2ljZXJ0LmNvbS9DUFMwCAYGZ4EMAQIBMHUGCCsGAQUF
BwEBBGkwZzAmBggrBgEFBQcwAYYaaHR0cDovL3N0YXR1cy5nZW90cnVzdC5jb20w
PQYIKwYBBQUHMAKGMWh0dHA6Ly9jYWNlcnRzLmdlb3RydXN0LmNvbS9HZW9UcnVz
dFJTQUNBMjAxOC5jcnQwCQYDVR0TBAIwADANBgkqhkiG9w0BAQsFAAOCAQEAqv5b
CxKeHDV8+j86I9zvXoPwMq+a+SpjQHeTgNjU3zsnQWxyDVH6gqnA9xmt/c36AcTd
mqeLZMsdDNMAimeEk7hCCaQcuOp0mLlxqlxNnPas6kTzvpM3KpBKUeFbCW77jkFZ
Q1WU/oq9jZZCtvl5SZadwmSzXAp9PxDuRIJEYtpHs/Mo6RQWZJcOmIN1aGfUOuav
+xOLhUK/FQ+e4riUcuLiTcxLBHtqkMORHP4t3GWCNv1Sj/i8qwI/SpU3DBvryp3Z
RXKA4okXFPQ8fKfq5HHyhpmGRhWlCUIHSnyjo9QzKvH0A18f+DDtQ/1KWU9lQHcO
BaZOw378zMc0PHHceg==
-----END CERTIFICATE-----
subject=/CN=api.prod.quasar.nadops.net
issuer=/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=GeoTrust RSA CA 2018
---
No client certificate CA names sent
Peer signing digest: SHA1
Server Temp Key: ECDH, P-256, 256 bits
---
SSL handshake has read 1718 bytes and written 415 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
Protocol : TLSv1.2
Cipher : ECDHE-RSA-AES256-GCM-SHA384
Session-ID: 4F057AE288401B87506EF68BA0F2AFC3B3182E20100F5E0778493198BD2C6984
Session-ID-ctx:
Master-Key: 04CE701B1414899E854E136F858D7907BDA2ED52BDDB41EBA65F48A7102847B222AC0CAA1B3E733D6945139A752E0331
Key-Arg : None
Krb5 Principal: None
PSK identity: None
PSK identity hint: None
Start Time: 1542064692
Timeout : 300 (sec)
Verify return code: 0 (ok)
---
I can successfully do this directly against Knox using the [Knox] section configuration of the odbc.ini: [mpetronic@vmwhnsqsrclnt01 ~]$ echo "show tables" | isql -d, -b -v knox mpetronic $(cat ~/.pw.dat)
cro_capacity_extract_tmp
cro_capacity_ranked_tmp
cro_capacity_report_final
cro_efficiency_extract_tmp
cro_efficiency_hourly_tmp
cro_efficiency_report_final
j1_total_user_counts
my_vsat_info_ht1100
san_data_2
test
But this fails using the [F5] section: [mpetronic@vmwhnsqsrclnt01 ~]$ echo "show tables" | isql -d, -b -v f5 mpetronic $(cat ~/.pw.dat)
[S1000][unixODBC][Hortonworks][DriverSupport] (1100) SSL certificate verification failed because the certificate is missing or incorrect.
[ISQL]ERROR: Could not SQLConnect
I have also added the Geotrust CA into the keystore being used but no difference. Here you can see the Knox certificate (alias=mykey) and the Geotrust CA (alias=geotrust): (test) [mpetronic@vmwhnsqsrclnt01 keystore]$ keytool -list -keystore keystore.jks
Enter keystore password:
Keystore type: jks
Keystore provider: SUN
Your keystore contains 2 entries
geotrust, Nov 12, 2018, trustedCertEntry,
Certificate fingerprint (SHA1): 7C:CC:2A:87:E3:94:9F:20:57:2B:18:48:29:80:50:5F:A9:0C:AC:3B
mykey, Oct 9, 2017, trustedCertEntry,
Certificate fingerprint (SHA1): F1:75:91:59:8D:BE:B1:79:AC:02:7F:66:FE:0A:53:E2:89:98:75:BA
At this point, I am stumpted. I am unclear as to what cannot find the certificate or what certificate it is looking for. I know that some applications do not look for CA certificates in the OS system CA bundles so maybe isql is not looking there? I also get the same error with a test Python app using pyodbc so I doubt that is the issue. Any help would be greatly appreciated!
... View more
Labels:
04-24-2018
12:24 PM
Heartbeats work fine from ambari-agent host with this: rpm -qa openssl
openssl-1.0.1e-51.el7_2.5.x86_64 But not with this: rpm -qa openssl
openssl-1.0.2k-8.el7.x86_64 With this newer version of openssl, the ambari agent is attempting to connect to ambari server using https instead of http. In our setup, ambari is restricted to just internal cluster users (admins) and therefore is not setup for https. This results in lost heartbeats. You can work around this by changing the default verification rule for python on each agent host like this: sed -i 's/verify=platform_default/verify=disable/' /etc/python/cert-verification.cfg
ambari-agent restart I know, not the best solution because you change the security default for python host-wide. But, as an interim fix, it works.
... View more
08-18-2017
09:11 PM
You should be able to export an environment variable and place the odbc.ini file anywhere you want - where you do have write access. Can you try this and then run your test? I never tried making it work completely with just a connect string. export ODBCINI=/path/to/your/odbc.ini
... View more
05-04-2017
05:23 AM
I just starting playing around with airflow to orchestrate some Hive jobs so, not an expert. However, I would say you should use an "insert overwrite directory 'some/dir' select ..." type of query to land the results in some files then you could use another task in the DAG to read that from HDFS, maybe with an "hdfs dfs -get" or webhdfs REST call to wherever you need it or continue to process it further.
... View more
10-06-2016
12:00 PM
4 Kudos
This article outlines the steps needed to setup ODBC access to Hive via Apache Knox from a Linux workstation. While there seems to be some "reasonable" documentation other there on setting up this access from Windows, it took quite some time to figure out how to do this from Linux. For this example, I am installing the Hortonworks Linux ODBC driver on CentOS Linux release 7.2.1511 then verifying it works using both the isql ODBC command line tool and PyODBC. Installation Steps 1) Install the UnixODBC Dependencies sudo yum install unixODBC 2) Download and Install Hortonworks ODBC Linux Driver I developed this guide using the below version of the driver. Note that the following URL is for the CentOS 7 version of the driver. If you are running an older version of the OS, then visit https://hortonworks.com/downloads/ and use the CentOS 6 driver. Also note that the below URL will likely change as new versions of the driver are released so it is a good idea to check the Hortonworks download page for the latest version when you run this procedure. I can tell you that a college used this procedure on RHEL 6.4 and used the CentOS 6 driver and it worked fine. wget http://public-repo-1.hortonworks.com/HDP/hive-odbc/2.1.5.1006/centos7/hive-odbc-native-2.1.5.1006-1.el7.x86_64.rpm
sudo yum install hive-odbc-native-2.1.5.1006-1.el7.x86_64.rpm 3) Create $HOME/.odbc.ini You don't necessarily need this file. For example, you could build up a DSN string in code in a Python program that you use to connect to Hive using PyODBC. But, you can also just specify a DSN name that PyODBC then maps to an entry in your $HOME/.odbc.ini file. So, either way, here are the settings that I used to build up a minimal $HOME/.odbc.ini used in this guide. Simple copy the following values into $HOME/.odbc.ini. If you want to know the details of what each field means and its optional values, you can find it all at the end of the ODBC installation document. Note that the following sample entries have some fields that you MUST change that are environment and user specific. Namely the host, trustStorePassword, and ClientCert fields. All other fields should be left as they are. [ODBC Data Sources]
# The below key 'Knox' can be anything but it MUST MATCH the key name used
# in the section below. The right side is just descriptive and can be anything
Knox=DSN for access to the production cluster using Beeline via Knox
[Knox]
# Optional description of this DSN
Description=Hortonworks Knox QA DSN
# Where the actual ODBC driver library is loaded from. This is the default
# location where Hortonworks installs the driver
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
# The hostname of the server running the Knox gateway
Host=api01.quasar.local
# The port used to connect to the Knox gateway
port=8443
# The same httppath value used in a typical beeline connection string that
# connects via Knox. For our cluster, it is always "quasar/jupstats/hive"
HttpPath=quasar/jupstats/hive
# The database you want to be on once beeline connects to Hive. This is basically
# like executing the HQL "use my_database" command after the connection is made.
schema=default
# Must be 0 because we want to connect directly to Hiveserver2 inside the cluster.
ServiceDiscoveryMode=0
# Must be 2 because we are using hiveserver2 inside the cluster.
HiveServerType=2
# This indicates we need to provide a username and password in the connect string
AuthMech=3
# This indicates the transport mode for Hive is HTTP, which is required for Knox
ThriftTransport=2
# Indicates the connection to Knox uses SSL. Note that our Hive configuration has
# hive.server2.use.SSL=false but, the connection to hiverserver2 is via Knox and,
# to connect to Knox, you must use SSL. So, the ODBC driver is first connecting
# to Knox hence we need SSL=1 here. If you were using ODBC to directly connect
# to hiveserver2 (not via Knox), then, for the configuration we have in our cluster,
# this value would need to be 0
SSL=1
# But NOT two-way SSL meaning no client side authentication - only server-side
TwoWaySSL=0
# These two are the same values used in the beeline connect string. The password is
# the password you used to add the Knox certificate to your local user key store.
# The client certifificate path is just the path to that key store
trustStorePassword=<your-trust-store-password>
ClientCert=/home/mpetronic/keystore/keystore.jks
# Must be 1 to support using self-signed certificates. Since we generated and
# self-signed certificate in our cluster, we need this.
AllowSelfSignedServerCert=1
# This prevents ODBC from doing query preprocessing which is not required since we are
# running against a Hive-aware end point
UseNativeQuery=1 Regarding the use of the UseNativeQuery=1 option, the ODBC docs say: This option specifies whether the driver uses native HiveQL queries, or converts them into an equivalent form in HiveQL. Enabled (1): The driver does not transform the queries emitted by an application, and executes HiveQL queries directly. Disabled (0): The driver transforms the queries emitted by an application and converts them into an equivalent form in HiveQL.
Note: If the application is Hive-aware and already emits HiveQL, then enable this option to avoid the extra overhead of query transformation. Hive logs the queries that are run in various places like log files and in the Tez View. Admins need to be able to read these to help debug query execution problems. When UseNativeQuery=1 is set, the logged queries look like what you would normally expect - this: select count(*) from my_db.some_table where year=2016 and month=10 and day=2 But, if UseNativeQuery=0 is used, then the same query will end up looking like this: SELECT COUNT(1) AS `C_455850525f31` FROM `my_db`.`some_table` `C_6a7570315f73746174735f63726f5f6361706163697479` WHERE
((`C_6a7570315f73746174735f63726f5f6361706163697479`.`day` = 2) AND (`C_6a7570315f73746174735f63726f5f6361706163697479`.`year` = 2016)
AND (`C_6a7570315f73746174735f63726f5f6361706163697479`.`month` = 10)) If you had to debug queries in logs, which would you prefer? Right! ALWAYS set UseNativeQuery=1. Plus, per the docs, this also improves performance by reducing the overhead of unnecessary query transformation. 4) Test your Connection via Command Line When you installed the unixODBC packages, it came with "isql", an interactive SQL CLI. Run the following command to see if you can connect to Hive via ODBC via Knox using the DSN you just created above. You should get some valid results from your query: [mpetronic@mpws Downloads]$ echo "show databases" | isql -v -d, Knox <your-Linux-username> <your-Linux-password>
+---------------------------------------+
| Connected! |
| |
| sql-statement |
| help [tablename] |
| quit |
| |
+---------------------------------------+
SQL> show databases
default
test_db
another_db 5) Test your Connection via PyODBC The following is a simple Python app that uses PyODBC to access the same database using the same DSN. Copy the below sample code to your workstation, read the comments in the script and do what they say to do to be able to run the program. Then run it to verify you can successfully connect and execute a query via Knox. It will print the query results to stdout. #!/usr/bin/env python
import pyodbc
import argparse
'''
Demonstrates the bare bones code needed to access Hive via Beeline via Knox.
For this sample to work, you must provide either the full ODBC configuration
via a connection string or reference an odbc.ini file that already has a DSN
defined for the connection to Knox you are trying to make.
To go the odbc.ini route, do the following:
1. Create .odbc.ini in your $HOME directory. Note this is a hidden file and that
is what the ODBC driver expects by default
2. Populate $HOME/.odbc.ini with the following values. These are the minimum values
needed for the file to work with Knox. I am not going to explain all the values
in detail. If you want to know what each one means, look them up in this doc:
https://hortonworks.com/wp-content/uploads/2016/03/Hortonworks-Hive-ODBC-Driver-User-Guide.pdf.
Also, change the host, trustStorePassword, and ClientCert values as needed for your setup.
[ODBC Data Sources]
# This key 'Knox' can be anything but it MUST MATCH the key name used in the
# section below. The right side is just descriptive and can be anything
Knox=DNS for access to Hive via Knox
[Knox]
# Optional description of this DSN
Description=Hortonworks Knox QA DSN
# Where the actually ODBC driver library is loaded from. This is the default
# location where Hortonworks installs the driver
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
# The hostname of the sever running the Knox gateway
Host=api01.qa
# The port used to connect to the Knox gateway
port=8443
# The same Http path value used in a typical beeline connect string that
# connects to Knox. For Quasar, it is always "quasar/jupstats/hive"
HttpPath=quasar/jupstats/hive
# The default database you want to be one once beeline connects to Hive
schema=jup1_stats
# Must be zero as we want to connect directly to Hiveserver2 inside the cluster.
ServiceDiscoveryMode=0
# Must be 2 because we are using hiveserver2 which is what beeline uses as well
HiveServerType=2
# This indicates we need to provide a username and passwork in the connect string
AuthMech=3
# This indicates the transport mode for Hive is HTTP, which is required for Knox
ThriftTransport=2
# Indicates the connection to Knox uses SSL
SSL=1
# But NOT two-way SSL meaning no client side authentication - only server-side
TwoWaySSL=0
# These two are the same values used in the beeline connect string. This is the
# password you used when you added the Knox certificate to your own trust store
trustStorePassword=<your-trust-store-password>
ClientCert=/home/mpetronic/keystore/keystore.jks
# Must be 1 to support the self-signed certificate mode we use for Knox
AllowSelfSignedServerCert=1
3. Install dependencies for this script to run. First, install the unixODBC-devel
package (via yum, if you can or download the RPM and install that way). This
is needed for header files that are used build pyodbc. Then install PyODBC using
"pip install pyodbc==3.0.10". Note that the pyodbc installation needs to build
some source code so you will also need the gcc tools chain installed.
4. Run this script and provide the require command line options to test your connection
To instead, NOT require the $HOME/.odbc.ini file, you can build a DSN string directly
in code. Below is an example of how you can build a connection string to directly pass
into the pyodbc.connect() method. Using this approach, you DO NOT need to have a .odbc.ini
file present. This gives you full programmatic control of the process if you wish to
use it this way. This simply builds a string like "A=1;B=2;C=3" containing all the
exact same parameters that could be defined in the .odbc.ini file.
'''
CONNECTION_STRING_EXAMPLE = ';'.join('''
Description=Hortonworks Knox DSN
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
Host=api01
port=8443
HttpPath=quasar/jupstats/hive
schema=default
ServiceDiscoveryMode=0
HiveServerType=2
AuthMech=3
ThriftTransport=2
SSL=1
TwoWaySSL=0
trustStorePassword=<your-trust-store-password>
ClientCert=/home/mpetronic/keystore/keystore.jks
AllowSelfSignedServerCert=1
uid=<your-linux-username>
pwd=<your-linux-password>
'''.splitlines())
#===============================================================================
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--user', '-u', type=str, required=True,
help='User name used to connect to Hive with beeline via Knox.')
parser.add_argument('--password', '-p', type=str, required=True,
help='Password used to connect to Hive with beeline via Knox.')
parser.add_argument('--dsn', '-d', type=str, required=True,
help='The DSN name from $HOME/.odbc.ini to use for this connection.')
return parser.parse_args()
#===============================================================================
def execute_query(sql, dsn, user, password):
try:
connect_string = 'DSN=%(dsn)s;uid=%(user)s;pwd=%(password)s' % locals()
conn = pyodbc.connect(connect_string, autocommit=True)
cursor = conn.cursor()
cursor.execute(sql)
for row in cursor:
print row
except Exception as e:
print e
finally:
cursor.close()
conn.close()
###########################################################
# MAIN
###########################################################
if __name__ == '__main__':
args = parse_args()
execute_query('show databases', args.dsn, args.user, args.password)
... View more
- Find more articles tagged with:
- Data Processing
- Hive
- How-ToTutorial
- Installation
- Linux
- odbc
- pyodbc
Labels:
09-18-2016
03:57 PM
1 Kudo
I am trying to get ODBC via Knox working on CentOS 7 using this Hortonworks ODBC driver (which support said is the right one for my Linux OS version):
hive-odbc-native-2.1.4.1004-1.el7.x86_64.rpm After installing that, I am able to use this driver to connect directly to Hive using a DSN string like this in odbc.ini: [testdsn]
Description=Test
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
Host=sn01.qa.quasar.local:2181
Schema=default
ServiceDiscoveryMode=1
ZKNamespace=/hiveserver2
HiveServerType=2
AuthMech=3
ThriftTransport=2
Port=10001
HttpPath=cliservice And I can test access like this and all works well. I can also use this via pyodbc and that works too. [mpetronic@mpws ~]$ echo "show tables" | isql testdsn mpetronic anypwd
+---------------------------------------+
| Connected! |
| |
| sql-statement |
| help [tablename] |
| quit |
| |
+---------------------------------------+
SQL> show tables
+---------------------------------------+
| tab_name |
+---------------------------------------+
| test |
+---------------------------------------+
SQLRowCount returns -1
1 rows fetched Now, I need to make this work via Knox. I have read everything I could possibly find on setting up Hive access via ODBC via Knox. Most postings are about Windows. I need help with Linux configuration. There does not seem to be any clear documentation on how to setup this up on Linux. Can anyone provide the actual, specific odbc.ini settings or DSN you are using to connect to Hive via Knox from a Linux client over ODBC? Here is the connect string that I can use to directly connect to Hive via Knox without ODBC. Just cannot figure out how to make it work via ODBC. beeline -u "jdbc:hive2://api01.qa:8443/;ssl=true;sslTrustStore=/home/mpetronic/keystore/keystore.jks;trustStorePassword=password;transportMode=http;httpPath=quasar/jupstats/hive" -n mpetronic -w ~/pw.txt
... View more
Labels:
09-09-2016
10:18 PM
Thanks @Constantin Stanca. Parsing that log will work just fine for my use case. Appreciate your help!!!
... View more
09-06-2016
08:22 PM
2 Kudos
As an administrator, you can easily use the Tez view in Ambari to see the actual full HQL query that was executed by any user. How do you obtain the actual query that was run when hive.execution.engine=mr?
... View more
- Tags:
- Data Processing
- Hive
Labels:
09-01-2016
03:15 PM
We have lots of partitioned tables and need to write queries that have partition clauses that include year, month, day values that are not as simple as: where year=2016 and month=8 and day between 7 and 14 Often, they require non contiguous ranges of days over different months and years, etc. So, I am trying to come up with a way to help users craft those partitions clauses more easily and programatically (to the greatest extent possible). I don't want users to have to write wrapper scripts around queries or have to write queries into files first as they can do something like this: beeline 'connect-string' --hivevar part_string=$(./make_part.py 2015-12-25 2016-01-07) -f some.hql Where make_part.py might be a Python script that takes two dates and forms the full partition clause string that can then be simply referenced in some.hql, like this, which I know would work: select * from table where ${hivevar:part_string}; What I would like to do is something "functionally" like this but from within beeline so users can work more interactively in the beeline shell. For example, I wish you could do this from within beeline: set hivevar:part_string=!sh ./make_part.py 2015-12-25 2016-01-07 And have the output of the !sh command become the value of the hive variable. That does not work, of course. So, I was wondering, is it possible to create a UDF that could be used in a WHERE clause that could return the partition clause string which would get evaluated by Hive properly and work as expected - meaning, proper partition pruning. Something like this: select * from table where udf_make_part("2015-12-25", "2016-01-07"); Where udf_make_part would do the same thing as make_part.py - take some date arguments and return some generated partition string. I've not worked with UDFs so far but just wondering if they could be used in this context in the WHERE clause. Or, does anyone have another useful approach for dealing with long, complicated partition clauses?
... View more
Labels:
07-06-2016
01:49 AM
Step 1 requires the following syntax tweak to include the alias otherwise you get an error like: keytool error: java.lang.Exception: Alias <mykey> does not exist keytool -exportcert -file knox.crt -keystore ./gateway.jks -storepass <master-secret-password> -alias <alias-used-to-create-key>
... View more
05-21-2016
02:32 PM
1 Kudo
I stumbled onto my own answer. Appears that you have to provide a partition spec to the alter command. I figured this out after poking around in mysql to look at my hive metastore to see if that would give me a clue. These queries are what made me think that the serde information is on a partition-by-partition basis: mysql> select * from TBLS where TBL_NAME='rm'\G
*************************** 1. row ***************************
TBL_ID: 170
CREATE_TIME: 1463833647
DB_ID: 11
LAST_ACCESS_TIME: 0
OWNER: mpetronic
RETENTION: 0
SD_ID: 227
TBL_NAME: rm
TBL_TYPE: EXTERNAL_TABLE
VIEW_EXPANDED_TEXT: NULL
VIEW_ORIGINAL_TEXT: NULL
LINK_TARGET_ID: NULL
1 row in set (0.00 sec)
mysql> select * from SDS where CD_ID=170\G
*************************** 1. row ***************************
SD_ID: 227
CD_ID: 170
INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
IS_COMPRESSED:
IS_STOREDASSUBDIRECTORIES:
LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
SERDE_ID: 227
*************************** 2. row ***************************
SD_ID: 228
CD_ID: 170
INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
IS_COMPRESSED:
IS_STOREDASSUBDIRECTORIES:
LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=10
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
SERDE_ID: 228
*************************** 3. row ***************************
SD_ID: 229
CD_ID: 170
INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
IS_COMPRESSED:
IS_STOREDASSUBDIRECTORIES:
LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=11
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
SERDE_ID: 229
Once I saw all those 'LOCATION' values in the SDS table, I tried the following command to alter the table and then the query worked. Interesting. Does this imply that you could have a different schema for each partition? Can anyone comment as to why the avro schema is tied to a partition and not simply to the whole table? alter table rm partition (year=2016) set serdeproperties ('avro.schema.url' = 'hdfs://mpws:8020/jup1_stats/avro/rm_1.avsc'); Since all my data is in partitions under the "year=2016" partition, I was able to just specify that one partition and it applied the change to all partitions under that.
... View more
05-21-2016
12:46 PM
1 Kudo
I have Hive external tables using Avro serdes that look like this: CREATE EXTERNAL TABLE `rm`(
`gw_id` bigint COMMENT '',
`rm_id` string COMMENT '',
`start_time` bigint COMMENT '',
`end_time` bigint COMMENT '',
`switch_failure_count` bigint COMMENT '')
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url'='hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'hdfs://mpws:8020/jup1_stats/external_tables/rm'
TBLPROPERTIES (
'last_modified_by'='mpetronic',
'last_modified_time'='1463833730',
'transient_lastDdlTime'='1463833730')
The Avro schemas are stored on HDFS in /jup1_stats/schema/. I want to change that path from this: /jup1_stats/schema/ to this: /jup1_stats/avro/ So, I copied my avro schema files from /jup1_stats/schema to /jup1_stats/avro and did this: alter table rm set serdeproperties ('avro.schema.url'='hdfs://mpws:8020/stats/avro/rm_1.avsc'); I ran a test query and it worked. But not so fast! Then I did the clean up and removed the /jup1_stats/schema directory and ran the test query again and got tons of stack traces and the query failed. ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1463688329951_0007_4_00, diagnostics=[Task failed, taskId=task_1463688329951_0007_4_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:135)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:650)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:621)
at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:406)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:128)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149)
... 14 more
Caused by: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:251)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193)
... 25 more It appears that Hive is still looking for the schema in the old location based on this error trace: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc I also tried this but the queries still fail: msck repair table rm; If I recreate the schema files in the old location (/jup1_stats/schema), the query then works again. These are external tables, and I did try dropping the table and recreating it with the new location in avro.schema.url and that works. But, I would like to understand why I cannot use "alter table" to achieve this same result. Thanks!
... View more
Labels:
05-19-2016
07:45 PM
I'm setting up a tiny pseudo-distributed cluster for testing. I have only one datanode. I works and I can load and query data in Hive. Great. Then I changed the HDFS replication factor from 3 to 1, and also changed the max replication factor to 1, too. I restarted all HDFS, YARN, and MR processes (that's all that Ambari indicated that needed restarted). Now, when I run a Hive query, I see this in the hadoop-hdfs-namenode-log 2016-05-19 15:26:03,322 INFO ipc.Server (Server.java:run(2172)) - IPC Server handler 103 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 172.19.64.3:57450 Call#27745 Retry#0java.io.IOException: file /tmp/hive/mpetronic/_tez_session_dir/5e16aba9-d9d3-4138-afda-58c1d0027e4d/hive-hcatalog-core.jar on client 172.19.64.3.Requested replication 3 exceeds maximum 1 at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.verifyReplication(BlockManager.java:988) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2374) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2335) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:688) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) And my query fails with this... 0: jdbc:hive2://mpws:10000/default> select count(device_id) from vsat_modc_batch where device_id like 'DSS%' limit 10;
INFO : Tez session hasn't been created yet. Opening session
ERROR : Failed to execute tez graph.
java.io.IOException: Previous writer likely failed to write hdfs://mpws:8020/tmp/hive/mpetronic/_tez_session_dir/9216a3de-e9fc-4940-a4b3-21e49bcca4b5/hive-hcatalog-core.jar. Failing because I am unlikely to write too.
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeResource(DagUtils.java:982)
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.addTempResources(DagUtils.java:862)
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeTempFilesFromConf(DagUtils.java:805)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.refreshLocalResourcesFromConf(TezSessionState.java:233)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:158)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.updateSession(TezTask.java:271)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:151)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1720)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1477)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1254)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1118)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1113)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask (state=08S01,code=1)
So, it seems that something did not get the memo about the reduced replication factor. Did I miss some additional configuration needed to run at replication factor = 1? I reconfigured back to 3 and everything works again. I could not find anything indicating I need to configure something in the hive or tez clients that is related to replication.
... View more
Labels:
05-07-2016
11:54 AM
3 Kudos
I spoke to two different HWX support engineers this week who are Ranger experts. I gained a bit more understanding on some points of confusion, so I thought I would pass this on in case anyone has the same questions:
Usersync can be installed on any node, as @vperiasamy noted in his comment because, in Unix sync mode, it simply reads in the /etc/passwd and /etc/group information from the node where it is installed. In fact, per my discussion with the support guys, it is now clear that Ranger simply ONLY READS in the users and groups from the Unix settings to prepopulate the Ranger Admin DB more for "convenience" so that you have those names present in the UI to add to policies. If you add more users and groups, Usersync will pull them in. However, deleting users and groups are NOT reflected in the UI. I can understand why as that would require much more complex sychronization logic that is probably being avoided for the sake of simplicity in design. They said there are scripts you can run manually to remove users and groups. The same is true if configured for AD/LDAP. It ONLY READS in users that appear which your search specifications against your domain of users.
The Linux users and groups should be defined on all cluster nodes so that the user/group names and UIDs and GIDs are all the same. However, I have some doubts that you really need to do that on the data nodes. I can see this helping on the master nodes where clients are running. The support guys said that, generally speaking, having this all defined ONLY on the name nodes "should" be sufficient for most Hadoop applications to work but indicated there are "corner cases" where some application process might look at the local nodes user/group definitions and, if they are out of sync with, other node, could result in strange behavior. So they recommend syncing up those settings on all nodes to be safe.
... View more
05-04-2016
01:38 AM
Oh, and I am running HDP 2.3.4/Ambari 2.2.1.1.
... View more
05-03-2016
10:57 PM
Ok, I was thinking that must be the case. So, I decided to just remove Ranger and start over with a clean install. I used the Ambari REST APIs to remove the service. Then I dropped all the ranger users and databases from MySQL. However, when I try to reinstall Ranger via Ambari, it will not allow me to install on any node except one - the place where I first installed it - which is NOT the NN where I want to install it. Somehow, I believe Ambari has some left over config that is causing it to lock my choice to just one node. Both Ranger Admin and UserSync are forced to the same node. All the pick lists that you normally use to select a different node to install the service on are all disabled from selection. Any idea how to release Ranger to be installed where I want to?
... View more
05-03-2016
06:54 PM
Thanks, I'm working through this but have a question still regarding my question #2. Can you please explain how usersync obtains the user/groups? Does it read /etc/passwd and /etc/group files from the node that the usersync process is running on or does it use some RPC call to the namenode to obtain that information from those same files that reside on the namenode? I believe that the ONLY place I need to create user and groups is on the namenode (or both namenodes in my case since I am running NN HA). I would just like to understand the process by which Ranger discovers the users and groups so that I can be sure where to create them in the first place. It's a bit of a mystery to me. That's why I asked if usersync should be running on the NNs or can it be running anywhere. You seem to state anywhere is ok.
... View more
05-03-2016
06:29 AM
1 Kudo
HDP 2.3.4/Ambari 2.2.1.1 I am having some trouble understanding how the various Ranger users are used and where to configure the passwords. What I believe to be true (based on some other posts on Ranger in this forum), is: Ranger admin user - I understand that the password for this has to be the same in Ranger Admin UI and under Ranger Advanced config for the field "admin_password" in the Advanced ranger-env configuration Ranger amb_ranger_admin user - I understand that the password for this has to be the same in Ranger Admin UI and under Ranger Advanced config for the field "Ranger Admin username for Ambari" in the Advanced Ranger Admin configuration Ranger rangerusersync user - This one seems to be my problem. It appears, based on the error trace below, that the password for this user needs to be configured somewhere but is not correct and I cannot figure out where I need to make changes. Note that I have Ranger configured for Unix sync mode. Ranger admin and usersync processes are running on host api01 and my two HA name nodes are running on host nn01 and nn02. So, what I am looking for is two things: An explanation of how these three users are used to make Ranger operate Should the usersync process be running on one of the name nodes instead of a different host? I believe that I only need to configure users and groups on the two name nodes and that other services, like Hive, rely on the equivalent of "hdfs groups <user>" to obtain group information and that that information comes from the Linux OS configured user/group information on the namenode. Since I am running usersync on a non-NN host, I am thinking this might be wrong. Some help in solving the error related to rangerusersync, namely this one (full traces below): 03 May 2016 06:04:41 INFO PasswordValidator [Thread-8] - Response [FAILED: [rangerusersync] does not exists.] for user: rangerusersync 03 May 2016 06:04:09 INFO UnixAuthenticationService [main] - Starting User Sync Service!
03 May 2016 06:04:09 INFO UnixAuthenticationService [main] - Enabling Unix Auth Service!
03 May 2016 06:04:09 INFO UserGroupSync [UnixUserSyncThread] - initializing sink: org.apache.ranger.unixusersync.process.PolicyMgrUserGroupBuilder
03 May 2016 06:04:10 WARN NativeCodeLoader [main] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
03 May 2016 06:04:10 INFO UnixAuthenticationService [main] - Enabling Protocol: [SSLv2Hello]
03 May 2016 06:04:10 INFO UnixAuthenticationService [main] - Enabling Protocol: [TLSv1]
03 May 2016 06:04:10 INFO UnixAuthenticationService [main] - Enabling Protocol: [TLSv1.1]
03 May 2016 06:04:10 INFO UnixAuthenticationService [main] - Enabling Protocol: [TLSv1.2]
03 May 2016 06:04:41 INFO PasswordValidator [Thread-8] - Response [FAILED: [rangerusersync] does not exists.] for user: rangerusersync
03 May 2016 06:04:41 ERROR UserGroupSync [UnixUserSyncThread] - Failed to initialize UserGroup source/sink. Will retry after 60000 milliseconds. Error details:
com.sun.jersey.api.client.UniformInterfaceException: GET http://api01.qa.quasar.local:6080/service/xusers/groups/?pageSize=1000&startIndex=0 returned a response status of 401 Unauthorized
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:686)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at com.sun.jersey.api.client.WebResource$Builder.get(WebResource.java:507)
at org.apache.ranger.unixusersync.process.PolicyMgrUserGroupBuilder.buildGroupList(PolicyMgrUserGroupBuilder.java:346)
at org.apache.ranger.unixusersync.process.PolicyMgrUserGroupBuilder.buildUserGroupInfo(PolicyMgrUserGroupBuilder.java:156)
at org.apache.ranger.unixusersync.process.PolicyMgrUserGroupBuilder.init(PolicyMgrUserGroupBuilder.java:152)
at org.apache.ranger.usergroupsync.UserGroupSync.run(UserGroupSync.java:51)
at java.lang.Thread.run(Thread.java:745)
Thanks in advance...
... View more
Labels:
05-02-2016
09:27 PM
@Neeraj Sabharwal, I'm a bit confused on this impersonation concept. I understand the resource reuse benefits part. But, if I have different users running Hive queries and they all run as the "hive" user, then what is to prevent one user's "insert overwrite directory "some_relative_dir" select from ...." results from overwriting some other users query that just happens to pick the same directory path? If a relative path is supplied, then Hive will write the results to "/user/hive/some_relative_dir". I tried to read the best practices link at the end of this post but the page is not loading.
... View more
04-27-2016
03:26 AM
Actually, I tried restarting them before the reboot. Restarted everything. Still had the errors. Then did the reboot and they cleared. Oh well, is what it is!
... View more
04-26-2016
05:22 AM
We'll, I would still love to understand how this worked but, a reboot of the 4 DNs made this error go away. Never would have thunk it! That's pretty strange... Anyway, maybe this will help some other poor soul who hits this same condition. 🙂
... View more
04-26-2016
03:53 AM
So, I was installing a new cluster for our development QA testing and failed to adjust the default configuration that Ambari gave me for dfs.datanode.data.dir. It ended up putting in every partition it found. I really only wanted one - /grid/1, which was a dedicated disk partition for HDFS block storage. I discovered this blunder after the complete installation completed. I did not want to just redo the whole install, opting instead to try to manually fix this as a good deep dive learning experience. I got it all worked out, and it was a good learning exercise, but I have one lingering issue that I cannot solve. I am getting 4 Ambari errors (one for each DN) that state: Detected data dir(s) that became unmounted and are now writing to the root partition: /grid/1/hadoop/hdfs/data . I figured out that Ambari agents monitor (and remember) which HDFS directories were previously mounted and it checks to see if a mounted disk goes away for some reason and displays an error. That's all fine - I get that. However, my setup seems to be correct yet Ambari is still complaining. As part of correcting the configuration issue I laid upon myself, I did edit the following file (on each DN) to remove all those previously caches mount points that I did not want and I just left the one I did want. I ended up just stopping HDFS, removing all the /opt/hadoop/hdfs, //tmp/hadoop/hdfs, etc directories, removing the name node metadata directories, reformatting the namenode, and starting up HDFS. The file system is up an working. But, can anyone tell me why I cannot get rid of this Ambari error? Here's the contents of one of dfs_data_dir_mount.hist files. All 4 are exactly the same. Below showa the mount where I have a disk for HDFS data storage. It all looks good. I must be missing something obvious. I did restart everything - nothing clears this error. Thanks in advance... [root@vmwqsrqadn01 ~]# cat /var/lib/ambari-agent/data/datanode/dfs_data_dir_mount.hist
# This file keeps track of the last known mount-point for each DFS data dir.
# It is safe to delete, since it will get regenerated the next time that the DataNode starts.
# However, it is not advised to delete this file since Ambari may
# re-create a DFS data dir that used to be mounted on a drive but is now mounted on the root.
# Comments begin with a hash (#) symbol
# data_dir,mount_point
/grid/1/hadoop/hdfs/data,/grid/1
[root@vmwqsrqadn01 ~]# mount -l | grep grid
/dev/sdb1 on /grid/1 type ext4 (rw,noatime,nodiratime,seclabel,data=ordered)
... View more
Labels:
04-26-2016
03:27 AM
1 Kudo
Ya, it was fun undoing that. Ultimately, I just blew away the data and namenode dirs and recreated from scratch after I first removed all the undesired partitions in the HDFS configs. IMO, Ambari setup wizard should make NO guesses on such critical configurations like this. It should REQUIRE you to enter the dfs.namenode.name.dir and dfs.datanode.data.dir list of directories before you can proceed. I could think of a few other path-based parameters that is should NOT try to default for you. Anyway, thanks for the response!
... View more
04-23-2016
06:06 PM
1 Kudo
Good point and thanks for the quick response. I believe the directories are there because I failed to setup some configs completely in the earlier steps of the install. So, for example, my dfs.namenode.name.dir is set to this, which is bogus. But, I am going to immediately convert to NN HA anyway, so all this gets fixed. But, as you noted, there is the /usr/hdp/hadoop dir as one of the below contributing to the error. /usr/hdp/hadoop/hdfs/namenode,/tmp/hadoop/hdfs/namenode,/zk/hadoop/hdfs/namenode,/qjm/hadoop/hdfs/namenode,/opt/hadoop/hdfs/namenode,/var/hadoop/hdfs/namenode That sure seems like a strange concoction of directories for defaults. Anyway, it all has to change for HA config anyway. I was just trying to get to a basic base install. Next phase is a complete config review. This is why I am building this on 8 VMs first before building out in PROD on real HW. Working through the details on this test set first! 🙂
... View more
04-23-2016
05:22 PM
2 Kudos
I an doing a clean install of HDP 2.3.4.7 using Ambari 2.2.1.1. By default, the HDP repo URL does not point to the latest version of HDP right now, which is 2.3.4.7. I conferred with @Neeraj Sabharwal to determine if it is feasible to change that URL to 2.3.4.7 and directly do the clean install of that latest version to avoid the need to install the earlier release of 2.3.4 then immediately have to upgrade. He said yes, so that's what I did. I encountered errors in the final step of the install at the "Install, Start, and Test" step where all the services are installed on the nodes. I was seeing errors from the scripts trying obtain the HDP versions so I was wondering if my URL tweak was actually causing a problem. In the end it was not. Here is a sample error trace: stderr:
<script id="metamorph-39638-start" type="text/x-placeholder"></script>Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/after-INSTALL/scripts/hook.py", line 37, in <module>
AfterInstallHook().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 219, in execute
method(env)
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/after-INSTALL/scripts/hook.py", line 31, in hook
setup_hdp_symlinks()
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/after-INSTALL/scripts/shared_initialization.py", line 44, in setup_hdp_symlinks
hdp_select.select_all(version)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/hdp_select.py", line 122, in select_all
Execute(command, only_if = only_if_command)
File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 154, in __init__
self.env.run()
File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 158, in run
self.run_action(resource, action)
File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 121, in run_action
provider_action()
File "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py", line 238, in action_run
tries=self.resource.tries, try_sleep=self.resource.try_sleep)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 70, in inner
result = function(command, **kwargs)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 92, in checked_call
tries=tries, try_sleep=try_sleep)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 140, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 291, in _call
raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of 'ambari-sudo.sh /usr/bin/hdp-select set all `ambari-python-wrap /usr/bin/hdp-select versions | grep ^2.3 | tail -1`' returned 1. Traceback (most recent call last):
File "/usr/bin/hdp-select", line 378, in <module>
printVersions()
File "/usr/bin/hdp-select", line 235, in printVersions
result[tuple(map(int, versionRegex.split(f)))] = f
ValueError: invalid literal for int() with base 10: 'hadoop'
ERROR: set command takes 2 parameters, instead of 1
usage: hdp-select [-h] [<command>] [<package>] [<version>]
Set the selected version of HDP.
positional arguments:
<command> One of set, status, versions, or packages
<package> the package name to set
<version> the HDP version to set
optional arguments:
-h, --help show this help message and exit
-r, --rpm-mode if true checks if there is symlink exists and creates the symlink if it doesn't
Commands:
set : set the package to a specified version
status : show the version of the package
versions : show the currently installed versions
packages : show the individual package names<script id="metamorph-39638-end" type="text/x-placeholder"></script>
Being a Python guy, I dug into /usr/bin/hdp-select and found where the error was occurring. There are two issues:
This function is not very bullet proof and prone to errors The reason for the error is an unexpected directory exists in /usr/hdp where this function is looking to parse installed versions The function in question is: # Print the installed packages
def printVersions():
result = {}
for f in os.listdir(root):
if f not in [".", "..", "current", "share", "lost+found"]:
result[tuple(map(int, versionRegex.split(f)))] = f
keys = result.keys()
keys.sort()
for k in keys:
print result[k]
The problem is that if any but the excluded directories are found on the scan, this function will raise an exception where it tries to map the regex output to an int(), where the output is a non-numeric string - like the value "hadoop". On some of my nodes, the /usr/hdp directory looks like this (which does NOT result in this function throwing an exception): [nn01.qa] out: drwxr-xr-x. 13 root root 4096 Apr 23 07:09 2.3.4.7-4
[nn01.qa] out: drwxr-xr-x. 2 root root 4096 Apr 23 16:32 current
[nn01.qa] out: drwx------. 2 root root 16384 Apr 19 16:09 lost+found But others look like this (note presence of unexpected hadoop directory that was causing the exception): [nn02.qa] out: drwxr-xr-x. 13 root root 4096 Apr 23 07:09 2.3.4.7-4
[nn02.qa] out: drwxr-xr-x. 2 root root 4096 Apr 23 16:32 current
[nn02.qa] out: drwxr-xr-x. 4 root root 4096 Apr 23 16:12 hadoop
[nn02.qa] out: drwx------. 2 root root 16384 Apr 19 16:09 lost+found I fixed the problem by changing that function code as follows (adding exception handling), copying this update to all my nodes, retrying the install. It completed successfully after that. # Print the installed packages
def printVersions():
result = {}
for f in os.listdir(root):
if f not in [".", "..", "current", "share", "lost+found"]:
try:
result[tuple(map(int, versionRegex.split(f)))] = f
except:
pass
keys = result.keys()
keys.sort()
for k in keys:
print result[k]
I don't recommend this code as a final fix. Instead of a black list condition, I would get rid of that code an use a new regex filter to verify that each directory looked like "^\d+\.\d+.*" or something like that which requires it to at least look like some form of the expected version format of 2.3.4.7-n. Then it does not matter how many unexpected entries you find - you will just gracefully skip over them. One interesting point about this bug... The first time I ran the "Install, Start, and Test" step in the Ambari GUI, all 8 nodes failed, one with a red failure and the others orange and stopped due to warnings. I guess when one fails, the others seem to abort as well with warnings. Just for the heck of it, I then clicked the "Retry" button and one node completed (turned blue) but another failed and the rest also stopped with warnings. I though that was weird so I keep retrying over and over. I got to the point where 4 of the 8 turned blue but then the rest would never complete no matter how many times I retried. When I looked at the contents of all the nodes that made it to blue, they did NOT have "hadoop" subdir in /usr/hdp so that is why the function above did not raise an exception. I seems like, at some stage of the install, /usr/hdp/hadoop gets created when certain applications need to be installed there and, if you have to retry after that, you are dead. However, after my fixed code was deployed and I retried, all nodes completely installed and EVERY node now has a /usr/hdp/hadoop directory. @Ryan Chapin
... View more
Labels:
04-22-2016
06:31 PM
Thanks @Jonas Straub. That helps clarify the issue a bit but still left wondering. So, the ranger mananger GUI let's you add services - the big fat plus sign icon. Services must have different names so they cannot all be <cluster>_<service>. Am I understanding this right... I am thinking I have two Hive-based services called A and B. They have different users, different policies. So, I am thinking I would create two services under the Hive section in Ranger, one for A and one for B to nicely organize these policies that do not overlap. Then define the users and polices for each. Is that not how services are meant to be used - to organize users/policies? Am I off track here? If not, then are you saying that the current version of Ranger in Ambari is cobbled and can only support ONE service per service type, like one per Hive, one per HDFS, etc, and the names are locked down (as you already indicated)? I can leave the services names as default for now so that it works but just trying to understand the overall configuration concepts of Ranger for future planning.
... View more
04-22-2016
04:53 AM
@Ryan Chapin
... View more
04-22-2016
04:51 AM
Just installed Ranger on 2.4.0 to start experimenting with it. Basically got it working and was starting to play around with making policy changes and seeing how they impact access to prove it works as I anticipate. By default, it created three services names nadcluster_hive, nadcluster_hdfs, and nadcluster_yarn because my cluster name is nadcluster, I suppose. So, I decided to rename these to be more reflective of the service. I renamed nadcluster_hive to jupstats_hive. I started to see errors in the hiveserver2 log like below. Seems the hiveserver2 ranger plugin is still trying to download policies data from a URL like: http://vmwhaddev01:6080/service/plugins/policies/download/nadcluster_hive When it should be trying against the newly named service like: http://vmwhaddev01:6080/service/plugins/policies/download/jupstats_hive
And, in fact, I manually hit the REST end point from a browser with a URL like this and did get a proper response: http://vmwhaddev01:6080/service/plugins/policies/download/jupstats_hive?lastKnownVersion=4&pluginId=hiveServer2@vmwhaddev01-jupstats_hive Here's the error in the log: 2016-04-22 04:38:17,290 ERROR [Thread-9]: client.RangerAdminRESTClient (RangerAdminRESTClient.java:getServicePoliciesIfUpdated(81)) - Error getting policies. request=http://vmwhaddev01:6080/service/plugins/policies/download/nadcluster_hive?lastKnownVersion=4&pluginId=hiveServer2@vmwhaddev01-nadcluster_hive, response={"httpStatusCode":400,"statusCode":1,"msgDesc":"Serivce:nadcluster_hive not found","messageList":[{"name":"DATA_NOT_FOUND","rbKey":"xa.error.data_not_found","message":"Data not found"}]}, serviceName=nadcluster_hive
2016-04-22 04:38:17,290 ERROR [Thread-9]: util.PolicyRefresher (PolicyRefresher.java:loadPolicyfromPolicyAdmin(228)) - PolicyRefresher(serviceName=nadcluster_hive): failed to refresh policies. Will continue to use last known version of policies (4)
java.lang.Exception: Serivce:nadcluster_hive not found
at org.apache.ranger.admin.client.RangerAdminRESTClient.getServicePoliciesIfUpdated(RangerAdminRESTClient.java:83)
at org.apache.ranger.plugin.util.PolicyRefresher.loadPolicyfromPolicyAdmin(PolicyRefresher.java:205)
at org.apache.ranger.plugin.util.PolicyRefresher.loadPolicy(PolicyRefresher.java:175)
at org.apache.ranger.plugin.util.PolicyRefresher.run(PolicyRefresher.java:154) And here is when it started working after I renamed the service back to what it was looking for: 2016-04-22 04:38:47,375 INFO [Thread-9]: util.PolicyRefresher (PolicyRefresher.java:loadPolicyfromPolicyAdmin(218)) - PolicyRefresher(serviceName=nadcluster_hive): found updated version. lastKnownVersion=4; newVersion=18 Questions:
What has to be restarted when you change something in a Ranger policy via that Ranger GUI? I restarted Ranger admin and usersync and hiveserver2, but the wrong service name was still being used in the plugin URL? Where is the hiveserver2 Ranger plugin learning this URL in the first place [UPDATE] Ok, I spent some more time playing around. I figured out that the policy configuration is located here (noting the fact that the service name "nadcluster_hive" is in path and filename: /etc/ranger/nadcluster_hive/policycache/hiveServer2_nadcluster_hive.json I performed some testing. With the service name in Ranger UI set to "nadcluster_hive", I made various changes to one of the policies, like adding a new user, enabling/disabling table or column permissions, etc. I tailed the json file above and, every time I made a change and saved, within 30 seconds, I would see the json file be rewritten with the updates. Cool. That seems right. Next, I renamed the service to nadcluster_hive_1 and repeated the tests. The json file never once changed - as expected because the wrong REST URL is being used. But, I would have expected that maybe a brand new json file with the new service name would have appeared with a path like: /etc/ranger/nadcluster_hive_1/policycache/hiveServer2_nadcluster_hive_1.json But, it never did. So, is this expected behavior or a bug?
... View more
Labels: