About Mark_Petronic

Mark_Petronic · ‎12-01-2018

Thank you for your answer. I wish I could buy you a beer. 🙂 My solution was slightly different but starring me right in the face. Your answer provided the perfect clue. In my case, there was no "universal" directory. However, I noticed there was a CA bundle file. It DID NOT come preconfigured with the GeoTrust CA we have for the SSL certificate we purchased for our F5. It did have some other GeoTrust CA and many others, as well. I simply opened up the cacerts.pem file and added that GeoTrust CA to the end, saved the file, and ran my test query. It then worked! [mpetronic@vmwhnsqsrclnt01 ~]$ echo "show tables" | isql -d, -b -v f5 mpetronic $(cat ~/.pw.dat) cro_capacity_extract_tmp cro_capacity_ranked_tmp cro_capacity_report_final cro_efficiency_extract_tmp cro_efficiency_hourly_tmp cro_efficiency_report_final j1_total_user_counts san_data_2 test Here is what my directory structure looks like for the ODBC driver version I am using: [root@vmwhnsqsrclnt01 lib]# tree /usr/lib/hive/ hive └── lib └── native ├── hiveodbc │ ├── ErrorMessages │ │ └── en-US │ │ ├── DSMessages.xml │ │ ├── HiveODBCMessages.xml │ │ ├── ODBCMessages.xml │ │ ├── SQLEngineMessages.xml │ │ └── ThriftExtensionMessages.xml │ ├── EULA.txt │ ├── Hortonworks\ Hive\ ODBC\ Driver\ User\ Guide.pdf │ ├── Release\ Notes.txt │ └── Setup │ ├── odbc.ini │ └── odbcinst.ini └── Linux-amd64-64 ├── api.prod.quasar.nadops.net.pem ├── cacerts.pem <<< Added GeoTrust CA to end of this file ├── cacerts.pem.orig ├── HiveODBC.did ├── hortonworks.hiveodbc.ini └── libhortonworkshiveodbc64.so

Mark_Petronic · ‎11-12-2018

HDP 2.3.4.7 hive-odbc-native-2.6.1.1001-1.x86_64.rpm I have a working setup using this ODBC driver on CentOS 7.2 via Knox. Now we have added an F5 proxy in front of Knox. I can successfully connect to Hive using beeline through the F5 as well as connect with other JDBC apps (dbvisualizer, for example works fine). The F5 is using a Geotrust CA for it's SSL connection. My odbc.ini file looks like this where the [Knox] section is for directly connecting via Knox and the F5 section is for going through the F5 then through Knox: [ODBC Data Sources] Knox=DSN for access to the production cluster using Beeline via Knox [Knox] Description=Hortonworks Knox Production DSN Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so Host=api01.prod.quasar.nadops.net port=8443 HttpPath=quasar/jupstats/hive schema=default ServiceDiscoveryMode=0 HiveServerType=2 AuthMech=3 ThriftTransport=2 SSL=1 TwoWaySSL=0 ClientCert=/users/mpetronic/keystore/keystore.jks trustStorePassword=###### AllowSelfSignedServerCert=1 UseNativeQuery=1 [F5] Description=Hortonworks Knox Production DSN (F5) Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so Host=api.prod.quasar.nadops.net port=443 # The knox/ part get stripped by an F5 rewrite rule then forwarded to Knox HttpPath=knox/quasar/jupstats/hive schema=default ServiceDiscoveryMode=0 HiveServerType=2 AuthMech=3 ThriftTransport=2 SSL=1 TwoWaySSL=0 ClientCert=/users/mpetronic/keystore/keystore.jks trustStorePassword=###### AllowSelfSignedServerCert=1 UseNativeQuery=1 I have a valid Geotrust CA in the OS certificate store proven by this test: [mpetronic@vmwhnsqsrclnt01 ~]$ openssl s_client -connect api.prod.quasar.nadops.net:443 -state CONNECTED(00000003) SSL_connect:before/connect initialization SSL_connect:SSLv2/v3 write client hello A SSL_connect:SSLv3 read server hello A depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA verify return:1 depth=1 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = GeoTrust RSA CA 2018 verify return:1 depth=0 CN = api.prod.quasar.nadops.net verify return:1 SSL_connect:SSLv3 read server certificate A SSL_connect:SSLv3 read server key exchange A SSL_connect:SSLv3 read server done A SSL_connect:SSLv3 write client key exchange A SSL_connect:SSLv3 write change cipher spec A SSL_connect:SSLv3 write finished A SSL_connect:SSLv3 flush data SSL_connect:SSLv3 read finished A --- Certificate chain 0 s:/CN=api.prod.quasar.nadops.net i:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=GeoTrust RSA CA 2018 --- Server certificate -----BEGIN CERTIFICATE----- MIIEuTCCA6GgAwIBAgIQDFoE4dcFI+tHxeZBx3DCKDANBgkqhkiG9w0BAQsFADBe MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3 d3cuZGlnaWNlcnQuY29tMR0wGwYDVQQDExRHZW9UcnVzdCBSU0EgQ0EgMjAxODAe Fw0xODAxMTcwMDAwMDBaFw0yMTAxMTYxMjAwMDBaMCUxIzAhBgNVBAMTGmFwaS5w cm9kLnF1YXNhci5uYWRvcHMubmV0MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIB CgKCAQEAvzSCRUpL3S3XF82i7A/qZPOgwAbGagJCZUss5ZAU2Xb0bNi2FFGT5rhq 6sC59tAVF15BoY24KfMvxbp86B0n7OCvvVCJ4IY21/i2MNrCOM2t09UOMFdSuKhD IdKLcYNrw8xhvMqGPeyFgHYZhPFfvB2XrwO9K8hjwgq01AHB490W6PylU+S0cedW J1pqC57gXne3sQd20+DRvLZ9dP/phMofeMoO5zrcUZIMZ1cjP32LEVjwQ93CEgbG BZ+cdHNJcyDtoP5t4Z2of8+f5mp2WK583vQVpKdZYw/AT1ZdEQrcePDVt+wAROHJ x2SRtk2ouIcBUCcwMH4goIFparN4lQIDAQABo4IBqjCCAaYwHwYDVR0jBBgwFoAU kFj/sJx1qFFUd7Ht8qNDFjiebMUwHQYDVR0OBBYEFJwKxfT5OhncAC2KcYLtJVnu 0ajNMCUGA1UdEQQeMByCGmFwaS5wcm9kLnF1YXNhci5uYWRvcHMubmV0MA4GA1Ud DwEB/wQEAwIFoDAdBgNVHSUEFjAUBggrBgEFBQcDAQYIKwYBBQUHAwIwPgYDVR0f BDcwNTAzoDGgL4YtaHR0cDovL2NkcC5nZW90cnVzdC5jb20vR2VvVHJ1c3RSU0FD QTIwMTguY3JsMEwGA1UdIARFMEMwNwYJYIZIAYb9bAECMCowKAYIKwYBBQUHAgEW HGh0dHBzOi8vd3d3LmRpZ2ljZXJ0LmNvbS9DUFMwCAYGZ4EMAQIBMHUGCCsGAQUF BwEBBGkwZzAmBggrBgEFBQcwAYYaaHR0cDovL3N0YXR1cy5nZW90cnVzdC5jb20w PQYIKwYBBQUHMAKGMWh0dHA6Ly9jYWNlcnRzLmdlb3RydXN0LmNvbS9HZW9UcnVz dFJTQUNBMjAxOC5jcnQwCQYDVR0TBAIwADANBgkqhkiG9w0BAQsFAAOCAQEAqv5b CxKeHDV8+j86I9zvXoPwMq+a+SpjQHeTgNjU3zsnQWxyDVH6gqnA9xmt/c36AcTd mqeLZMsdDNMAimeEk7hCCaQcuOp0mLlxqlxNnPas6kTzvpM3KpBKUeFbCW77jkFZ Q1WU/oq9jZZCtvl5SZadwmSzXAp9PxDuRIJEYtpHs/Mo6RQWZJcOmIN1aGfUOuav +xOLhUK/FQ+e4riUcuLiTcxLBHtqkMORHP4t3GWCNv1Sj/i8qwI/SpU3DBvryp3Z RXKA4okXFPQ8fKfq5HHyhpmGRhWlCUIHSnyjo9QzKvH0A18f+DDtQ/1KWU9lQHcO BaZOw378zMc0PHHceg== -----END CERTIFICATE----- subject=/CN=api.prod.quasar.nadops.net issuer=/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=GeoTrust RSA CA 2018 --- No client certificate CA names sent Peer signing digest: SHA1 Server Temp Key: ECDH, P-256, 256 bits --- SSL handshake has read 1718 bytes and written 415 bytes --- New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384 Server public key is 2048 bit Secure Renegotiation IS supported Compression: NONE Expansion: NONE No ALPN negotiated SSL-Session: Protocol : TLSv1.2 Cipher : ECDHE-RSA-AES256-GCM-SHA384 Session-ID: 4F057AE288401B87506EF68BA0F2AFC3B3182E20100F5E0778493198BD2C6984 Session-ID-ctx: Master-Key: 04CE701B1414899E854E136F858D7907BDA2ED52BDDB41EBA65F48A7102847B222AC0CAA1B3E733D6945139A752E0331 Key-Arg : None Krb5 Principal: None PSK identity: None PSK identity hint: None Start Time: 1542064692 Timeout : 300 (sec) Verify return code: 0 (ok) --- I can successfully do this directly against Knox using the [Knox] section configuration of the odbc.ini: [mpetronic@vmwhnsqsrclnt01 ~]$ echo "show tables" | isql -d, -b -v knox mpetronic $(cat ~/.pw.dat) cro_capacity_extract_tmp cro_capacity_ranked_tmp cro_capacity_report_final cro_efficiency_extract_tmp cro_efficiency_hourly_tmp cro_efficiency_report_final j1_total_user_counts my_vsat_info_ht1100 san_data_2 test But this fails using the [F5] section: [mpetronic@vmwhnsqsrclnt01 ~]$ echo "show tables" | isql -d, -b -v f5 mpetronic $(cat ~/.pw.dat) [S1000][unixODBC][Hortonworks][DriverSupport] (1100) SSL certificate verification failed because the certificate is missing or incorrect. [ISQL]ERROR: Could not SQLConnect I have also added the Geotrust CA into the keystore being used but no difference. Here you can see the Knox certificate (alias=mykey) and the Geotrust CA (alias=geotrust): (test) [mpetronic@vmwhnsqsrclnt01 keystore]$ keytool -list -keystore keystore.jks Enter keystore password: Keystore type: jks Keystore provider: SUN Your keystore contains 2 entries geotrust, Nov 12, 2018, trustedCertEntry, Certificate fingerprint (SHA1): 7C:CC:2A:87:E3:94:9F:20:57:2B:18:48:29:80:50:5F:A9:0C:AC:3B mykey, Oct 9, 2017, trustedCertEntry, Certificate fingerprint (SHA1): F1:75:91:59:8D:BE:B1:79:AC:02:7F:66:FE:0A:53:E2:89:98:75:BA At this point, I am stumpted. I am unclear as to what cannot find the certificate or what certificate it is looking for. I know that some applications do not look for CA certificates in the OS system CA bundles so maybe isql is not looking there? I also get the same error with a test Python app using pyodbc so I doubt that is the issue. Any help would be greatly appreciated!

Mark_Petronic · ‎08-18-2017

You should be able to export an environment variable and place the odbc.ini file anywhere you want - where you do have write access. Can you try this and then run your test? I never tried making it work completely with just a connect string. export ODBCINI=/path/to/your/odbc.ini

Mark_Petronic · ‎10-06-2016

This article outlines the steps needed to setup ODBC access to Hive via Apache Knox from a Linux workstation. While there seems to be some "reasonable" documentation other there on setting up this access from Windows, it took quite some time to figure out how to do this from Linux. For this example, I am installing the Hortonworks Linux ODBC driver on CentOS Linux release 7.2.1511 then verifying it works using both the isql ODBC command line tool and PyODBC. Installation Steps 1) Install the UnixODBC Dependencies sudo yum install unixODBC 2) Download and Install Hortonworks ODBC Linux Driver I developed this guide using the below version of the driver. Note that the following URL is for the CentOS 7 version of the driver. If you are running an older version of the OS, then visit https://hortonworks.com/downloads/ and use the CentOS 6 driver. Also note that the below URL will likely change as new versions of the driver are released so it is a good idea to check the Hortonworks download page for the latest version when you run this procedure. I can tell you that a college used this procedure on RHEL 6.4 and used the CentOS 6 driver and it worked fine. wget http://public-repo-1.hortonworks.com/HDP/hive-odbc/2.1.5.1006/centos7/hive-odbc-native-2.1.5.1006-1.el7.x86_64.rpm sudo yum install hive-odbc-native-2.1.5.1006-1.el7.x86_64.rpm 3) Create $HOME/.odbc.ini You don't necessarily need this file. For example, you could build up a DSN string in code in a Python program that you use to connect to Hive using PyODBC. But, you can also just specify a DSN name that PyODBC then maps to an entry in your $HOME/.odbc.ini file. So, either way, here are the settings that I used to build up a minimal $HOME/.odbc.ini used in this guide. Simple copy the following values into $HOME/.odbc.ini. If you want to know the details of what each field means and its optional values, you can find it all at the end of the ODBC installation document. Note that the following sample entries have some fields that you MUST change that are environment and user specific. Namely the host, trustStorePassword, and ClientCert fields. All other fields should be left as they are. [ODBC Data Sources] # The below key 'Knox' can be anything but it MUST MATCH the key name used # in the section below. The right side is just descriptive and can be anything Knox=DSN for access to the production cluster using Beeline via Knox [Knox] # Optional description of this DSN Description=Hortonworks Knox QA DSN # Where the actual ODBC driver library is loaded from. This is the default # location where Hortonworks installs the driver Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so # The hostname of the server running the Knox gateway Host=api01.quasar.local # The port used to connect to the Knox gateway port=8443 # The same httppath value used in a typical beeline connection string that # connects via Knox. For our cluster, it is always "quasar/jupstats/hive" HttpPath=quasar/jupstats/hive # The database you want to be on once beeline connects to Hive. This is basically # like executing the HQL "use my_database" command after the connection is made. schema=default # Must be 0 because we want to connect directly to Hiveserver2 inside the cluster. ServiceDiscoveryMode=0 # Must be 2 because we are using hiveserver2 inside the cluster. HiveServerType=2 # This indicates we need to provide a username and password in the connect string AuthMech=3 # This indicates the transport mode for Hive is HTTP, which is required for Knox ThriftTransport=2 # Indicates the connection to Knox uses SSL. Note that our Hive configuration has # hive.server2.use.SSL=false but, the connection to hiverserver2 is via Knox and, # to connect to Knox, you must use SSL. So, the ODBC driver is first connecting # to Knox hence we need SSL=1 here. If you were using ODBC to directly connect # to hiveserver2 (not via Knox), then, for the configuration we have in our cluster, # this value would need to be 0 SSL=1 # But NOT two-way SSL meaning no client side authentication - only server-side TwoWaySSL=0 # These two are the same values used in the beeline connect string. The password is # the password you used to add the Knox certificate to your local user key store. # The client certifificate path is just the path to that key store trustStorePassword=<your-trust-store-password> ClientCert=/home/mpetronic/keystore/keystore.jks # Must be 1 to support using self-signed certificates. Since we generated and # self-signed certificate in our cluster, we need this. AllowSelfSignedServerCert=1 # This prevents ODBC from doing query preprocessing which is not required since we are # running against a Hive-aware end point UseNativeQuery=1 Regarding the use of the UseNativeQuery=1 option, the ODBC docs say: This option specifies whether the driver uses native HiveQL queries, or converts them into an equivalent form in HiveQL. Enabled (1): The driver does not transform the queries emitted by an application, and executes HiveQL queries directly. Disabled (0): The driver transforms the queries emitted by an application and converts them into an equivalent form in HiveQL. Note: If the application is Hive-aware and already emits HiveQL, then enable this option to avoid the extra overhead of query transformation. Hive logs the queries that are run in various places like log files and in the Tez View. Admins need to be able to read these to help debug query execution problems. When UseNativeQuery=1 is set, the logged queries look like what you would normally expect - this: select count(*) from my_db.some_table where year=2016 and month=10 and day=2 But, if UseNativeQuery=0 is used, then the same query will end up looking like this: SELECT COUNT(1) AS `C_455850525f31` FROM `my_db`.`some_table` `C_6a7570315f73746174735f63726f5f6361706163697479` WHERE ((`C_6a7570315f73746174735f63726f5f6361706163697479`.`day` = 2) AND (`C_6a7570315f73746174735f63726f5f6361706163697479`.`year` = 2016) AND (`C_6a7570315f73746174735f63726f5f6361706163697479`.`month` = 10)) If you had to debug queries in logs, which would you prefer? Right! ALWAYS set UseNativeQuery=1. Plus, per the docs, this also improves performance by reducing the overhead of unnecessary query transformation. 4) Test your Connection via Command Line When you installed the unixODBC packages, it came with "isql", an interactive SQL CLI. Run the following command to see if you can connect to Hive via ODBC via Knox using the DSN you just created above. You should get some valid results from your query: [mpetronic@mpws Downloads]$ echo "show databases" | isql -v -d, Knox <your-Linux-username> <your-Linux-password> +---------------------------------------+ | Connected! | | | | sql-statement | | help [tablename] | | quit | | | +---------------------------------------+ SQL> show databases default test_db another_db 5) Test your Connection via PyODBC The following is a simple Python app that uses PyODBC to access the same database using the same DSN. Copy the below sample code to your workstation, read the comments in the script and do what they say to do to be able to run the program. Then run it to verify you can successfully connect and execute a query via Knox. It will print the query results to stdout. #!/usr/bin/env python import pyodbc import argparse ''' Demonstrates the bare bones code needed to access Hive via Beeline via Knox. For this sample to work, you must provide either the full ODBC configuration via a connection string or reference an odbc.ini file that already has a DSN defined for the connection to Knox you are trying to make. To go the odbc.ini route, do the following: 1. Create .odbc.ini in your $HOME directory. Note this is a hidden file and that is what the ODBC driver expects by default 2. Populate $HOME/.odbc.ini with the following values. These are the minimum values needed for the file to work with Knox. I am not going to explain all the values in detail. If you want to know what each one means, look them up in this doc: https://hortonworks.com/wp-content/uploads/2016/03/Hortonworks-Hive-ODBC-Driver-User-Guide.pdf. Also, change the host, trustStorePassword, and ClientCert values as needed for your setup. [ODBC Data Sources] # This key 'Knox' can be anything but it MUST MATCH the key name used in the # section below. The right side is just descriptive and can be anything Knox=DNS for access to Hive via Knox [Knox] # Optional description of this DSN Description=Hortonworks Knox QA DSN # Where the actually ODBC driver library is loaded from. This is the default # location where Hortonworks installs the driver Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so # The hostname of the sever running the Knox gateway Host=api01.qa # The port used to connect to the Knox gateway port=8443 # The same Http path value used in a typical beeline connect string that # connects to Knox. For Quasar, it is always "quasar/jupstats/hive" HttpPath=quasar/jupstats/hive # The default database you want to be one once beeline connects to Hive schema=jup1_stats # Must be zero as we want to connect directly to Hiveserver2 inside the cluster. ServiceDiscoveryMode=0 # Must be 2 because we are using hiveserver2 which is what beeline uses as well HiveServerType=2 # This indicates we need to provide a username and passwork in the connect string AuthMech=3 # This indicates the transport mode for Hive is HTTP, which is required for Knox ThriftTransport=2 # Indicates the connection to Knox uses SSL SSL=1 # But NOT two-way SSL meaning no client side authentication - only server-side TwoWaySSL=0 # These two are the same values used in the beeline connect string. This is the # password you used when you added the Knox certificate to your own trust store trustStorePassword=<your-trust-store-password> ClientCert=/home/mpetronic/keystore/keystore.jks # Must be 1 to support the self-signed certificate mode we use for Knox AllowSelfSignedServerCert=1 3. Install dependencies for this script to run. First, install the unixODBC-devel package (via yum, if you can or download the RPM and install that way). This is needed for header files that are used build pyodbc. Then install PyODBC using "pip install pyodbc==3.0.10". Note that the pyodbc installation needs to build some source code so you will also need the gcc tools chain installed. 4. Run this script and provide the require command line options to test your connection To instead, NOT require the $HOME/.odbc.ini file, you can build a DSN string directly in code. Below is an example of how you can build a connection string to directly pass into the pyodbc.connect() method. Using this approach, you DO NOT need to have a .odbc.ini file present. This gives you full programmatic control of the process if you wish to use it this way. This simply builds a string like "A=1;B=2;C=3" containing all the exact same parameters that could be defined in the .odbc.ini file. ''' CONNECTION_STRING_EXAMPLE = ';'.join(''' Description=Hortonworks Knox DSN Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so Host=api01 port=8443 HttpPath=quasar/jupstats/hive schema=default ServiceDiscoveryMode=0 HiveServerType=2 AuthMech=3 ThriftTransport=2 SSL=1 TwoWaySSL=0 trustStorePassword=<your-trust-store-password> ClientCert=/home/mpetronic/keystore/keystore.jks AllowSelfSignedServerCert=1 uid=<your-linux-username> pwd=<your-linux-password> '''.splitlines()) #=============================================================================== def parse_args(): parser = argparse.ArgumentParser() parser.add_argument('--user', '-u', type=str, required=True, help='User name used to connect to Hive with beeline via Knox.') parser.add_argument('--password', '-p', type=str, required=True, help='Password used to connect to Hive with beeline via Knox.') parser.add_argument('--dsn', '-d', type=str, required=True, help='The DSN name from $HOME/.odbc.ini to use for this connection.') return parser.parse_args() #=============================================================================== def execute_query(sql, dsn, user, password): try: connect_string = 'DSN=%(dsn)s;uid=%(user)s;pwd=%(password)s' % locals() conn = pyodbc.connect(connect_string, autocommit=True) cursor = conn.cursor() cursor.execute(sql) for row in cursor: print row except Exception as e: print e finally: cursor.close() conn.close() ########################################################### # MAIN ########################################################### if __name__ == '__main__': args = parse_args() execute_query('show databases', args.dsn, args.user, args.password)

Mark_Petronic · ‎09-09-2016

Thanks @Constantin Stanca. Parsing that log will work just fine for my use case. Appreciate your help!!!

Mark_Petronic · ‎09-06-2016

As an administrator, you can easily use the Tez view in Ambari to see the actual full HQL query that was executed by any user. How do you obtain the actual query that was run when hive.execution.engine=mr?

Mark_Petronic · ‎07-06-2016

Step 1 requires the following syntax tweak to include the alias otherwise you get an error like: keytool error: java.lang.Exception: Alias <mykey> does not exist keytool -exportcert -file knox.crt -keystore ./gateway.jks -storepass <master-secret-password> -alias <alias-used-to-create-key>

Mark_Petronic · ‎05-21-2016

I stumbled onto my own answer. Appears that you have to provide a partition spec to the alter command. I figured this out after poking around in mysql to look at my hive metastore to see if that would give me a clue. These queries are what made me think that the serde information is on a partition-by-partition basis: mysql> select * from TBLS where TBL_NAME='rm'\G *************************** 1. row *************************** TBL_ID: 170 CREATE_TIME: 1463833647 DB_ID: 11 LAST_ACCESS_TIME: 0 OWNER: mpetronic RETENTION: 0 SD_ID: 227 TBL_NAME: rm TBL_TYPE: EXTERNAL_TABLE VIEW_EXPANDED_TEXT: NULL VIEW_ORIGINAL_TEXT: NULL LINK_TARGET_ID: NULL 1 row in set (0.00 sec) mysql> select * from SDS where CD_ID=170\G *************************** 1. row *************************** SD_ID: 227 CD_ID: 170 INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat IS_COMPRESSED: IS_STOREDASSUBDIRECTORIES: LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm NUM_BUCKETS: -1 OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat SERDE_ID: 227 *************************** 2. row *************************** SD_ID: 228 CD_ID: 170 INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat IS_COMPRESSED: IS_STOREDASSUBDIRECTORIES: LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=10 NUM_BUCKETS: -1 OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat SERDE_ID: 228 *************************** 3. row *************************** SD_ID: 229 CD_ID: 170 INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat IS_COMPRESSED: IS_STOREDASSUBDIRECTORIES: LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=11 NUM_BUCKETS: -1 OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat SERDE_ID: 229 Once I saw all those 'LOCATION' values in the SDS table, I tried the following command to alter the table and then the query worked. Interesting. Does this imply that you could have a different schema for each partition? Can anyone comment as to why the avro schema is tied to a partition and not simply to the whole table? alter table rm partition (year=2016) set serdeproperties ('avro.schema.url' = 'hdfs://mpws:8020/jup1_stats/avro/rm_1.avsc'); Since all my data is in partitions under the "year=2016" partition, I was able to just specify that one partition and it applied the change to all partitions under that.

Mark_Petronic · ‎05-21-2016

I have Hive external tables using Avro serdes that look like this: CREATE EXTERNAL TABLE `rm`( `gw_id` bigint COMMENT '', `rm_id` string COMMENT '', `start_time` bigint COMMENT '', `end_time` bigint COMMENT '', `switch_failure_count` bigint COMMENT '') PARTITIONED BY ( `year` int, `month` int, `day` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ( 'avro.schema.url'='hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs://mpws:8020/jup1_stats/external_tables/rm' TBLPROPERTIES ( 'last_modified_by'='mpetronic', 'last_modified_time'='1463833730', 'transient_lastDdlTime'='1463833730') The Avro schemas are stored on HDFS in /jup1_stats/schema/. I want to change that path from this: /jup1_stats/schema/ to this: /jup1_stats/avro/ So, I copied my avro schema files from /jup1_stats/schema to /jup1_stats/avro and did this: alter table rm set serdeproperties ('avro.schema.url'='hdfs://mpws:8020/stats/avro/rm_1.avsc'); I ran a test query and it worked. But not so fast! Then I did the clean up and removed the /jup1_stats/schema directory and ran the test query again and got tons of stack traces and the query failed. ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1463688329951_0007_4_00, diagnostics=[Task failed, taskId=task_1463688329951_0007_4_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:135) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:650) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:621) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:406) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:128) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149) ... 14 more Caused by: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:251) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193) ... 25 more It appears that Hive is still looking for the schema in the old location based on this error trace: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc I also tried this but the queries still fail: msck repair table rm; If I recreate the schema files in the old location (/jup1_stats/schema), the query then works again. These are external tables, and I did try dropping the table and recreating it with the new location in avro.schema.url and that works. But, I would like to understand why I cannot use "alter table" to achieve this same result. Thanks!

Mark_Petronic · ‎05-02-2016

@Neeraj Sabharwal, I'm a bit confused on this impersonation concept. I understand the resource reuse benefits part. But, if I have different users running Hive queries and they all run as the "hive" user, then what is to prevent one user's "insert overwrite directory "some_relative_dir" select from ...." results from overwriting some other users query that just happens to pick the same directory path? If a relative path is supplied, then Hive will write the results to "/user/hive/some_relative_dir". I tried to read the best practices link at the end of this post but the page is not loading.

Online	Offline
Last Visited	‎12-01-2018 04:39 PM

Member Since	‎11-24-2015 02:54 PM
Last Visited	‎12-01-2018 04:39 PM
Posts	56
Kudos received	58

Cloudera Community

Re: How to change location of avro.schema.url loca...

Re: Detected data dir(s) that became unmounted and...

Re: Best practice for Avro schema/field naming reg...

Re: Hive's "alter table partition concatenate" no...

Re: Hortonworks Hive ODBC - SSL certificate verifi...

Hortonworks Hive ODBC - SSL certificate verificati...

Re: How to Connect To Hive via Knox Using ODBC

How to Connect To Hive via Knox Using ODBC

Re: Obtain the actual query run on Hive when using...

Obtain the actual query run on Hive when using hiv...

Re: beeline connect via Knox - SSL issue

Re: How to change location of avro.schema.url loca...

How to change location of avro.schema.url location...

Re: Ranger implementation - Hive impersonation fal...