Member since
11-24-2015
56
Posts
58
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2257 | 05-21-2016 02:32 PM | |
3314 | 04-26-2016 05:22 AM | |
7816 | 01-15-2016 06:23 PM | |
10754 | 12-24-2015 04:52 PM |
12-01-2018
03:07 PM
1 Kudo
Thank you for your answer. I wish I could buy you a beer. 🙂 My solution was slightly different but starring me right in the face. Your answer provided the perfect clue. In my case, there was no "universal" directory. However, I noticed there was a CA bundle file. It DID NOT come preconfigured with the GeoTrust CA we have for the SSL certificate we purchased for our F5. It did have some other GeoTrust CA and many others, as well. I simply opened up the cacerts.pem file and added that GeoTrust CA to the end, saved the file, and ran my test query. It then worked! [mpetronic@vmwhnsqsrclnt01 ~]$ echo "show tables" | isql -d, -b -v f5 mpetronic $(cat ~/.pw.dat)
cro_capacity_extract_tmp
cro_capacity_ranked_tmp
cro_capacity_report_final
cro_efficiency_extract_tmp
cro_efficiency_hourly_tmp
cro_efficiency_report_final
j1_total_user_counts
san_data_2
test
Here is what my directory structure looks like for the ODBC driver version I am using: [root@vmwhnsqsrclnt01 lib]# tree /usr/lib/hive/
hive
└── lib
└── native
├── hiveodbc
│ ├── ErrorMessages
│ │ └── en-US
│ │ ├── DSMessages.xml
│ │ ├── HiveODBCMessages.xml
│ │ ├── ODBCMessages.xml
│ │ ├── SQLEngineMessages.xml
│ │ └── ThriftExtensionMessages.xml
│ ├── EULA.txt
│ ├── Hortonworks\ Hive\ ODBC\ Driver\ User\ Guide.pdf
│ ├── Release\ Notes.txt
│ └── Setup
│ ├── odbc.ini
│ └── odbcinst.ini
└── Linux-amd64-64
├── api.prod.quasar.nadops.net.pem
├── cacerts.pem <<< Added GeoTrust CA to end of this file
├── cacerts.pem.orig
├── HiveODBC.did
├── hortonworks.hiveodbc.ini
└── libhortonworkshiveodbc64.so
... View more
11-12-2018
11:39 PM
HDP 2.3.4.7 hive-odbc-native-2.6.1.1001-1.x86_64.rpm I have a working setup using this ODBC driver on CentOS 7.2 via Knox. Now we have added an F5 proxy in front of Knox. I can successfully connect to Hive using beeline through the F5 as well as connect with other JDBC apps (dbvisualizer, for example works fine). The F5 is using a Geotrust CA for it's SSL connection. My odbc.ini file looks like this where the [Knox] section is for directly connecting via Knox and the F5 section is for going through the F5 then through Knox: [ODBC Data Sources]
Knox=DSN for access to the production cluster using Beeline via Knox
[Knox]
Description=Hortonworks Knox Production DSN
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
Host=api01.prod.quasar.nadops.net
port=8443
HttpPath=quasar/jupstats/hive
schema=default
ServiceDiscoveryMode=0
HiveServerType=2
AuthMech=3
ThriftTransport=2
SSL=1
TwoWaySSL=0
ClientCert=/users/mpetronic/keystore/keystore.jks
trustStorePassword=######
AllowSelfSignedServerCert=1
UseNativeQuery=1
[F5]
Description=Hortonworks Knox Production DSN (F5)
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
Host=api.prod.quasar.nadops.net
port=443
# The knox/ part get stripped by an F5 rewrite rule then forwarded to Knox
HttpPath=knox/quasar/jupstats/hive
schema=default
ServiceDiscoveryMode=0
HiveServerType=2
AuthMech=3
ThriftTransport=2
SSL=1
TwoWaySSL=0
ClientCert=/users/mpetronic/keystore/keystore.jks
trustStorePassword=######
AllowSelfSignedServerCert=1
UseNativeQuery=1
I have a valid Geotrust CA in the OS certificate store proven by this test: [mpetronic@vmwhnsqsrclnt01 ~]$ openssl s_client -connect api.prod.quasar.nadops.net:443 -state
CONNECTED(00000003)
SSL_connect:before/connect initialization
SSL_connect:SSLv2/v3 write client hello A
SSL_connect:SSLv3 read server hello A
depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
verify return:1
depth=1 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = GeoTrust RSA CA 2018
verify return:1
depth=0 CN = api.prod.quasar.nadops.net
verify return:1
SSL_connect:SSLv3 read server certificate A
SSL_connect:SSLv3 read server key exchange A
SSL_connect:SSLv3 read server done A
SSL_connect:SSLv3 write client key exchange A
SSL_connect:SSLv3 write change cipher spec A
SSL_connect:SSLv3 write finished A
SSL_connect:SSLv3 flush data
SSL_connect:SSLv3 read finished A
---
Certificate chain
0 s:/CN=api.prod.quasar.nadops.net
i:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=GeoTrust RSA CA 2018
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIEuTCCA6GgAwIBAgIQDFoE4dcFI+tHxeZBx3DCKDANBgkqhkiG9w0BAQsFADBe
MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3
d3cuZGlnaWNlcnQuY29tMR0wGwYDVQQDExRHZW9UcnVzdCBSU0EgQ0EgMjAxODAe
Fw0xODAxMTcwMDAwMDBaFw0yMTAxMTYxMjAwMDBaMCUxIzAhBgNVBAMTGmFwaS5w
cm9kLnF1YXNhci5uYWRvcHMubmV0MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIB
CgKCAQEAvzSCRUpL3S3XF82i7A/qZPOgwAbGagJCZUss5ZAU2Xb0bNi2FFGT5rhq
6sC59tAVF15BoY24KfMvxbp86B0n7OCvvVCJ4IY21/i2MNrCOM2t09UOMFdSuKhD
IdKLcYNrw8xhvMqGPeyFgHYZhPFfvB2XrwO9K8hjwgq01AHB490W6PylU+S0cedW
J1pqC57gXne3sQd20+DRvLZ9dP/phMofeMoO5zrcUZIMZ1cjP32LEVjwQ93CEgbG
BZ+cdHNJcyDtoP5t4Z2of8+f5mp2WK583vQVpKdZYw/AT1ZdEQrcePDVt+wAROHJ
x2SRtk2ouIcBUCcwMH4goIFparN4lQIDAQABo4IBqjCCAaYwHwYDVR0jBBgwFoAU
kFj/sJx1qFFUd7Ht8qNDFjiebMUwHQYDVR0OBBYEFJwKxfT5OhncAC2KcYLtJVnu
0ajNMCUGA1UdEQQeMByCGmFwaS5wcm9kLnF1YXNhci5uYWRvcHMubmV0MA4GA1Ud
DwEB/wQEAwIFoDAdBgNVHSUEFjAUBggrBgEFBQcDAQYIKwYBBQUHAwIwPgYDVR0f
BDcwNTAzoDGgL4YtaHR0cDovL2NkcC5nZW90cnVzdC5jb20vR2VvVHJ1c3RSU0FD
QTIwMTguY3JsMEwGA1UdIARFMEMwNwYJYIZIAYb9bAECMCowKAYIKwYBBQUHAgEW
HGh0dHBzOi8vd3d3LmRpZ2ljZXJ0LmNvbS9DUFMwCAYGZ4EMAQIBMHUGCCsGAQUF
BwEBBGkwZzAmBggrBgEFBQcwAYYaaHR0cDovL3N0YXR1cy5nZW90cnVzdC5jb20w
PQYIKwYBBQUHMAKGMWh0dHA6Ly9jYWNlcnRzLmdlb3RydXN0LmNvbS9HZW9UcnVz
dFJTQUNBMjAxOC5jcnQwCQYDVR0TBAIwADANBgkqhkiG9w0BAQsFAAOCAQEAqv5b
CxKeHDV8+j86I9zvXoPwMq+a+SpjQHeTgNjU3zsnQWxyDVH6gqnA9xmt/c36AcTd
mqeLZMsdDNMAimeEk7hCCaQcuOp0mLlxqlxNnPas6kTzvpM3KpBKUeFbCW77jkFZ
Q1WU/oq9jZZCtvl5SZadwmSzXAp9PxDuRIJEYtpHs/Mo6RQWZJcOmIN1aGfUOuav
+xOLhUK/FQ+e4riUcuLiTcxLBHtqkMORHP4t3GWCNv1Sj/i8qwI/SpU3DBvryp3Z
RXKA4okXFPQ8fKfq5HHyhpmGRhWlCUIHSnyjo9QzKvH0A18f+DDtQ/1KWU9lQHcO
BaZOw378zMc0PHHceg==
-----END CERTIFICATE-----
subject=/CN=api.prod.quasar.nadops.net
issuer=/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=GeoTrust RSA CA 2018
---
No client certificate CA names sent
Peer signing digest: SHA1
Server Temp Key: ECDH, P-256, 256 bits
---
SSL handshake has read 1718 bytes and written 415 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
Protocol : TLSv1.2
Cipher : ECDHE-RSA-AES256-GCM-SHA384
Session-ID: 4F057AE288401B87506EF68BA0F2AFC3B3182E20100F5E0778493198BD2C6984
Session-ID-ctx:
Master-Key: 04CE701B1414899E854E136F858D7907BDA2ED52BDDB41EBA65F48A7102847B222AC0CAA1B3E733D6945139A752E0331
Key-Arg : None
Krb5 Principal: None
PSK identity: None
PSK identity hint: None
Start Time: 1542064692
Timeout : 300 (sec)
Verify return code: 0 (ok)
---
I can successfully do this directly against Knox using the [Knox] section configuration of the odbc.ini: [mpetronic@vmwhnsqsrclnt01 ~]$ echo "show tables" | isql -d, -b -v knox mpetronic $(cat ~/.pw.dat)
cro_capacity_extract_tmp
cro_capacity_ranked_tmp
cro_capacity_report_final
cro_efficiency_extract_tmp
cro_efficiency_hourly_tmp
cro_efficiency_report_final
j1_total_user_counts
my_vsat_info_ht1100
san_data_2
test
But this fails using the [F5] section: [mpetronic@vmwhnsqsrclnt01 ~]$ echo "show tables" | isql -d, -b -v f5 mpetronic $(cat ~/.pw.dat)
[S1000][unixODBC][Hortonworks][DriverSupport] (1100) SSL certificate verification failed because the certificate is missing or incorrect.
[ISQL]ERROR: Could not SQLConnect
I have also added the Geotrust CA into the keystore being used but no difference. Here you can see the Knox certificate (alias=mykey) and the Geotrust CA (alias=geotrust): (test) [mpetronic@vmwhnsqsrclnt01 keystore]$ keytool -list -keystore keystore.jks
Enter keystore password:
Keystore type: jks
Keystore provider: SUN
Your keystore contains 2 entries
geotrust, Nov 12, 2018, trustedCertEntry,
Certificate fingerprint (SHA1): 7C:CC:2A:87:E3:94:9F:20:57:2B:18:48:29:80:50:5F:A9:0C:AC:3B
mykey, Oct 9, 2017, trustedCertEntry,
Certificate fingerprint (SHA1): F1:75:91:59:8D:BE:B1:79:AC:02:7F:66:FE:0A:53:E2:89:98:75:BA
At this point, I am stumpted. I am unclear as to what cannot find the certificate or what certificate it is looking for. I know that some applications do not look for CA certificates in the OS system CA bundles so maybe isql is not looking there? I also get the same error with a test Python app using pyodbc so I doubt that is the issue. Any help would be greatly appreciated!
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Knox
08-18-2017
09:11 PM
You should be able to export an environment variable and place the odbc.ini file anywhere you want - where you do have write access. Can you try this and then run your test? I never tried making it work completely with just a connect string. export ODBCINI=/path/to/your/odbc.ini
... View more
10-06-2016
12:00 PM
4 Kudos
This article outlines the steps needed to setup ODBC access to Hive via Apache Knox from a Linux workstation. While there seems to be some "reasonable" documentation other there on setting up this access from Windows, it took quite some time to figure out how to do this from Linux. For this example, I am installing the Hortonworks Linux ODBC driver on CentOS Linux release 7.2.1511 then verifying it works using both the isql ODBC command line tool and PyODBC. Installation Steps 1) Install the UnixODBC Dependencies sudo yum install unixODBC 2) Download and Install Hortonworks ODBC Linux Driver I developed this guide using the below version of the driver. Note that the following URL is for the CentOS 7 version of the driver. If you are running an older version of the OS, then visit https://hortonworks.com/downloads/ and use the CentOS 6 driver. Also note that the below URL will likely change as new versions of the driver are released so it is a good idea to check the Hortonworks download page for the latest version when you run this procedure. I can tell you that a college used this procedure on RHEL 6.4 and used the CentOS 6 driver and it worked fine. wget http://public-repo-1.hortonworks.com/HDP/hive-odbc/2.1.5.1006/centos7/hive-odbc-native-2.1.5.1006-1.el7.x86_64.rpm
sudo yum install hive-odbc-native-2.1.5.1006-1.el7.x86_64.rpm 3) Create $HOME/.odbc.ini You don't necessarily need this file. For example, you could build up a DSN string in code in a Python program that you use to connect to Hive using PyODBC. But, you can also just specify a DSN name that PyODBC then maps to an entry in your $HOME/.odbc.ini file. So, either way, here are the settings that I used to build up a minimal $HOME/.odbc.ini used in this guide. Simple copy the following values into $HOME/.odbc.ini. If you want to know the details of what each field means and its optional values, you can find it all at the end of the ODBC installation document. Note that the following sample entries have some fields that you MUST change that are environment and user specific. Namely the host, trustStorePassword, and ClientCert fields. All other fields should be left as they are. [ODBC Data Sources]
# The below key 'Knox' can be anything but it MUST MATCH the key name used
# in the section below. The right side is just descriptive and can be anything
Knox=DSN for access to the production cluster using Beeline via Knox
[Knox]
# Optional description of this DSN
Description=Hortonworks Knox QA DSN
# Where the actual ODBC driver library is loaded from. This is the default
# location where Hortonworks installs the driver
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
# The hostname of the server running the Knox gateway
Host=api01.quasar.local
# The port used to connect to the Knox gateway
port=8443
# The same httppath value used in a typical beeline connection string that
# connects via Knox. For our cluster, it is always "quasar/jupstats/hive"
HttpPath=quasar/jupstats/hive
# The database you want to be on once beeline connects to Hive. This is basically
# like executing the HQL "use my_database" command after the connection is made.
schema=default
# Must be 0 because we want to connect directly to Hiveserver2 inside the cluster.
ServiceDiscoveryMode=0
# Must be 2 because we are using hiveserver2 inside the cluster.
HiveServerType=2
# This indicates we need to provide a username and password in the connect string
AuthMech=3
# This indicates the transport mode for Hive is HTTP, which is required for Knox
ThriftTransport=2
# Indicates the connection to Knox uses SSL. Note that our Hive configuration has
# hive.server2.use.SSL=false but, the connection to hiverserver2 is via Knox and,
# to connect to Knox, you must use SSL. So, the ODBC driver is first connecting
# to Knox hence we need SSL=1 here. If you were using ODBC to directly connect
# to hiveserver2 (not via Knox), then, for the configuration we have in our cluster,
# this value would need to be 0
SSL=1
# But NOT two-way SSL meaning no client side authentication - only server-side
TwoWaySSL=0
# These two are the same values used in the beeline connect string. The password is
# the password you used to add the Knox certificate to your local user key store.
# The client certifificate path is just the path to that key store
trustStorePassword=<your-trust-store-password>
ClientCert=/home/mpetronic/keystore/keystore.jks
# Must be 1 to support using self-signed certificates. Since we generated and
# self-signed certificate in our cluster, we need this.
AllowSelfSignedServerCert=1
# This prevents ODBC from doing query preprocessing which is not required since we are
# running against a Hive-aware end point
UseNativeQuery=1 Regarding the use of the UseNativeQuery=1 option, the ODBC docs say: This option specifies whether the driver uses native HiveQL queries, or converts them into an equivalent form in HiveQL. Enabled (1): The driver does not transform the queries emitted by an application, and executes HiveQL queries directly. Disabled (0): The driver transforms the queries emitted by an application and converts them into an equivalent form in HiveQL.
Note: If the application is Hive-aware and already emits HiveQL, then enable this option to avoid the extra overhead of query transformation. Hive logs the queries that are run in various places like log files and in the Tez View. Admins need to be able to read these to help debug query execution problems. When UseNativeQuery=1 is set, the logged queries look like what you would normally expect - this: select count(*) from my_db.some_table where year=2016 and month=10 and day=2 But, if UseNativeQuery=0 is used, then the same query will end up looking like this: SELECT COUNT(1) AS `C_455850525f31` FROM `my_db`.`some_table` `C_6a7570315f73746174735f63726f5f6361706163697479` WHERE
((`C_6a7570315f73746174735f63726f5f6361706163697479`.`day` = 2) AND (`C_6a7570315f73746174735f63726f5f6361706163697479`.`year` = 2016)
AND (`C_6a7570315f73746174735f63726f5f6361706163697479`.`month` = 10)) If you had to debug queries in logs, which would you prefer? Right! ALWAYS set UseNativeQuery=1. Plus, per the docs, this also improves performance by reducing the overhead of unnecessary query transformation. 4) Test your Connection via Command Line When you installed the unixODBC packages, it came with "isql", an interactive SQL CLI. Run the following command to see if you can connect to Hive via ODBC via Knox using the DSN you just created above. You should get some valid results from your query: [mpetronic@mpws Downloads]$ echo "show databases" | isql -v -d, Knox <your-Linux-username> <your-Linux-password>
+---------------------------------------+
| Connected! |
| |
| sql-statement |
| help [tablename] |
| quit |
| |
+---------------------------------------+
SQL> show databases
default
test_db
another_db 5) Test your Connection via PyODBC The following is a simple Python app that uses PyODBC to access the same database using the same DSN. Copy the below sample code to your workstation, read the comments in the script and do what they say to do to be able to run the program. Then run it to verify you can successfully connect and execute a query via Knox. It will print the query results to stdout. #!/usr/bin/env python
import pyodbc
import argparse
'''
Demonstrates the bare bones code needed to access Hive via Beeline via Knox.
For this sample to work, you must provide either the full ODBC configuration
via a connection string or reference an odbc.ini file that already has a DSN
defined for the connection to Knox you are trying to make.
To go the odbc.ini route, do the following:
1. Create .odbc.ini in your $HOME directory. Note this is a hidden file and that
is what the ODBC driver expects by default
2. Populate $HOME/.odbc.ini with the following values. These are the minimum values
needed for the file to work with Knox. I am not going to explain all the values
in detail. If you want to know what each one means, look them up in this doc:
https://hortonworks.com/wp-content/uploads/2016/03/Hortonworks-Hive-ODBC-Driver-User-Guide.pdf.
Also, change the host, trustStorePassword, and ClientCert values as needed for your setup.
[ODBC Data Sources]
# This key 'Knox' can be anything but it MUST MATCH the key name used in the
# section below. The right side is just descriptive and can be anything
Knox=DNS for access to Hive via Knox
[Knox]
# Optional description of this DSN
Description=Hortonworks Knox QA DSN
# Where the actually ODBC driver library is loaded from. This is the default
# location where Hortonworks installs the driver
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
# The hostname of the sever running the Knox gateway
Host=api01.qa
# The port used to connect to the Knox gateway
port=8443
# The same Http path value used in a typical beeline connect string that
# connects to Knox. For Quasar, it is always "quasar/jupstats/hive"
HttpPath=quasar/jupstats/hive
# The default database you want to be one once beeline connects to Hive
schema=jup1_stats
# Must be zero as we want to connect directly to Hiveserver2 inside the cluster.
ServiceDiscoveryMode=0
# Must be 2 because we are using hiveserver2 which is what beeline uses as well
HiveServerType=2
# This indicates we need to provide a username and passwork in the connect string
AuthMech=3
# This indicates the transport mode for Hive is HTTP, which is required for Knox
ThriftTransport=2
# Indicates the connection to Knox uses SSL
SSL=1
# But NOT two-way SSL meaning no client side authentication - only server-side
TwoWaySSL=0
# These two are the same values used in the beeline connect string. This is the
# password you used when you added the Knox certificate to your own trust store
trustStorePassword=<your-trust-store-password>
ClientCert=/home/mpetronic/keystore/keystore.jks
# Must be 1 to support the self-signed certificate mode we use for Knox
AllowSelfSignedServerCert=1
3. Install dependencies for this script to run. First, install the unixODBC-devel
package (via yum, if you can or download the RPM and install that way). This
is needed for header files that are used build pyodbc. Then install PyODBC using
"pip install pyodbc==3.0.10". Note that the pyodbc installation needs to build
some source code so you will also need the gcc tools chain installed.
4. Run this script and provide the require command line options to test your connection
To instead, NOT require the $HOME/.odbc.ini file, you can build a DSN string directly
in code. Below is an example of how you can build a connection string to directly pass
into the pyodbc.connect() method. Using this approach, you DO NOT need to have a .odbc.ini
file present. This gives you full programmatic control of the process if you wish to
use it this way. This simply builds a string like "A=1;B=2;C=3" containing all the
exact same parameters that could be defined in the .odbc.ini file.
'''
CONNECTION_STRING_EXAMPLE = ';'.join('''
Description=Hortonworks Knox DSN
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
Host=api01
port=8443
HttpPath=quasar/jupstats/hive
schema=default
ServiceDiscoveryMode=0
HiveServerType=2
AuthMech=3
ThriftTransport=2
SSL=1
TwoWaySSL=0
trustStorePassword=<your-trust-store-password>
ClientCert=/home/mpetronic/keystore/keystore.jks
AllowSelfSignedServerCert=1
uid=<your-linux-username>
pwd=<your-linux-password>
'''.splitlines())
#===============================================================================
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--user', '-u', type=str, required=True,
help='User name used to connect to Hive with beeline via Knox.')
parser.add_argument('--password', '-p', type=str, required=True,
help='Password used to connect to Hive with beeline via Knox.')
parser.add_argument('--dsn', '-d', type=str, required=True,
help='The DSN name from $HOME/.odbc.ini to use for this connection.')
return parser.parse_args()
#===============================================================================
def execute_query(sql, dsn, user, password):
try:
connect_string = 'DSN=%(dsn)s;uid=%(user)s;pwd=%(password)s' % locals()
conn = pyodbc.connect(connect_string, autocommit=True)
cursor = conn.cursor()
cursor.execute(sql)
for row in cursor:
print row
except Exception as e:
print e
finally:
cursor.close()
conn.close()
###########################################################
# MAIN
###########################################################
if __name__ == '__main__':
args = parse_args()
execute_query('show databases', args.dsn, args.user, args.password)
... View more
- Find more articles tagged with:
- Data Processing
- Hive
- How-ToTutorial
- Installation
- Linux
- odbc
- pyodbc
Labels:
09-09-2016
10:18 PM
Thanks @Constantin Stanca. Parsing that log will work just fine for my use case. Appreciate your help!!!
... View more
09-06-2016
08:22 PM
2 Kudos
As an administrator, you can easily use the Tez view in Ambari to see the actual full HQL query that was executed by any user. How do you obtain the actual query that was run when hive.execution.engine=mr?
... View more
- Tags:
- Data Processing
- Hive
Labels:
- Labels:
-
Apache Hive
07-06-2016
01:49 AM
Step 1 requires the following syntax tweak to include the alias otherwise you get an error like: keytool error: java.lang.Exception: Alias <mykey> does not exist keytool -exportcert -file knox.crt -keystore ./gateway.jks -storepass <master-secret-password> -alias <alias-used-to-create-key>
... View more
05-21-2016
02:32 PM
1 Kudo
I stumbled onto my own answer. Appears that you have to provide a partition spec to the alter command. I figured this out after poking around in mysql to look at my hive metastore to see if that would give me a clue. These queries are what made me think that the serde information is on a partition-by-partition basis: mysql> select * from TBLS where TBL_NAME='rm'\G
*************************** 1. row ***************************
TBL_ID: 170
CREATE_TIME: 1463833647
DB_ID: 11
LAST_ACCESS_TIME: 0
OWNER: mpetronic
RETENTION: 0
SD_ID: 227
TBL_NAME: rm
TBL_TYPE: EXTERNAL_TABLE
VIEW_EXPANDED_TEXT: NULL
VIEW_ORIGINAL_TEXT: NULL
LINK_TARGET_ID: NULL
1 row in set (0.00 sec)
mysql> select * from SDS where CD_ID=170\G
*************************** 1. row ***************************
SD_ID: 227
CD_ID: 170
INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
IS_COMPRESSED:
IS_STOREDASSUBDIRECTORIES:
LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
SERDE_ID: 227
*************************** 2. row ***************************
SD_ID: 228
CD_ID: 170
INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
IS_COMPRESSED:
IS_STOREDASSUBDIRECTORIES:
LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=10
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
SERDE_ID: 228
*************************** 3. row ***************************
SD_ID: 229
CD_ID: 170
INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
IS_COMPRESSED:
IS_STOREDASSUBDIRECTORIES:
LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=11
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
SERDE_ID: 229
Once I saw all those 'LOCATION' values in the SDS table, I tried the following command to alter the table and then the query worked. Interesting. Does this imply that you could have a different schema for each partition? Can anyone comment as to why the avro schema is tied to a partition and not simply to the whole table? alter table rm partition (year=2016) set serdeproperties ('avro.schema.url' = 'hdfs://mpws:8020/jup1_stats/avro/rm_1.avsc'); Since all my data is in partitions under the "year=2016" partition, I was able to just specify that one partition and it applied the change to all partitions under that.
... View more
05-21-2016
12:46 PM
1 Kudo
I have Hive external tables using Avro serdes that look like this: CREATE EXTERNAL TABLE `rm`(
`gw_id` bigint COMMENT '',
`rm_id` string COMMENT '',
`start_time` bigint COMMENT '',
`end_time` bigint COMMENT '',
`switch_failure_count` bigint COMMENT '')
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url'='hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'hdfs://mpws:8020/jup1_stats/external_tables/rm'
TBLPROPERTIES (
'last_modified_by'='mpetronic',
'last_modified_time'='1463833730',
'transient_lastDdlTime'='1463833730')
The Avro schemas are stored on HDFS in /jup1_stats/schema/. I want to change that path from this: /jup1_stats/schema/ to this: /jup1_stats/avro/ So, I copied my avro schema files from /jup1_stats/schema to /jup1_stats/avro and did this: alter table rm set serdeproperties ('avro.schema.url'='hdfs://mpws:8020/stats/avro/rm_1.avsc'); I ran a test query and it worked. But not so fast! Then I did the clean up and removed the /jup1_stats/schema directory and ran the test query again and got tons of stack traces and the query failed. ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1463688329951_0007_4_00, diagnostics=[Task failed, taskId=task_1463688329951_0007_4_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:135)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:650)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:621)
at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:406)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:128)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149)
... 14 more
Caused by: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:251)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193)
... 25 more It appears that Hive is still looking for the schema in the old location based on this error trace: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc I also tried this but the queries still fail: msck repair table rm; If I recreate the schema files in the old location (/jup1_stats/schema), the query then works again. These are external tables, and I did try dropping the table and recreating it with the new location in avro.schema.url and that works. But, I would like to understand why I cannot use "alter table" to achieve this same result. Thanks!
... View more
Labels:
- Labels:
-
Apache Hive
05-02-2016
09:27 PM
@Neeraj Sabharwal, I'm a bit confused on this impersonation concept. I understand the resource reuse benefits part. But, if I have different users running Hive queries and they all run as the "hive" user, then what is to prevent one user's "insert overwrite directory "some_relative_dir" select from ...." results from overwriting some other users query that just happens to pick the same directory path? If a relative path is supplied, then Hive will write the results to "/user/hive/some_relative_dir". I tried to read the best practices link at the end of this post but the page is not loading.
... View more
04-27-2016
03:26 AM
Actually, I tried restarting them before the reboot. Restarted everything. Still had the errors. Then did the reboot and they cleared. Oh well, is what it is!
... View more
04-26-2016
05:22 AM
We'll, I would still love to understand how this worked but, a reboot of the 4 DNs made this error go away. Never would have thunk it! That's pretty strange... Anyway, maybe this will help some other poor soul who hits this same condition. 🙂
... View more
04-26-2016
03:53 AM
So, I was installing a new cluster for our development QA testing and failed to adjust the default configuration that Ambari gave me for dfs.datanode.data.dir. It ended up putting in every partition it found. I really only wanted one - /grid/1, which was a dedicated disk partition for HDFS block storage. I discovered this blunder after the complete installation completed. I did not want to just redo the whole install, opting instead to try to manually fix this as a good deep dive learning experience. I got it all worked out, and it was a good learning exercise, but I have one lingering issue that I cannot solve. I am getting 4 Ambari errors (one for each DN) that state: Detected data dir(s) that became unmounted and are now writing to the root partition: /grid/1/hadoop/hdfs/data . I figured out that Ambari agents monitor (and remember) which HDFS directories were previously mounted and it checks to see if a mounted disk goes away for some reason and displays an error. That's all fine - I get that. However, my setup seems to be correct yet Ambari is still complaining. As part of correcting the configuration issue I laid upon myself, I did edit the following file (on each DN) to remove all those previously caches mount points that I did not want and I just left the one I did want. I ended up just stopping HDFS, removing all the /opt/hadoop/hdfs, //tmp/hadoop/hdfs, etc directories, removing the name node metadata directories, reformatting the namenode, and starting up HDFS. The file system is up an working. But, can anyone tell me why I cannot get rid of this Ambari error? Here's the contents of one of dfs_data_dir_mount.hist files. All 4 are exactly the same. Below showa the mount where I have a disk for HDFS data storage. It all looks good. I must be missing something obvious. I did restart everything - nothing clears this error. Thanks in advance... [root@vmwqsrqadn01 ~]# cat /var/lib/ambari-agent/data/datanode/dfs_data_dir_mount.hist
# This file keeps track of the last known mount-point for each DFS data dir.
# It is safe to delete, since it will get regenerated the next time that the DataNode starts.
# However, it is not advised to delete this file since Ambari may
# re-create a DFS data dir that used to be mounted on a drive but is now mounted on the root.
# Comments begin with a hash (#) symbol
# data_dir,mount_point
/grid/1/hadoop/hdfs/data,/grid/1
[root@vmwqsrqadn01 ~]# mount -l | grep grid
/dev/sdb1 on /grid/1 type ext4 (rw,noatime,nodiratime,seclabel,data=ordered)
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hadoop
04-18-2016
08:35 PM
@Ancil McBarnett, Could you please explain why you propose /dev as a separate 32G partition?
... View more
03-19-2016
07:52 PM
4 Kudos
@Ryan Chapin Looking for some suggestions. We have a query that was hanging indefinitely in Hive on Tez. We are running with HDP 2.4.0. After some debugging, we narrowed it down to a single ORC file in a Hive partition that contained that file plus about 10 others. If we move this one file out of the partition and test the query, it now completes. If we include ONLY that one file in the partition, the query hangs. Even a simple "select * from table" hangs. The query never gets beyond the first map task. I then discovered the ORC file dump feature of Hive and ran the following on this file: hive --orcfiledump --skip-dump --recover -d hdfs://vmwhaddev01:8020/tmp/000002_0 > orc.dump This command never returns to the command line and hangs, similar to what Hive does. I Tested this on a known good file and the dump completes successfully and returns control to the command line as expected. So, even this dump test is hanging. If I tail orc.dump during the hang, the last line of the file looks complete. So, I am wondering if this line is the end of a stripe and the next stripe is corrupt? The ORC reader seems to get into some infinite loop at this point. Once the dumped output file size stops increasing, at about 382 MB, top command shows the Hive dump process continuously using about 99.9% CPU until I CTRL+C the command. It dumps about 382 MB of data before the dump command hangs. Here's last line which is complete: {"_col0":"DSN001000325021","_col1":10784199,"_col2":1457431200,"_col3":1457434800,"_col4":20209,"_col5":0,"_col6":60,"_col7":10,"_col8":1456222331,"_col9":120,"_col10":117,"_col11":114,"_col12":0,"_col13":0,"_col14":0,"_col15":0,"_col16":0,"_col17":0,"_col18":120,"_col19":121,"_col20":123,"_col21":124,"_col22":125,"_col23":0,"_col24":0,"_col25":15296815,"_col26":2,"_col27":0,"_col28":163528,"_col29":88,"_col30":1498,"_col31":29082,"_col32":874908,"_col33":51565,"_col34":104138,"_col35":149,"_col36":6,"_col37":0,"_col38":0,"_col39":1508,"_col40":0,"_col41":2248,"_col42":46961,"_col43":624,"_col44":1732,"_col45":0,"_col46":0,"_col47":0,"_col48":0,"_col49":41,"_col50":159,"_col51":12,"_col52":30,"_col53":0,"_col54":0,"_col55":0,"_col56":0,"_col57":0,"_col58":0,"_col59":0,"_col60":0,"_col61":1398668,"_col62":5916915,"_col63":5916855,"_col64":5986115,"_col65":249,"_col66":66,"_col67":547,"_col68":76,"_col69":132618,"_col70":17398,"_col71":140325,"_col72":19012,"_col73":0,"_col74":0,"_col75":0,"_col76":0,"_col77":"TUC04HNSIGW63B002Adv","_col78":1456805959,"_col79":0,"_col80":158,"_col81":136,"_col82":0,"_col83":0,"_col84":1,"_col85":0,"_col86":12,"_col87":12,"_col88":0,"_col89":0,"_col90":0,"_col91":0,"_col92":12,"_col93":12,"_col94":0,"_col95":0,"_col96":0,"_col97":0,"_col98":18,"_col99":14,"_col100":0,"_col101":0,"_col102":0,"_col103":0,"_col104":12,"_col105":12,"_col106":0,"_col107":0,"_col108":0,"_col109":0,"_col110":51565,"_col111":37383,"_col112":0,"_col113":402,"_col114":0,"_col115":449,"_col116":3126,"_col117":46256,"_col118":28682,"_col119":4,"_col120":0,"_col121":0,"_col122":0,"_col123":0,"_col124":0,"_col125":0,"_col126":0,"_col127":0,"_col128":0,"_col129":0,"_col130":0,"_col131":0,"_col132":0,"_col133":0,"_col134":0,"_col135":0,"_col136":0,"_col137":0,"_col138":0,"_col139":0,"_col140":0,"_col141":0,"_col142":0,"_col143":"20.2.1-3569","_col144":0,"_col145":0,"_col146":0,"_col147":0,"_col148":0,"_col149":0,"_col150":0,"_col151":0,"_col152":0,"_col153":0,"_col154":0,"_col155":0,"_col156":0,"_col157":0,"_col158":0,"_col159":0,"_col160":0,"_col161":0,"_col162":0,"_col163":0,"_col164":0,"_col165":0,"_col166":0,"_col167":0,"_col168":0,"_col169":0,"_col170":0,"_col171":0,"_col172":0,"_col173":4,"_col174":"12-AUG-15","_col175":12,"_col176":"HT1100","_col177":"1","_col178":"B4WB16S2","_col179":"CORE_DSN_PROD_HT1100_50K","_col180":"SW_DSN_PROD_HT1100_50K","_col181":"3.2.0.24","_col182":"1000"} I am trying to determine if I have uncovered a bug or somehow the data that I am inserting into the ORC somehow resulted in this condition. Either way, it seems like a bug if you can insert data that causes an ORC file to become corrupted. The ingest pipeline for this data is as follows. I convert raw CSV files into Avro and land them in an HDFS directory. There could be multiple Avro schemas in play here as there are multiple versions of these CSV files in flight. The Avro schemas are designed such that I can include all versions in the same Hive table. Typically, newer versions of these stats files add more columns of stats. Once a day, I move all the Avro files that have accumulated to a temp directory and create an external table over the files in that directory I run a query that selects * from the external table and inserts all the results into another Hive managed table that is in ORC format, effectively using Hive to perform the Avro to ORC conversion. This query also performs a join with some data from one other table to enrich the data landing in the ORC table. This table is partitioned by year/month/day. Because the resulting ORC files are relatively small for HDFS, I perform one final step after the ORC insert query completes. I run a Hive query against the newly created partition to effectively compact the ORC files. Typically, the reduce part generates around 70 ORC files. I run a query like the following for the appropriate year, month, and day of the partition just created which typically compacts all 70 ORC files into about 5 much larger ones that are about 2-3 HDFS blocks (128 MB) in size each. alter table table_name partition (year=2016, month=3, day=16) concatenate; This is the first such issue we've seen in over two months of ingesting such files in this manner. Does anyone have any ideas of where to look further to possibly understand the root cause of this problem? Maybe the concatenate operation happened to cause the file corruption in this case? Anyone heard of such a thing? Should I file a bug report and provide this corrupt ORC file for some forensic analysis? I don't really want to start trying to hex dump and decode ORC to figure out what happened.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Tez
03-07-2016
11:11 PM
1 Kudo
Thank you @vpoornalingam. That helps demystify this step. I tried the same query as @Rich Raposa and it did not list any of my FSRoots. Then I realized the --config path in the example is wrong for my install of Ambari 2.2 and HDP 2.3.0. It should be "/etc/hive/conf/conf.server". Using that I do get the expected listing. [hive@vmwhaddev01 ~]$ hive --config /etc/hive/conf/conf.server --service metatool -listFSRoot
Initializing HiveMetaTool..
<info traces>
Listing FS Roots..
hdfs://vmwhaddev01:8020/apps/hive/warehouse
hdfs://vmwhaddev01:8020/apps/hive/warehouse/jupstats.db
hdfs://vmwhaddev01:8020/apps/hive/warehouse/aggregates.db
hdfs://vmwhaddev01:8020/apps/hive/warehouse/imports.db
... View more
03-05-2016
06:16 PM
1 Kudo
@Ancil McBarnett Hey, man, thanks for this write up. Very helpful in gaining insight to the big picture. I am working on the requirements for my prod cluster. Question for you... Based on my somewhat novice knowledge, it seems like overkill to use RAID-10 and 4 1TB drives for the HA NNs when you are running QJM (which I assume is the case as the diagram also shows ZKs and JNs). All the edits go to the JN JBOD disks. So, that just leaves a couple fsimage files for the RAID-10 arrangement against 1TB - isn't that a bit over kill on storage?. Another confusion point for me is in that the NN disk layout diagram above it shows fsimage and edits going to the RAID-10 disks. Edits are written to the JNs against the one JBOD disk, right? So, is the diagram misleading or am I missing some intended message there? Here's a question I just asked related to this comment.
... View more
03-05-2016
06:14 PM
3 Kudos
I am trying to determine and plan the best disk layout for my active/standby NNs for a new production rollout that is going to run with QJM NN HA. I plan to have three servers, each running an instance of ZK and JN. The plan, per recommendations against another question I asked on this forum, is to have a dedicated RAID-1 disk on each server for the JNs to use for edit logs. I expect that array to use 256-512 GB sized disks. Each server will also have a dedicated disk for the OS, logs, tmp, etc, also RAID-1 using two 1.2TB drives. Each ZK instance will also have dedicated disks (spindles) per recommendations here. I am having a hard time answering this question... Where to store the fsimage files? Could I, for example, store them on the same RAID-1 disk that the JNs are using? I do plan to collocate two of the three JNs on the same two servers running the NNs and the third JN on a third server so, the NN would have access to the same drives used by the JNs. This collocation seems to be a commonly recommend arrangement. Or, should the fsimage files be pointed to a separate RAID-1 disk array just for that purpose? Another option would be to point the fsimage files to a separately size partition on the OS RAID-1 disk array. These questions do NOT come from a sizing perspective, but more from a workload perspective. Fitting the files somewhere is easy to figure out. The real question is about performance impacts of mixing, for example, the background checkpointing operations done by the standby namenode with the work being done by the JNs to save edits and putting all that onto the same spindle. I see clearly that ZK should be kept on it's own spindle due to how it using a write ahead log and how latency is a huge concern in that case to impacting ZK performance. I just don't have a good feel for mixing the two work loads of checkpointing and edit log updates. Can someone please make some recommendation here?
... View more
Labels:
- Labels:
-
Apache Hadoop
03-05-2016
05:16 PM
2 Kudos
I am currently planning ahead in preparation to deploy a production cluster that will run using QJM NN HA. Reading this, all looks reasonable except when I get to the last step (15). It seems to instruct you to change configuration directly on the hosts. Won't that be overwritten by Ambari in subsequent pushes of other config updates? Shouldn't these changes be made via Ambari config changes?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hadoop
02-13-2016
06:53 PM
1 Kudo
Thanks for the clarifications. All makes sense now.
... View more
02-12-2016
08:23 PM
1 Kudo
+1 for shameless plugs. Nice article and thank you for it!
... View more
02-12-2016
08:22 PM
1 Kudo
@Benjamin Leonhardi, Nice writeup. Thank you for taking the time to be so thorough!!! All the links were very helpful, too. You read my mind on the Yahoo performance link - that was the next topic I was going to research. 🙂 Couple follow up clarification questions/comments... Q1 - Chris' blog (thanks @Chris Nauroth) answered the remaining point. Only the "edits in progress" changes need to be applied to the fsimage by the failover NN when it takes over, all the completed edits on the JNs should have already been applied. Q2 - So, my focus was on the JNs writting to each of their own disks and I completely missed the point that the NN needs some place to build the fsimage file. So, would you just point that to the same disk used but the JNs (assuming that I am going to collocate the JNs on the same host as the NNs? Q3 - I was thinking more about a disk failure were a failed disk means a failed JN. So separate disks for each JN means more reliability. Do you recommend some other arrangement? Q5 - "So he writes a checkpoint regularly and distributes it to the active namenode." You mean "distributes it" through the JN's in the normal HA manner of publishing edits, right, or am I missing something here?
... View more
02-11-2016
09:08 PM
6 Kudos
I am trying to put together a hardware specification for name nodes running in HA mode. That made me have to think about disk allocation for name nodes. I pretty much get it with non-HA. Use one RAID drive and another NFS mount for redundancy. SNN incrementally applies changes in the edit log to the fsimage, etc. But I want to run HA. And I want to use Journal Nodes (JN) and the Quorum Journal Manager (QJM) approach. So, that made me think about this scenario and I was not sure I was getting it right and wanted to ask some gurus for input. Here's what I think... Can you please confirm or correct? I think a scenario type question will help me more easily ask the questions so here goes. Assume a clean install. Primary and failover NNs both have empty fsimage files. Primary starts running and writing changes to all three JN's. As I understand it, the failover NN will be reading all those changes, via the JNs, and applying them to his empty fsimage to prepare it to be 100% complete should he be called to take over (faster startup time). Now the primary fails. The failover NN starts up and reads in the fsimage file and starts accepting client requests as normal. It now starts to write edits to the JNs. But the formally primary NN is still down so it is NOT reading updates from the JNs. So, it's fsimage remains empty, essentially. Next, I fix the formally primary NN and start it up. It now becomes the failover NN. At this point, I guess it starts reading changes from the JNs and building up its empty fsimage with all changes to date in hopes that it will once again rule the world and become active should the other NN fail some day. Q1 - Is it true that the failover NN will NEVER have to apply any edit log changes at start up but simply loads its fsimage and starts running because it assumes fsimage is already 100% up to date via recent JN reads? Q2 - In a setup with 3 JNs as a quorum, what should the disk layout look like on the three servers hosting those JNs? Because the edits are now distributed x3, should I just have a single disk per JN host dedicated to the JNs? No need for the one RAID and second NFS type arrangement used in non-HA mode? Specifically, the disk resources typically used for non-HA NN, where the NN writes edit log changes, now become disk resources used exclusively by the JNs, right? Meaning, the NNs never read/write anything directly to disk (except for configuration, I assume) but rather ALL goes through the JNs. Q3 - I believe I still should have one dedicated disk for each JN on each host to isolate the unique work load of the NN for other processes. So, for example, there might be one disk for the OS, one for JNs, and another for the ZK instances that are sharing the same server to support the ZKFC. Correct? Q4 - Because JNs are distributed, it makes me think I should treat these disks like I do disks on the DNs, meaning no RAID, just plain old JBOD. Does that sound right? Q5 - Is it the NN on the failover server that actually does the JN reads and fsimage updates now in HA mode given that there is no SNN in such a configuration? Thanks in advance for confirmation or any insight on this...
... View more
Labels:
- Labels:
-
Apache Hadoop
02-09-2016
04:42 PM
https://community.hortonworks.com/questions/15422/hive-and-avro-schema-defined-in-tblproperties-vs-s.html
... View more
02-09-2016
04:41 PM
4 Kudos
I built a DWH style application on Hive that is centered on Avro to handle schema evolution. My source data is CSV and they change when new releases of the applications are deployed (like adding more columns, removing columns, etc). So, I decided to convert everything to Avro on ingest and then strictly deal with Avro inside Hadoop. For most of my data, I convert to Avro and drop into HDFS into date partitions and allow users to access that data using external Hive tables. For some of these CSV files, there are up to 6 versions in flight at once (each defined using a separate Avro schema), all landing in the same Hive table. So, like you, I wanted to be able to query over all the data no matter what the version. The approach seems to be working well. I did, however want to chime in on TBLPROPERTIES. I just posted a questions related to this. Seems like we should be using SERDEPROPERTIES, not TBLPROPERTIES, when defining the URL to the schema file. All my Avro schema files are in an HDFS directory - I did not want to use the literal approach of defining the schema inline in the TBLPROPERTIES. I was creating me Hive tables using TBLPROPERTIES pointing to the URL of the newest schema which is defined with defaults and properly defined to be a superset of all earlier schemas allowing them to all coexist in the same Hive table. However, I recently tried to build a Spark SQL application using HiveContext to read these tables and was surprise to find that Spark threw Avro exceptions. Creating the table using SERDEPROPERTIES to define the avcs URL was the solution to make the data accessible from both Hive and Spark. Just though I would mention to save you some hassles down the road if you every need Spark SQL access to that data.
... View more
01-19-2016
08:41 PM
1 Kudo
Interestingly, I just upgraded Ambari from 2.1 to 2.2 as part of my upgrade plans and the Hive service check now passes. The stack trace does show Ambari running various command scripts that implement this check.
... View more
01-19-2016
02:20 AM
1 Kudo
Well, I solved it another way - I just upgraded to Ambari 2.2. I will watch see if this comes back as I run with this version and open a ticket if that happens. I was in the midst of upgrading anyway when this started to happen. I need to move to HDP 2.3.4 and Spark 1.5. But, 8 GB, really? Wow! That seems ridiculously expensive for a monitoring framework.
... View more
01-18-2016
08:44 PM
My installation differs from that links instructions but I did not change these from what the base install of HDP created. Why the discrepancy? For "/apps/hive", your link says:
hdfs dfs -chown -R $HIVE_USER:$HDFS_USER /apps/hive
hdfs dfs -chmod -R 775 /apps/hive
My setup is:
drwxr-xr-x 3 hdfs hdfs 96 Sep 30 15:09 hive
For "/tmp/hive", your link says:
hdfs dfs -chmod -R 777 /tmp/hive
My setup is:
drwx-wx-wx 11 ambari-qa hdfs 352 Jan 15 22:13 hive
... View more
01-18-2016
07:49 PM
I'm working though this procedure to upgrade Ambari from 2.1.1 to 2.2.0 before I start an HDP upgrade from 2.3.0 to 2.3.4: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_upgrading_Ambari/content/_preparing_to_upgrade_ambari_and_hdp.html It says to run the service checks on installed components first. They all passed except the Hive check and I get this access error: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.security.AccessControlException: Permission denied: user=ambari-qa, access=WRITE, inode="/apps/hive/warehouse":hive:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) /apps/hive/warehouse user:group is to hive:hdfs. The ambari-qa user that is used when running the service checks on the node where Ambari is saying the checks are being run is ambari-qa:hadoop: [jupstats@vmwhadnfs01 hive]$ id ambari-qa
uid=1001(ambari-qa) gid=501(hadoop) groups=501(hadoop),100(users) So, ambari-qa is a member of the hadoop group but that group has no write permission to hive managed tables which are owned by hive and allowed read access to only hdfs users. I'm not sure what Ambari service check is trying to do by it is clearly trying to write something in that managed table space. As I understand it, the hadoop superuser is the user that starts the namenode, user "hdfs" in my case. So, my questions are: 1. Should "hdfs" really be the group for /apps/hive/warehouse or would it be better to have that be the "hadoop" group? 2. What are the best practice recommendations for the user:group permissions on /apps/hive/warehouse? For example, I have some Java and Python apps that run every 30 minutes to ingest data into hive management and external tables. Those processes run as a service user "jupstats" and group "ingest". My /apps/hive/warehouse/jupstats.db directory is where the managed tables lives and that directory is set to jupstats:ingest to restrict access appropriately. This seems right to me. Do you experts agree? Same for the directories where I also write some HDFS data that is accessed by external Hive tables. Those files are owned as jupstats:ingest. 3. I think I am generally lacking knowledge in how to best setup up access to various Hive tables that are eventually going to need to be accessed by various users. My thought was that all my jupstats.db tables, which are read only by group ingest, will be made made readable by these users by adding those users to the "ingest" group. Does that approach seem reasonable? 4. This still leaves me with the question of how to I setup Hive so that this Ambari service check can pass? Should I add ambari-qa to the "hdfs" group? That feels wrong and dangerous in that it is like adding ambari-qa to a root-like account since user "hdfs" is the hadoop superuser and can wack a lot of stuff. Thanks for any help/tips on this...
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hive
01-18-2016
06:15 PM
@Neeraj Sabharwal Hey Neeraj! I'm having this same issue now, too. Do you have any input regarding how support troubleshoot it for you or do I need to hit them up as well with a ticket for an answer?
... View more