The networking would have been whatever the cloudera installer set up within EC2 - I didn't specifically set anything up, the wizard that run from the initial cloud instance then sets up the hadoop instances, their keypairs, their security groups etc. So - in theory - presumably it should just work.
When I run python -c "import socket; print socket.getfqdn(); print socket.gethostbyname(socket.getfqdn())" SSH'd into one of the instances, it returns the hostname, and the internal IP address (internal to EC2). I wonder - do I have to create an SSH tunnel to get access to the EC2 machines from within the AWS network, like this - http://www.toadworld.com/products/toad-for-cloud-databases/f/102/t/5773.aspx - I'm thinking not though as port 10000 is open in those instances security groups.
Another thing - the other thread I linked to had the same problem, and he was using the Quickstart VM. I've also had the same problem (with Hive, not Impala) on Quickstart too, so is it a generic issue with machines set up using this installer/CDH4?
I'm getting the impression from lack of responses, and from lack of content on this on the forums etc, that Hive is effectively deprecated in favour of Impala for SQL-like access against CDH; however - it's the only way that I can access Hadoop from ETL tools such as Oracle Data Integrator, which then rules out this platform as a source for "hadoop ETL".
If I come up with a solution I'll post it here, but it's disappointing that I can't get it to work.
I've raised this thread with some of our subject matter experts internally in hopes somebody has some insight for you.
What is the error you are seeing when trying to connect to HS2?
Do you have security configured on your cluster or is this unsecured? If it is unsecured, I have seen a problem when using the "No Authenticaiton" mechanism in the DSN setup. There is a workaround for this. First, choose the "User name" mechanism. Then type something in the User name option. It doesn't matter what it is.
Can you connect through beeline locally?
The error message that the Cloudera ODBC Driver for Apache Hive Data Source dialog shows, when testing out a new system DSN on Windows Server 2008 R2 64-bit, is:
Driver Version: V220.127.116.116
Running connectivity tests...
Failed to establish connection
SQLSTATE: HY000[Cloudera][HiveODBC] (34) Error from Hive: connect() failed: errno = 10061.
TESTS COMPLETED WITH ERROR."
Going over to CDH4, even testing it against the VMWare Quickstart VM using beeline on that VM, I get the error message:
"Error: could not establish connection to jdbc:hive2://localhost:10000: Java.net.ConnectException: connection refused (state=08S01, code=0"
I don't think the Hiveserver2 service is actually installed or running on the VM (or on the EC2 installs). If I try:
sudo service hive-server2 start
hive-server2: unrecognized service
and if I type in:
it's not listed there - which makes me think hiveserver2 isn't running (or thrift), and that's why nothing can connect.
I believe in your first post you mention that you are using CM. If you're using CM to manage the cluster then you won't see the hive-server2 service from a command line. You'll have to add the instance and start it from CM. The default settings for HiveServer2 are listed in the configuration, but by default the instance is not added or started. Here is the documentation for adding a role instance. Once you have added the hiveserver2 instance then you can start it and should be able to access it straight away.
Hopefully this will get you going. Please let me know your results.
You can also use the following commands on the quickstart vm or your ec2 setup to verify that port 10000 is in use once you start hiveserver2:
sudo netstat -tulpn | grep 10000
Perfect, that was it. The Hiveserver2 service as you say, isn't installed by default, but adding it via those steps via CM enabled it, and now I can connect OK. Thanks for your help.
Is there any reason that this service isn't installed and enabled by default? As you say, looking at the instance details in CM, it looks like it should be there, and no-body would be able to connect via Hive ODBC/JDBC drivers from an external machine until this is set up.
HiveServer2 is only available in CDH4.2+, and is considered generally optional in CDH4.x. Many customers use the old Hive CLI on CDH4, since HiveServer2 is somewhat new. CM generally tries not to install optional roles, just required ones.
In CDH5, CM treats HiveServer2 is a required role for Hive, since the old Hive CLI will be deprecated in CDH5 and all clients should instead go through the beeline CLI and/or drivers that talk to HS2.
Hope this clears it up for you!
Thanks Darren. So does that mean that hiveserver1 (with thrift) is installed by default? My take is that there's no thrift server and no hiveserver2 by default, so by default there's no way you can connect via ODBC or JDBC to Hive? Is this correct?
CM doesn't manage hiveserver1 because it should basically never be used. It notably has no real support for concurrent operations, and you may get corruption in such scenarios.
Since HS2 is not set up by default, there's no way to use O/JDBC to connect to hive in CDH4 by default. You can, however, tell CM to initially install the HS2 role if you click "Inspect Roles" (or something similar) in the setup wizard, on the page where you select which service types you want to install. And as you know, you can also just add the role later.