Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Cannot connect to Hive on CDH4.5 EC2 installation using Cloudera ODBC 2.5.5 Windows x64 drivers

avatar
Explorer

Hi All,

 

I've installed a CDH4.5 Hadoop cluster on Amazon EC2 using the instructions here:

 

http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager...

 

All seems to be working OK, however I can't connect to it from a Windows VM on my laptop using either the Hive or Impala ODBC drivers. I've connected this VM to the Quickstart VM in the past, and connected via the Impala ODBC drivers, but I can't seem to connect to CDH4 running on EC2 at all. Checking one of the EC2 instances, it doesn't even seem if port 10000 (the Hive port) is being used, but Hive is running and in the configuration properties for Hiveserver2 within CM, it says it's using port 10000.


Ports are open within the EC2 security group. Is there something obvious I'm missing here?

 

Mark

1 ACCEPTED SOLUTION

avatar
Cloudera Employee

I believe in your first post you mention that you are using CM.  If you're using CM to manage the cluster then you won't see the hive-server2 service from a command line.  You'll have to add the instance and start it from CM.  The default settings for HiveServer2 are listed in the configuration, but by default the instance is not added or started.  Here is the documentation for adding a role instance.  Once you have added the hiveserver2 instance then you can start it and should be able to access it straight away.

 

Hopefully this will get you going.  Please let me know your results.

 

You can also use the following commands on the quickstart vm or your ec2 setup to verify that port 10000 is in use once you start hiveserver2:

 

sudo netstat -tulpn | grep 10000

 

Dave

View solution in original post

27 REPLIES 27

avatar
Explorer

The networking would have been whatever the cloudera installer set up within EC2 - I didn't specifically set anything up, the wizard that run from the initial cloud instance then sets up the hadoop instances, their keypairs, their security groups etc. So - in theory - presumably it should just work.


When I run python -c "import socket; print socket.getfqdn(); print socket.gethostbyname(socket.getfqdn())" SSH'd into one of the instances, it returns the hostname, and the internal IP address (internal to EC2). I wonder - do I have to create an SSH tunnel to get access to the EC2 machines from within the AWS network, like this - http://www.toadworld.com/products/toad-for-cloud-databases/f/102/t/5773.aspx - I'm thinking not though as port 10000 is open in those instances security groups.

 

Another thing - the other thread I linked to had the same problem, and he was using the Quickstart VM. I've also had the same problem (with Hive, not Impala) on Quickstart too, so is it a generic issue with machines set up using this installer/CDH4?

 

Mark

avatar
Explorer

I'm getting the impression from lack of responses, and from lack of content on this on the forums etc, that Hive is effectively deprecated in favour of Impala for SQL-like access against CDH; however - it's the only way that I can access Hadoop from ETL tools such as Oracle Data Integrator, which then rules out this platform as a source for "hadoop ETL".

 

If I come up with a solution I'll post it here, but it's disappointing that I can't get it to work.

avatar
Guru

Mark,

 

  I've raised this thread with some of our subject matter experts internally in hopes somebody has some insight for you.

 

Regards.

avatar
Cloudera Employee

Mark,

 

What is the error you are seeing when trying to connect to HS2?  

 

Do you have security configured on your cluster or is this unsecured?  If it is unsecured, I have seen a problem when using the "No Authenticaiton" mechanism in the DSN setup.  There is a workaround for this.  First, choose the "User name" mechanism.  Then type something in the User name option.  It doesn't matter what it is.

 

 

Can you connect through beeline locally?

 

 

avatar
Explorer

Hi Dave,

 

The error message that the Cloudera ODBC Driver for Apache Hive Data Source dialog shows, when testing out a new system DSN on Windows Server 2008 R2 64-bit, is:

 

"

Driver Version: V2.5.5.1006

Running connectivity tests...

Attempting connection
Failed to establish connection
SQLSTATE: HY000[Cloudera][HiveODBC] (34) Error from Hive: connect() failed: errno = 10061.

TESTS COMPLETED WITH ERROR."

 

Going over to CDH4, even testing it against the VMWare Quickstart VM using beeline on that VM, I get the error message:

 

"Error: could not establish connection to jdbc:hive2://localhost:10000: Java.net.ConnectException: connection refused (state=08S01, code=0"

 

I don't think the Hiveserver2 service is actually installed or running on the VM (or on the EC2 installs). If I try:

 

sudo service hive-server2 start

 

I get

 

hive-server2: unrecognized service

 

and if I type in:

 

chkconfig --list

 

it's not listed there - which makes me think hiveserver2 isn't running (or thrift), and that's why nothing can connect.

 

Mark

avatar
Cloudera Employee

I believe in your first post you mention that you are using CM.  If you're using CM to manage the cluster then you won't see the hive-server2 service from a command line.  You'll have to add the instance and start it from CM.  The default settings for HiveServer2 are listed in the configuration, but by default the instance is not added or started.  Here is the documentation for adding a role instance.  Once you have added the hiveserver2 instance then you can start it and should be able to access it straight away.

 

Hopefully this will get you going.  Please let me know your results.

 

You can also use the following commands on the quickstart vm or your ec2 setup to verify that port 10000 is in use once you start hiveserver2:

 

sudo netstat -tulpn | grep 10000

 

Dave

avatar
Explorer

Hi Dave

 

Perfect, that was it. The Hiveserver2 service as you say, isn't installed by default, but adding it via those steps via CM enabled it, and now I can connect OK. Thanks for your help.

 

Is there any reason that this service isn't installed and enabled by default? As you say, looking at the instance details in CM, it looks like it should be there, and no-body would be able to connect via Hive ODBC/JDBC drivers from an external machine until this is set up.

 

Mark

avatar

Hi Mark,

 

HiveServer2 is only available in CDH4.2+, and is considered generally optional in CDH4.x. Many customers use the old Hive CLI on CDH4, since HiveServer2 is somewhat new. CM generally tries not to install optional roles, just required ones.

 

In CDH5, CM treats HiveServer2 is a required role for Hive, since the old Hive CLI will be deprecated in CDH5 and all clients should instead go through the beeline CLI and/or drivers that talk to HS2.

 

Hope this clears it up for you!

 

-Darren

avatar
Explorer

Thanks Darren. So does that mean that hiveserver1 (with thrift) is installed by default? My take is that there's no thrift server and no hiveserver2 by default, so by default there's no way you can connect via ODBC or JDBC to Hive? Is this correct?

avatar

CM doesn't manage hiveserver1 because it should basically never be used. It notably has no real support for concurrent operations, and you may get corruption in such scenarios.

 

Since HS2 is not set up by default, there's no way to use O/JDBC to connect to hive in CDH4 by default. You can, however, tell CM to initially install the HS2 role if you click "Inspect Roles" (or something similar) in the setup wizard, on the page where you select which service types you want to install. And as you know, you can also just add the role later.