Support Questions
Find answers, ask questions, and share your expertise

Suggestions for sharing Hive table data in the cloud?

Super Collaborator

We want to share some of the data we have in Hive with external users - those who do not have access to our cluster.

Ideally, I would like to setup a daily process to push the output of certain queries to AWS in such a way that users outside our office/VPN could perform queries against that data.

I can think of a handful of complicated ways to accomplish this, but I'm guessing this has to be a pretty common use case.

What are some good options to get large amounts of data from a baremetal cluster to a cloud resource without tying up an upload pipe all the time?

Thanks in advance


@Zack Riesland,

hive -e "create table my_export stored as text location 's3n://my_bucket/my_export' as select * from my_table;"

Or some variation of the above works pretty well. This will be a parallel write into s3 as if it was just another directory on HDFS. You will need to use AWS APIs to configure needed security policies on the bucket and/or "subfolders".

S3a would be better, but depending on your cluster version, s3n will just work whereas there are kinks with S3a yet to be worked out.

Super Collaborator

Thanks Randy,

That's helpful - and very slick - for a particular use case.

But suppose the end user (who runs the query) doesn't have a hadoop cluster or EMR instance.

They just want to open SquirrelSQL (or whatever) or perform a query.

That's more the situation I'm trying to figure out.

Correct me if you know otherwise, but Squirrel isn't a SQL engine itself. It needs a backend to connect to before it can do much of anything.

When doing data exploration outside of a Hadoop environment, I've had success using standalone Zeppelin/Spark as a way to run SQL against static files (including stored in S3).

Super Collaborator

Thanks Randy.

I wonder if anyone has experience connecting Hive to Aurora or RDS (?)