We want to share some of the data we have in Hive with external users - those who do not have access to our cluster.
Ideally, I would like to setup a daily process to push the output of certain queries to AWS in such a way that users outside our office/VPN could perform queries against that data.
I can think of a handful of complicated ways to accomplish this, but I'm guessing this has to be a pretty common use case.
What are some good options to get large amounts of data from a baremetal cluster to a cloud resource without tying up an upload pipe all the time?
Thanks in advance
hive -e "create table my_export stored as text location 's3n://my_bucket/my_export' as select * from my_table;"
Or some variation of the above works pretty well. This will be a parallel write into s3 as if it was just another directory on HDFS. You will need to use AWS APIs to configure needed security policies on the bucket and/or "subfolders".
S3a would be better, but depending on your cluster version, s3n will just work whereas there are kinks with S3a yet to be worked out.
That's helpful - and very slick - for a particular use case.
But suppose the end user (who runs the query) doesn't have a hadoop cluster or EMR instance.
They just want to open SquirrelSQL (or whatever) or perform a query.
That's more the situation I'm trying to figure out.
Correct me if you know otherwise, but Squirrel isn't a SQL engine itself. It needs a backend to connect to before it can do much of anything.
When doing data exploration outside of a Hadoop environment, I've had success using standalone Zeppelin/Spark as a way to run SQL against static files (including stored in S3).