Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Is it possible to perform processing on s3 buckets without bringing data into HDFS?

Is it possible to perform processing on s3 buckets without bringing data into HDFS?

Hi experts

I wanted to check and validate my understanding. Is it possible to perform HIVE queries directly on S3 bucket without bringing the data into HDFS? suppose we have a really large S3 store, (example, 300 million documents per day) and we want to process and extract text from these documents and only store the extracted text, what is the best possible solution?

6 REPLIES 6
Highlighted

Re: Is it possible to perform processing on s3 buckets without bringing data into HDFS?

Rising Star

@Naveen Keshava It is possible to use S3 as the storage for Hive, for example uses refer to the documentation at https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.1/bk_hdcloud-aws/content/s3-hiv....

Re: Is it possible to perform processing on s3 buckets without bringing data into HDFS?

Thanks for your resonse, @Mike Riggs. However, in my case the S3 repository does not necessarily only have structured content. There are some pdfs, ppts and word documents as well and I want to extract text from them. So, may be my only option is to bring the data into HDFS?

Re: Is it possible to perform processing on s3 buckets without bringing data into HDFS?

@Naveen Keshava - Please read the documentation related to Storage Connectors:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_cloud-data-access/content/intro.html

Using these you should be able to read / write data from S3

However, please note that this is a new feature released with HDP 2.6.1, so make sure to use the latest HDP version.

Re: Is it possible to perform processing on s3 buckets without bringing data into HDFS?

Thanks for this Namit @Namit Maheshwari. so does it mean that data should first land into hdfs in some way. be it hive or on hdfs as files. what if I wanted to build a SOLR index out of my files in S3?

Re: Is it possible to perform processing on s3 buckets without bringing data into HDFS?

New Contributor
@Naveen Keshava

Yes, you can create a Hive external table pointing to your S3 data location. But before this you will have to set these properties in Custom core-site.xml

'fs.s3a.access.key': AWS_ACCESS_KEY,                         
'fs.s3a.secret.key': AWS_SECRET                   

And below properties in Custom Hive-site.xml and restart affected services via Ambari

'fs.s3a.awsAccessKeyId': AWS_ACCESS_KEY, 
'fs.s3a.awsSecretAccessKey' : AWS_SECRET, 
'hive.exim.uri.scheme.whitelist' : 's3a,hdfs,pfile'

Re: Is it possible to perform processing on s3 buckets without bringing data into HDFS?

Thanks @ksuresh, however what if my data in s3 is primarily unstructured? lets say i have loads of pdfs and image files and all i care if text extraction from them and making search available on top of it...

My understanding is that HIVE metastore can only store structured content and my idea was to use SOLR index instead, because my S3 data size is really really massive ! 300 million documents per day (lets say avg 5MB epr file)