I wanted to check and validate my understanding. Is it possible to perform HIVE queries directly on S3 bucket without bringing the data into HDFS? suppose we have a really large S3 store, (example, 300 million documents per day) and we want to process and extract text from these documents and only store the extracted text, what is the best possible solution?
@Naveen Keshava It is possible to use S3 as the storage for Hive, for example uses refer to the documentation at https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.1/bk_hdcloud-aws/content/s3-hiv....
Thanks for your resonse, @Mike Riggs. However, in my case the S3 repository does not necessarily only have structured content. There are some pdfs, ppts and word documents as well and I want to extract text from them. So, may be my only option is to bring the data into HDFS?
@Naveen Keshava - Please read the documentation related to Storage Connectors:
Using these you should be able to read / write data from S3
However, please note that this is a new feature released with HDP 2.6.1, so make sure to use the latest HDP version.
Thanks for this Namit @Namit Maheshwari. so does it mean that data should first land into hdfs in some way. be it hive or on hdfs as files. what if I wanted to build a SOLR index out of my files in S3?
Yes, you can create a Hive external table pointing to your S3 data location. But before this you will have to set these properties in Custom core-site.xml
'fs.s3a.access.key': AWS_ACCESS_KEY, 'fs.s3a.secret.key': AWS_SECRET
And below properties in Custom Hive-site.xml and restart affected services via Ambari
'fs.s3a.awsAccessKeyId': AWS_ACCESS_KEY, 'fs.s3a.awsSecretAccessKey' : AWS_SECRET, 'hive.exim.uri.scheme.whitelist' : 's3a,hdfs,pfile'
Thanks @ksuresh, however what if my data in s3 is primarily unstructured? lets say i have loads of pdfs and image files and all i care if text extraction from them and making search available on top of it...
My understanding is that HIVE metastore can only store structured content and my idea was to use SOLR index instead, because my S3 data size is really really massive ! 300 million documents per day (lets say avg 5MB epr file)