We are planning to improve and standardize our connectivity to Hive on ingest and export side. Performance, parallelism, scalability, sustainable maintainability and support boundaries are attributes we need to address.
Currently, we see the following approaches:
a) Get file path information via ThriftClient/Metastore, but accessing files direct via HDFS
b) use a JDBC driver
For both approaches, we see different pros and cons and we would like to learn what is your preferred and suggested way for Hive connectivity?
Is there an official suggestion or any best practices on that?
We are looking for a sustainable approach which also interact well with Hive 2.1 and later Hive 3, and which uses Hive onboard security mechanisms.