I'm trying to understand how Atlas is working. I know that there is no hook for HDFS in Atlas (yet!?). I understood that all metadata is stored by the Atlas service in HBase and Solr. So if the HDFS hook is implemented, does it mean that all the metadata for all the files stored in HDFS will be stored in HBase too, and not alongside the file in HDFS? If so, I fail to understand how this can scale: the HDFS Ranger plugin wil need to retrieve metadata from (the remote service) Atlas for every file access!
I feel I'm missing something here... Could you please explain this use case to me?
Atlas uses TitanDB (JanusGraph in 1.0) as its underlying database. This supports many different back ends.
Table names: Prior to v1, atlas_titan. For v1 it is atlas_janus.
You are right in noting that in Hadoop ecosystem, HBase for data storage with Solr for index storage.
HDFS hook is tricky from many aspects. Most importantly volume of data that can potentially be generated by HDFS hook.
Note that Atlas does not store data. In case of HDFS hook, the meta information will be stored. Like directory/file name, size, creation date, so on. Please take a look at models defined here.
Scalability is indeed something that needs to be addressed before this can be usable. I have few ideas on this, but don't have concrete implementation.