Support Questions

Find answers, ask questions, and share your expertise

Apache Atlas - Unable to collect HDFS metadata

avatar
Explorer

I want to collect the metadata from HDFS. But when i searched , it looks like there is no atlas hook for HDFS available like it is there for kafka , hive and other data sources. Can I get the HDFS hook  install steps or the steps to create a custom hook. 

1 ACCEPTED SOLUTION

avatar
Contributor

@Nigal ,

 

Currently , When you create hive/sqoop/falcon/storm entity which has an association to HDFS path, it shows up in Atlas.
Otherwise , any file/folder created in HDFS doesn't show up in Atlas.
 
For example, when you create a directory in HDFS , Atlas doesn't ingest it .
But when you create a hive table like :
"CREATE EXTERNAL TABLE test_table ( id int,value string) LOCATION '/user/cloudera/text' "
Atlas creates a lineage graph which shows relationship between the hive table and the HDFS path.
 
You can see the HDFS directories by searching "hdfs_path" and the hive tables by searching "hive_table".

View solution in original post

4 REPLIES 4

avatar
Contributor

Hello @Nigal ,

 

Yes right. There is no 'HDFS hook' pre-defined in Atlas.
 
Atlas mainly collects information from Hive - Spark - Hbase - Impala
 
hdfs_path is synced only if this belongs to a Hive table's lineage (as is explained in https://issues.apache.org/jira/browse/ATLAS-599). By default, Atlas won't fetch HDFS paths.
 
Unlike HIVE entities, HDFS entities within Atlas are created manually using the Create Entity link within the Atlas Web UI.
 
Please check out the list of available 'hooks' in Atlas:
 
Here's a document on creating hdfs_path manually in Atlas:

avatar
Explorer

Thanks for the solution . But this point i did'nt get clearly "hdfs_path is synced only if this belongs to a Hive table's lineage" . What i understood from this is that since hive runs on top of HDFS and on creating hive lineage, the lineage will show the HDFS path of hive warehouse directory. Is that correct?

avatar
Contributor

@Nigal ,

 

Currently , When you create hive/sqoop/falcon/storm entity which has an association to HDFS path, it shows up in Atlas.
Otherwise , any file/folder created in HDFS doesn't show up in Atlas.
 
For example, when you create a directory in HDFS , Atlas doesn't ingest it .
But when you create a hive table like :
"CREATE EXTERNAL TABLE test_table ( id int,value string) LOCATION '/user/cloudera/text' "
Atlas creates a lineage graph which shows relationship between the hive table and the HDFS path.
 
You can see the HDFS directories by searching "hdfs_path" and the hive tables by searching "hive_table".

avatar
Explorer

Hi @pkr , Thanks for the solution. Much Appreciated