Created 03-06-2017 10:05 PM
I want to use Atlas traits and attributes to hold data quality metadata (counts and dates).
I have multiple Hive tables and for each of them I run basic DQ scripts to count the number of anomalies for different DQ checks each day (at both table or column level). I only expect Atlas to hold the most recent date and count.
Example of the sort of DQ metadata I generate:
hive_table | hive_column | Load date | DQ check | DQ count |
table_1 | - | 2017-03-06 | Count number of records | 999 |
table_1 | column_1 | 2017-03-06 | Number of not nulls | 2 |
table_1 | column_2 | 2017-03-06 | Number of inconsistent dates | 0 |
table_2 | - | 2017-03-06 | Count number of records | 9999 |
table_2 | column_1 | 2017-03-06 | Number of not nulls | 232 |
table_2 | column_2 | 2017-03-06 | Number of inconsistent dates | 2 |
I have 2 questions.
1. What is the best way to structure the traits and attributes?
Traits:
Attributes:
If I were to update attribute values for a trait that is linked to 2 entities (hive_tables) can each value be updated separately, or will the attribute value be shared across the trait? If it is shared then I will need unique trait names (I think).
2. How should I update the attribute values (the values are generated from HQL scripts)?
Here's an example of my traits and attributes (but not attribute values) for a DQ check for not nulls.
{ "enumTypes":[], "structTypes":[], "traitTypes":[ { "superTypes":[], "hierarchicalMetaTypeName":"org.apache.atlas.typesystem.types.TraitType", "typeName":"dq_monitor_not_null", "typeDescription":null, "attributeDefinitions":[ { "name":"dq_monitor_load_date", "dataTypeName":"date", "multiplicity":"optional", "isComposite":false, "isUnique":false, "isIndexable":true, "reverseAttributeName":null }, { "name":"dq_monitor_count", "dataTypeName":"int", "multiplicity":"optional", "isComposite":false, "isUnique":false, "isIndexable":true, "reverseAttributeName":null } ] } ], "classTypes":[] }
Created 03-08-2017 08:50 PM
Each Atlas Tag can have multiple Attributes name/value pairs. If you had a tag with attribute called owner, you could tag 2 hive tables using the tag and then update each table to have different values.
ex.
Tag1 --> Hive Table 1
Owner = user1
Tag1 --> Hive Table 2
Owner = user2
Is this what you are asking?
Hope this is helpful.
Created 03-08-2017 08:50 PM
Each Atlas Tag can have multiple Attributes name/value pairs. If you had a tag with attribute called owner, you could tag 2 hive tables using the tag and then update each table to have different values.
ex.
Tag1 --> Hive Table 1
Owner = user1
Tag1 --> Hive Table 2
Owner = user2
Is this what you are asking?
Hope this is helpful.