About marcel-jan

marcel-jan · ‎05-16-2024

Because I ran into this thread when looking how to solve this error and because we found a solution, I thought it might still serve some people if I share what solution we found. We needed HWC to profile Hive managed + transactional tables from Ataccama (data quality solution). And we found someone who successfully got spark-submit working. We checked their settings and changed the spark-submit as follows: COMMAND="$SPARK_HOME/bin/$SPARK_SUBMIT \ --files $MYDIR/$LOG4J_FILE_NAME $SPARK_DRIVER_JAVA_OPTS $SPARK_DRIVER_OPTS \ --jars {{ hwc_jar_path }} \ --conf spark.security.credentials.hiveserver2.enabled=false \ --conf "spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@{{ ad_realm }}" \ --conf spark.dynamicAllocation.enable=false \ --conf spark.hadoop.metastore.catalog.default=hive \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \ --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.timeParserPolicy=LEGACY \ --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled=true \ --conf spark.sql.parquet.int96TimestampConversion=true \ --conf spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions \ --conf spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension \ --conf spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator \ --conf spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol \ --conf spark.datasource.hive.warehouse.read.mode=DIRECT_READER_V2 \ --class $CLASS $JARS $MYLIB $PROPF $LAUNCH $*"; exec $COMMAND Probably the difference was in the spark.hadoop.metastore.catalog.default=hive setting. In the above example are some Ansible variables: hwc_jar_path: "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/jars/hive-warehouse-connector-assembly-1.0.0.7.1.7.1000-141.jar" ad_realm is our LDAP realm. Hope it helps anyone.

marcel-jan · ‎02-07-2019

This is now the winning REST API query: curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_table+where+qualifiedName%3D%27testdb.mytable.id@CLUSTERNAME%27" It gives a list of all columns for a table, including deleted ones. In my Python code I pick the column with ACTIVE state.

marcel-jan · ‎02-06-2019

I've been another half day at it. I come to the conclusion that the best that I can get is a list of all columns in a table, including ones with a DELETED state.

marcel-jan · ‎01-31-2019

import-hive.sh was indeed the solution. We only got it working when admin had read/execute access in Ranger to protected data in HDFS. Now all Hive tables have been discovered.

marcel-jan · ‎01-29-2019

I want to be able to tag a list of Hive columns in Atlas (HDP 2.6.5) from a Python script. For this I need to find the guid of the Hive column I'm targetting. But I can't seem to tame the Atlas REST API to do that. Getting all the Hive columns in the data lake is easy enough: curl -u myaccount -i -H "Content-Type: application/json" -X GET 'https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column+where+__state=%27ACTIVE%27+select+qualifiedName,name,__guid' That works (the number of Hive columns is limited to 100 by default): { "requestId": "pool-2-thread-6 - 7e8ec2ee-a94b-4e37-b851-57b1b3df6e29", "query": "hive_column where __state='ACTIVE' select qualifiedName,name,__guid", "queryType": "dsl", "count": 100, "results": [ { "$typeName$": "__tempQueryResultStruct5", "qualifiedName": "mytestdb.employee.name@CLUSTERNAME", "__guid": "809c84d5-e065-45f4-81c8-fcdb7cf81560", "name": "name" }, { "$typeName$": "__tempQueryResultStruct5", "qualifiedName": "mytestdb.othertest.id@CLUSTERNAME", "__guid": "04cd03fb-31d3-43e0-a75a-f35f2a60ad22", "name": "id" }, etcetera.. } } Now let's pick just one hive_column: curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column+where+name='id'" Which results in a 500 server error. Really? Okay, let's try selecting on the qualifiedName of the column: curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column+where+qualifiedName='mytestdb.othertest.id@CLUSTERNAME'" Don't think so: HTTP ERROR 500. Okay, wise guy. How about I query the table, but then really ask for the qualifiedName of the column? curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_table+where+qualifiedName='mytestdb.othertest.id@CLUSTERNAME'" That works, but I don't get any data on the column. Okay, let's have a look at the fields in the output. Let's ask for all the Hive columns again, but don't put in the select. curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column&__state=%27ACTIVE%27" I get results, but mixed in there are deleted states. For example: { "$typeName$": "hive_column", "$id$": { "id": "b3cc1db6-9c95-499c-9f5f-c457a4e4add6", "$typeName$": "hive_column", "version": 0, "state": "DELETED" }, "$systemAttributes$": { "createdBy": "myaccount", "modifiedBy": "myaccount", "createdTime": "Wed Jun 27 15:07:24 CEST 2018", "modifiedTime": "Mon Aug 06 16:30:30 CEST 2018" }, "comment": null, "qualifiedName": "mytestdb.othertest.id@CLUSTERNAME", "type": "string", "position": 30, "owner": "myaccount", "description": null, "name": "id", "table": { "id": "161b4ef8-d5de-4e38-a76d-c030a71b730c", "$typeName$": "hive_table", "version": 0, "state": "DELETED" } }, Okay, let's remove DELETED states in all the fields. curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column&results.table.state=%27ACTIVE%27&__state=%27ACTIVE%27&__id.state=%27ACTIVE%27" Nice try, but here are the DELETED states again. I feel like I'm playing a chess match agains the Atlas REST API and I'm not winning. Is there any way to do a query against the Atlas REST API to just get the qualifiedName and guid of one Hive column?

marcel-jan · ‎01-29-2019

Thanks for the responses everyone. I saw I needed to have the Atlas admin account to run this script. I don't have it, but I've asked the people who do (a company that maintains the Hadoop cluster for us) to run it for me. While we're waiting, it might be interesting to share the one thing that I did get working. So I put this in the JSON: { "entity": { "typeName": "hive_table", "attributes": { "description": null, "name": "tablename", "owner": "degierf", "qualifiedName": "test.tablename@clustername", "replicatedFrom": [], "replicatedTo": [], "aliases": [], "columns": [], "comment": null, "createTime": 1494949241, "db": { "guid": "557c073c-da51-461c-8bba-3594e004db63", "typeName": "hive_db" }, "lastAccessTime": 1494949241, "partitionKeys": [], "retention": null, "sd": null, "tableType": null, "temporary": null, "viewExpandedText": null, "viewOriginalText": null }, "guid": -1 }, "referredEntities": {} } Then I ran it with the entity endpoint: curl -u myaccount -i -H "Content-Type: application/json" -X POST -d @upload_hive_table_def.json https://atlasnode:21000/api/atlas/v2/entity What you get is that Atlas but wasn't the desired result : I managed to get Atlas to recognise there's a Hive table. Just not the columns, references, etc.. I was hoping to add them one by one after this, but that didn't work.

marcel-jan · ‎01-28-2019

Thanks, @Sandeep Nemuri. I'll try that.

marcel-jan · ‎01-28-2019

I've been trying this now for one and a half day. I've been trying to approach this from all kinds of directions (writing the JSON from scratch, modifying exported JSON data from another table, trying the old REST API version). I've been searching the web for any solution. I'm about to give up. Anyone else ideas?

marcel-jan · ‎01-28-2019

So I've read somewhere (can't seem to find the link anymore) that if you put a URL in an attribute with type string, Atlas recognises it and and makes it into a hyperlink. And as far as I was able to understand the code in the patch, it seems to work that way on strings starting with ftp://, http:// and https://. I've tried creating an attribute and unfortunately, the link is unclickable. So it does not seem to work. Is my interpretation wrong? Should it work some other way?

marcel-jan · ‎01-24-2019

Is there a way to let Atlas discover existing Hive tables (in HDP 2.6.5)? I've got a couple of Hive external tables that existed before Atlas was enabled. It didn't pick them up, but Atlas does find new tables. I'm trying to get one Hive table in Atlas, but it's not exactly easy to do that for tables, columns, storagedesc and references. There are very few (simple) examples out there. I'm getting one ObjectId is not valid AtlasObjectId error after the other.

Online	Offline
Last Visited	‎05-16-2024 09:09 AM

Member Since	‎03-01-2017 10:37 AM
Last Visited	‎05-16-2024 09:09 AM
Posts	62
Kudos received	7

Cloudera Community

Re: Atlas REST API: get the guid of a Hive column

Re: HDP 3.1 & Spark 2.3.2 - hive.table("default.ta...

Re: Atlas REST API: get the guid of a Hive column

Re: Atlas REST API: get the guid of a Hive column

Re: Discovering existing Hive tables in Atlas

Atlas REST API: get the guid of a Hive column

Re: Discovering existing Hive tables in Atlas

Re: Discovering existing Hive tables in Atlas

Re: Discovering existing Hive tables in Atlas

Re: Hyperlinks from Atlas attributes?

Discovering existing Hive tables in Atlas