Member since
03-01-2017
62
Posts
7
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3570 | 02-07-2019 02:28 PM |
05-16-2024
05:48 AM
1 Kudo
Because I ran into this thread when looking how to solve this error and because we found a solution, I thought it might still serve some people if I share what solution we found. We needed HWC to profile Hive managed + transactional tables from Ataccama (data quality solution). And we found someone who successfully got spark-submit working. We checked their settings and changed the spark-submit as follows: COMMAND="$SPARK_HOME/bin/$SPARK_SUBMIT \ --files $MYDIR/$LOG4J_FILE_NAME $SPARK_DRIVER_JAVA_OPTS $SPARK_DRIVER_OPTS \ --jars {{ hwc_jar_path }} \ --conf spark.security.credentials.hiveserver2.enabled=false \ --conf "spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@{{ ad_realm }}" \ --conf spark.dynamicAllocation.enable=false \ --conf spark.hadoop.metastore.catalog.default=hive \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \ --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.timeParserPolicy=LEGACY \ --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled=true \ --conf spark.sql.parquet.int96TimestampConversion=true \ --conf spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions \ --conf spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension \ --conf spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator \ --conf spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol \ --conf spark.datasource.hive.warehouse.read.mode=DIRECT_READER_V2 \ --class $CLASS $JARS $MYLIB $PROPF $LAUNCH $*"; exec $COMMAND Probably the difference was in the spark.hadoop.metastore.catalog.default=hive setting. In the above example are some Ansible variables: hwc_jar_path: "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/jars/hive-warehouse-connector-assembly-1.0.0.7.1.7.1000-141.jar" ad_realm is our LDAP realm. Hope it helps anyone.
... View more
02-07-2019
02:28 PM
1 Kudo
This is now the winning REST API query: curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_table+where+qualifiedName%3D%27testdb.mytable.id@CLUSTERNAME%27" It gives a list of all columns for a table, including deleted ones. In my Python code I pick the column with ACTIVE state.
... View more
02-06-2019
03:55 PM
I've been another half day at it. I come to the conclusion that the best that I can get is a list of all columns in a table, including ones with a DELETED state.
... View more
01-31-2019
05:56 PM
import-hive.sh was indeed the solution. We only got it working when admin had read/execute access in Ranger to protected data in HDFS. Now all Hive tables have been discovered.
... View more
01-29-2019
04:08 PM
1 Kudo
I want to be able to tag a list of Hive columns in Atlas (HDP 2.6.5) from a Python script. For this I need to find the guid of the Hive column I'm targetting. But I can't seem to tame the Atlas REST API to do that. Getting all the Hive columns in the data lake is easy enough: curl -u myaccount -i -H "Content-Type: application/json" -X GET 'https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column+where+__state=%27ACTIVE%27+select+qualifiedName,name,__guid' That works (the number of Hive columns is limited to 100 by default): {
"requestId": "pool-2-thread-6 - 7e8ec2ee-a94b-4e37-b851-57b1b3df6e29",
"query": "hive_column where __state='ACTIVE' select qualifiedName,name,__guid",
"queryType": "dsl",
"count": 100,
"results": [
{
"$typeName$": "__tempQueryResultStruct5",
"qualifiedName": "mytestdb.employee.name@CLUSTERNAME",
"__guid": "809c84d5-e065-45f4-81c8-fcdb7cf81560",
"name": "name"
},
{
"$typeName$": "__tempQueryResultStruct5",
"qualifiedName": "mytestdb.othertest.id@CLUSTERNAME",
"__guid": "04cd03fb-31d3-43e0-a75a-f35f2a60ad22",
"name": "id"
},
etcetera..
}
} Now let's pick just one hive_column: curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column+where+name='id'"
Which results in a 500 server error. Really? Okay, let's try selecting on the qualifiedName of the column: curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column+where+qualifiedName='mytestdb.othertest.id@CLUSTERNAME'" Don't think so: HTTP ERROR 500. Okay, wise guy. How about I query the table, but then really ask for the qualifiedName of the column? curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_table+where+qualifiedName='mytestdb.othertest.id@CLUSTERNAME'" That works, but I don't get any data on the column. Okay, let's have a look at the fields in the output. Let's ask for all the Hive columns again, but don't put in the select. curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column&__state=%27ACTIVE%27" I get results, but mixed in there are deleted states. For example: {
"$typeName$": "hive_column",
"$id$": {
"id": "b3cc1db6-9c95-499c-9f5f-c457a4e4add6",
"$typeName$": "hive_column",
"version": 0,
"state": "DELETED"
},
"$systemAttributes$": {
"createdBy": "myaccount",
"modifiedBy": "myaccount",
"createdTime": "Wed Jun 27 15:07:24 CEST 2018",
"modifiedTime": "Mon Aug 06 16:30:30 CEST 2018"
},
"comment": null,
"qualifiedName": "mytestdb.othertest.id@CLUSTERNAME",
"type": "string",
"position": 30,
"owner": "myaccount",
"description": null,
"name": "id",
"table": {
"id": "161b4ef8-d5de-4e38-a76d-c030a71b730c",
"$typeName$": "hive_table",
"version": 0,
"state": "DELETED"
}
}, Okay, let's remove DELETED states in all the fields. curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column&results.table.state=%27ACTIVE%27&__state=%27ACTIVE%27&__id.state=%27ACTIVE%27" Nice try, but here are the DELETED states again. I feel like I'm playing a chess match agains the Atlas REST API and I'm not winning. Is there any way to do a query against the Atlas REST API to just get the qualifiedName and guid of one Hive column?
... View more
Labels:
- Labels:
-
Apache Atlas
01-29-2019
08:21 AM
Thanks for the responses everyone. I saw I needed to have the Atlas admin account to run this script. I don't have it, but I've asked the people who do (a company that maintains the Hadoop cluster for us) to run it for me. While we're waiting, it might be interesting to share the one thing that I did get working. So I put this in the JSON: {
"entity": {
"typeName": "hive_table",
"attributes": {
"description": null,
"name": "tablename",
"owner": "degierf",
"qualifiedName": "test.tablename@clustername",
"replicatedFrom": [],
"replicatedTo": [],
"aliases": [],
"columns": [],
"comment": null,
"createTime": 1494949241,
"db": {
"guid": "557c073c-da51-461c-8bba-3594e004db63",
"typeName": "hive_db"
},
"lastAccessTime": 1494949241,
"partitionKeys": [],
"retention": null,
"sd": null,
"tableType": null,
"temporary": null,
"viewExpandedText": null,
"viewOriginalText": null
},
"guid": -1
},
"referredEntities": {}
} Then I ran it with the entity endpoint: curl -u myaccount -i -H "Content-Type: application/json" -X POST -d @upload_hive_table_def.json https://atlasnode:21000/api/atlas/v2/entity What you get is that Atlas but wasn't the desired result : I managed to get Atlas to recognise
there's a Hive table. Just not the columns, references, etc.. I was hoping to add them one by one after this, but that didn't work.
... View more
01-28-2019
04:01 PM
Thanks, @Sandeep Nemuri. I'll try that.
... View more
01-28-2019
03:29 PM
I've been trying this now for one and a half day. I've been trying to approach this from all kinds of directions (writing the JSON from scratch, modifying exported JSON data from another table, trying the old REST API version). I've been searching the web for any solution. I'm about to give up. Anyone else ideas?
... View more
01-28-2019
08:27 AM
So I've read somewhere (can't seem to find the link anymore) that if you put a URL in an attribute with type string, Atlas recognises it and and makes it into a hyperlink. And as far as I was able to understand the code in the patch, it seems to work that way on strings starting with ftp://, http:// and https://. I've tried creating an attribute and unfortunately, the link is unclickable. So it does not seem to work. Is my interpretation wrong? Should it work some other way?
... View more
01-24-2019
04:02 PM
Is there a way to let Atlas discover existing Hive tables (in HDP 2.6.5)? I've got a couple of Hive external tables that existed before Atlas was enabled. It didn't pick them up, but Atlas does find new tables. I'm trying to get one Hive table in Atlas, but it's not exactly easy to do that for tables, columns, storagedesc and references. There are very few (simple) examples out there. I'm getting one ObjectId is not valid AtlasObjectId error after the other.
... View more
Labels:
- Labels:
-
Apache Atlas
-
Apache Hive