About dbompart

marcel-jan · ‎05-16-2024

Because I ran into this thread when looking how to solve this error and because we found a solution, I thought it might still serve some people if I share what solution we found. We needed HWC to profile Hive managed + transactional tables from Ataccama (data quality solution). And we found someone who successfully got spark-submit working. We checked their settings and changed the spark-submit as follows: COMMAND="$SPARK_HOME/bin/$SPARK_SUBMIT \ --files $MYDIR/$LOG4J_FILE_NAME $SPARK_DRIVER_JAVA_OPTS $SPARK_DRIVER_OPTS \ --jars {{ hwc_jar_path }} \ --conf spark.security.credentials.hiveserver2.enabled=false \ --conf "spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@{{ ad_realm }}" \ --conf spark.dynamicAllocation.enable=false \ --conf spark.hadoop.metastore.catalog.default=hive \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \ --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.timeParserPolicy=LEGACY \ --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled=true \ --conf spark.sql.parquet.int96TimestampConversion=true \ --conf spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions \ --conf spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension \ --conf spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator \ --conf spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol \ --conf spark.datasource.hive.warehouse.read.mode=DIRECT_READER_V2 \ --class $CLASS $JARS $MYLIB $PROPF $LAUNCH $*"; exec $COMMAND Probably the difference was in the spark.hadoop.metastore.catalog.default=hive setting. In the above example are some Ansible variables: hwc_jar_path: "/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/jars/hive-warehouse-connector-assembly-1.0.0.7.1.7.1000-141.jar" ad_realm is our LDAP realm. Hope it helps anyone.

DianaTorres · ‎07-14-2022

@data_diver Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks

Gayatree · ‎07-15-2021

@renzhongpei for cluster mode it can be put on hdfs location as well. And can be referenced from there in files argument of spark-submit script. --files hdfs://namenode:8020/log4j-driver.properties#log4j-driver.properties

Sean464 · ‎07-01-2020

Great article! I faced the following error while trying adding data to ldap (Step 13.) # ldapadd -x -W -D "cn=Manager,dc=example,dc=com" -f /root/ldap/base.ldif Enter LDAP Password: adding new entry "dc=example,dc=com" ldap_add: Invalid syntax (21) additional info: objectClass: value #1 invalid per syntax After some research, found that we need to add the cosine and nis LDAP schemas before running the preceding command. # ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/cosine.ldif # ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/nis.ldif # ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/inetorgperson.ldif

dbompart · ‎03-15-2020

The purpose of this repo is to provide quick examples and utilities to work on a Spark and Hive integration on HDP 3.1.4 Prerequisites: HiveServer2Interactive (LLAP) must be installed, up and running Bash and Python interpreter must be available Ideally, for connections using HTTP transport protocol, in the Ambari -> Hive -> Configs, hive.server2.thrift.http.cookie.auth.enabled should be set to true Set hadoop.proxyuser.hive.hosts=* in the Ambari -> HDFS -> Configs -> Custom core-site section (core-site.xml, or mention at least the LLAP Hosts, HS2 Hosts and Spark client hosts separated by commas). Once the LLAP service is up and running, the next step for this setup requires the following properties to be configured in the Spark client: spark.datasource.hive.warehouse.load.staging.dir= spark.datasource.hive.warehouse.metastoreUri= spark.hadoop.hive.llap.daemon.service.hosts= spark.jars= spark.security.credentials.hiveserver2.enabled= spark.sql.hive.hiveserver2.jdbc.url= spark.sql.hive.hiveserver2.jdbc.url.principal= spark.submit.pyFiles= spark.hadoop.hive.zookeeper.quorum= While the above information can be manually collected as explained in our Cloudera official documentation, the following steps will help in collecting the standard information and avoid making mistakes during the copy and paste process of parameter values between HS2I and Spark. Steps: --Notice the connection is to the LLAP Host-- ssh root@my-llap-host cd /tmp wget https://raw.githubusercontent.com/dbompart/hive_warehouse_connector/master/hwc_info_collect.sh chmod +x hwc_info_collect.sh ./hwc_info_collect.sh The above script will immediately provide enough information with regard to the following: The information to enable the Spark and Hive integration (HWConnector) A working spark-shell command to test initial connectivity A short how-to list all Databases in Hive, in scala. Done !!! LDAP/AD Authentication In an LDAP enabled authentication setup, the username and password will be passed in plaintext. The recommendation is to Kerberize the cluster. Otherwise, expect to see the username and password exposed in clear text amongst the logs. To provide username and password, we will have to specify them as part of the JDBC URL string in the following format: jdbc:hive2://zk1:2181,zk2:2181,zk3:2181/;user=myusername;password=mypassword;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive Note: We will need to URL encode the password if it has a special character. For example, user=hr1;password=BadPass#1 will translate to user=hr1;password=BadPass%231 This method is not fully supported for Spark HWC integration nor it is recommended. Kerberos Authentication For Kerberos authentication, the following pre-conditions have to be met: Initial "kinit" has to always be executed and validated. $ kinit [email protected] Password for [email protected]: $ klist -v Credentials cache: API:501:9 Principal: [email protected] Cache version: 0 Server: krbtgt/[email protected] Client: [email protected] Ticket etype: aes128-cts-hmac-sha1-96 Ticket length: 256 Auth time: Feb 11 16:11:36 2013 End time: Feb 12 02:11:22 2013 Renew till: Feb 18 16:11:36 2013 Ticket flags: pre-authent, initial, renewable, forwardable Addresses: addressless For YARN, HDFS, Hive and HBase long-running jobs, DelegationTokens have to be fetched. Hence, provide "--keytab" and "--principal" extra arguments, i.e.: spark-submit $arg1 $arg2 $arg3 $arg-etc --keytab my_file.keytab --principal [email protected] --class a.b.c.d app.jar For Kafka, a JAAS file has to be provided: With a keytab, recommended for long running jobs: KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="./my_file.keytab" storeKey=true useTicketCache=false serviceName="kafka" principal="[email protected]"; }; Without a keytab, usually used for batch jobs: KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useTicketCache=true renewTicket=true serviceName="kafka"; }; And, it also has to be mentioned at the JVM level: spark-submit $arg1 $arg2 $arg3 $arg-etc --files jaas.conf --conf spark.driver.extraJava.Options="-Djava.security.auth.login.config=./jaas.conf" --conf spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf" Livy2 - Example curl -X POST --data '{"kind": "pyspark", "queue": "default", "conf": { "spark.jars": "/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.32-1.jar", "spark.submit.pyFiles":"/usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.32-1.zip", "spark.hadoop.hive.llap.daemon.service.hosts": "@llap0","spark.sql.hive.hiveserver2.jdbc.url": "jdbc:hive2://[node2.cloudera.com:2181,node3.cloudera.com:2181,node4.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive](http://node2.cloudera.com:2181,node3.cloudera.com:2181,node4.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive)", "spark.yarn.security.credentials.hiveserver2.enabled": "false","spark.sql.hive.hiveserver2.jdbc.url.principal": "hive/[[email protected]](mailto:[email protected])", "spark.datasource.hive.warehouse.load.staging.dir": "/tmp", "spark.datasource.hive.warehouse.metastoreUri": "thrift://node3.cloudera.com:9083", "spark.hadoop.hive.zookeeper.quorum": "[node2.cloudera.com:2181,node3.cloudera.com:2181,node4.cloudera.com:2181](http://node2.cloudera.com:2181,node3.cloudera.com:2181,node4.cloudera.com:2181)"}}' -H "X-Requested-By: admin" -H "Content-Type: application/json" --negotiate -u : [http://node3.cloudera.com:8999/sessions/](http://node3.cloudera.com:8999/sessions/) | python -mjson.tool Submitting a brief example to show databases (hive.showDatabases()): curl --negotiate -u : http://node2.cloudera.com:8999/sessions/2/statements -X POST -H 'Content-Type: application/json' -H "X-Requested-By: admin" -d '{"code":"from pyspark_llap import HiveWarehouseSession"}' curl --negotiate -u : http://node2.cloudera.com:8999/sessions/2/statements -X POST -H 'Content-Type: application/json' -H "X-Requested-By: admin" -d '{"code":"hive = HiveWarehouseSession.session(spark).build()"}' curl --negotiate -u : http://node2.cloudera.com:8999/sessions/2/statements -X POST -H 'Content-Type: application/json' -H "X-Requested-By: admin" -d '{"code":"hive.showDatabases().show()"}' Quick reference for basic API commands to check on the application status: # Check sessions. Based on the ID field, update the following curl commands to replace "2" with $ID. curl --negotiate -u : http://node2.cloudera.com:8999/sessions/ | python -mjson.tool # Check session status curl --negotiate -u : http://node2.cloudera.com:8999/sessions/2/status | python -mjson.tool # Check session logs curl --negotiate -u : http://node2.cloudera.com:8999/sessions/2/log | python -mjson.tool # Check session statements. curl --negotiate -u : http://node2.cloudera.com:8999/sessions/2/statements | python -mjson.tool Zeppelin - Example Livy2 Interpreter Assumptions: The cluster is kerberized. LLAP has already been enabled. We got the initial setup information using the script hwc_info_collect.sh In Ambari->Spark2->Configs->Advanced livy2-conf, the property livy.spark.deployMode should be set to either "yarn-cluster" or just plain "cluster". Note: Client mode is not supported. Extra steps: Add the following property=value in Ambari->Spark2->Configs->Custom livy2-conf section: - livy.file.local-dir-whitelist=/usr/hdp/current/hive_warehouse_connector/ We can test our configurations before setting them statically in the Interpreter: Notebook > First paragraph: %livy2.conf livy.spark.datasource.hive.warehouse.load.staging.dir=$value livy.spark.datasource.hive.warehouse.metastoreUri=$value livy.spark.hadoop.hive.llap.daemon.service.hosts=$value livy.spark.jars=file:///$value livy.spark.security.credentials.hiveserver2.enabled=true livy.spark.sql.hive.hiveserver2.jdbc.url=$value livy.spark.sql.hive.hiveserver2.jdbc.url.principal=$value livy.spark.submit.pyFiles=file:///$value livy.spark.hadoop.hive.zookeeper.quorum=$value Please note that compared to a regular spark-shell or spark-submit, this time we'll have to specify the filesystem scheme file:///, otherwise it'll try to reference a path on HDFS by default. Notebook > Second paragraph: %livy2 import com.hortonworks.hwc.HiveWarehouseSession import com.hortonworks.hwc.HiveWarehouseSession._ val hive = HiveWarehouseSession.session(spark).build() hive.showDatabases().show() Creating a Table with Dummy data in Hive For this specific task, we can expedite the table creation and dummy data ingest by referring to the Cloudera's VideoKB. In the above link, the Python script (HiveRandom.zip) should help you create and load a table based on an input table schema. Another short bash script is available show_create_cleaner.sh, and it can be used as the following: wget https://raw.githubusercontent.com/dbompart/hive_warehouse_connector/master/show_create_cleaner.sh chmod +x show_create_cleaner.sh ./show_create_cleaner.sh show_create_table_output_file.txt This bash script is a quick cleaner, it will make the Show create table stmt output re-usable in Hive or Spark by using --clean which will also remove the Table's Location and Table's Properties sections, i.e.: ./show_create_cleaner.sh show_create_table_output_file.txt --clean Common errors No service instances found in registry Check again the configuration settings, especially the llap.daemon.service.hosts value and also the corresponding zNode which should be available and readable in from Zookeeper. error: object hortonworks is not a member of package com This usually means that either the HWC jar or zip files were not successfully uploaded to the Spark classpath. We can confirm this by looking at the logs and searching for: Uploading resource file:/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.32-1.jar Cannot run get splits outside HS2 Add hive.fetch.task.conversion="more" To Custom hiveserver2-interactive section. And check the LLAP logs if needed. Query returns no more than 1000 Follow the HWS API guide. This usually means that execute() method was incorrectly used instead of executeQuery(). "Blacklisted configuration values in session config: spark.master" In the /etc/livy2/conf/spark-blacklist.conf file on the Livy2 server host, reconfigure this file to allow/disallow for configurations to be modified. Unable to read HiveServer2 configs from ZooKeeper. Tried all existing HiveServer2 URIS from ZooKeeper LLAP may not be up and running or there is a problem on reading its znode. Suggested documentation https://community.cloudera.com/t5/Community-Articles/Integrating-Apache-Hive-with-Apache-Spark-Hive-Warehouse/ta-p/249035 https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html

dbompart · ‎08-13-2019

Zeppelin and Spark-shell are not the same client and properties work diferently, if you moved on to Zeppelin can we assume it did work for Spark-shell? In regard to the Zeppelin issue, the problem should be within the the path to the hive warehouse connector file either on the spark.jars or the spark.submit.pyFiles, I believe the path must be whitelisted in Zeppelin, but its clear that the hivewarehouseconnector files are not being succesfully uploaded to the application classpath, therefore, the pyspark_llap module cannot be imported. Hope it helps. BR, David

shobikas · ‎01-08-2019

@Nikhil Raina In hadoop, Mapreduce breaks the jobs into task and these task runs in a parallel way. So that the overall execution time may reduce. Now among the divided tasks, if one of the tasks take more time than desired, then the overall execution time of job increases. The reason can be anything: node busy, network congestion, etc, which limits the total execution time of the Job, and the system should wait for the slow running tasks to be completed. It may be difficult to detect causes since the tasks still complete successfully, although more time is taken than the expected time. Hadoop doesn’t try to diagnose and fix slow running tasks, instead, it tries to detect them and runs backup tasks for them. The backup tasks will be preferentially scheduled on the faster nodes. This is called "speculative execution" in Hadoop. The "backup task" are "speculative Tasks". When a task successfully completes, then duplicate tasks that are running are killed since they are no longer needed. If the original task finishes first, then the speculative task will be killed. On the other hand, if the speculative task finishes first, then the original one will be killed. Simply, "Speculative execution" is a "MapReduce job optimization technique" in Hadoop that is enabled by default. To disable that set the property value "mapred.map.tasks.speculative.execution" - "false" and "mapred.reduce.tasks.speculative.execution" - "false" in "mapred-site.xml". Please accept this answer if you found it helpful.

Online	Offline
Last Visited	‎11-30-2022 01:29 PM

Member Since	‎07-05-2017 05:13 PM
Last Visited	‎11-30-2022 01:29 PM
Posts	74
Kudos received	3

Cloudera Community

Re: Hive databases are not visible in Spark sessio...

Re: Hive databases are not visible in Spark sessio...

Re: HDP 3.1 & Spark 2.3.2 - hive.table("default.ta...

Re: what is speculative execution

Re: HDP 3.1 & Spark 2.3.2 - hive.table("default.ta...

Re: write python dataframe (pandas, spark) from CM...

Re: Support Video: How to configure log4j for Spar...

Re: Openldap Setup

Spark - HiveWarehouseConnector Quick Setup

Re: Hive databases are not visible in Spark sessio...

Re: what is speculative execution