About RangaReddy

RangaReddy · ‎02-08-2023

Hi @sat_046 I don't think we have a specific configuration parameter to handle the task retry attempts with some delay. But we have a parameters to blacklist the node if the task is failed with some no of attempts in the node. References: 1. https://community.cloudera.com/t5/Community-Articles/Configuring-spark-task-maxFailures-amp-spark-blacklist-task/ta-p/335235 2. https://www.waitingforcode.com/apache-spark/failed-tasks-resubmit/read

RangaReddy · ‎01-18-2023

Hi @Nikhil44 First of all, Cloudera will not support Standalone Spark installation. To access any hive table, we need a hive-site.xml and Hadoop-related configuration files like (core-site.xml, hdfs-site.xml and yarn-site.xml)

RangaReddy · ‎12-20-2022

Hi @Samie Is there any update on your testing?

RangaReddy · ‎12-15-2022

HI @Samie Please attach the spark application and event logs to check the queue name. The easiest way to check the spark application is by running spark pi example. spark-submit \ --class org.apache.spark.examples.SparkPi \ --queue <queue_name> \ --master yarn \ --deploy-mode cluster \ --num-executors 1 \ --driver-memory 512m \ --executor-memory 512m \ --executor-cores 1 \ /usr/hdp/current/spark2-client/examples/jars/spark-examples_*.jar 10 Spark on YARN only: --queue QUEUE_NAME The YARN queue to submit to (Default: "default").

RangaReddy · ‎12-08-2022

Hi @quangbilly79 You have used CDP hbase-spark-1.0.0.7.2.15.0-147.jar instead of CDH. There is no guarantee it will work latest jar in CDH. Luckily for you it is worked.

RangaReddy · ‎11-07-2022

Hi @PNCJeff I would recommend installing and using Livy Server in the CDP cluster. For Livy Kerberos configuration parameters are below: livy.server.launch.kerberos.keytab=<LIVY_SERVER_PATH>/livy.keytab livy.server.launch.kerberos.principal=livy/server@DOMAIN.COM livy.server.auth.type=kerberos livy.server.auth.kerberos.keytab=<LIVY_SERVER_PATH>/livy.keytab livy.server.auth.kerberos.principal=HTTP/server@DOMAIN.COM livy.server.auth.kerberos.name-rules=RULE:[2:$1@$0](rangeradmin@DOMAIN.COM)s/(.*)@DOMAIN.COM/ranger/\u000ARULE:[2:$1@$0](rangertagsync@DOMAIN.COM)s/(.*)@DOMAIN.COM/rangertagsync/\u000ARULE:[2:$1@$0](rangerusersync@DOMAIN.COM)s/(.*)@DOMAIN.COM/rangerusersync/\u000ARULE:[2:$1@$0](rangerkms@DOMAIN.COM)s/(.*)@DOMAIN.COM/keyadmin/\u000ARULE:[2:$1@$0](atlas@DOMAIN.COM)s/(.*)@DOMAIN.COM/atlas/\u000ADEFAULT\u000A

RangaReddy · ‎10-27-2022

Hi @Jean-Luc You can try the following example code https://github.com/rangareddy/ranga_spark_experiments/tree/master/spark_hbase_cdh_integration

RangaReddy · ‎10-14-2022

Let's assume we have a Hive table with the name test and the table is stored under /tmp directory. In the test table, data is stored as below: hdfs dfs -ls -R /tmp/test drwxr-xr-x - hive hive 0 2022-08-24 09:15 /tmp/test/dir1 -rw-r--r-- 3 hive hive 685 2022-08-24 09:15 /tmp/test/dir1/000000_0 drwxr-xr-x - hive hive 0 2022-08-24 09:15 /tmp/test/dir2 -rw-r--r-- 3 hive hive 685 2022-08-24 09:15 /tmp/test/dir2/000000_0 Generally, the above kind of data will be generated while doing union all operations in Hive. By using spark, if we try to load the hive table data, we will get the following exception: scala> spark.sql("SELECT * FROM test").show() java.io.IOException: Not a file: hdfs://localhost:8020/tmp/test/dir1 at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:340) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) at scala.Option.getOrElse(Option.scala:121) ..... By default spark will not read the table data if it contains subdirectories. To solve the this issue, we need to set the following parameter: spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true") We can also get same kind of exception while reading data in hive table. To solve this issue in Hive, we need to set the following two parameters: hive> set mapred.input.dir.recursive=true; hive> set hive.mapred.supports.subdirectories=true; We can also set above two parameters in hive-site.xml.

RangaReddy · ‎10-14-2022

Let's understand the information_schema database: Hive supports the ANSI-standard information_schema database, which we can query for information about tables, views, columns, and our Hive privileges. The information_schema data reveals the state of the system, similar to sys database data, but in a user-friendly, read-only way. Example: SELECT * FROM information_schema.tables WHERE is_insertable_into='YES' limit 2; ... +--------------------+-------------------+----------------- |tables.table_catalog|tables.table_schema|tables.table_name +--------------------+-------------------+----------------- |default |default |students2 |default |default |t3 Now we will try to access the following table under the information_schema database. spark.sql("select * from information_schema.schemata").show() We will get the following exception: org.apache.spark.sql.AnalysisException: Undefined function: 'restrict_information_schema'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632 We will get the above exception because in spark we don't have restrict_information_schema function and in Hive it is available. We can check the available functions using the following command: spark.sql("show functions").show() We can solve the above error by passing hive-exec.jar and by creating a temporary function. spark-shell --jars /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/hive/lib/hive-exec.jar spark.sql(""" CREATE TEMPORARY FUNCTION restrict_information_schema AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFRestrictInformationSchema' """) After registering the function if we try to access the table data we will another error like below: scala> spark.sql("select * from information_schema.schemata").show() org.apache.spark.sql.AnalysisException: Undefined function: 'current_user'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632 Difficult to find out how many function(s) we need to register. To avoid registering functions, we can use the Spark JDBC API to read the tables under information_schema. spark-shell --jars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc.jar val options = Map( "url" -> "jdbc:hive2://localhost:10000/default;", "driver" -> "org.apache.hive.jdbc.HiveDriver", "dbtable" -> "information_schema.schemata", "user" -> "hive_user", "password" -> "hive_password" ) val df = spark.read.format("jdbc").options(options).load() df.show() +---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+ |schemata.catalog_name|schemata.schema_name|schemata.schema_owner|schemata.default_character_set_catalog|schemata.default_character_set_schema|schemata.default_character_set_name|schemata.sql_path| +---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+ | schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path| | schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path| | schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path| +---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+ References: 1. https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/using-hiveql/topics/hive_query_information_schema.html

RangaReddy · ‎10-11-2022

Hi @fares_ In the above application log, we can see clearly the docker mount path is not found. Could you please fix the mount issue? And also verify the spark submit parameter once. Shell error output: Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056' Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056' Invalid docker mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:rw', realpath=/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056 Error constructing docker command, docker error code=13, error message='Invalid docker mount' Reference: https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/yarn-managing-docker-containers/topics/yarn-docker-example-spark.html

Online	Offline
Last Visited	‎08-29-2024 03:41 AM

Member Since	‎06-02-2020 05:25 AM
Last Visited	‎08-29-2024 03:41 AM
Posts	331
Kudos received	68

Cloudera Community

Re: Icebreg on CDP private cloud 7.1.9

Re: How to set default time zone/local time for Sp...

Re: Load Iceberg Table on PowerBI Desktop

Re: NoClassDefFoundError due to Incompatible Spark...

Re: Creating Iceberg table

Re: How to introduce delay time between retry atte...

Re: Read hive table from spark

Re: All spark-submit are routed to the same yarn q...

Re: All spark-submit are routed to the same yarn q...

Re: Problems using hbase-spark on CDH

Re: Livy server on CDH 6.x.x

Re: Problems using hbase-spark on CDH

Spark to read the Hive table sub-directory data

Spark to read the Hive tables under information_sc...

Re: Python script with docker on yarn results in j...