Member since
06-02-2020
331
Posts
67
Kudos Received
49
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2807 | 07-11-2024 01:55 AM | |
| 7880 | 07-09-2024 11:18 PM | |
| 6580 | 07-09-2024 04:26 AM | |
| 5918 | 07-09-2024 03:38 AM | |
| 5625 | 06-05-2024 02:03 AM |
02-08-2023
11:00 PM
Hi @sat_046 I don't think we have a specific configuration parameter to handle the task retry attempts with some delay. But we have a parameters to blacklist the node if the task is failed with some no of attempts in the node. References: 1. https://community.cloudera.com/t5/Community-Articles/Configuring-spark-task-maxFailures-amp-spark-blacklist-task/ta-p/335235 2. https://www.waitingforcode.com/apache-spark/failed-tasks-resubmit/read
... View more
01-18-2023
01:07 AM
Hi @Nikhil44 First of all, Cloudera will not support Standalone Spark installation. To access any hive table, we need a hive-site.xml and Hadoop-related configuration files like (core-site.xml, hdfs-site.xml and yarn-site.xml)
... View more
12-20-2022
10:18 PM
Hi @Samie Is there any update on your testing?
... View more
12-15-2022
09:13 PM
HI @Samie Please attach the spark application and event logs to check the queue name. The easiest way to check the spark application is by running spark pi example. spark-submit \
--class org.apache.spark.examples.SparkPi \
--queue <queue_name> \
--master yarn \
--deploy-mode cluster \
--num-executors 1 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
/usr/hdp/current/spark2-client/examples/jars/spark-examples_*.jar 10 Spark on YARN only:
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
... View more
12-08-2022
08:30 PM
Hi @quangbilly79 You have used CDP hbase-spark-1.0.0.7.2.15.0-147.jar instead of CDH. There is no guarantee it will work latest jar in CDH. Luckily for you it is worked.
... View more
11-07-2022
02:09 AM
Hi @PNCJeff I would recommend installing and using Livy Server in the CDP cluster. For Livy Kerberos configuration parameters are below: livy.server.launch.kerberos.keytab=<LIVY_SERVER_PATH>/livy.keytab
livy.server.launch.kerberos.principal=livy/server@DOMAIN.COM
livy.server.auth.type=kerberos
livy.server.auth.kerberos.keytab=<LIVY_SERVER_PATH>/livy.keytab
livy.server.auth.kerberos.principal=HTTP/server@DOMAIN.COM
livy.server.auth.kerberos.name-rules=RULE:[2:$1@$0](rangeradmin@DOMAIN.COM)s/(.*)@DOMAIN.COM/ranger/\u000ARULE:[2:$1@$0](rangertagsync@DOMAIN.COM)s/(.*)@DOMAIN.COM/rangertagsync/\u000ARULE:[2:$1@$0](rangerusersync@DOMAIN.COM)s/(.*)@DOMAIN.COM/rangerusersync/\u000ARULE:[2:$1@$0](rangerkms@DOMAIN.COM)s/(.*)@DOMAIN.COM/keyadmin/\u000ARULE:[2:$1@$0](atlas@DOMAIN.COM)s/(.*)@DOMAIN.COM/atlas/\u000ADEFAULT\u000A
... View more
10-27-2022
08:21 PM
Hi @Jean-Luc You can try the following example code https://github.com/rangareddy/ranga_spark_experiments/tree/master/spark_hbase_cdh_integration
... View more
10-14-2022
06:10 AM
Let's assume we have a Hive table with the name test and the table is stored under /tmp directory. In the test table, data is stored as below: hdfs dfs -ls -R /tmp/test
drwxr-xr-x - hive hive 0 2022-08-24 09:15 /tmp/test/dir1
-rw-r--r-- 3 hive hive 685 2022-08-24 09:15 /tmp/test/dir1/000000_0
drwxr-xr-x - hive hive 0 2022-08-24 09:15 /tmp/test/dir2
-rw-r--r-- 3 hive hive 685 2022-08-24 09:15 /tmp/test/dir2/000000_0 Generally, the above kind of data will be generated while doing union all operations in Hive. By using spark, if we try to load the hive table data, we will get the following exception: scala> spark.sql("SELECT * FROM test").show() java.io.IOException: Not a file: hdfs://localhost:8020/tmp/test/dir1
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:340)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
..... By default spark will not read the table data if it contains subdirectories. To solve the this issue, we need to set the following parameter: spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true") We can also get same kind of exception while reading data in hive table. To solve this issue in Hive, we need to set the following two parameters: hive> set mapred.input.dir.recursive=true;
hive> set hive.mapred.supports.subdirectories=true; We can also set above two parameters in hive-site.xml.
... View more
10-14-2022
04:14 AM
Let's understand the information_schema database:
Hive supports the ANSI-standard information_schema database, which we can query for information about tables, views, columns, and our Hive privileges. The information_schema data reveals the state of the system, similar to sys database data, but in a user-friendly, read-only way.
Example:
SELECT * FROM information_schema.tables WHERE is_insertable_into='YES' limit 2;
...
+--------------------+-------------------+-----------------
|tables.table_catalog|tables.table_schema|tables.table_name
+--------------------+-------------------+-----------------
|default |default |students2
|default |default |t3
Now we will try to access the following table under the information_schema database.
spark.sql("select * from information_schema.schemata").show()
We will get the following exception:
org.apache.spark.sql.AnalysisException: Undefined function: 'restrict_information_schema'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632
We will get the above exception because in spark we don't have restrict_information_schema function and in Hive it is available. We can check the available functions using the following command:
spark.sql("show functions").show()
We can solve the above error by passing hive-exec.jar and by creating a temporary function.
spark-shell --jars /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/hive/lib/hive-exec.jar
spark.sql("""
CREATE TEMPORARY FUNCTION restrict_information_schema AS
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFRestrictInformationSchema'
""")
After registering the function if we try to access the table data we will another error like below:
scala> spark.sql("select * from information_schema.schemata").show()
org.apache.spark.sql.AnalysisException: Undefined function: 'current_user'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632
Difficult to find out how many function(s) we need to register.
To avoid registering functions, we can use the Spark JDBC API to read the tables under information_schema.
spark-shell --jars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc.jar
val options = Map(
"url" -> "jdbc:hive2://localhost:10000/default;",
"driver" -> "org.apache.hive.jdbc.HiveDriver",
"dbtable" -> "information_schema.schemata",
"user" -> "hive_user",
"password" -> "hive_password"
)
val df = spark.read.format("jdbc").options(options).load()
df.show()
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
|schemata.catalog_name|schemata.schema_name|schemata.schema_owner|schemata.default_character_set_catalog|schemata.default_character_set_schema|schemata.default_character_set_name|schemata.sql_path|
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path|
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path|
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path|
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
References:
1. https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/using-hiveql/topics/hive_query_information_schema.html
... View more
Labels:
10-11-2022
02:55 AM
Hi @fares_ In the above application log, we can see clearly the docker mount path is not found. Could you please fix the mount issue? And also verify the spark submit parameter once. Shell error output: Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056'
Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056'
Invalid docker mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:rw', realpath=/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056
Error constructing docker command, docker error code=13, error message='Invalid docker mount' Reference: https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/yarn-managing-docker-containers/topics/yarn-docker-example-spark.html
... View more