Member since
06-02-2020
131
Posts
18
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
756 | 09-30-2022 02:01 AM | |
474 | 09-16-2022 05:19 AM | |
239 | 08-30-2022 04:25 AM | |
259 | 08-08-2022 03:40 AM | |
3401 | 08-03-2022 10:53 PM |
01-18-2023
01:07 AM
Hi @Nikhil44 First of all, Cloudera will not support Standalone Spark installation. To access any hive table, we need a hive-site.xml and Hadoop-related configuration files like (core-site.xml, hdfs-site.xml and yarn-site.xml)
... View more
12-20-2022
10:18 PM
Hi @Samie Is there any update on your testing?
... View more
12-15-2022
09:13 PM
HI @Samie Please attach the spark application and event logs to check the queue name. The easiest way to check the spark application is by running spark pi example. spark-submit \
--class org.apache.spark.examples.SparkPi \
--queue <queue_name> \
--master yarn \
--deploy-mode cluster \
--num-executors 1 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
/usr/hdp/current/spark2-client/examples/jars/spark-examples_*.jar 10 Spark on YARN only:
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
... View more
12-08-2022
08:30 PM
Hi @quangbilly79 You have used CDP hbase-spark-1.0.0.7.2.15.0-147.jar instead of CDH. There is no guarantee it will work latest jar in CDH. Luckily for you it is worked.
... View more
11-07-2022
02:09 AM
Hi @PNCJeff I would recommend installing and using Livy Server in the CDP cluster. For Livy Kerberos configuration parameters are below: livy.server.launch.kerberos.keytab=<LIVY_SERVER_PATH>/livy.keytab
livy.server.launch.kerberos.principal=livy/server@DOMAIN.COM
livy.server.auth.type=kerberos
livy.server.auth.kerberos.keytab=<LIVY_SERVER_PATH>/livy.keytab
livy.server.auth.kerberos.principal=HTTP/server@DOMAIN.COM
livy.server.auth.kerberos.name-rules=RULE:[2:$1@$0](rangeradmin@DOMAIN.COM)s/(.*)@DOMAIN.COM/ranger/\u000ARULE:[2:$1@$0](rangertagsync@DOMAIN.COM)s/(.*)@DOMAIN.COM/rangertagsync/\u000ARULE:[2:$1@$0](rangerusersync@DOMAIN.COM)s/(.*)@DOMAIN.COM/rangerusersync/\u000ARULE:[2:$1@$0](rangerkms@DOMAIN.COM)s/(.*)@DOMAIN.COM/keyadmin/\u000ARULE:[2:$1@$0](atlas@DOMAIN.COM)s/(.*)@DOMAIN.COM/atlas/\u000ADEFAULT\u000A
... View more
10-27-2022
08:21 PM
Hi @Jean-Luc You can try the following example code https://github.com/rangareddy/ranga_spark_experiments/tree/master/spark_hbase_cdh_integration
... View more
10-14-2022
06:10 AM
Let's assume we have a Hive table with the name test and the table is stored under /tmp directory. In the test table, data is stored as below: hdfs dfs -ls -R /tmp/test
drwxr-xr-x - hive hive 0 2022-08-24 09:15 /tmp/test/dir1
-rw-r--r-- 3 hive hive 685 2022-08-24 09:15 /tmp/test/dir1/000000_0
drwxr-xr-x - hive hive 0 2022-08-24 09:15 /tmp/test/dir2
-rw-r--r-- 3 hive hive 685 2022-08-24 09:15 /tmp/test/dir2/000000_0 Generally, the above kind of data will be generated while doing union all operations in Hive. By using spark, if we try to load the hive table data, we will get the following exception: scala> spark.sql("SELECT * FROM test").show() java.io.IOException: Not a file: hdfs://localhost:8020/tmp/test/dir1
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:340)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
..... By default spark will not read the table data if it contains subdirectories. To solve the this issue, we need to set the following parameter: spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true") We can also get same kind of exception while reading data in hive table. To solve this issue in Hive, we need to set the following two parameters: hive> set mapred.input.dir.recursive=true;
hive> set hive.mapred.supports.subdirectories=true; We can also set above two parameters in hive-site.xml.
... View more
10-14-2022
04:14 AM
Let's understand the information_schema database:
Hive supports the ANSI-standard information_schema database, which we can query for information about tables, views, columns, and our Hive privileges. The information_schema data reveals the state of the system, similar to sys database data, but in a user-friendly, read-only way.
Example:
SELECT * FROM information_schema.tables WHERE is_insertable_into='YES' limit 2;
...
+--------------------+-------------------+-----------------
|tables.table_catalog|tables.table_schema|tables.table_name
+--------------------+-------------------+-----------------
|default |default |students2
|default |default |t3
Now we will try to access the following table under the information_schema database.
spark.sql("select * from information_schema.schemata").show()
We will get the following exception:
org.apache.spark.sql.AnalysisException: Undefined function: 'restrict_information_schema'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632
We will get the above exception because in spark we don't have restrict_information_schema function and in Hive it is available. We can check the available functions using the following command:
spark.sql("show functions").show()
We can solve the above error by passing hive-exec.jar and by creating a temporary function.
spark-shell --jars /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/hive/lib/hive-exec.jar
spark.sql("""
CREATE TEMPORARY FUNCTION restrict_information_schema AS
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFRestrictInformationSchema'
""")
After registering the function if we try to access the table data we will another error like below:
scala> spark.sql("select * from information_schema.schemata").show()
org.apache.spark.sql.AnalysisException: Undefined function: 'current_user'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632
Difficult to find out how many function(s) we need to register.
To avoid registering functions, we can use the Spark JDBC API to read the tables under information_schema.
spark-shell --jars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc.jar
val options = Map(
"url" -> "jdbc:hive2://localhost:10000/default;",
"driver" -> "org.apache.hive.jdbc.HiveDriver",
"dbtable" -> "information_schema.schemata",
"user" -> "hive_user",
"password" -> "hive_password"
)
val df = spark.read.format("jdbc").options(options).load()
df.show()
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
|schemata.catalog_name|schemata.schema_name|schemata.schema_owner|schemata.default_character_set_catalog|schemata.default_character_set_schema|schemata.default_character_set_name|schemata.sql_path|
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path|
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path|
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path|
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
References:
1. https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/using-hiveql/topics/hive_query_information_schema.html
... View more
Labels:
10-11-2022
02:55 AM
Hi @fares_ In the above application log, we can see clearly the docker mount path is not found. Could you please fix the mount issue? And also verify the spark submit parameter once. Shell error output: Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056'
Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056'
Invalid docker mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:rw', realpath=/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056
Error constructing docker command, docker error code=13, error message='Invalid docker mount' Reference: https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/yarn-managing-docker-containers/topics/yarn-docker-example-spark.html
... View more
10-11-2022
02:13 AM
Hi @Ploeplse Still, if you are facing the issue, could you share the requested information (i.e code and impala table creation script)
... View more
10-07-2022
04:59 AM
Hi @VidyaSargur / @DianaTorres Could you please recheck this support question asked by the user?
... View more
09-30-2022
02:01 AM
Hi @imule Add the following parameter to your spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=<python3_path>
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=<python3_path> Note: 1. Ensure python3_path exists in all nodes. 2. Ensure required modules are installed in each node.
... View more
09-21-2022
10:22 PM
Hi @Boron Could you please set the spark-home environment variable like below before creating spark-session. import os
os.environ['SPARK_HOME'] = '/usr/hdp/current/spark-client' Reference: https://stackoverflow.com/questions/55569985/pyspark-could-not-find-valid-spark-home https://stackoverflow.com/questions/40087188/cant-find-spark-submit-when-typing-spark-shell
... View more
09-16-2022
05:19 AM
Hi @poorva Please check the application logs for failed applications from Resource Manager UI, there exception message is printed. Fix the exception and resubmit the job.
... View more
09-14-2022
12:54 AM
Hi @Ploeplse Could you please share reproducible sample code and impala tab creation script?
... View more
08-31-2022
10:53 PM
Hi @Yosieam Please avoid calling read_file_log.collect() method. It will bring whole data to the driver and the driver needs to have more memory to hold that much data. Please check the modified code: move_to_rdd = sc.textFile("datalog2.log").map(lambda row : row.split("time=")).filter(lambda x : x != "")
ReSymbol = move_to_rdd.map(lambda x : re.sub(r'\t', ' ', x)).map(lambda x : re.sub(r'\n', ' ', x)).map(lambda x : re.sub(r' +', ' ', x))
... View more
08-31-2022
10:48 PM
Hi @mmk I think you have shared the following information. 7 nodes with each having 250 gb memory and vcpu = 32 per each node spark-defaults.conf spark.executor.memory = 100g spark.executor.memoryOverhead = 49g spark.driver.memoryOverhead=200g spark.driver.memory = 500g You have maximum of 250 gb for node and you have specified driver memory is (500gb and 200gb overhead). How it possible to driver to get 700gb? Generally you should not exceed the driver/executor memory beyond yarn physical memory. Coming to the actual problem, please avoid the show() to print 8000000 records. If you need to get the print the all values, then implement a logic to 1000 records at once and next 1000 records for another iteration. https://stackoverflow.com/questions/29227949/how-to-implement-spark-sql-pagination-query
... View more
08-31-2022
09:45 PM
Hi @mmk By default, Hive will load all SerDe under the hive/lib location. So you are able to do the create/insert/select operations. In order to read the Hive table created with Custom or external SerDe we need to provide to spark, so spark internally use those libraries and it will load the Hive table data. If you are not provided the serde you can see the following exception: org.apache.hadoop.hive.serde2.SerDeException Please add the following library to the spark-submit command: json-serde-<version>.jar
... View more
08-31-2022
09:36 PM
Hi @suri789 I think you haven't shared the full code, sample data and expected output to provide a solution. Please share the code proper format.
... View more
08-31-2022
09:33 PM
Hi @AZIMKBC Please try to run the SparkPi example and see if is there any error in the logs. https://rangareddy.github.io/SparkPiExample/ If still issue is not resolved and you are a Cloudera customer please raise a case we will work on internally.
... View more
08-31-2022
09:29 PM
Hi @shraddha Could you please check by any chance if you have set master as local while creating SparkSession in your code. Use the following sample code to run locally and cluster without updating the master value. val appName = "MySparkApp"
// Creating the SparkConf object
val sparkConf = new SparkConf().setAppName(appName).setIfMissing("spark.master", "local[2]")
// Creating the SparkSession object
val spark: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate() Verify the whole logs once again to check is there any others errors.
... View more
08-31-2022
09:20 PM
Hi @Yosieam Thanks for sharing the code. You forgot to share the spark-submit/pyspark command. Please check what is executor/driver memory is passed to the spark-submit. Could you please confirm file is in local system/hdfs system.
... View more
08-31-2022
09:15 PM
Hi @nvelraj Pyspark job working locally because in your local system pandas library is installed, so it is working. When you run in cluster, pandas library/module is not available so you are getting the following error. ModuleNotFoundError: No module named 'pandas' To solve the. issue, you need to install the pandal library/module in all machines or use Virtual environment.
... View more
08-31-2022
09:08 PM
Hi @dmharshit As you know, Cloudera provides the Hybrid data platform, so you can install the CDP product in on-premises and public cloud or both. CDP Private Cloud Base product is supported only for On-Premises cluster. CDP Public Cloud Base product is supported for public cloud like AWS, Azure, GCP. @fzsombor already shared references how you can install CDP Private cloud and how to install Spark3 as well. Please let me know still you need any further information.
... View more
08-31-2022
08:59 PM
Hi @Camilo When you are sharing the exception you need to share more details. So it will help us to provide a solution in faster way. 1. How are you launching the spark job? 2. If you built application using maven or sbt built tool have you specified spark-hive.jar version. For example, <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>2.4.8</version>
<scope>provided</scope>
</dependency> References: 1. https://stackoverflow.com/questions/39444493/how-to-create-sparksession-with-hive-support-fails-with-hive-classes-are-not-f 2. https://mvnrepository.com/artifact/org.apache.spark/spark-hive
... View more
08-30-2022
11:18 PM
What is the HDP version. if it is HDP3.x then you need to use Hive Warehouse Connector (HWC).
... View more
08-30-2022
06:49 PM
Hi @mala_etl You can find the catalog information in the below link: https://stackoverflow.com/questions/59894454/spark-and-hive-in-hadoop-3-difference-between-metastore-catalog-default-and-spa Could you please confirm, the table is internal or external table in Hive and also verify the data in Hive.
... View more
08-30-2022
04:33 AM
Hi @mala_etl I think you didn't mention you are running the application in CDH/HDP/CDP. Could you please share your hive script and check you are using hive catalog instead of in-memory catalog.
... View more
08-30-2022
04:31 AM
Hi @somant Please don't use open source libraries and use cluster-supported spark/kafka versions. Check the following example code: https://community.cloudera.com/t5/Community-Articles/Running-DirectKafkaWordCount-example-in-CDP/ta-p/340402
... View more
08-30-2022
04:25 AM
Hi @MikeCC Spark 3.3 is not yet supported in CDP. We have a future plan to release Spark 3.3 in CDP 7.1.8 or later versions. As per the below support matrix, we are not yet supported Java17. https://supportmatrix.cloudera.com/ I hope you have got answers for your question. If yes, please accept as a solution.
... View more