Member since
06-02-2020
331
Posts
67
Kudos Received
49
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4098 | 07-11-2024 01:55 AM | |
| 11365 | 07-09-2024 11:18 PM | |
| 8564 | 07-09-2024 04:26 AM | |
| 8585 | 07-09-2024 03:38 AM | |
| 7507 | 06-05-2024 02:03 AM |
08-08-2021
11:43 PM
1 Kudo
In this article, we will learn to pass atlas-application.properties configuration file from a different location in spark-submit command.
When Atlas service is enabled in CDP, and we run Spark application by default, atlas-application.properties file is picked from /etc/spark/conf.cloudera.spark_on_yarn/ directory.
Let's test with SparkPi example:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10
We can see the following output in the application log.
21/08/23 06:12:03 INFO atlas.ApplicationProperties: Looking for atlas-application.properties in classpath
21/08/23 06:12:03 INFO atlas.ApplicationProperties: Loading atlas-application.properties from file:/etc/spark/conf.cloudera.spark_on_yarn/atlas-application.properties
If we want to pass the atlas-application.properties configuration file from a different location, for example /tmp directory, copy the atlas-application.properties from /etc/spark/conf.cloudera.spark_on_yarn to /tmp directory and pass it using -Datlas.conf=/tmp/ variable in spark-submit.
Let's test with same SparkPi example by adding --driver-java-options="-Datlas.conf=/tmp/" property to the spark-submit.
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-java-options="-Datlas.conf=/tmp/" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10
We can see the following output in the application log.
21/08/05 14:36:24 INFO atlas.ApplicationProperties: Looking for atlas-application.properties in classpath
21/08/05 14:36:24 INFO atlas.ApplicationProperties: Loading atlas-application.properties from file:/tmp/atlas-application.properties
In order to run the same SparkPi example in cluster mode, we need to place the atlas-application.properties file in all nodes /tmp directory and run the Spark application as follows:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster \
--files /tmp/atlas-application.properties#atlas-application.properties --driver-java-options="-Datlas.conf=/tmp/" \
/opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10
or,
sudo -u spark spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster \
--files /tmp/atlas-application.properties --conf spark.driver.extraJavaOptions="-Datlas.conf=./" \
/opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10
We can see the following output:
21/08/23 06:12:07 INFO atlas.ApplicationProperties: Loading atlas-application.properties from file:/data1/tmp/usercache/spark/appcache/application_1629693759177_0016/container_e74_1629693759177_0016_01_000001/./atlas-application.properties
... View more
Labels:
07-30-2021
08:57 AM
Hi @RonyA You haven't shared what is your dataset size. Apart from data you need to tune on few spark parameters. spark = (SparkSession
.builder.master("yarn")
.config("spark.executor.cores", "5") # you have mentioned 12
.config("spark.num.executors", "10")
.config("spark.executor.memory", "10G")
.config("spark.executor.memoryOverhead", "2G") # executor memory * 0.1 or 0.2 %
.config("spark.driver.memory", "10G")
.config("spark.driver.memoryOverhead", "2G") # driver memory * 0.1 or 0.2 %
.config("spark.sql.hive.convertMetastoreOrc", "true")
.config("spark.executor.heartbeatInterval", "60s") # default 10s
.config("spark.network.timeout", "600s") # default 120s
.config("spark.driver.maxResultSize", "2g")
.config("spark.driver.cores","4")
.config("spark.executor.extraClassPath", "-Dhdp.version=current")
.config("spark.debug.maxToStringFields", 200)
.config("spark.sql.catalogImplementation", "hive")
.config("spark.memory.fraction", "0.8")
.config("spark.memory.storageFraction", "0.2")
.config("spark.sql.hive.filesourcePartitionFileCacheSize", "0")
.config("spark.yarn.maxAppAttempts", "10")
.appName(app_name)
.enableHiveSupport().getOrCreate()) Apart from above if you are doing any kind of wide operation shuffle is involved. To set shuffle value we will use below calculation: spark.sql.shuffle.partitions = shuffle input size/ hdfs block size for example, shuffle input size is 10GB and hdfs block size is 128 MB then shuffle partitions is 10GB/128MB = 80 partitions. And also check you have enabled dynamic allocation or not. You can open Spark UI --> Select Application --> Go to the Environment page --> find spark.dynamicallocation.enabled property.
... View more
07-30-2021
08:47 AM
Hi @BabaHer CDP onward to support Spark and HBase, cloudera is recommended to use hbase-spark jar. https://mvnrepository.com/artifact/org.apache.hbase.connectors.spark/hbase-spark?repo=cloudera-repos The latest hbase-spark jar version is 1.0.0.7.2.10.0-148. To integrate Spark3 with Hbase you can find sample example below: https://kontext.tech/column/spark/628/spark-connect-to-hbase
... View more
07-15-2021
06:50 AM
1 Kudo
Hi @PrernaU 1. By default CDP uses PAM authentication. So we can remove below two properties pamRealm=org.apache.zeppelin.realm.PamRealm
pamRealm.service=sshd 2. And then configured `admin=admin, admins` under `zeppelin.shiro.user.block`
... View more
06-28-2021
07:22 PM
Hi @javidshaik Yes based on cloudera documentation, it is not supported multiple versions under same Cloudera Manager Server.
... View more
06-28-2021
04:04 AM
Hi @javidshaik I have checked with internal team. We can migrate Spark version from Spark 2.3 to 2.4 mentioned the details in below document. 2.4 Release 2 CDH 5.10 and any higher CDH 5.x versions 2.4 Release 1 CDH 5.10 and any higher CDH 5.x versions https://docs.cloudera.com/documentation/spark2/latest/topics/spark2_requirements.html But Spark 2.3 -> 2.4 version changes have higher potential of risks. If you are satisfied with my answer please Accept as solution.
... View more
06-28-2021
02:39 AM
Hi @javidshaik CDH 5.x and HDP 2.X version clusters has reached end of life support. Better upgrade your cluster to CDH 6.X/CDP 7.X version. Both CDH 6.X and CDP 7.X clusters will support Spark 2.4 versions. Please refer following documentation: https://www.cloudera.com/legal/policies/support-lifecycle-policy.html
... View more
06-27-2021
09:04 AM
Hi @roshanbi Please find the difference: val textFileDF : Dataset[String] = spark.read.textFile("/path") // returns Dataset object val textFileRDD : RDD[String] = spark.sparkContext.textFile("/path") // returns RDD object If you are satisfied, please Accept as Solution.
... View more
06-25-2021
04:28 PM
1 Kudo
Hi @roshanbi val ds = Seq(1, 2, 3).toDS() It will create sequence of number and later we are converting it into DataSet. There are multiple ways we can create dataset. Above one one way of creating Dataset. If you are created a dataframe with case class and you want to convert it into dataset you can use dataframe.as[Classname] Here you can find different ways of creating dataset. https://www.educba.com/spark-dataset/ Please let me know is there any doubts. Please Accept as Solution once you satisfied with above answer.
... View more
06-24-2021
08:23 AM
Hi @roshanbi If you are satisfied with my answer please Accept as Solution.
... View more