About RangaReddy

RangaReddy · ‎09-14-2021

Hi @Seaport As you know, resource managers like yarn, standalone, kubernets will create containers. Internally RMs will use shell script to create containers. Based on resources, it will create one or more containers in the same node.

RangaReddy · ‎09-08-2021

In this tutorial, we will learn how to create Apache Ozone volumes, buckets, and keys. After that, we will see how to create an Apache Hive table using Apache Ozone, and finally how we can insert/read the data from Apache Spark. Ozone Create the volume with the name vol1. # ozone sh volume create /vol1 21/08/25 06:23:27 INFO rpc.RpcClient: Creating Volume: vol1, with root as owner. Create the bucket with the name bucket1 under vol1. # ozone sh bucket create /vol1/bucket1 21/08/25 06:24:09 INFO rpc.RpcClient: Creating Bucket: vol1/bucket1, with Versioning false and Storage Type set to DISK and Encryption set to false Hive Launch the beeline shell. Create the employee table in Hive. Note: Update the om.host.example.com value. CREATE DATABASE IF NOT EXISTS ozone_db; USE ozone_db; CREATE EXTERNAL TABLE IF NOT EXISTS `employee`( `id` bigint, `name` string, `age` smallint) STORED AS parquet LOCATION 'o3fs://bucket1.vol1.om.host.example.com/employee'; Spark Spark2: Launch spark-shell spark-shell Run the following query to insert/read the data from the Hive employee table. spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (1, "Ranga", 33)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (2, "Nishanth", 3)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (3, "Raja", 59)""") spark.sql("SELECT * FROM ozone_db.employee").show() Spark3: Launch spark3-shell spark3-shell --jars /opt/cloudera/parcels/CDH/lib/hadoop-ozone/hadoop-ozone-filesystem-hadoop3-*.jar Run the following query to insert/read the data from the Hive employee table. spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (1, "Ranga", 33)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (2, "Nishanth", 3)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (3, "Raja", 59)""") spark.sql("SELECT * FROM ozone_db.employee").show() Kerberized environment Pre-requisites: Create a user and provide proper Ranger permissions to create Ozone volume and buckets, etc. kinit with the user. Spark2: Launch spark-shell Note: Before launching spark-shell update the om.host.example.com value. spark-shell \ --conf spark.yarn.access.hadoopFileSystems=o3fs://bucket1.vol1.om.host.example.com:9862 Run the following query to insert/read the data from Hive employee table. spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (1, "Ranga", 33)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (2, "Nishanth", 3)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (3, "Raja", 59)""") spark.sql("SELECT * FROM ozone_db.employee").show() Spark3: Launch spark3-shell Note: Before launching spark-shell update the om.host.example.com value. spark3-shell \ --conf spark.kerberos.access.hadoopFileSystems=o3fs://bucket1.vol1.om.host.example.com:9862 \ --jars /opt/cloudera/parcels/CDH/lib/hadoop-ozone/hadoop-ozone-filesystem-hadoop3-*.jar Run the following query to insert/read the data from the Hive employee table. spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (1, "Ranga", 33)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (2, "Nishanth", 3)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (3, "Raja", 59)""") spark.sql("SELECT * FROM ozone_db.employee").show() Notes: If you get the java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ozone.OzoneFileSystem not foundthen add the /opt/cloudera/parcels/CDH/jars/hadoop-ozone-filesystem-hadoop3-*.jar to spark class path using --jars option. In a Kerberized environment, mandatorily, we need to specify the spark.yarn.access.hadoopFileSystems configuration, otherwise, it will display the following error. java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] Thanks for reading this article. If you liked this article, you can give kudos.

RangaReddy · ‎08-31-2021

Hi @yudh3 This application is deployed first time or it is an existing application. If it is first time then you need to tune according to what kind of operation you are doing. If an existing application, this issue is occurring recently or from long time it is there. If it is occurring recently, is there any data change or any hdfs/hive issues. Without understanding the logs, difficult to tell what is the exact issue. Please go ahead and create a case for this issue, we will work on.

RangaReddy · ‎08-30-2021

Hi @Sbofa Yes you are right. Based on kind it will decide which kind of spark shell needs to start.

RangaReddy · ‎08-24-2021

In this article, we will learn how to register a Hive UDFs using Spark HiveWarehouseSession. Download and build the Spark Hive UDF example. git clone https://github.com/rangareddy/spark-hive-udf cd spark-hive-udf mvn clean package -DskipTests Copy the target/spark-hive-udf-1.0.0-SNAPSHOT.jar to the edge node. Login to edge node and upload the spark-hive-udf-1.0.0-SNAPSHOT.jar to HDFS location for example, /tmp. hdfs dfs -put ./brickhouse-0.7.1-SNAPSHOT.jar /tmp Launch the spark-shell with 'hwc' parameters. spark-shell \ --jars /opt/cloudera/parcels/CDH/jars/hive-warehouse-connector-assembly-*.jar \ --conf spark.sql.hive.hiveserver2.jdbc.url='jdbc:hive2://hiveserver2_host1:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2' \ --conf spark.sql.hive.hwc.execution.mode=spark \ --conf spark.datasource.hive.warehouse.metastoreUri='thrift://metastore_host:9083' \ --conf spark.datasource.hive.warehouse.load.staging.dir='/tmp' \ --conf spark.datasource.hive.warehouse.user.name=hive \ --conf spark.datasource.hive.warehouse.password=hive \ --conf spark.datasource.hive.warehouse.smartExecution=false \ --conf spark.datasource.hive.warehouse.read.via.llap=false \ --conf spark.datasource.hive.warehouse.read.jdbc.mode=cluster \ --conf spark.datasource.hive.warehouse.read.mode=DIRECT_READER_V2 \ --conf spark.security.credentials.hiveserver2.enabled=false \ --conf spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions Create the HiveWarehouseSession. import com.hortonworks.hwc.HiveWarehouseSession import com.hortonworks.hwc.HiveWarehouseSession._ val hive = HiveWarehouseSession.session(spark).build() Execute the following statement to register a Hive UDF. hive.executeUpdate("CREATE FUNCTION uppercase AS 'com.ranga.spark.hive.udf.UpperCaseUDF' USING JAR 'hdfs:///tmp/spark-hive-udf-1.0.0-SNAPSHOT.jar'") Test the registered function, for example, uppercase. scala> val data1 = hive.executeQuery("select id, uppercase(name), age, salary from employee") scala> data1.show() +---+-----------------------+---+---------+ | id|default.uppercase(name)|age| salary| +---+-----------------------+---+---------+ | 1| RANGA| 32| 245000.3| | 2| NISHANTH| 2| 345000.1| | 3| RAJA| 32|245000.86| | 4| MANI| 14| 45000.0| +---+-----------------------+---+---------+ Thanks for reading this article.

JB0000000000001 · ‎08-13-2021

Maybe you are still asking more than what is available? It really depends on what kind of cluster you have available. It depends on following paramaters: 1)cloudera manager-> yarn-> configuration ->yarn.nodemanager.resource.memory-mb (= Amount of physical memory, in MiB, that can be allocated for containers=all memory that yarn can use on 1 worker node) 2)yarn.scheduler.minimum-allocation-mb (container memmory minimum= every container will request this much memory) 3)yarn.nodemanager.resource.cpu-vcores (Container Virtual CPU Cores) 4)how many worker nodes? Cluster with x nodes? I noticed you really are requesting a lot of cores too. Maybe you can try reduce these a bit? This might also be a bottleneck.

RangaReddy · ‎08-08-2021

In this article, we will learn to pass atlas-application.properties configuration file from a different location in spark-submit command. When Atlas service is enabled in CDP, and we run Spark application by default, atlas-application.properties file is picked from /etc/spark/conf.cloudera.spark_on_yarn/ directory. Let's test with SparkPi example: spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10 We can see the following output in the application log. 21/08/23 06:12:03 INFO atlas.ApplicationProperties: Looking for atlas-application.properties in classpath 21/08/23 06:12:03 INFO atlas.ApplicationProperties: Loading atlas-application.properties from file:/etc/spark/conf.cloudera.spark_on_yarn/atlas-application.properties If we want to pass the atlas-application.properties configuration file from a different location, for example /tmp directory, copy the atlas-application.properties from /etc/spark/conf.cloudera.spark_on_yarn to /tmp directory and pass it using -Datlas.conf=/tmp/ variable in spark-submit. Let's test with same SparkPi example by adding --driver-java-options="-Datlas.conf=/tmp/" property to the spark-submit. spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-java-options="-Datlas.conf=/tmp/" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10 We can see the following output in the application log. 21/08/05 14:36:24 INFO atlas.ApplicationProperties: Looking for atlas-application.properties in classpath 21/08/05 14:36:24 INFO atlas.ApplicationProperties: Loading atlas-application.properties from file:/tmp/atlas-application.properties In order to run the same SparkPi example in cluster mode, we need to place the atlas-application.properties file in all nodes /tmp directory and run the Spark application as follows: spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster \ --files /tmp/atlas-application.properties#atlas-application.properties --driver-java-options="-Datlas.conf=/tmp/" \ /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10 or, sudo -u spark spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster \ --files /tmp/atlas-application.properties --conf spark.driver.extraJavaOptions="-Datlas.conf=./" \ /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10 We can see the following output: 21/08/23 06:12:07 INFO atlas.ApplicationProperties: Loading atlas-application.properties from file:/data1/tmp/usercache/spark/appcache/application_1629693759177_0016/container_e74_1629693759177_0016_01_000001/./atlas-application.properties

RangaReddy · ‎08-05-2021

Hi @vnandigam Good news. Now Spark Atlas integration is supported using CDP cluster. References: 1. https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/atlas-reference/topics/atlas-spark-metadata-collection.html 2. https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-hdp/topics/amb-enable-spark-cm.html

RangaReddy · ‎07-30-2021

Hi @BabaHer CDP onward to support Spark and HBase, cloudera is recommended to use hbase-spark jar. https://mvnrepository.com/artifact/org.apache.hbase.connectors.spark/hbase-spark?repo=cloudera-repos The latest hbase-spark jar version is 1.0.0.7.2.10.0-148. To integrate Spark3 with Hbase you can find sample example below: https://kontext.tech/column/spark/628/spark-connect-to-hbase

RangaReddy · ‎07-22-2021

Hi @SudEl Please try to modify required parameters (memory and other tuning parameters) in spark interpreter.

Online	Offline
Last Visited	‎08-29-2024 03:41 AM

Member Since	‎06-02-2020 05:25 AM
Last Visited	‎08-29-2024 03:41 AM
Posts	331
Kudos received	68

Cloudera Community

Re: Icebreg on CDP private cloud 7.1.9

Re: How to set default time zone/local time for Sp...

Re: Load Iceberg Table on PowerBI Desktop

Re: NoClassDefFoundError due to Incompatible Spark...

Re: Creating Iceberg table

Re: Do I need to install Python3 on every CDP node...

Spark Hive Ozone Integration in CDP

Re: hive on spark job suddenly runs abnormally slo...

Re: How to Submit Spark Application through Livy R...

Spark HWC integration with Hive UDFs

Re: Spark cluster: Launched executors less than sp...

How to pass atlas-application.properties configura...

Re: Apache Atlas Spark Data lineage

Re: Spark Hbase Connector for Spark 3.x and Scala ...

Re: PySpark Custom Config not being considered