About RangaReddy

RangaReddy · ‎10-15-2021

@LegallyBind For each python, you need to create separate interpreter.

RangaReddy · ‎10-06-2021

Hi @LegallyBind Please find the following tutorial. https://community.cloudera.com/t5/Customer/How-to-use-multiple-versions-of-Python-in-Zeppelin/ta-p/271226

RangaReddy · ‎09-14-2021

Hi @Seaport As you know, resource managers like yarn, standalone, kubernets will create containers. Internally RMs will use shell script to create containers. Based on resources, it will create one or more containers in the same node.

RangaReddy · ‎09-14-2021

Hi @Seaport Please check the following example. It will may help. https://kontext.tech/column/spark/284/pyspark-convert-json-string-column-to-array-of-object-structtype-in-data-frame

RangaReddy · ‎09-08-2021

In this tutorial, we will learn how to create Apache Ozone volumes, buckets, and keys. After that, we will see how to create an Apache Hive table using Apache Ozone, and finally how we can insert/read the data from Apache Spark. Ozone Create the volume with the name vol1. # ozone sh volume create /vol1 21/08/25 06:23:27 INFO rpc.RpcClient: Creating Volume: vol1, with root as owner. Create the bucket with the name bucket1 under vol1. # ozone sh bucket create /vol1/bucket1 21/08/25 06:24:09 INFO rpc.RpcClient: Creating Bucket: vol1/bucket1, with Versioning false and Storage Type set to DISK and Encryption set to false Hive Launch the beeline shell. Create the employee table in Hive. Note: Update the om.host.example.com value. CREATE DATABASE IF NOT EXISTS ozone_db; USE ozone_db; CREATE EXTERNAL TABLE IF NOT EXISTS `employee`( `id` bigint, `name` string, `age` smallint) STORED AS parquet LOCATION 'o3fs://bucket1.vol1.om.host.example.com/employee'; Spark Spark2: Launch spark-shell spark-shell Run the following query to insert/read the data from the Hive employee table. spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (1, "Ranga", 33)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (2, "Nishanth", 3)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (3, "Raja", 59)""") spark.sql("SELECT * FROM ozone_db.employee").show() Spark3: Launch spark3-shell spark3-shell --jars /opt/cloudera/parcels/CDH/lib/hadoop-ozone/hadoop-ozone-filesystem-hadoop3-*.jar Run the following query to insert/read the data from the Hive employee table. spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (1, "Ranga", 33)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (2, "Nishanth", 3)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (3, "Raja", 59)""") spark.sql("SELECT * FROM ozone_db.employee").show() Kerberized environment Pre-requisites: Create a user and provide proper Ranger permissions to create Ozone volume and buckets, etc. kinit with the user. Spark2: Launch spark-shell Note: Before launching spark-shell update the om.host.example.com value. spark-shell \ --conf spark.yarn.access.hadoopFileSystems=o3fs://bucket1.vol1.om.host.example.com:9862 Run the following query to insert/read the data from Hive employee table. spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (1, "Ranga", 33)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (2, "Nishanth", 3)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (3, "Raja", 59)""") spark.sql("SELECT * FROM ozone_db.employee").show() Spark3: Launch spark3-shell Note: Before launching spark-shell update the om.host.example.com value. spark3-shell \ --conf spark.kerberos.access.hadoopFileSystems=o3fs://bucket1.vol1.om.host.example.com:9862 \ --jars /opt/cloudera/parcels/CDH/lib/hadoop-ozone/hadoop-ozone-filesystem-hadoop3-*.jar Run the following query to insert/read the data from the Hive employee table. spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (1, "Ranga", 33)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (2, "Nishanth", 3)""") spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (3, "Raja", 59)""") spark.sql("SELECT * FROM ozone_db.employee").show() Notes: If you get the java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ozone.OzoneFileSystem not foundthen add the /opt/cloudera/parcels/CDH/jars/hadoop-ozone-filesystem-hadoop3-*.jar to spark class path using --jars option. In a Kerberized environment, mandatorily, we need to specify the spark.yarn.access.hadoopFileSystems configuration, otherwise, it will display the following error. java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] Thanks for reading this article. If you liked this article, you can give kudos.

RangaReddy · ‎08-30-2021

Hi @Seaport Yes it is required when if you want to run Python UDFs or do something outside spark SQL operations in your application. If you are just using the Spark SQL API there’s no runtime requirement for python. If you are going to install Spark3, please check below supported versions: Spark 2.4 supports Python 2.7 and 3.4-3.7. Spark 3.0 supports Python 2.7 and 3.4 and higher, although support for Python 2 and 3.4 to 3.5 is deprecated. Spark 3.1 supports Python 3.6 and higher. CDS Powered by Apache Spark requires one of the following Python versions: Python 2.7 or higher, when using Python 2. Python 3.4 or higher, when using Python 3. (CDS 2.0 only supports Python 3.4 and 3.5; CDS 2.1 and higher include support for Python 3.6 and higher). Python 3.4 or higher, when using Python 3 (CDS 3). Note: Spark 2.4 is not compatible with Python 3.8. The latest version recommended is Python 3.4+ (https://spark.apache.org/docs/2.4.0/#downloading). The Apache Jira SPARK-29536 related to Python 3.8 is fixed in Spark3.

RangaReddy · ‎08-30-2021

Hi @Sbofa Yes you are right. Based on kind it will decide which kind of spark shell needs to start.

RangaReddy · ‎08-25-2021

In this tutorial, we will learn how to create Apache Ozone volumes, buckets, and keys. After that, we will see how we can access Apache Ozone data in Apache Spark. Ozone Create the volume with the name vol1 in Apache Ozone. # ozone sh volume create /vol1 21/08/25 06:23:27 INFO rpc.RpcClient: Creating Volume: vol1, with root as owner. Create the bucket with the name bucket1 under vol1. # ozone sh bucket create /vol1/bucket1 21/08/25 06:24:09 INFO rpc.RpcClient: Creating Bucket: vol1/bucket1, with Versioning false and Storage Type set to DISK and Encryption set to false Create the employee.csv file to upload to Ozone. # vi /tmp/employee.csv id,name,age 1,Ranga,33 2,Nishanth,4 3,Raja,60 Upload the employee.csv file to Ozone # ozone sh key put /vol1/bucket1/employee.csv /tmp/employee.csv Add the fs.o3fs.impl property to core-site.xml Go to Cloudera Manager > HDFS > Configuration > search for core-site.xml > Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml <property> <name>fs.o3fs.impl</name> <value>org.apache.hadoop.fs.ozone.OzoneFileSystem</value> </property> Display the files created earlier using 'hdfs' command. Note: Before running the following command, update the om-host.example.com value. hdfs dfs -ls o3fs://bucket1.vol1.om-host.example.com/ Spark Launch spark-shell spark spark-shell Run the following command to print the employee.csv file content. Note: Update the omHost value. scala> val omHost="om.host.example.com" scala> val df=spark.read.option("header", "true").option("inferSchema", "true").csv(s"o3fs://bucket1.vol1.${omHost}/employee.csv") scala> df.show() +---+--------+---+ | id| name|age| +---+--------+---+ | 1| Ranga| 33| | 2|Nishanth| 4| | 3| Raja| 60| +---+--------+---+ Kerberized environment Pre-requisites: Create a user and provide proper Ranger permissions to create Ozone volume and buckets, etc. kinit with the user Steps: Create Ozone volumes, buckets, and keys mentioned in Ozone section. Launch spark-shell Replace the KEY_TAB, PRINCIPAL, and om.host.example.com in spark-shell spark-shell \ --keytab ${KEY_TAB} \ --principal ${PRINCIPAL} \ --conf spark.yarn.access.hadoopFileSystems=o3fs://bucket1.vol1.om.host.example.com:9862 Note: In a Kerberized environment, mandatorily, we need to specify the spark.yarn.access.hadoopFileSystems configuration, otherwise, it will display the following error: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] Run the following command to print the employee.csv file content. Note: Update the omHost value. scala> val omHost="om.host.example.com" scala> val df=spark.read.option("header", "true").option("inferSchema", "true").csv(s"o3fs://bucket1.vol1.${omHost}/employee.csv") scala> df.show() +---+--------+---+ | id| name|age| +---+--------+---+ | 1| Ranga| 33| | 2|Nishanth| 4| | 3| Raja| 60| +---+--------+---+ scala> val age30DF = df.filter(df("age") > 30) scala> val outputPath = s"o3fs://bucket1.vol1.${omHost}/employee_age30.csv" scala> age30DF.write.option("header", "true").mode("overwrite").csv(outputPath) scala> val df2=spark.read.option("header", "true").option("inferSchema", "true").csv(outputPath) scala> df2.show() +---+-----+---+ | id| name|age| +---+-----+---+ | 1|Ranga| 33| | 3| Raja| 60| +---+-----+---+ Note: If you get the java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ozone.OzoneFileSystem not found, add the /opt/cloudera/parcels/CDH/jars/hadoop-ozone-filesystem-hadoop3-*.jar to spark class path using --jars option. Thanks for reading this article. If you liked this article, you can give kudos.

RangaReddy · ‎08-24-2021

In this article, we will learn how to register a Hive UDFs using Spark HiveWarehouseSession. Download and build the Spark Hive UDF example. git clone https://github.com/rangareddy/spark-hive-udf cd spark-hive-udf mvn clean package -DskipTests Copy the target/spark-hive-udf-1.0.0-SNAPSHOT.jar to the edge node. Login to edge node and upload the spark-hive-udf-1.0.0-SNAPSHOT.jar to HDFS location for example, /tmp. hdfs dfs -put ./brickhouse-0.7.1-SNAPSHOT.jar /tmp Launch the spark-shell with 'hwc' parameters. spark-shell \ --jars /opt/cloudera/parcels/CDH/jars/hive-warehouse-connector-assembly-*.jar \ --conf spark.sql.hive.hiveserver2.jdbc.url='jdbc:hive2://hiveserver2_host1:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2' \ --conf spark.sql.hive.hwc.execution.mode=spark \ --conf spark.datasource.hive.warehouse.metastoreUri='thrift://metastore_host:9083' \ --conf spark.datasource.hive.warehouse.load.staging.dir='/tmp' \ --conf spark.datasource.hive.warehouse.user.name=hive \ --conf spark.datasource.hive.warehouse.password=hive \ --conf spark.datasource.hive.warehouse.smartExecution=false \ --conf spark.datasource.hive.warehouse.read.via.llap=false \ --conf spark.datasource.hive.warehouse.read.jdbc.mode=cluster \ --conf spark.datasource.hive.warehouse.read.mode=DIRECT_READER_V2 \ --conf spark.security.credentials.hiveserver2.enabled=false \ --conf spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions Create the HiveWarehouseSession. import com.hortonworks.hwc.HiveWarehouseSession import com.hortonworks.hwc.HiveWarehouseSession._ val hive = HiveWarehouseSession.session(spark).build() Execute the following statement to register a Hive UDF. hive.executeUpdate("CREATE FUNCTION uppercase AS 'com.ranga.spark.hive.udf.UpperCaseUDF' USING JAR 'hdfs:///tmp/spark-hive-udf-1.0.0-SNAPSHOT.jar'") Test the registered function, for example, uppercase. scala> val data1 = hive.executeQuery("select id, uppercase(name), age, salary from employee") scala> data1.show() +---+-----------------------+---+---------+ | id|default.uppercase(name)|age| salary| +---+-----------------------+---+---------+ | 1| RANGA| 32| 245000.3| | 2| NISHANTH| 2| 345000.1| | 3| RAJA| 32|245000.86| | 4| MANI| 14| 45000.0| +---+-----------------------+---+---------+ Thanks for reading this article.

RangaReddy · ‎08-11-2021

Hi @RonyA If you are a cloudera customer, please create the case we will work on this issue.

Online	Offline
Last Visited	‎08-29-2024 03:41 AM

Member Since	‎06-02-2020 05:25 AM
Last Visited	‎08-29-2024 03:41 AM
Posts	331
Kudos received	68

Cloudera Community

Re: Icebreg on CDP private cloud 7.1.9

Re: How to set default time zone/local time for Sp...

Re: Load Iceberg Table on PowerBI Desktop

Re: NoClassDefFoundError due to Incompatible Spark...

Re: Creating Iceberg table

Re: What is the best practice for multiple python ...

Re: What is the best practice for multiple python ...

Re: Do I need to install Python3 on every CDP node...

Re: Parse nested json using Spark RDD

Spark Hive Ozone Integration in CDP

Re: Do I need to install Python3 on every CDP node...

Re: How to Submit Spark Application through Livy R...

Spark Ozone Integration in CDP

Spark HWC integration with Hive UDFs

Re: Spark cluster: Launched executors less than sp...