About RangaReddy

VidyaSargur · ‎01-17-2024

@Nardoleo, Did any of the responses assist in resolving your query? If it did, kindly mark the relevant reply as the solution, as it will aid others in locating the answer more easily in the future.

cardozogp · ‎01-09-2024

As I was already using the Hadoop Credential Provider, I found a solution that does not require decrypting the password as follows: PySpark code: # Spark session spark = SparkSession.builder \ .config("spark.yarn.keytab=/etc/security/keytabs/<APPLICATION_USER>.keytab") \ .appName('SPARK_TEST') \ .master("yarn") \ .getOrCreate() credential_provider_path = 'jceks://hdfs/<PATH>/<CREDENTIAL_FILE>.jceks' credential_name = 'PASSWORD.ALIAS' # Hadoop credential conf = spark.sparkContext._jsc.hadoopConfiguration() conf.set('hadoop.security.credential.provider.path',credential_provider_path) credential_raw = conf.getPassword(credential_name) for i in range(credential_raw.__len__()): password = password + str(credential_raw.__getitem__(i)) The important point above is the .config() line in SparkSession. You must enter the keytab to access the password. Otherwise you will get the encrypted value. I can't say that I'm very happy with being able to directly manipulate the password value in the code. I would like to delegate this to some component in a way that the programmer does not have direct access to the password value. Maybe what I'm looking for is some kind of authentication provider, but for now the solution above works for me.

RangaReddy · ‎01-02-2024

Thanks @Chandler641 Your issue is resolved after building the spark code properly. Note: We will not support Upstream Spark installation in our cloudera cluster because we are done lot of customisation in cloudera spark to support multiple integration components. Please let me know if you have further concerns on this issue.

RangaReddy · ‎12-19-2023

We are not sure, this kind of scenorio we are going to support. If you are cloudera customer, you can check with your account team to engage cloudera Professional Team to enagage and support this feature.

RangaReddy · ‎12-10-2023

This article delves into the practical aspects of integrating Spark and HBase using Livy, showcasing a comprehensive example that demonstrates the process of reading, processing, and writing data between Spark and HBase. The example utilizes Livy to submit Spark jobs to a YARN cluster, enabling remote execution of Spark applications on HBase data. Prerequisites: Apache Spark installed and configured Apache Livy installed and configured Apache HBase installed and configured HBase Spark Connector jar file available Steps: This step-by-step guide provides a comprehensive overview of how to integrate Spark and HBase using Livy. Step 1: Create an HBase Table Note: If your cluster is kerberized, then you need to provide the proper Ranger HBase permissions to the user and needs to the kinit. Connect to your HBase cluster using the HBase shell: hbase shell Create an HBase table named employees with two column families: per and prof: create 'employees', 'per', 'prof' Exit the HBase Shell: quit Step 2: Create pyspark code. Create a Python file (e.g., hbase_spark_connector_app.py) and add the following code: from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, LongType, ShortType, FloatType import json def main(): spark = SparkSession.builder.appName("HBase Spark Connector App").getOrCreate() data = [(1, "Ranga", 34, 15000.5), (2, "Nishanth", 5, 35000.5),(3, "Meena", 30, 25000.5)] schema = StructType([ \ StructField("id",LongType(),True), \ StructField("name",StringType(),True), \ StructField("age",ShortType(),True), \ StructField("salary", FloatType(), True) ]) employeeDF = spark.createDataFrame(data=data,schema=schema) catalog = json.dumps({ "table":{"namespace":"default", "name":"employees"}, "rowkey":"key", "columns":{ "id":{"cf":"rowkey", "col":"key", "type":"long"}, "name":{"cf":"per", "col":"name", "type":"string"}, "age":{"cf":"per", "col":"age", "type":"short"}, "salary":{"cf":"prof", "col":"salary", "type":"float"} } }) employeeDF.write.format("org.apache.hadoop.hbase.spark").options(catalog=catalog).option("hbase.spark.use.hbasecontext", False).save() df = spark.read.format("org.apache.hadoop.hbase.spark").options(catalog=catalog).option("hbase.spark.use.hbasecontext", False).load() df.show() spark.stop() if __name__ == "__main__": main() Step 3: Verify the pyspark code using spark-submit Run the following command to verify application is working with out any issues. Note: Based on your cluster cdp version, the hbase-spark jar version(s) needs to be updated. If your cluster kerberized, then do the kinit: spark-submit \ --master yarn \ --deploy-mode client \ --jars /opt/cloudera/parcels/CDH/jars/hbase-spark-protocol-shaded-1.0.0.7.1.9.0-387.jar,/opt/cloudera/parcels/CDH/jars/hbase-spark-1.0.0.7.1.9.0-387.jar \ hbase_spark_connector_app.py Step 4: Upload Resources to HDFS Upload the Python hbase_spark_connector_app.py file and the HBase Spark Connector JAR file to your HDFS directory for example /tmp: hdfs dfs -put hbase_spark_connector_app.py /tmp hdfs dfs -put /opt/cloudera/parcels/CDH/jars/hbase-spark-protocol-shaded-1.0.0.7.1.9.0-387.jar /tmp hdfs dfs -put /opt/cloudera/parcels/CDH/jars/hbase-spark-1.0.0.7.1.9.0-387.jar /tmp Step 5: Submit the Spark Job to Livy Submit the Spark job to Livy using the Livy REST API: Note: You need to replace the LIVY_SERVER_HOST (for example localhost) value and LIVY_SERVER_PORT (for example 8998) value. Non-kerberized cluster: curl -k \ -H "Content-Type: application/json" \ -X POST \ -d '{ "file": "/tmp/hbase_spark_connector_app.py", "name": "Spark HBase Connector Example", "driverMemory": "1g", "driverCores": 1, "executorMemory": "1g", "executorCores": 1, "jars" : ["/tmp/hbase-spark-protocol-shaded-1.0.0.7.1.9.0-387.jar","/tmp/hbase-spark-1.0.0.7.1.9.0-387.jar"], "conf":{ "spark.dynamicAllocation.enabled":"false", "spark.executor.instances":1 } }' \ https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/ Kerberized cluster: Run the kinit command after that run the following curl command: curl -k \ --negotiate -u: \ -H "Content-Type: application/json" \ -X POST \ -d '{ "file": "/tmp/hbase_spark_connector_app.py", "name": "Spark HBase Connector Example", "driverMemory": "1g", "driverCores": 1, "executorMemory": "1g", "executorCores": 1, "jars" : ["/tmp/hbase-spark-protocol-shaded-1.0.0.7.1.9.0-387.jar","/tmp/hbase-spark-1.0.0.7.1.9.0-387.jar"], "conf":{ "spark.dynamicAllocation.enabled":"false", "spark.executor.instances":1 } }' \ https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/ This will submit the Spark job to Livy and execute it on your cluster. You can monitor the job status using the Livy REST API or the Livy web UI. Step 6: Monitor the Livy Job State To verify the Livy job State, run the following command by replace the LIVY_SERVER_HOST, LIVY_SERVER_PORT and BATCH_ID(Generated using above step5). Non-kerberized cluster: curl -k \ -H "Content-Type: application/json" \ -X GET \ https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/<BATCH_ID>/state Kerberized cluster: curl -k \ --negotiate -u: \ -H "Content-Type: application/json" \ -X GET \ https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/<BATCH_ID>/state Step 7: Verify the Livy job logs To verify the Livy job logs, run the following command by replace the LIVY_SERVER_HOST, LIVY_SERVER_PORT and BATCH_ID (Generated using above step5). Non Kerberized cluster: curl -k \ -H "Content-Type: application/json" \ -X GET \ https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/<BATCH_ID>/log Kerberized cluster: curl -k \ --negotiate -u: \ -H "Content-Type: application/json" \ -X GET \ https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/<BATCH_ID>/log

HadoopCommunity · ‎11-19-2023

Please give me solution on this

RangaReddy · ‎10-24-2023

Hi @ali786XI Jolokia is not part of Cloudera stack and we are not supporting running spark applications using Standalone mode.

RangaReddy · ‎10-24-2023

Hi @myzard I think you need to verify the following things are set properly: 1. SPARK_HOME path 2. Python Environment paths

RangaReddy · ‎10-24-2023

Hi @SAMSAL I think you want to run the spark application using Standalone mode. Please follow the following steps: 1. Install the Apache Spark 2. Start the Standalone master and workers. By default master will start with port 7777. Try to access and Standalone UI and see all workers are running expected. 3. Once it is running as expected then submit spark application by specifying standalone master host with 7777

RangaReddy · ‎10-24-2023

Have you tried to restart the cluster and cluster services. I think you need to keep all services are running because either directly or indirectly one service is dependent on other service.

Online	Offline
Last Visited	‎08-29-2024 03:41 AM

Member Since	‎06-02-2020 05:25 AM
Last Visited	‎08-29-2024 03:41 AM
Posts	331
Kudos received	65

Cloudera Community

Re: Icebreg on CDP private cloud 7.1.9

Re: How to set default time zone/local time for Sp...

Re: Load Iceberg Table on PowerBI Desktop

Re: NoClassDefFoundError due to Incompatible Spark...

Re: Creating Iceberg table

Re: Delay with Spark application

Re: Password secure way to use Spark JDBC

Re: Spark version is empty in CDH6.3.2

Re: Spark master shuts down when one of zookeeper ...

Streamlining Data Processing with Spark HBase Inte...

Re: I want to enable ACL for hadoop users

Re: Is there a way to configure Apache Spark runni...

Re: (Zeppelin) pyspark is not responding

Re: spark continously running with exit code 1

Re: Spark-shell keeps getting stuck!