Member since
06-02-2020
331
Posts
67
Kudos Received
49
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4488 | 07-11-2024 01:55 AM | |
| 12431 | 07-09-2024 11:18 PM | |
| 8900 | 07-09-2024 04:26 AM | |
| 9192 | 07-09-2024 03:38 AM | |
| 7805 | 06-05-2024 02:03 AM |
01-02-2024
03:38 AM
Thanks @Chandler641 Your issue is resolved after building the spark code properly. Note: We will not support Upstream Spark installation in our cloudera cluster because we are done lot of customisation in cloudera spark to support multiple integration components. Please let me know if you have further concerns on this issue.
... View more
12-10-2023
10:05 PM
This article delves into the practical aspects of integrating Spark and HBase using Livy, showcasing a comprehensive example that demonstrates the process of reading, processing, and writing data between Spark and HBase. The example utilizes Livy to submit Spark jobs to a YARN cluster, enabling remote execution of Spark applications on HBase data.
Prerequisites:
Apache Spark installed and configured
Apache Livy installed and configured
Apache HBase installed and configured
HBase Spark Connector jar file available
Steps:
This step-by-step guide provides a comprehensive overview of how to integrate Spark and HBase using Livy.
Step 1: Create an HBase Table
Note: If your cluster is kerberized, then you need to provide the proper Ranger HBase permissions to the user and needs to the kinit.
Connect to your HBase cluster using the HBase shell: hbase shell
Create an HBase table named employees with two column families: per and prof: create 'employees', 'per', 'prof'
Exit the HBase Shell: quit
Step 2: Create pyspark code.
Create a Python file (e.g., hbase_spark_connector_app.py) and add the following code: from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType, ShortType, FloatType
import json
def main():
spark = SparkSession.builder.appName("HBase Spark Connector App").getOrCreate()
data = [(1, "Ranga", 34, 15000.5), (2, "Nishanth", 5, 35000.5),(3, "Meena", 30, 25000.5)]
schema = StructType([ \
StructField("id",LongType(),True), \
StructField("name",StringType(),True), \
StructField("age",ShortType(),True), \
StructField("salary", FloatType(), True)
])
employeeDF = spark.createDataFrame(data=data,schema=schema)
catalog = json.dumps({
"table":{"namespace":"default", "name":"employees"},
"rowkey":"key",
"columns":{
"id":{"cf":"rowkey", "col":"key", "type":"long"},
"name":{"cf":"per", "col":"name", "type":"string"},
"age":{"cf":"per", "col":"age", "type":"short"},
"salary":{"cf":"prof", "col":"salary", "type":"float"}
}
})
employeeDF.write.format("org.apache.hadoop.hbase.spark").options(catalog=catalog).option("hbase.spark.use.hbasecontext", False).save()
df = spark.read.format("org.apache.hadoop.hbase.spark").options(catalog=catalog).option("hbase.spark.use.hbasecontext", False).load()
df.show()
spark.stop()
if __name__ == "__main__":
main()
Step 3: Verify the pyspark code using spark-submit
Run the following command to verify application is working with out any issues.
Note:
Based on your cluster cdp version, the hbase-spark jar version(s) needs to be updated.
If your cluster kerberized, then do the kinit: spark-submit \
--master yarn \
--deploy-mode client \
--jars /opt/cloudera/parcels/CDH/jars/hbase-spark-protocol-shaded-1.0.0.7.1.9.0-387.jar,/opt/cloudera/parcels/CDH/jars/hbase-spark-1.0.0.7.1.9.0-387.jar \
hbase_spark_connector_app.py
Step 4: Upload Resources to HDFS
Upload the Python hbase_spark_connector_app.py file and the HBase Spark Connector JAR file to your HDFS directory for example /tmp:
hdfs dfs -put hbase_spark_connector_app.py /tmp
hdfs dfs -put /opt/cloudera/parcels/CDH/jars/hbase-spark-protocol-shaded-1.0.0.7.1.9.0-387.jar /tmp
hdfs dfs -put /opt/cloudera/parcels/CDH/jars/hbase-spark-1.0.0.7.1.9.0-387.jar /tmp
Step 5: Submit the Spark Job to Livy
Submit the Spark job to Livy using the Livy REST API: Note: You need to replace the LIVY_SERVER_HOST (for example localhost) value and LIVY_SERVER_PORT (for example 8998) value.
Non-kerberized cluster:
curl -k \
-H "Content-Type: application/json" \
-X POST \
-d '{
"file": "/tmp/hbase_spark_connector_app.py",
"name": "Spark HBase Connector Example",
"driverMemory": "1g",
"driverCores": 1,
"executorMemory": "1g",
"executorCores": 1,
"jars" : ["/tmp/hbase-spark-protocol-shaded-1.0.0.7.1.9.0-387.jar","/tmp/hbase-spark-1.0.0.7.1.9.0-387.jar"],
"conf":{
"spark.dynamicAllocation.enabled":"false",
"spark.executor.instances":1
}
}' \
https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/
Kerberized cluster:
Run the kinit command after that run the following curl command:
curl -k \
--negotiate -u: \
-H "Content-Type: application/json" \
-X POST \
-d '{
"file": "/tmp/hbase_spark_connector_app.py",
"name": "Spark HBase Connector Example",
"driverMemory": "1g",
"driverCores": 1,
"executorMemory": "1g",
"executorCores": 1,
"jars" : ["/tmp/hbase-spark-protocol-shaded-1.0.0.7.1.9.0-387.jar","/tmp/hbase-spark-1.0.0.7.1.9.0-387.jar"],
"conf":{
"spark.dynamicAllocation.enabled":"false",
"spark.executor.instances":1
}
}' \
https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/
This will submit the Spark job to Livy and execute it on your cluster. You can monitor the job status using the Livy REST API or the Livy web UI.
Step 6: Monitor the Livy Job State
To verify the Livy job State, run the following command by replace the LIVY_SERVER_HOST, LIVY_SERVER_PORT and BATCH_ID(Generated using above step5).
Non-kerberized cluster:
curl -k \
-H "Content-Type: application/json" \
-X GET \
https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/<BATCH_ID>/state
Kerberized cluster:
curl -k \
--negotiate -u: \
-H "Content-Type: application/json" \
-X GET \
https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/<BATCH_ID>/state
Step 7: Verify the Livy job logs
To verify the Livy job logs, run the following command by replace the LIVY_SERVER_HOST, LIVY_SERVER_PORT and BATCH_ID (Generated using above step5).
Non Kerberized cluster:
curl -k \
-H "Content-Type: application/json" \
-X GET \
https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/<BATCH_ID>/log
Kerberized cluster:
curl -k \
--negotiate -u: \
-H "Content-Type: application/json" \
-X GET \
https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/<BATCH_ID>/log
... View more
Labels:
10-24-2023
09:58 PM
Hi @ali786XI Jolokia is not part of Cloudera stack and we are not supporting running spark applications using Standalone mode.
... View more
10-24-2023
09:49 PM
Hi @SAMSAL I think you want to run the spark application using Standalone mode. Please follow the following steps: 1. Install the Apache Spark 2. Start the Standalone master and workers. By default master will start with port 7777. Try to access and Standalone UI and see all workers are running expected. 3. Once it is running as expected then submit spark application by specifying standalone master host with 7777
... View more
10-24-2023
09:43 PM
I think you need to verify the yarn and spark resources are configured properly. If yes then go and check from spark ui, it will show driver memory and executor memory. It is coming as expected then safely you can ignore it.
... View more
10-24-2023
09:36 PM
I think you don't have sufficient resources to run the job for queue root.hdfs. Verify is there any pending running jobs/application in the root.hdfs queue from Resource Manager UI. If it is running kill those if it is not required. And also verify from spark side you have given less resource to test it.
... View more
10-19-2023
04:36 PM
1 Kudo
The node must have a NodeManager role to take part of the processing, Spark gateway, and Yarn Gateway
... View more
10-17-2023
10:27 PM
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.
The below table provides you links to features released in each Spark 3 minor version:
Spark Version
Documentation Link
Release Date
Spark 3.0.0
https://spark.apache.org/docs/3.0.0/
2020-06-16
Spark 3.0.1
https://spark.apache.org/docs/3.0.1/
2020-11-05
Spark 3.0.2
https://spark.apache.org/docs/3.0.2/
2021-02-19
Spark 3.0.3
https://spark.apache.org/docs/3.0.3/
2022-06-17
Spark 3.1.1
https://spark.apache.org/docs/3.1.1/
2021-03-02
Spark 3.1.2
https://spark.apache.org/docs/3.1.2/
2022-06-17
Spark 3.1.3
https://spark.apache.org/docs/3.1.3/
2022-06-17
Spark 3.2.0
https://spark.apache.org/docs/3.2.0/
2021-10-13
Spark 3.2.1
https://spark.apache.org/docs/3.2.1/
2022-06-17
Spark 3.2.2
https://spark.apache.org/docs/3.2.2/
2022-07-15
Spark 3.2.3
https://spark.apache.org/docs/3.2.3/
2022-11-28
Spark 3.2.4
https://spark.apache.org/docs/3.2.4/
2023-04-13
Spark 3.3.0
https://spark.apache.org/docs/3.3.0/
2022-06-17
Spark 3.3.1
https://spark.apache.org/docs/3.3.1/
2022-10-25
Spark 3.3.2
https://spark.apache.org/docs/3.3.2/
2023-02-15
Spark 3.3.3
https://spark.apache.org/docs/3.3.3/
2023-08-21
Spark 3.4.0
https://spark.apache.org/docs/3.4.0/
2023-04-13
Spark 3.4.1
https://spark.apache.org/docs/3.4.1/
2023-06-23
Spark 3.5.0
https://spark.apache.org/docs/3.5.0/
2023-09-13
... View more
Labels:
10-05-2023
01:44 AM
Hi Team, Livy3 with Zeppelin Integration is not yet supported. To use Spark3, you need to install python3 and needs to add the following parameters: PYSPARK3_PYTHON spark.yarn.appMasterEnv.PYSPARK3_PYTHON Reference: https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/running-spark-applications/topics/spark-python-path-variables-livy.html
... View more
10-02-2023
11:13 PM
Hi @pranav007 The required setup is little bit complex. You can try copy the core-site.xml, hdfs-site.xml, yarn-site.xml, hive-site.xml, mapred-site.xml, krb_configuration files to resource folder. In the spark code, you need to add a two parameters i.e spark.driver.extraJavaOptions and spark.executor.extraJavaOptionsby specifiing the krb_file location. --conf spark.driver.extraJavaOptions="-Djava.security.krb5.conf=KRB5_PATH" \
--conf spark.executor.extraJavaOptions="-Djava.security.krb5.conf=KRB5_PATH" \
... View more