Member since
06-02-2020
331
Posts
64
Kudos Received
49
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
802 | 07-11-2024 01:55 AM | |
2161 | 07-09-2024 11:18 PM | |
2030 | 07-09-2024 04:26 AM | |
1540 | 07-09-2024 03:38 AM | |
1734 | 06-05-2024 02:03 AM |
01-31-2024
06:57 AM
Hi @Taries As I mentioned previously, only the hbase.spark.query.timerange parameter can be used for filtering data during read. The hbase.spark.scan parameter wouldn't be set for this purpose. To filter the data after reading, you can apply a Spark WHERE or filter clause with your desired conditions.
... View more
01-31-2024
12:34 AM
1 Kudo
Hi @Taries You need to use the following two parameters to apply filter. hbase.spark.query.timerange.start hbase.spark.query.timerange.end Reference: https://github.com/apache/hbase-connectors/blob/307607cf7287084b3ce49cdd96d094e2ede9363a/spark/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/datasources/HBaseSparkConf.scala#L65
... View more
01-22-2024
09:51 PM
Hi @sind65 Could you please clarify few things before proceeding any solutions: 1. Are you using CDP or CDH? 2. How are you submitting the spark job? 3. Are you using kinit or you have specified principal/keytab? 4. Have you followed the following documentation to setup and run the sample example https://community.cloudera.com/t5/Community-Articles/Spark-Ozone-Integration-in-CDP/ta-p/323132
... View more
01-02-2024
03:38 AM
Thanks @Chandler641 Your issue is resolved after building the spark code properly. Note: We will not support Upstream Spark installation in our cloudera cluster because we are done lot of customisation in cloudera spark to support multiple integration components. Please let me know if you have further concerns on this issue.
... View more
12-28-2023
07:31 AM
Hi @Chandler641 You're correct that Spark 2.4.0 is the version compatible with CDH 6.3.2. Have you checked spark setup is done properly. Could you also share how you have downloaded and installed CDH 6.3.2 cluster because CDH/HDP cluster support is stopped. References: https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11/2.4.0-cdh6.3.2
... View more
12-27-2023
04:37 PM
1 Kudo
There is a two-step process to achieve your scenario: encrypt the password externally and decrypt it within the Spark code. Step1: Encrypt the Password Choose a Cryptography Library: Use a well-established library like Jasypt, Apache Shiro, Spring Security, or Java's built-in javax.crypto package. Encrypt the Password: Generate a strong encryption key and store it securely (e.g., environment variable, key vault). val password = "YOUR_PASSWORD"
val salt_key = "YOUR_SALT_KEY"
val encryptedPassword = EncryptionUtil.encrypt(password, salt_key) Step2: Decrypt the password inside the Spark code and Read JDBC data: Decrypt the password: Read the encrypted password and salt value using either properties file or environment variable or command line arguments and decrypt it before setting JDBC options. val encryptedPassword = sys.env("ENCRYPTED_PASSWORD") val saltKey = sys.env("SALT_KEY") val decryptedPassword = EncryptionUtil.decrypt(encryptedPassword, saltKey) Set JDBC Options: Provide the JDBC URL, driver class, username, and the encrypted password. val options = Map(
"url" -> "jdbc:mysql://your_database_url",
"driver" -> "com.mysql.jdbc.Driver",
"user" -> "your_username",
"password" -> decryptedPassword
)
val df = spark.read.jdbc(options("url"), "your_table", options)
... View more
12-19-2023
03:29 AM
We are not sure, this kind of scenorio we are going to support. If you are cloudera customer, you can check with your account team to engage cloudera Professional Team to enagage and support this feature.
... View more
12-19-2023
03:23 AM
Hi @Nardoleo It is very difficult to provide answer here because we don't know what each application is doing. You can do one thing, check each application when it is started and when it is ended. In between you can check the how much data you are processing, in which spark job is taking more time and you can try to tune resources, apply some kind of code optimizations.
... View more
12-10-2023
10:05 PM
This article delves into the practical aspects of integrating Spark and HBase using Livy, showcasing a comprehensive example that demonstrates the process of reading, processing, and writing data between Spark and HBase. The example utilizes Livy to submit Spark jobs to a YARN cluster, enabling remote execution of Spark applications on HBase data.
Prerequisites:
Apache Spark installed and configured
Apache Livy installed and configured
Apache HBase installed and configured
HBase Spark Connector jar file available
Steps:
This step-by-step guide provides a comprehensive overview of how to integrate Spark and HBase using Livy.
Step 1: Create an HBase Table
Note: If your cluster is kerberized, then you need to provide the proper Ranger HBase permissions to the user and needs to the kinit.
Connect to your HBase cluster using the HBase shell: hbase shell
Create an HBase table named employees with two column families: per and prof: create 'employees', 'per', 'prof'
Exit the HBase Shell: quit
Step 2: Create pyspark code.
Create a Python file (e.g., hbase_spark_connector_app.py) and add the following code: from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType, ShortType, FloatType
import json
def main():
spark = SparkSession.builder.appName("HBase Spark Connector App").getOrCreate()
data = [(1, "Ranga", 34, 15000.5), (2, "Nishanth", 5, 35000.5),(3, "Meena", 30, 25000.5)]
schema = StructType([ \
StructField("id",LongType(),True), \
StructField("name",StringType(),True), \
StructField("age",ShortType(),True), \
StructField("salary", FloatType(), True)
])
employeeDF = spark.createDataFrame(data=data,schema=schema)
catalog = json.dumps({
"table":{"namespace":"default", "name":"employees"},
"rowkey":"key",
"columns":{
"id":{"cf":"rowkey", "col":"key", "type":"long"},
"name":{"cf":"per", "col":"name", "type":"string"},
"age":{"cf":"per", "col":"age", "type":"short"},
"salary":{"cf":"prof", "col":"salary", "type":"float"}
}
})
employeeDF.write.format("org.apache.hadoop.hbase.spark").options(catalog=catalog).option("hbase.spark.use.hbasecontext", False).save()
df = spark.read.format("org.apache.hadoop.hbase.spark").options(catalog=catalog).option("hbase.spark.use.hbasecontext", False).load()
df.show()
spark.stop()
if __name__ == "__main__":
main()
Step 3: Verify the pyspark code using spark-submit
Run the following command to verify application is working with out any issues.
Note:
Based on your cluster cdp version, the hbase-spark jar version(s) needs to be updated.
If your cluster kerberized, then do the kinit: spark-submit \
--master yarn \
--deploy-mode client \
--jars /opt/cloudera/parcels/CDH/jars/hbase-spark-protocol-shaded-1.0.0.7.1.9.0-387.jar,/opt/cloudera/parcels/CDH/jars/hbase-spark-1.0.0.7.1.9.0-387.jar \
hbase_spark_connector_app.py
Step 4: Upload Resources to HDFS
Upload the Python hbase_spark_connector_app.py file and the HBase Spark Connector JAR file to your HDFS directory for example /tmp:
hdfs dfs -put hbase_spark_connector_app.py /tmp
hdfs dfs -put /opt/cloudera/parcels/CDH/jars/hbase-spark-protocol-shaded-1.0.0.7.1.9.0-387.jar /tmp
hdfs dfs -put /opt/cloudera/parcels/CDH/jars/hbase-spark-1.0.0.7.1.9.0-387.jar /tmp
Step 5: Submit the Spark Job to Livy
Submit the Spark job to Livy using the Livy REST API: Note: You need to replace the LIVY_SERVER_HOST (for example localhost) value and LIVY_SERVER_PORT (for example 8998) value.
Non-kerberized cluster:
curl -k \
-H "Content-Type: application/json" \
-X POST \
-d '{
"file": "/tmp/hbase_spark_connector_app.py",
"name": "Spark HBase Connector Example",
"driverMemory": "1g",
"driverCores": 1,
"executorMemory": "1g",
"executorCores": 1,
"jars" : ["/tmp/hbase-spark-protocol-shaded-1.0.0.7.1.9.0-387.jar","/tmp/hbase-spark-1.0.0.7.1.9.0-387.jar"],
"conf":{
"spark.dynamicAllocation.enabled":"false",
"spark.executor.instances":1
}
}' \
https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/
Kerberized cluster:
Run the kinit command after that run the following curl command:
curl -k \
--negotiate -u: \
-H "Content-Type: application/json" \
-X POST \
-d '{
"file": "/tmp/hbase_spark_connector_app.py",
"name": "Spark HBase Connector Example",
"driverMemory": "1g",
"driverCores": 1,
"executorMemory": "1g",
"executorCores": 1,
"jars" : ["/tmp/hbase-spark-protocol-shaded-1.0.0.7.1.9.0-387.jar","/tmp/hbase-spark-1.0.0.7.1.9.0-387.jar"],
"conf":{
"spark.dynamicAllocation.enabled":"false",
"spark.executor.instances":1
}
}' \
https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/
This will submit the Spark job to Livy and execute it on your cluster. You can monitor the job status using the Livy REST API or the Livy web UI.
Step 6: Monitor the Livy Job State
To verify the Livy job State, run the following command by replace the LIVY_SERVER_HOST, LIVY_SERVER_PORT and BATCH_ID(Generated using above step5).
Non-kerberized cluster:
curl -k \
-H "Content-Type: application/json" \
-X GET \
https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/<BATCH_ID>/state
Kerberized cluster:
curl -k \
--negotiate -u: \
-H "Content-Type: application/json" \
-X GET \
https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/<BATCH_ID>/state
Step 7: Verify the Livy job logs
To verify the Livy job logs, run the following command by replace the LIVY_SERVER_HOST, LIVY_SERVER_PORT and BATCH_ID (Generated using above step5).
Non Kerberized cluster:
curl -k \
-H "Content-Type: application/json" \
-X GET \
https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/<BATCH_ID>/log
Kerberized cluster:
curl -k \
--negotiate -u: \
-H "Content-Type: application/json" \
-X GET \
https://<LIVY_SERVER_HOST>:<LIVY_SERVER_PORT>/batches/<BATCH_ID>/log
... View more
Labels:
12-10-2023
09:39 PM
6 Kudos
Spark Python Supportability Matrix The Spark Python Supportability Matrix serves as an essential tool for determining which Python versions are compatible with specific Spark versions. This matrix provides a detailed overview of the compatibility levels for various Python versions across different Spark releases. Spark Version Python Min Supported Version Python Max Supported Version Python v 2.7 Python v3.4 Python v3.5 Python v 3.6 Python v3.7 Python v3.8 Python v3.9 Python v3.10 Python v3.11 3.5.2 3.8 3.11 No No No No No Yes Yes Yes Yes 3.5.1 3.8 3.11 No No No No No Yes Yes Yes Yes 3.5.0 3.8 3.11 No No No No No Yes Yes Yes Yes 3.4.2 3.7 3.11 No No No No Yes Yes Yes Yes Yes 3.4.1 3.7 3.11 No No No No Yes Yes Yes Yes Yes 3.4.0 3.7 3.11 No No No No Yes Yes Yes Yes Yes 3.3.3 3.7 3.10 No No No No Yes Yes Yes Yes No 3.3.2 3.7 3.10 No No No No Yes Yes Yes Yes No 3.3.1 3.7 3.10 No No No No Yes Yes Yes Yes No 3.3.0 3.7 3.10 No No No No Yes Yes Yes Yes No 3.2.4 3.6 3.9 No No No Yes Yes Yes Yes No No 3.2.3 3.6 3.9 No No No Yes Yes Yes Yes No No 3.2.2 3.6 3.9 No No No Yes Yes Yes Yes No No 3.2.1 3.6 3.9 No No No Yes Yes Yes Yes No No 3.2.0 3.6 3.9 No No No Yes Yes Yes Yes No No 3.1.3 3.6 3.9 No No No Yes Yes Yes Yes No No 3.1.2 3.6 3.9 No No No Yes Yes Yes Yes No No 3.1.1 3.6 3.9 No No No Yes Yes Yes Yes No No 3.0.3 2.7/3.4 3.9 Yes Yes Yes Yes Yes Yes Yes No No 3.0.2 2.7/3.4 3.9 Yes Yes Yes Yes Yes Yes Yes No No 3.0.1 2.7/3.4 3.8 Yes Yes Yes Yes Yes Yes No No No 3.0.0 2.7/3.4 3.8 Yes Yes Yes Yes Yes Yes No No No 3.0.0 2.7/3.4 3.8 Yes Yes Yes Yes Yes Yes No No No 3.0.0 2.7/3.4 3.8 Yes Yes Yes Yes Yes Yes No No No 2.4.8 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.7 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.6 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.5 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.4 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.3 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.2 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.1 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.0 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.3.4 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.3.3 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.3.2 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.3.1 2.7/3.4 3.6 Yes Yes Yes Yes No No No No No 2.3.0 2.7/3.4 3.6 Yes Yes Yes Yes No No No No No 2.2.3 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.2.2 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.2.1 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.2.0 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.1.3 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.1.2 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.1.1 2.7/3.4 3.5 Yes Yes Yes No No No No No No Note: The above data is generated using https://pypi.org/project/pyspark/ website. If you face any problems with supported python environment share in comments so that we can put some notes.
... View more