Member since
12-15-2020
7
Posts
5
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6909 | 07-22-2022 03:42 AM |
08-01-2022
08:03 AM
Hi @somant , regarding this " I'm wondering if there is any way that the spark job doesn't depend on the CDH provided libraries and only use packaged dependencies", it depends on the library you are referring to. If you are referring to kafka libraries, they shouldn't be loaded by spark classpath by default, hence you can prepare a fat jar including the kafka dependencies choosing the version you need. Moreover, did you try by specifying less options at launch time? E.g. I would start by removing the usage of G1GC and other advanced options, monitoring the behaviour one step at a time.
... View more
07-22-2022
03:42 AM
1 Kudo
Hi @yassan , let's recap some important concept, then I will add my comment to your points. Generally speaking: If you issue a SQL statement Hive/Impala, the engine is fully aware of what has been requested and will do everything needed. E.g. if you drop a partition, the engine knows that metadata has to be updated and knows if data has to be purged to or not (e.g. if it's an external table, data on filesystem won't be deleted by default). NOTE: if you want drop statements to delete data as well, you would need a managed (non external) table. You might also try to alter your table and set this TBLPROPERTIES ("external.table.purge"="true"). Honestly I'm not sure if this is available in your version of Hive, it is for sure in more up to date versions (e.g. Cloudera Data Platform). If you delete data directly on the filesystem (e.g. via a Spark job or via hdfs CLI), there is no way for Hive/Impala engine to know that it happened unless you explicitly tell them. This is something you can do by launching a "MSCK REPAIR TABLE [...]" on Hive or by launching an "ALTER TABLE tablename DROP PARTITION [...]" either on Hive or Impala. Well actually, if you are using Spark you could rely on sparksql and issue a drop partition statement (see the summary at the end of this post). Impala relies on Hive metastore but caches metadata. If you make some changes to metadata via Hive, then you'll have to launch an "INVALIDATE METADATA" on Impala in order to refresh the cache. These are key points to be taken into account. Commenting your last post: If you have a lot of partitions you have a couple of ways to lower down the effort of launching a number of drop partitions statements: of course, you can script it (or you could develop a spark job, or you could maybe come up with some other automation strategy) if the partition field allows for it, you could drop a range of partition with a single statement. E.g. it could be something like "ALTER TABLE my_table_partitioned_by_year DROP PARTITION (year < 2020)". You can do this from Impala if you prefer, so that you won't have to refresh the Impala cache. This will drop partitions but won't ever drop the table. Now, summarizing everything we've shared so far, you have two alternatives three possible ways to go: Do it via Impala by using the drop partition SQL statement. Delete data directly on the filesystem and lately tell Impala to drop the partition (drop partition statements in Impala or MSCK REPAIR on Hive + INVALIDATE METADATA on Impala). Use a Spark job and issue a drop partition statement via Spark SQL + INVALIDATE METADATA on Impala (since the Spark job would directly act on the Hive metastore, out of Impala's line of sight). Hope this helps
... View more
07-20-2022
08:25 AM
As far as I know this is not something that Ambari or SQOOP allow for. What you could do to achieve your goal is one of the two: Prepare sh scripts and refer to your jdbc string as a variable Prepare an Oozie Worklfow and pass the jdbc string as a variable At that point you might have an external tool (e.g. Jenkins) maintaining a list of jdbc strings and taking the responsibility to specify the desidred one. In solution 1, Jenkins should SSH to the node, set the variable to the JDBC string, launch the sh. In solution 2, Jenkins should use Oozie API to start the workflow while specifying the desired variable value. Solution 2 is much better than 1, since it relies on a distributed, highly available service (Oozie). Regards
... View more
07-20-2022
08:18 AM
Hii somant, there are some information that need to be provided in order to drive the investigation: From where are you launching the job? E.g. from a gateway of your CDH cluster? Can you please share your spark-submit command? You are saying the job is not starting up, do you have any log (Spark driver logs, YARN logs)? Thanks
... View more
07-20-2022
08:16 AM
1 Kudo
Hi Yassan, first recommendation I have: when you need to drop a partition, it is better to do it via SQL statement either on Impala/Hive or with Spark SQL. For example, assuming that "year" is my partitioning field: alter table my_partitioned_table drop partition (year = 2020); If you drop a partition at file system level, there are two things you should do in order to have everything aligned on Impala: first run "MSCK REPAIR TABLE my_partitioned_table" on Hive, in order to refresh the metastore with the correct partitions' information once point 1 is done, run "INVALIDATE METADATA" on Impala, so to refresh Impala cache Let me know if this helps. Regards
... View more
07-19-2022
01:25 AM
Please note the last screenshot has typos in it, while the article hasn't. The correct configuration properties' names are the following (please note camel case): spark.driver.extraClassPath spark.executor.extraClassPath
... View more
02-02-2022
08:42 AM
3 Kudos
COD - CDE Spark-HBase tutorial
In this article related to CDP Public Cloud, we will walk through steps required to be followed to read/ write to COD (Cloudera Operational Database) from Spark on CDE (Cloudera Data Engineering) using spark-hbase connector. If you are looking to leverage Phoenix instead, please refer to this community article.
Assumption
COD is already provisioned and database is created. Refer to this link for the same. For this example, we will assume "amallegni-cod" as the database name.
CDE is already provisioned and the virtual cluster is already created. Refer to this link for the same.
COD
Download client configuration
Spark in CDE to be able to talk to COD, would require the hbase-site.xml config of the COD cluster. Do the following steps to retrieve the same:
Go to the COD control plane UI and click on "amallegni-cod" database.
Under the Connect tab of COD database, look for HBase Client Configuration URL field. Following is the screenshot for the same.
The configuration can be downloaded using the following curl command.
curl -f -o "hbase-config.zip" -u "<YOUR WORKLOAD USERNAME>" "https://cod--4wfxojpfxmwg-gateway.XXXXXXXXX.cloudera.site/clouderamanager/api/v41/clusters/cod--4wfxojpfxmwg/services/hbase/clientConfig"
Make sure to provide the "workload" password for the above curl call.
Explore the downloaded zip file to obtain the hbase-site.xml file.
Create the HBase table
In this example, we are going to use COD database HUE to quickly create a new HBase table inside our COD database. Let's walk through this step by step:
Go to the COD control plane UI and click on "amallegni-cod" database.
Click on the Hue link.
Once in Hue, click on the HBase menu item on the left sidebar, and then click on the New Table button in the top right corner:
Choose your table name and column families, and then click on the Submit button. For the sake of this example, let's call the table 'testtable' and let's create a single column family called 'testcf'.
CDE
For Spark code complete examples, refer to HbaseRead.scala and this HBaseWrite.scala examples.
Configure your Job via CDE CLI
To configure the job via CDE CLI, perform these steps:
Configure CDE CLI to point to the virtual cluster created in the above step. For more details, see Configuring the CLI client.
Create resources using the following command. cde resource create --name cod-spark-resource
Upload hbase-site.xml cde resource upload --name cod-spark-resource --local-path /your/path/to/hbase-site.xml --resource-path conf/hbase-site.xml
Upload the demo app jar that was built earlier. cde resource upload --name cod-spark-resource --local-path /path/to/your/spark-hbase-project.jar --resource-path spark-hbase-project.jar
Create the CDE job using a JSON definition which should look like this:
{
"mounts": [
{
"resourceName": "cod-spark-resource"
}
],
"name": "my-cod-spark-job",
"spark": {
"className": "<YOUR MAIN CLASS>",
"conf": {
"spark.executor.extraClassPath": "/app/mount/conf",
"spark.driver.extraClassPath": "/app/mount/conf"
},
"args": [ "<YOUR ARGS IF ANY>"],
"driverCores": 1,
"driverMemory": "1g",
"executorCores": 1,
"executorMemory": "1g",
"file": "spark-hbase-project.jar",
"pyFiles": [],
"files": ["conf/hbase-site.xml"],
"numExecutors": 4
}
}
Finally, assuming the above JSON was saved into my-job-definition.json, import the job using the following command:
cde job import --file my-job-definition.json
Please note the spark.driver.extraClassPath and spark.executor.extraClassPath inside the job definition, pointing to the same path we used to upload the hbase-site.xml into our CDE resource. This is important since this way the hbase-site.xml will be automatically loaded from the classpath and you won't need to refer to it explicitly in your Spark code, hence, you will only need to do like this:
val conf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(spark.sparkContext, conf)
If you prefer CDE UI
If you prefer using the UI instead, you'll have to take the following into account (please find the screenshot below):
In your CDE job configuration page, the hbase-site.xml should be uploaded under Advanced options > Other dependencies.
At the time of this article, it is not possible to specify a path for a file inside a CDE resource or for files uploaded under Other dependencies. For this reason, in your job definition you should use /app/mount as the value for your spark.driver.extraClassPath and spark.executor.extraClassPathvariables. In your CDE job configuration page, you should set this variable inside the Configurations section.
... View more