Member since
12-12-2019
34
Posts
1
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3117 | 01-07-2020 01:29 AM | |
15381 | 12-26-2019 11:05 PM |
08-11-2020
11:48 AM
Hi, I need one urgent help here. I am using Delta Lake provided by Databricks for storing the staged data from source application. When Apache Spark processes the data, the data from source is staged in form of .parquet files and the transaction log directory _delta_log is updated with the location of .parquet files in a .json file. However, I have observed that, even though an application gets its data in 4 .parquet files and if the json files have reference of only 3 parquet files, then there will be a mismatch in the count of records when the data is read using the below commands: spark> spark.read.format("delta").load("/path/applicationname") //This command will show less count for the application VS spark> spark.read.format("parquet").load("/path/applicationname") //This command will show the count of data stored in all the 4 parquet files, including the missing one. So, according to Delta Lake, that one .parquet file doesn't exist. However, it actually exists. Issue This is causing issues in capturing the correct data in Target DB for further analysis and the analysis is getting impacted. My Questions Why this issue happens ? Why the delta lake fails to write the location of .parquet file to transaction log? How to fix this particular issue? I have seen that if I change the target path where I will capture the data for the same application and re-process, then 9 out of 10 times the issue gets fixed. But, I cannot keep changing the target path and that's not a clean solution as well. Please let me know if you need any additional information. Thanks and Regards, Sudhindra
... View more
Labels:
- Labels:
-
Apache Spark
01-16-2020
10:26 PM
Hi @lyubomirangelo and @EricL , Sorry for the delayed response. Thanks for your inputs. I have already changed the number of vcores. But, I am still facing the same issue. In the meantime, I was able to execute the jobs with YARN Capacity scheduler (with the same memory configuration). So, I am not sure what's wrong with the settings of YARN Fair Scheduler. Please suggest if any specific settings are required for YARN Fair Scheduler. Also, I am still using default queue. I haven't set a separate Queue for handling fair scheduler. Thanks and Regards, Sudhindra
... View more
01-09-2020
08:44 PM
Hi @EricL , I did change the parameter "yarn.app.mapreduce.am.resource.mb" to 2 GB (2048 MB). Although the second Spark job is now running fine under "Fair Scheduler" configuration, the tasks under the second Spark job are not getting the required number of resources at all. [Stage 0:> (0 + 0) / 1]20/01/09 22:58:01 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 20/01/09 22:58:16 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 20/01/09 22:58:31 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources Here are the important information about the cluster: 1. Number of nodes in the Cluster: 2 2. Total amount of memory of Cluster: 15.28 GB (yarn.nodemanager.resource.memory-mb = 7821 MB yarn.app.mapreduce.am.resource.mb = 2048 MB yarn.scheduler.minimum-allocation-mb = 1024 MB yarn.scheduler.maximum-allocation-mb = 3072 MB) 3. Number of executors set through the program: 5 (spark.num.executors) 4. Number of cores set through the program: 3 (spark.executor.cores) 5. Spark Driver Memory and Spark Executor Memory: 2g each Please help me in understanding what else is going wrong. Note: With the same set of parameters (along with yarn.app.mapreduce.am.resource.mb of 1024 MB), the Spark job run fine when YARN Capacity Scheduler is set. However, it doesn't run when YARN Fair Scheduler is set. So, I want to understand what's going wrong only with Fair Scheduler.
... View more
01-07-2020
11:05 PM
Hi @EricL , I am still facing the same issue when I use YARN fair scheduler to run the Spark jobs. With the same memory configuration, the Spark jobs are running fine when YARN Capacity Scheduler is used. Can you please help me in fixing this issue? Thanks and Regards, Sudhindra
... View more
01-07-2020
11:03 PM
Hi @Shelton , I was able to achieve my objective of running multiple Spark Sessions under a single Spark context using YARN capacity scheduler and Spark Fair Scheduling. However, the issue still remains with YARN fair scheduler. The second Spark job is still not running (with the same memory configuration) due to lack of resources. So, what additional parameters need to be set for YARN fair scheduler to achieve this? Please help me in fixing this issue. Thanks and Regards, Sudhindra
... View more
01-07-2020
01:29 AM
I have observed that by increasing the number of cores/executors and driver/executor memory, I was able to verify that around 6 tasks are running in parallel at a time. Thanks and Regards, Sudhindra
... View more
01-06-2020
10:15 PM
Additional Information: As we can see, even though there are 3 stages active, only 1 task each is running in Production as well as Default pools. My basic question is - how can we increase the parallelism within pools? In other words, how can I make sure that the Stage ID "8" in the above screenshot also runs in parallel with the other 2 Thanks and Regards, Sudhindra
... View more
01-06-2020
10:05 PM
Hi,
I am running Spark jobs on YARN, using HDP 3.1.1.0-78 version.
I have set the Spark Scheduler Mode to FAIR by setting the parameter "spark.scheduler.mode" to FAIR. The fairscheduler.xml is as follows:
I have also configured my program to use "production" pool.
Upon running the job, it has been observed that although 4 stages are running, only 1 stage run under "production" and rest 3 run under "default" pool.
So, at any point of time, I am able to make sure that only 2 tasks are running in parallel. If I want to make sure that 3 tasks or more run in parallel, then 2 tasks should run under "production" and rest 2 should run under "default".
Is there any programmatic way to achieve that, by setting configuration parameters?
Any inputs will be really helpful.
Thanks and Regards,
Sudhindra
... View more
Labels:
01-05-2020
09:26 AM
Hi @EricL , This is just a gentle reminder. Can you please help me in fixing this issue? Thanks and Regards, Sudhindra
... View more
01-05-2020
09:25 AM
Hi @Shelton , Can you please help me in fixing this issue. With the same memory configuration as mentioned, I am able to run more than 1 spark job with Capacity Scheduler, while it's not possible to run the second Spark job with Fair Scheduler. I have already sent you the required screenshots. Please let me know your inputs at the earliest. Thanks and Regards, Sudhindra
... View more