Member since
03-13-2017
9
Posts
2
Kudos Received
0
Solutions
04-20-2021
11:26 AM
Thank you, I appreciate the comment. This issue occurs after a hive sql query that joins around 15 tables(some of them big) so I think broadcast join do not applies, salting would imply breaking down the query and running the joins on spark functions instead of hive sql, because of the number of tables it can be time consuming, so my question is, is there any other way to force spark do distribute the partitions evenly to executors?
... View more
04-19-2021
04:30 PM
I'm facing a severe performance issue on a job that suddenly(no code changes) takes 4x time to complete, after debugging and investigating I found most of the data is read to a single executor(13 GB one executor vs 200 MB the rest). Initially i thought it was an classical uneven partitions issue so i started to test different partition numbers (and criteria) but the issue was not fixed, I did a rows per partition analysis and found all partitions have similar number of rows so this is not the problem, it seems the scheduler assigns most partitions to a single executor instead of evenly, question is how spark decides which partitions go to which executor, and how to control that behavior to make distribution even? I asked this on SO too: https://stackoverflow.com/questions/67133177/how-spark-distributes-partitions-to-executors
... View more
Labels:
09-21-2020
03:29 PM
We have a use case where we run an ETL written in spark on top of some streaming data, the ETL writes results to the target hive table every hour, but users are commonly running queries to the target table and we have faced cases of having query errors due to spark loading the table at the same time: java.io.FileNotFoundException: File does not exist: <HDFS path> What alternatives do we have to avoid or minimize this errors? Any property to the spark job(or to the hive table)? or something like creating a temporary table?
... View more
Labels:
04-06-2017
10:50 AM
We want to use Hue notebooks for Spark, but that requires Livy and Livy requires Spark 1.6, but our cluster has spark 1.1(CDH 5.2) , searching for info, found that you can add Spark 2 parcel(and CSD) as seen in https://www.cloudera.com/documentation/spark2/latest/topics/spark2_installing.html, but when looking at spark 2 requeriments https://www.cloudera.com/documentation/spark2/latest/topics/spark2_requirements.html , it says cdh 5.7 , so how can you install Spark 1.6 (or higher) in a cdh 5.2 without having to do it manually and with the option to manage it in cloudera manager?
... View more
Labels:
04-05-2017
01:40 PM
On CDH 5.2, In order to use Hue spark notebooks, im configurating Livy, but Livy requires Spark 1.6 and CDH 5.2 provides Spark 1.1, is there an official way to install Spark 1.6(or later) for CDH 5.2? Maybe though Parcels ?
... View more
Labels:
03-27-2017
02:10 PM
Hi, is there any easy way to get a list of which user a deamon is running with on every node without having to go to each node? The purpuse of this is to verify that each deamon is running with the correct user on each node.
... View more
Labels:
03-27-2017
02:09 PM
Hi, is there any easy way to get a list of which user a deamon is running with on every node without having to go to each node? The purpuse of this is to verify that each deamon is running with the correct user on each node.
... View more
03-13-2017
03:47 PM
2 Kudos
Hi , i now SparkR is not supported by cloudera and sparklyr is an alternative to this, i have configured and tested this from Rstudio but it would be nice to be able to use hue notebooks to use sparlkyr , does anybody knows if it is posible to configure this? already searched on google and didnt find anything.
... View more