Member since
05-27-2016
14
Posts
0
Kudos Received
0
Solutions
11-14-2018
05:43 PM
I am getting this error: pyspark-error.png What does this error actually mean? and what steps/changes can I take to fix this?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
09-14-2016
09:26 AM
I am running a spark application, where I am loading two tables as a dataframe, doing a left join, and generating a row number on records missing from right table. I have my code below and my spark submit command as well. spark-submit --master yarn --deploy-mode client --num-executors 16 --driver-memory 10g --executor-memory 7g --executor-cores 5 --class CLASS_NAME PATH_TO_JAR public static void main(String[] args){
SparkConf conf = new SparkConf().setAppName("Spark Sequence Test");
JavaSparkContext jsc = new JavaSparkContext(conf);
HiveContext hc = new HiveContext(jsc.sc());
DataFrame cvxDataFrame = hc.sql("select * from Table_A");
DataFrame skDataFrame = hc.sql("select * from Table_B");
DataFrame new_df = cvxDataFrame.join(skDataFrame, cvxDataFrame.col("cvx_code").equalTo(skDataFrame.col("xref_code_id")), "left_outer");
DataFrame fordf = new_df.select(new_df.col("cvx_code").as("xref_code_id"),new_df.col("xref_code_sk")).filter(new_df.col("xref_code_sk").isNull());
Column rowNum = functions.row_number().over(Window.orderBy(fordf.col("xref_code_id")));
DataFrame df = fordf.select(fordf.col("xref_code_id"), rowNum.as("xref_code_sk"));
df.registerTempTable("final_result");
hc.sql("INSERT INTO TABLE TABLE_C SELECT xref_code_id, xref_code_sk, 'CVX' as xref_code_tp_cd from final_result" );
}
This works when both Table A and Table B has 50 million records, but It is failing when Table A has 50 million records and Table B has 0 records. The error I am getting is “Executor heartbeat timed out…” ERROR cluster.YarnScheduler: Lost executor 7 on sas-hdp-d03.devapp.domain: Executor heartbeat timed out after 161445 ms 16/09/14 11:23:58 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 232, sas-hdp-d03.devapp.domain): ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 161445 ms I would really appreciate if anyone has any suggestion on how I can get around this. Thanks
... View more
Labels:
09-06-2016
06:40 PM
When we are using hive context (hive tables) or phoenix tables with in our spark application it is very difficult ( as a matter of fact i think it is impossible with out going through point less installation in the local machine) to run the application locally through eclipse. Anyways, I was looking for something like this http://www.dbengineering.info/2016/09/debug-spark-application-running-on-cloud.html where we are able to run it on debug mode. Anyways, For the moment I am happy with this. Just sharing incase if someone else has the same question I had few months ago. Thanks
... View more
08-17-2016
09:42 AM
Hello srowen, I am doing count in step 1 as well. (right after caching the dataframe). So my expectation is that dataframe should have only 2 records even if are inserting records the table in between. If that is true then when we do count on the cached dataframe at the end. It should be 2, but why is it 4. This is what is confusing me. Thanks in advance
... View more
08-17-2016
09:05 AM
This might not be the relevant topic, but I think its right people. I am having some issue with caching dataframe in spark. (step 1). I am reading hive table as a dataframe. Lets say we have the count 2. (step 2). I am caching this dataframe. (step 3). I am adding 2 additional records to the hive table. (step 4). I am doing count on the cached dataframe again. If caching is working as I am expecting, the count in step 1 and step 4 should be 2. This is working when I am adding additional records to the table from outside the spark application. However it is not working if I am doing step 3 from within the application. I AM NOT UNDERSTANIDNG WHY. I I do step 3 from the same application I am getting step 4 count as 4. But why?? I think I am missing something.
... View more
08-03-2016
07:37 AM
Hello guys, When we build a spark application, we usually export it as a jar and run it on the cluster. Is there a way we can run the application on the cluster directly from eclipse (with some setting)? This would be very effecient to test/debug. So just wondering if there is anything out there. Thanks
... View more