Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

CDH Spark SQL window functions return wrong result at cluster mode.

CDH Spark SQL window functions return wrong result at cluster mode.

Explorer

Hi,
 
Thank you for releasing CDH5.6.0.

 

I found Spark SQL's window functions return a wrong result at yarn cluster mode.


The cause is that window functions return another colum values in some cases.
This is because ExprId is not created uniqly among executors.
The issue ticket is below.
https://issues.apache.org/jira/browse/SPARK-11080

Example snippet is below.

DataFrame df = ...

// count record per a, b -> count_a
GroupedData grouped = df.groupBy(col("a"), col("b"));
DataFrame aCount = grouped.agg(functions.count(col("a")).as("count_a"));

// sum count_a per b -> count_b
WindowSpec sumWs = Window.partitionBy("b");
DataFrame bCount = aCount.withColumn("count_a",functions.sum(col("count_a")).over(sumWs));

// rank per count_b
WindowSpec rankWs = Window.orderBy(functions.desc("count_b"));
DataFrame ranked = bCount.withColumn("rank", functions.denseRank().over(rankWs));

List<Row> rows = ranked.collectAsList();
for(Row row : rows) {
   Logger.info(row.toString()); // <= count_b values are corrupted!
}

The result is something like below.

a, b, count_a, count_b, rank
----------------------------------
[a1,b1,1,240518168655,10]
[a1,b2,1,240518168655,10]
[a1,b3,1,240518168655,10]
...

 

I confirmed the bug is fixed by applying above ticket patch.


This issue is critical, and I also think so because it might affect other shuffle features.

I hope the ticket patch to be backported.

 

Spark SQL's DataFrame is supported except for blow features, so I consider window functions are supported.
http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_rn_spark_ki.html

4 REPLIES 4

Re: CDH Spark window functions return wrong result at cluster mode.

Master Collaborator
I don't think the JIRA you linked is related and I'm not clear from
what you posted that there is a problem. What result do you expect?

With that kind of info, you'd need to contact support.

Re: CDH Spark window functions return wrong result at cluster mode.

Explorer

Sorry for unclear contents and inappropriate topic.

 

I would like to say the summarized values are unexpectedly large.

SPARK-11080 affects comparison of ExprId in BoundReferernces.bindReference.

https://github.com/cloudera/spark/blob/cdh5-1.5.0_5.6.0/sql/catalyst/src/main/scala/org/apache/spark...

 

I'll ready for executor's inconsistent generated code log, and contact support.

Re: CDH Spark window functions return wrong result at cluster mode.

Explorer

I reported to JIRA, because I think this issue is not application problem, but SPARK's.

I wrote the detail about the issue to the report.

 

https://issues.cloudera.org/browse/DISTRO-791

 

Highlighted

Re: CDH Spark SQL window functions return wrong result at cluster mode.

Explorer

Hi Prem1,

 

It seems not to be related to this topic, but ...

 

1) Could you please let us know whether there is any way to solve the above error?

Window functions depend on hive library.

https://issues.apache.org/jira/browse/SPARK-8641

https://issues.apache.org/jira/browse/SPARK-11001

 

So, you might need to install hive or be ready for related jars on the submit host and add the following jar options.

The libray version number might not be correct.

 

--jars /usr/lib/hive/lib/hive-jdbc.jar,/usr/lib/hive/lib/hive-cli.jar,/usr/lib/hive/lib/hive-common.jar,/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar,/usr/lib/hive/lib/datanucleus-core-3.2.10.jar,/usr/lib/hive/lib/derby-10.11.1.1.jar,/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar

 

The related topics seem to be below.

https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/CDH-5-3-Spark-hive-integration-iss...

http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/I-am-using-a-hive-cotext-in-pyspark...