Reply
Explorer
Posts: 11
Registered: ‎11-25-2015

CDH Spark SQL window functions return wrong result at cluster mode.

[ Edited ]

Hi,
 
Thank you for releasing CDH5.6.0.

 

I found Spark SQL's window functions return a wrong result at yarn cluster mode.


The cause is that window functions return another colum values in some cases.
This is because ExprId is not created uniqly among executors.
The issue ticket is below.
https://issues.apache.org/jira/browse/SPARK-11080

Example snippet is below.

DataFrame df = ...

// count record per a, b -> count_a
GroupedData grouped = df.groupBy(col("a"), col("b"));
DataFrame aCount = grouped.agg(functions.count(col("a")).as("count_a"));

// sum count_a per b -> count_b
WindowSpec sumWs = Window.partitionBy("b");
DataFrame bCount = aCount.withColumn("count_a",functions.sum(col("count_a")).over(sumWs));

// rank per count_b
WindowSpec rankWs = Window.orderBy(functions.desc("count_b"));
DataFrame ranked = bCount.withColumn("rank", functions.denseRank().over(rankWs));

List<Row> rows = ranked.collectAsList();
for(Row row : rows) {
   Logger.info(row.toString()); // <= count_b values are corrupted!
}

The result is something like below.

a, b, count_a, count_b, rank
----------------------------------
[a1,b1,1,240518168655,10]
[a1,b2,1,240518168655,10]
[a1,b3,1,240518168655,10]
...

 

I confirmed the bug is fixed by applying above ticket patch.


This issue is critical, and I also think so because it might affect other shuffle features.

I hope the ticket patch to be backported.

 

Spark SQL's DataFrame is supported except for blow features, so I consider window functions are supported.
http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_rn_spark_ki.html

Cloudera Employee
Posts: 366
Registered: ‎07-29-2013

Re: CDH Spark window functions return wrong result at cluster mode.

I don't think the JIRA you linked is related and I'm not clear from
what you posted that there is a problem. What result do you expect?

With that kind of info, you'd need to contact support.
Explorer
Posts: 11
Registered: ‎11-25-2015

Re: CDH Spark window functions return wrong result at cluster mode.

[ Edited ]

Sorry for unclear contents and inappropriate topic.

 

I would like to say the summarized values are unexpectedly large.

SPARK-11080 affects comparison of ExprId in BoundReferernces.bindReference.

https://github.com/cloudera/spark/blob/cdh5-1.5.0_5.6.0/sql/catalyst/src/main/scala/org/apache/spark...

 

I'll ready for executor's inconsistent generated code log, and contact support.

Explorer
Posts: 11
Registered: ‎11-25-2015

Re: CDH Spark window functions return wrong result at cluster mode.

I reported to JIRA, because I think this issue is not application problem, but SPARK's.

I wrote the detail about the issue to the report.

 

https://issues.cloudera.org/browse/DISTRO-791

 

Explorer
Posts: 11
Registered: ‎11-25-2015

Re: CDH Spark SQL window functions return wrong result at cluster mode.

Hi Prem1,

 

It seems not to be related to this topic, but ...

 

1) Could you please let us know whether there is any way to solve the above error?

Window functions depend on hive library.

https://issues.apache.org/jira/browse/SPARK-8641

https://issues.apache.org/jira/browse/SPARK-11001

 

So, you might need to install hive or be ready for related jars on the submit host and add the following jar options.

The libray version number might not be correct.

 

--jars /usr/lib/hive/lib/hive-jdbc.jar,/usr/lib/hive/lib/hive-cli.jar,/usr/lib/hive/lib/hive-common.jar,/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar,/usr/lib/hive/lib/datanucleus-core-3.2.10.jar,/usr/lib/hive/lib/derby-10.11.1.1.jar,/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar

 

The related topics seem to be below.

https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/CDH-5-3-Spark-hive-integration-iss...

http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/I-am-using-a-hive-cotext-in-pyspark...