Cloudera Labs
Provide feedback on Cloudera Labs
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Why spark code work just fine in local mode but returns null/0 values in cluster mode?

Why spark code work just fine in local mode but returns null/0 values in cluster mode?

Expert Contributor

Hello everyone,


This below code works just fine in local mode with proper output as expected but same code returns 
0 and null value for unique count ,min and max pro-filer's respectively when we submit the job in cluster mode.

 

 

 

 

val result = sourceData.schema.map(st => {
  val column = st.name.toString
  val nullcount = sourceData.filter(sourceData(column).isNull).count()
  val uniqueCount = sourceData.select(column).distinct().count()


  val df = sourceData
    .agg(countDistinct(col(column)).alias("UniqueCount"), min(col(column)).alias("Min"), max(col(column)).alias("Max"))
    .withColumn("UniqueCount", lit(uniqueCount))
    .withColumn("NullCount", lit(nullcount))
    .withColumn("Column_Name", lit(column))

  df
}).reduce(_ union _)
result.show()

 

 

 

Using below command to submit spark job in cluster mode:

 

 

spark-submit --class ey.profiler.sparkProfilerClient --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar --deploy-mode client DataProfiling.jar

 

 

 

I tried to achieve same thing in different way as below but their also i have observed same behavior of code or wrong result in final result data frame.

 

 

 

val resultantDf = sourceData.columns.foldLeft(initialDF)((df1, column) => df1.union({
      val x = sourceData.filter(col(column).isNull).count()
      sourceData
        .agg(countDistinct(col(column)), min(col(column)), max(col(column)))
        .withColumn("Null_Count", lit(x))
        .withColumn("Column_Name", lit(column))


    }))
    resultantDf.show()  

 

 

Why is that and How to solve this issue?

Thank you in  advance.

 

2 REPLIES 2
Highlighted

Re: Why spark code work just fine in local mode but returns null/0 values in cluster mode?

Rising Star

Hi Manus,

 

I could see you have submitted the job in client mode. Could you please let us know If you are facing any exception during the execution.

 

spark-submit --class ey.profiler.sparkProfilerClient --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar --deploy-mode client DataProfiling.jar

 

 For cluster mode, it should be: --deploy-mode cluster

 

The result of the job will be printed as part of the driver log. In the case of cluster mode, it should be available in the Application master container logs.

 

Could you please share the actual output to validate further

 

Thanks

Jerry

 

Highlighted

Re: Why spark code work just fine in local mode but returns null/0 values in cluster mode?

Expert Contributor

@Jerry 

Thank you for your reply.

When i submit the job it does not throw any exception.

 

And yes,i have also  tried to execute same code by copying the code lines one by one into the spark shell. The spark shell runs with master=yarn by default. So their also code is running in distributed fashion.

Unfortunately, In spark shell also observed same behaviour of code.

 

Don't have an account?
Coming from Hortonworks? Activate your account here