Support Questions

Find answers, ask questions, and share your expertise

Count mismatch while using the parquet file in Spark SQLContext and HiveContext

avatar
Rising Star

Hi,

I have developed a simple Java Spark application where it fetch the data from MongoDB to HDFS on Hourly basis.

The data is stored in Parquet format. Once the data is residing in HDFS, the actual testing began.

I am taking a simple row count but it got differed in two scenarios. Will it be possible to have the different count.

Code:

import org.apache.spark.sql.hive.HiveContext

val hivecontext = new HiveContext(sc)
val parquetFile = hivecontext.parquetFile("/data/daily/2016-08-11_15_31_34.995/*")
parquetFile.count

Result :

4030

Extending the above code and trying to use registerTempTable method the count got differed

Code:

import org.apache.spark.sql.hive.HiveContext
val hivecontext = new HiveContext(sc)
val parquetFile = hivecontext.parquetFile("/data/daily/2016-08-11_15_31_34.995/*")
parquetFile.registerTempTable("ParquetTable")val ParquetResult = hivecontext.sql("select count(distinct Id) from ParquetTable")ParquetResult.show

Result:

4026

This implies the difference between using the direct count & registering temp table count.

I am confused why the count is mismatch.Can we know the reason behind the difference in the count.

Note :

Its a simple java spark application which extracts the data from MongoDB to HDFS.
There is no intermediate transformation added in the code.

Regards,

Vijay Kumar J

10 REPLIES 10

avatar
Master Guru

First, find out how many records are actually in there to see which query is wrong. Then try the metastore pruning. It's probably related to metastore or caching for inconsistencies.