Support Questions

vijaykumar243 · ‎08-12-2016

Hi,

I have developed a simple Java Spark application where it fetch the data from MongoDB to HDFS on Hourly basis.

The data is stored in Parquet format. Once the data is residing in HDFS, the actual testing began.

I am taking a simple row count but it got differed in two scenarios. Will it be possible to have the different count.

Code:

import org.apache.spark.sql.hive.HiveContext

val hivecontext = new HiveContext(sc)
val parquetFile = hivecontext.parquetFile("/data/daily/2016-08-11_15_31_34.995/*")
parquetFile.count

Result :

Extending the above code and trying to use registerTempTable method the count got differed

Code:

import org.apache.spark.sql.hive.HiveContext
val hivecontext = new HiveContext(sc)
val parquetFile = hivecontext.parquetFile("/data/daily/2016-08-11_15_31_34.995/*")
parquetFile.registerTempTable("ParquetTable")val ParquetResult = hivecontext.sql("select count(distinct Id) from ParquetTable")ParquetResult.show

Result:

This implies the difference between using the direct count & registering temp table count.

I am confused why the count is mismatch.Can we know the reason behind the difference in the count.

Note :

Its a simple java spark application which extracts the data from MongoDB to HDFS.
There is no intermediate transformation added in the code.

Regards,

Vijay Kumar J

gopalv · ‎08-12-2016

Your question seems to be that count(distinct Id) != count(id) ?

vijaykumar243 · ‎08-12-2016

@gopal

If i do total number of Id is 4030. Whether i use distinct or not the result will be same, as the Id doesnt have any duplicate records

vijaykumar243 · ‎08-12-2016

Spark version : 1.6.0

hive version : 1.2.1

vijaykumar243 · ‎08-12-2016

Hadoop version: 2.7.1.2.4.0.0-169

amacudzinski · ‎08-15-2016

I am seeing a similar issue too. What is going on?

TimothySpann · ‎08-15-2016

Are you using any caching?

Have you run the count in hive cli or beeline? Or Spark beeline? Have you looked at it with parquet tools? how many records should there be?

https://github.com/Parquet/parquet-mr/tree/master/parquet-tools

also try with the default SQL Context

try:

SET spark.sql.hive.metastorePartitionPruning=true

Could be an issue between SparkSQL and HiveMetastore.

vijaykumar243 · ‎08-16-2016

@Timothy Spann

As per your suggestion i have downloaded the parquet tools from github and tried to package , it is throwing an error.

 Failed to execute goal on project parquet-tools: Could not resolve dependencies for project com.twitter:parquet-tools:jar:1.6.0rc3-SNAPSHOT: Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of sonatype-nexus-snapshots has elapsed or updates are forced

Please help me !!!

TimothySpann · ‎08-16-2016

git clone -b apache-parquet-1.8.0 https://github.com/apache/parquet-mr.git

cd parquet-mr

cd parquet-tools

mvn clean package -Plocal

You will need Java and Maven installed

Copy the /target/parquet-tools-1.8.0.jar to a directory in your path

java -jar ./parquet-tools-1.8.0.jar cat myParquetFilesAreAwewsome.parquet

vijaykumar243 · ‎08-16-2016

@Timothy Spann

I have followed the same. Now i can able to read the parquet file. But how can it be solution.

Cloudera Community

Support Questions

Count mismatch while using the parquet file in Spark SQLContext and HiveContext

Build and use Parquet-tools to read parquet files

Writing parquet on HDFS using Spark Streaming

Read SAS files into parquet using nifi

Parsing Apache Log Files with Spark

Counting lines in text files with NiFi - part 2

Spark-sql fails to use "SELECT" on Aliases on Parq...

HDF/NiFi to convert row-formatted text files to co...

Text file to Parquet file conversion using Pig

Spark SQL error when using parquet table

How to process a word count on zipped files in spa...