- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Count mismatch while using the parquet file in Spark SQLContext and HiveContext
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Spark
Created ‎08-12-2016 07:19 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have developed a simple Java Spark application where it fetch the data from MongoDB to HDFS on Hourly basis.
The data is stored in Parquet format. Once the data is residing in HDFS, the actual testing began.
I am taking a simple row count but it got differed in two scenarios. Will it be possible to have the different count.
Code:
import org.apache.spark.sql.hive.HiveContext val hivecontext = new HiveContext(sc) val parquetFile = hivecontext.parquetFile("/data/daily/2016-08-11_15_31_34.995/*") parquetFile.count
Result :
4030
Extending the above code and trying to use registerTempTable method the count got differed
Code:
import org.apache.spark.sql.hive.HiveContext val hivecontext = new HiveContext(sc) val parquetFile = hivecontext.parquetFile("/data/daily/2016-08-11_15_31_34.995/*") parquetFile.registerTempTable("ParquetTable")val ParquetResult = hivecontext.sql("select count(distinct Id) from ParquetTable")ParquetResult.show
Result:
4026
This implies the difference between using the direct count & registering temp table count.
I am confused why the count is mismatch.Can we know the reason behind the difference in the count.
Note :
Its a simple java spark application which extracts the data from MongoDB to HDFS. There is no intermediate transformation added in the code.
Regards,
Vijay Kumar J
Created ‎08-12-2016 07:21 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your question seems to be that count(distinct Id) != count(id) ?
Created ‎08-12-2016 08:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If i do total number of Id is 4030. Whether i use distinct or not the result will be same, as the Id doesnt have any duplicate records
Created ‎08-12-2016 07:23 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Spark version : 1.6.0
hive version : 1.2.1
Created ‎08-12-2016 07:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hadoop version: 2.7.1.2.4.0.0-169
Created ‎08-15-2016 09:27 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am seeing a similar issue too. What is going on?
Created ‎08-15-2016 10:41 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you using any caching?
Have you run the count in hive cli or beeline? Or Spark beeline? Have you looked at it with parquet tools? how many records should there be?
https://github.com/Parquet/parquet-mr/tree/master/parquet-tools
also try with the default SQL Context
try:
SET spark.sql.hive.metastorePartitionPruning=true
Could be an issue between SparkSQL and HiveMetastore.
Created ‎08-16-2016 02:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As per your suggestion i have downloaded the parquet tools from github and tried to package , it is throwing an error.
Failed to execute goal on project parquet-tools: Could not resolve dependencies for project com.twitter:parquet-tools:jar:1.6.0rc3-SNAPSHOT: Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of sonatype-nexus-snapshots has elapsed or updates are forced
Please help me !!!
Created ‎08-16-2016 02:18 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
git clone -b apache-parquet-1.8.0 https://github.com/apache/parquet-mr.git
cd parquet-mr
cd parquet-tools
mvn clean package -Plocal
You will need Java and Maven installed
Copy the /target/parquet-tools-1.8.0.jar to a directory in your path
java -jar ./parquet-tools-1.8.0.jar cat myParquetFilesAreAwewsome.parquet
Created ‎08-16-2016 03:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have followed the same. Now i can able to read the parquet file. But how can it be solution.
