Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Mismatch in the count between reading data in delta lake format vs parquet format

avatar
Contributor

Hi,

I need one urgent help here. I am using Delta Lake provided by Databricks for storing the staged data from source application.

When Apache Spark processes the data, the data from source is staged in form of .parquet files and the transaction log directory _delta_log is updated with the location of .parquet files in a .json file.

However, I have observed that, even though an application gets its data in 4 .parquet files and if the json files have reference of only 3 parquet files, then there will be a mismatch in the count of records when the data is read using the below commands:

spark> spark.read.format("delta").load("/path/applicationname") //This command will show less count for the application

VS

spark> spark.read.format("parquet").load("/path/applicationname") //This command will show the count of data stored in all the 4 parquet files, including the missing one.

So, according to Delta Lake, that one .parquet file doesn't exist. However, it actually exists.

Issue

This is causing issues in capturing the correct data in Target DB for further analysis and the analysis is getting impacted.

My Questions

  • Why this issue happens ? Why the delta lake fails to write the location of .parquet file to transaction log?

  • How to fix this particular issue? I have seen that if I change the target path where I will capture the data for the same application and re-process, then 9 out of 10 times the issue gets fixed. But, I cannot keep changing the target path and that's not a clean solution as well.

Please let me know if you need any additional information.

Thanks and Regards,
Sudhindra

2 REPLIES 2

avatar

Please clarify which Cloudera or Hortonworks platform you are using. It is a bit hard to think on a next step without this context.

 

If none of these platforms are involved this may not be the best place to ask the question.

 

----

Sidenote: If you have an urgent functional question, in general the recommended approach is to contact your account team.


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

avatar
New Contributor

You should use "vacum" command first (this will delete transactions history in deltalog)

After that both above statements should produce same results.