Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Record count and Duplicate check - using Spark

Highlighted

Record count and Duplicate check - using Spark

Explorer

Hi

I have a 2 part question:

Source files are in HDFS, normal csv or text files.

Part 1: Record counts.

I have a huge file, that has more than 50 million records. The total size of the file will be 50GB Average (daily files).

Is there an efficient way to do record count (i do not want to do wc -l). Would like to do this in native Hadoop or spark, so i can utilize the Hadoop resources.

Part 2: Duplicate Check.

For the same file mentioned above, I would like to know if there is way to do duplicate check, based on key fields? I would like to find if the file has duplicate records? if Yes, then i would like to capture them separately (this is optional).

Right now, we have some solution that is running for a very long time (more than 30 mins), and i would like to know if there are other options like native map reduce or spark.

4 REPLIES 4
Highlighted

Re: Record count and Duplicate check - using Spark

@karthick baskaranFor Part 1: Record counts: A simple rdd.count() or df.count() should give you the records count.

For Part 2: Duplicate Check: You could load the data into a dataframe and run a distinct against it or use dropDuplicates [https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html#dropDuplicates()]

Highlighted

Re: Record count and Duplicate check - using Spark

Explorer

Hi Sandeep Nemuri

Thanks for the answer, will try this. but should the data be in rdd? or can I use text files in hdfs? can I do record count and duplicate check using files? or just data frames, instead of having both RDD and Data frames. we are going to have a huge number of files and a huge data volume, so performance is very important. can you comment on that please.

Highlighted

Re: Record count and Duplicate check - using Spark

@karthick baskaran

Here is the command to get number of lines in a file. Spark will internally load your text file and keep it in RDD/dataframe/dataset.

spark-shell (spark 1.6.x)
scala> val textFile = sc.textFile("README.md")
scala> textFile.count() // Number of items in this RD

Re: Record count and Duplicate check - using Spark

Explorer

@Sandeep Nemuri Thanks for the details. Do you know how fast this is compared to doing a row count on HIVE Table (HIVE on Tez). I'm more concerned about the performance.

Don't have an account?
Coming from Hortonworks? Activate your account here