I have a 2 part question:
Source files are in HDFS, normal csv or text files.
Part 1: Record counts.
I have a huge file, that has more than 50 million records. The total size of the file will be 50GB Average (daily files).
Is there an efficient way to do record count (i do not want to do wc -l). Would like to do this in native Hadoop or spark, so i can utilize the Hadoop resources.
Part 2: Duplicate Check.
For the same file mentioned above, I would like to know if there is way to do duplicate check, based on key fields? I would like to find if the file has duplicate records? if Yes, then i would like to capture them separately (this is optional).
Right now, we have some solution that is running for a very long time (more than 30 mins), and i would like to know if there are other options like native map reduce or spark.