About kpk_ds

kpk_ds · ‎10-20-2017

@Sandeep Nemuri Thanks for the details. Do you know how fast this is compared to doing a row count on HIVE Table (HIVE on Tez). I'm more concerned about the performance.

kpk_ds · ‎10-20-2017

Hi Sandeep Nemuri Thanks for the answer, will try this. but should the data be in rdd? or can I use text files in hdfs? can I do record count and duplicate check using files? or just data frames, instead of having both RDD and Data frames. we are going to have a huge number of files and a huge data volume, so performance is very important. can you comment on that please.

kpk_ds · ‎10-19-2017

Hi I have a 2 part question: Source files are in HDFS, normal csv or text files. Part 1: Record counts. I have a huge file, that has more than 50 million records. The total size of the file will be 50GB Average (daily files). Is there an efficient way to do record count (i do not want to do wc -l). Would like to do this in native Hadoop or spark, so i can utilize the Hadoop resources. Part 2: Duplicate Check. For the same file mentioned above, I would like to know if there is way to do duplicate check, based on key fields? I would like to find if the file has duplicate records? if Yes, then i would like to capture them separately (this is optional). Right now, we have some solution that is running for a very long time (more than 30 mins), and i would like to know if there are other options like native map reduce or spark.

kpk_ds · ‎01-31-2017

Hi Bpreachuk Thanks for the answer. No, we do not have the option to buy syncsort.

kpk_ds · ‎01-31-2017

Hi Mqureshi I'm very new to this, so i dont know how to do this. But i will try to check some online resources and try this. if i struggle i will come back and ask you for help. if it works, i will let you know about that also. Thanks.

kpk_ds · ‎01-31-2017

But this will work only if the file in Mainframe is like a normal test file right. but in my case, the files are in EBCDIC format (has multiple occurrences), some junk values... so can we still do this with the Sqoop connector. i did go over the details in the link and couldnt see anything related to EBCDIC file. but if you think this is going to work, please share more details, and Im interested in knowing about this.

kpk_ds · ‎01-31-2017

Hi Mqureshi I'm very new to this, so i dont know how to do this. But i will try to check some online resources and try this. if i struggle i will come back and ask you for help. if it works, i will let you know about that also. Thanks.

kpk_ds · ‎01-31-2017

Hi, We have huge number of mainframe files, which are in EBCDIC format. these files are created by mainframe systems. These files are now stored in HDFS as EBCDIC files. I have a need to read these files, (copy books available), split them into multiple files based on record type, and store them as ASCII files in HDFS.

Online	Offline
Last Visited	‎10-25-2017 05:09 AM

Member Since	‎01-30-2017 02:23 PM
Last Visited	‎10-25-2017 05:09 AM
Posts	10

Cloudera Community

Re: Record count and Duplicate check - using Spark

Re: Record count and Duplicate check - using Spark

Record count and Duplicate check - using Spark

Re: How can I read Mainframe file which is in EBCD...

Re: How can I read Mainframe file which is in EBCD...

Re: How can I read Mainframe file which is in EBCD...

Re: How can I read Mainframe file which is in EBCD...

How can I read Mainframe file which is in EBCDIC f...