Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

I want to process 20 TB of pdf files in such a way that there is one output per input for each pdf files.

I want to process 20 TB of pdf files in such a way that there is one output per input for each pdf files.

New Contributor

I want to process 20 TB of pdf files in spark using tika in such a way that there is one output per input for each pdf files.

I am able to do it in a sequential manner but it is taking a lot of time. When doing it in parallel manner (by giving the input as the whole directory containing pdf files) it is taking very less time but the output is part files containing overlapping values. Is there any way in which I can do it in a parallel manner and get one output per input.

Below is my code :-

val binRDD = sc.binaryFiles("/data")

val textRDD = binRDD.map(file => {new org.apache.tika.Tika().parseToString(file._2.open( ))}) textRDD.saveAsTextFile("/output/")

Don't have an account?
Coming from Hortonworks? Activate your account here