Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How do I construct data into a Spark RDD from a java/POI?

How do I construct data into a Spark RDD from a java/POI?

I have a Spark job whose supporting data is built at the start by reading some large Excel multi-tab spreadsheets (Excel native format, read through Java/POI). I have 2 problems:

1. How do I construct an RDD from the Java/POI?

2. How can I parallelize the operation?

I've attached a screen shot of the relevant Spark code.

spark-code.png

5 REPLIES 5

Re: How do I construct data into a Spark RDD from a java/POI?

New Contributor

Thanks Scott for posting this. Just a little bit more info: The application generates a list of fairly large Java pojos by running through some deeply-nested loop logic. We need to call a method that invokes up to thousands of database queries for each of those POJOs, and ultimately writes results to a message queue. So we would like to distribute that work among the cluster nodes using Spark. We do not necessarily need the output of each operation, just that the operations occur. So another way to put the two main questions: 1. What is the best way to "add" objects to an RDD programatically? (We are trying "parallelize" currently)

2. Once we have an RDD, what type of transformation or filter fits the goal of "send off a bunch of threads to do their work and just tell me when they're done" ?

Re: How do I construct data into a Spark RDD from a java/POI?

Expert Contributor

One approach that comes to mind is this:

  1. Create the RDD using the WholeFileInputFormat and this method. Leverage the ImmutableBytesWritable as the value type. This will parallelize the Excel files across the cluster.
  2. Then in the RDD's map call leverage the POI library. Make sure to distribute the dependency jars (e.g. "--jars my-poi.jar")

Otherwise you could pre-process the Excel file in the client driver JVM, create a single element collection, parallelize it, and join it with the destination RDD. However, then you'll be single threaded when it comes to Excel processing via POI.

Re: How do I construct data into a Spark RDD from a java/POI?

Super Guru

Ingest with NIFI and have NIFI send to KAFKA then you can read with Spark Streaming which you can have set to use a lot of processors.

https://community.hortonworks.com/articles/34209/spark-16-tips-in-code-and-submission.html

Up your # of cores to as many as you have available, if it's 16 cores per server you have and you have 10 nodes. You could assign 160 cores.

A few people had suggestions: http://stackoverflow.com/questions/35743190/reading-large-number-of-excel-files-into-apache-spark

An example Spark program reading Excel: https://github.com/je-nunez/apache_spark/blob/master/src/main/scala/Excel2RDD.scala

Re: How do I construct data into a Spark RDD from a java/POI?

New Contributor

Thanks guys for the tips. @Clukasik: your last paragraph seems to be exactly what I tried. Processing the Excel up-front single-threaded is fine; that only takes 15-20 seconds. Can you tell me how your suggestion differs from what Scott posted in the screenshot (i.e., what we're doing now) ?

Just to clarify for any others trying to help: the number of Excel rows is not the issue (we read those at the beginning). It is the code-generated tasks (many per Excel row) that are the bottleneck.

Thanks!

Highlighted

Re: How do I construct data into a Spark RDD from a java/POI?

Contributor

Hey Ron,

I zipped up 3 java class files that I think should get you going in the right direction. The main class, TestBean and TestFunction.

Since you don't care about the results based on each of the inputs sent to the function, I used a .foreach() instead of .map() which would attempt to collect results.archive.zip

I only ran this in my IDE environment so local - not distributed. you shouldn't need to be explicit in the number of partitions when you parallelize your bean list and don't forget to strip off the .master("local") when instantiating the SparkConf.

Don't have an account?
Coming from Hortonworks? Activate your account here