I have a Spark job whose supporting data is built at the start by reading some large Excel multi-tab spreadsheets (Excel native format, read through Java/POI). I have 2 problems:
1. How do I construct an RDD from the Java/POI?
2. How can I parallelize the operation?
I've attached a screen shot of the relevant Spark code.
Thanks Scott for posting this. Just a little bit more info: The application generates a list of fairly large Java pojos by running through some deeply-nested loop logic. We need to call a method that invokes up to thousands of database queries for each of those POJOs, and ultimately writes results to a message queue. So we would like to distribute that work among the cluster nodes using Spark. We do not necessarily need the output of each operation, just that the operations occur. So another way to put the two main questions: 1. What is the best way to "add" objects to an RDD programatically? (We are trying "parallelize" currently)
2. Once we have an RDD, what type of transformation or filter fits the goal of "send off a bunch of threads to do their work and just tell me when they're done" ?
One approach that comes to mind is this:
Otherwise you could pre-process the Excel file in the client driver JVM, create a single element collection, parallelize it, and join it with the destination RDD. However, then you'll be single threaded when it comes to Excel processing via POI.
Ingest with NIFI and have NIFI send to KAFKA then you can read with Spark Streaming which you can have set to use a lot of processors.
Up your # of cores to as many as you have available, if it's 16 cores per server you have and you have 10 nodes. You could assign 160 cores.
A few people had suggestions: http://stackoverflow.com/questions/35743190/reading-large-number-of-excel-files-into-apache-spark
An example Spark program reading Excel: https://github.com/je-nunez/apache_spark/blob/master/src/main/scala/Excel2RDD.scala
Thanks guys for the tips. @Clukasik: your last paragraph seems to be exactly what I tried. Processing the Excel up-front single-threaded is fine; that only takes 15-20 seconds. Can you tell me how your suggestion differs from what Scott posted in the screenshot (i.e., what we're doing now) ?
Just to clarify for any others trying to help: the number of Excel rows is not the issue (we read those at the beginning). It is the code-generated tasks (many per Excel row) that are the bottleneck.
I zipped up 3 java class files that I think should get you going in the right direction. The main class, TestBean and TestFunction.
Since you don't care about the results based on each of the inputs sent to the function, I used a .foreach() instead of .map() which would attempt to collect results.archive.zip
I only ran this in my IDE environment so local - not distributed. you shouldn't need to be explicit in the number of partitions when you parallelize your bean list and don't forget to strip off the .master("local") when instantiating the SparkConf.