Support Questions
Find answers, ask questions, and share your expertise

Apache Spark take Action on Executors in fully distributed mode

Apache Spark take Action on Executors in fully distributed mode

Visitor

I am new to spark, i have the basic idea of how the transformation and action work (guide ). I am trying some NLP operation on each line (basically paragraphs) in a text file. After processing, the result should be sent to a server (REST Api) for storage. The program is run as a spark job (submitted using spark-submit) on a cluster of 10 nodes in yarn mode. This is what i have done so far.

 

 

...
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<String> processedLines = lines
    .map(line -> {
        // processed here
        return result;
    });
processedLines.foreach(line -> {
    // Send to server
});

 

 

This works but the foreach loop seems sequential, it seems like it is not running in distributed mode on the worker nodes. Am i correct?

I tried the following code but it doesn't work. Error: java: incompatible types: inferred type does not conform to upper bound(s). Obviously its wrong because map is a transformation, not an action.

 

 

lines.map(line -> { /* processing */ })
     .map(line -> { /* Send to server */ });

 

 

I also tried with take(), but it requires int and the processedLines.count() is of type long.

 

 

processedLines.take(processedLines.count()).forEach(pl -> { /* Send to server */ });

 

 

The data is huge (greater than 100gb). What i want is that both the processing and sending it to the server should be done on the worker nodes. The processing part in the map defiantly takes place on the worker nodes. But how do i send the processed data from the worker nodes to the server because the foreach seems sequential loop taking place in the driver (if i am correct). Simply put, how to execute action in the worker nodes and not in the driver program.

 

Any help will be highly appreciated.