Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Need to clear doubt : About SPARK - JavaRDD

Highlighted

Need to clear doubt : About SPARK - JavaRDD

Expert Contributor

What I understand with below :

JavaRDD<Row> SampleRDD = sc.textFile(rb.getString("DIRECTORY_PATH").map( new Function<String, Row>()

we are fetching the file from the hdfs location and then pass the same to JavaRDD .

It means now our file is in memory. This will reside till program execution or all time?

Can anyone explain what this map function is doing?

2 REPLIES 2
Highlighted

Re: Need to clear doubt : About SPARK - JavaRDD

Expert Contributor

This line won't cause the job run unless you invoke any RDD action. And it doesn't redisde in memory unless you invoke cache(). What this map doing depends what's in the function body. I think you miss the function body.

Highlighted

Re: Need to clear doubt : About SPARK - JavaRDD

As jzhang said that line on its own doesn't "do" anything. It just tells Spark how to process the data from the file. It will be executed if you create an operator that requires execution ( Saving results or doing a collect to see results or doing a foreach ... )

Contrary to popular believe Spark does not put everything in memory. It compiles an execution graph and executes the tasks on the data. If you stream from file and write somewhere nothing will be cached. Data will be executed by the different RDDs as they go along. A map would take data from the father RDD process it line by line and send the data on to the daughter RDDs.

However as jzhang says you can cache RDDs if you want to reuse them. If you only use them once this feature wouldn't make much sense anyhow. (This is also used for SparkSQL etc. to speed up table access ... ) So for example if an RDD computation is very expensive ( data mining ) and you want to take the results and do different things with them. It makes sense to cache them ( can be in memory or on disc ) so Spark doesn't have to execute all the RDDs before.

Spark automatically does this caching when it thinks it makes sense and has enough memory ( After a shuffle for example since recomputing a shuffle would mean recomputing ALL RDDs before the shuffle. )

Don't have an account?
Coming from Hortonworks? Activate your account here