Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Memory Issues in while accessing files in Spark

SOLVED Go to solution

Re: Memory Issues in while accessing files in Spark

Explorer

 

I really appreciate all the answers given by you today! It clarifies a lot ! Thanks!

 

Just one final question - I believe collect() , take() and print() are there only functions that put load upon the driver?

 

Is my understanding correct? Or is there any other documentation on this?

Re: Memory Issues in while accessing files in Spark

Master Collaborator

If you mean which functions don't return an RDD, there are more. All of the count* functions and take* functions, first(), print() I suppose, and reduce(). Anyting which conceptually should return a small value returns to the driver. I wouldn't describe it as putting load on the driver necessarily, but it of course returns a value into memory in the driver.

Re: Memory Issues in while accessing files in Spark

Master Collaborator

You can see all of the methods of RDD in http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.package and the PairRDDFunctions class. Look at what does and doesn't return an RDD.

Re: Memory Issues in while accessing files in Spark

Explorer

This might not be the relevant topic, but I think its right people.

I am having some issue with caching dataframe in spark.

 

(step 1). I am reading hive table as a dataframe. Lets say we have the count 2.

(step 2). I am caching this dataframe.

(step 3). I am adding 2 additional records to the hive table.

(step 4). I am doing count on the cached dataframe again.

 

If caching is working as I am expecting, the count in step 1 and step 4 should be 2. This is working when I am adding additional records to the table from outside the spark application. However it is not working if I am doing step 3 from within the application. I AM NOT UNDERSTANIDNG WHY. 

 

I I do step 3 from the same application I am getting step 4 count as 4. But why??

 

I think I am missing something.

 

 

Re: Memory Issues in while accessing files in Spark

Master Collaborator

Yes, I don't think this is related, but the quick answer is that "cache" just means "cache this thing whenever you get around to computing it", and you are adding 2 records before it is computed. Hence count is 4, not 2.

Re: Memory Issues in while accessing files in Spark

Explorer

Hello srowen,

 

I am doing count in step 1 as well. (right after caching the dataframe). So my expectation is that dataframe should have only 2 records even if are inserting records the table in between. If that is true then when we do count on the cached dataframe at the end. It should be 2, but why is it 4. This is what is confusing me.

 

Thanks in advance

Re: Memory Issues in while accessing files in Spark

Master Collaborator

(Please start a separate thread.) My last response explained why it's 4.

Highlighted

Re: Memory Issues in while accessing files in Spark

New Contributor

Thanks for this clarification. I also had the same qurery ragrding memory issue while loading data. Here you cleared doubt about file loading from HDFS. 

I have a similar question but the source is a local server or Cloud storage where the data size is more than driver memory ( let's say 1 GB in this case where the driver memory is 250 MB).  If I fire command

val file_rdd = sc.textFile("/path or local or S3")

 

shoud Spark load the data or as you mentioned above will throgh exception?

 

Also, is there a way to print driver available memroy in Terminal?

 

Many Thanks,

Siddharth Saraf