I really appreciate all the answers given by you today! It clarifies a lot ! Thanks!
Just one final question - I believe collect() , take() and print() are there only functions that put load upon the driver?
Is my understanding correct? Or is there any other documentation on this?
If you mean which functions don't return an RDD, there are more. All of the count* functions and take* functions, first(), print() I suppose, and reduce(). Anyting which conceptually should return a small value returns to the driver. I wouldn't describe it as putting load on the driver necessarily, but it of course returns a value into memory in the driver.
You can see all of the methods of RDD in http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.package and the PairRDDFunctions class. Look at what does and doesn't return an RDD.
This might not be the relevant topic, but I think its right people.
I am having some issue with caching dataframe in spark.
(step 1). I am reading hive table as a dataframe. Lets say we have the count 2.
(step 2). I am caching this dataframe.
(step 3). I am adding 2 additional records to the hive table.
(step 4). I am doing count on the cached dataframe again.
If caching is working as I am expecting, the count in step 1 and step 4 should be 2. This is working when I am adding additional records to the table from outside the spark application. However it is not working if I am doing step 3 from within the application. I AM NOT UNDERSTANIDNG WHY.
I I do step 3 from the same application I am getting step 4 count as 4. But why??
I think I am missing something.
Yes, I don't think this is related, but the quick answer is that "cache" just means "cache this thing whenever you get around to computing it", and you are adding 2 records before it is computed. Hence count is 4, not 2.
I am doing count in step 1 as well. (right after caching the dataframe). So my expectation is that dataframe should have only 2 records even if are inserting records the table in between. If that is true then when we do count on the cached dataframe at the end. It should be 2, but why is it 4. This is what is confusing me.
Thanks in advance
Thanks for this clarification. I also had the same qurery ragrding memory issue while loading data. Here you cleared doubt about file loading from HDFS.
I have a similar question but the source is a local server or Cloud storage where the data size is more than driver memory ( let's say 1 GB in this case where the driver memory is 250 MB). If I fire command
val file_rdd = sc.textFile("/path or local or S3")
shoud Spark load the data or as you mentioned above will throgh exception?
Also, is there a way to print driver available memroy in Terminal?