I am trying to load a csv file into a RDD using textFile function in Zeppelin and then do a take(10). But the take does not produce any result in Zepplin while the same commands outputs rows in SSH (shell)
I have attached the file, my zepplin notebook and some screenshots. Can you please suggest me how to resolve this error? (HortonWorks Sandbox HDP2.5 on Microsoft Azure)
data : dodgers.zip
This could be due to lack of sufficient memory.
How did you launch the spark-shell? Is it in YARN mode or in standalone?
Also how is Zeppelin's Spark interpreter configured? YARN or Standalone?
This is the sandbox from hortonworks. so i suppose it is standalone mode; but how do i verify that?
Spark-shell - just launch Putty, ssh'd into root, issued pyspark to get to Spark, and issued command (rdd = sc.textFile(csv)) - Works like a charm
Zeppelin - exactly the same - used %pyspark interpreter . - doesnt work
My Azure VM (D12 v2) config is 4 cores, 28 GB RAM, 200GB HDD; My local VMWare sandbox has 16GB RAM, 8 cores and 1 TB HDDspace. Will this be not enough for Zeppelin?
It likely is due to insufficient memory. You can try bumping up the memory allocated to Sandbox and also in sandbox shutdown the unneeded services.
Another option is to try out with Spark 2.1 in HDC https://hortonworks.com/blog/try-apache-spark-2-1-zeppelin-hortonworks-data-cloud/
@vshukla here are the configs for my machines
Azure VM (D12 v2 config) is 4 cores, 28 GB RAM, 200GB HDD;
VMWare sandbox has 16GB RAM, 8 cores and 1 TB HDDspace.
Will this be not enough for Sandbox?
The size of file i am trying to upload - 1 MB. I am able to load 5 MB txt file just fine. Only for CSVs it cribs.
I will try out the HDC. Thanks for the link!