- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
spark - spark socketexception connection reset by peer
- Labels:
-
Apache Spark
Created on ‎06-04-2017 09:44 PM - edited ‎08-17-2019 11:22 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
was trying spark scenario, tried to load csv file of data set around 1M records into a RDD.
did a split by delimiter, and was checking count() which worked.
on the same RDD wanted to check sample data and tried action take(10)which did not work.
Was throwing spark socketexception connection reset by peer
Your assistance would be of great help
Created ‎06-05-2017 03:32 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are using PySpark, there appears to be a bug where pyspark crashes for large datasets.
https://issues.apache.org/jira/browse/SPARK-12261
Since you are just trying to see sample data, you could use collect and then print.
However, collect should not be used for large datasets as it brings all the data to driver node and could basically make the driver node run out of memory.
This link gives a detail on how to print the rdd elements using Scala.
Created ‎06-05-2017 03:32 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are using PySpark, there appears to be a bug where pyspark crashes for large datasets.
https://issues.apache.org/jira/browse/SPARK-12261
Since you are just trying to see sample data, you could use collect and then print.
However, collect should not be used for large datasets as it brings all the data to driver node and could basically make the driver node run out of memory.
This link gives a detail on how to print the rdd elements using Scala.
Created ‎06-05-2017 06:06 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
tried creating RDD with collect() and print out using for loop. Was working fine.
Was trying out in pyspark though.
thank you
