Created 06-20-2016 02:39 PM
I just want to use HDFS.open in my Pyspark shell but get the following Error:
Someone got an idea ? In Python it works I can use HDFS.Open function - In Pyspark I can not access the Namenode? I do not get why it works in Python but not in Pyspark?
Python 2.7 (Anaconda 4) Spark 1.6.0 Hadoop 2.4 (Installed with Ambari)
I also asked on Stackoverflow: Stackoverflow-Python-Pydoop-Hdfs
16/06/20 16:11:40 WARN util.NativeCodeLoader: Unable to load native-hadoop libra ry for your platform... using builtin-java classes where applicable hdfsBuilderConnect(forceNewInstance=0, nn=xipcc01, port=8020, kerbTicketCachePat h=(NULL), userName=(NULL)) error: java.io.IOException: No FileSystem for scheme: hdfs at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:26 44) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:268 7) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:160) at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:157) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1709) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:157) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/__init_ _.py", line 121, in open fs = hdfs(host, port, user) File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/fs.py", line 150, in __init__ h, p, u, fs = _get_connection_info(host, port, user) File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/fs.py", line 64, in _get_connection_info fs = core_hdfs_fs(host, port, user) File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/core/__ init__.py", line 57, in core_hdfs_fs return _CORE_MODULE.CoreHdfsFs(host, port, user) RuntimeError: (255, 'Unknown error 255')
Created 07-05-2016 07:54 AM
Hi Lukas,
PySpark's Spark Context (sc) also has it's own methods to read data from HDFS, sc.textFile(...), sc.wholeTextFile(...), sc.binaryFile(...). Why don't you try using those to read data from HDFS and you also directly get an RDD for the data you read in? However if you use these methods of the SparkContext make sure to add your core-site.xml and hdfs-site.xml config files to the Spark conf dir; and by the way the Spark Conf Dir can be set using the environment variable to any desired location SPARK_CONF_DIR.
Created 06-20-2016 04:04 PM
Try adding spark assembly jar while running pyspark.
pyspark --jars /usr/hdp/current/spark-client/lib/spark-assembly-<version>-hadoop2<version>.jar
Created 06-21-2016 08:22 AM
This did not work (I also commented my question for further information)
Created 06-21-2016 08:21 AM
This is what I am executing with Pydoop in Jupyter:
file_X_train = hdfs.open("/path../.csv") import pydoop.hdfs as hdfs
Created 06-21-2016 08:31 AM
https://github.com/crs4/pydoop/issues/158 this is the error I get - I use HDP 2.4 and Python 2.7 - This is why I am asking here...
Created 07-04-2016 09:20 AM
Created 07-05-2016 07:54 AM
Hi Lukas,
PySpark's Spark Context (sc) also has it's own methods to read data from HDFS, sc.textFile(...), sc.wholeTextFile(...), sc.binaryFile(...). Why don't you try using those to read data from HDFS and you also directly get an RDD for the data you read in? However if you use these methods of the SparkContext make sure to add your core-site.xml and hdfs-site.xml config files to the Spark conf dir; and by the way the Spark Conf Dir can be set using the environment variable to any desired location SPARK_CONF_DIR.