question Re: Can not access Namenode in Pyspark but in Python it works? in Archives of Support Questions (Read Only)

Can not access Namenode in Pyspark but in Python it works?

lott3 — Mon, 20 Jun 2016 21:39:28 GMT

I just want to use HDFS.open in my Pyspark shell but get the following Error:

Someone got an idea ? In Python it works I can use HDFS.Open function - In Pyspark I can not access the Namenode? I do not get why it works in Python but not in Pyspark?

Python 2.7 (Anaconda 4) Spark 1.6.0 Hadoop 2.4 (Installed with Ambari)

I also asked on Stackoverflow: Stackoverflow-Python-Pydoop-Hdfs

16/06/20 16:11:40 WARN util.NativeCodeLoader: Unable to load native-hadoop 

libra                                                                                                             ry for your platform... using builtin-java classes where applicable
hdfsBuilderConnect(forceNewInstance=0, nn=xipcc01, port=8020, kerbTicketCachePat                                                                                                             h=(NULL), userName=(NULL)) error:
java.io.IOException: No FileSystem for scheme: hdfs
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:26                                                                                                             44)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651                                                                                                             )
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:268                                                                                                             7)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
        at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:160)
        at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:157)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma                                                                                                             tion.java:1709)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:157)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/__init_                                                                                                             _.py", line 121, in open
    fs = hdfs(host, port, user)
  File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/fs.py",                                                                                                              line 150, in __init__
    h, p, u, fs = _get_connection_info(host, port, user)
  File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/fs.py",                                                                                                              line 64, in _get_connection_info
    fs = core_hdfs_fs(host, port, user)
  File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/core/__                                                                                                             init__.py", line 57, in core_hdfs_fs
    return _CORE_MODULE.CoreHdfsFs(host, port, user)
RuntimeError: (255, 'Unknown error 255')

Re: Can not access Namenode in Pyspark but in Python it works?

jyadav — Mon, 20 Jun 2016 23:04:24 GMT

Try adding spark assembly jar while running pyspark.

pyspark --jars /usr/hdp/current/spark-client/lib/spark-assembly-<version>-hadoop2<version>.jar

Re: Can not access Namenode in Pyspark but in Python it works?

lott3 — Tue, 21 Jun 2016 15:21:40 GMT

This is what I am executing with Pydoop in Jupyter:

file_X_train
= hdfs.open("/path../.csv")
import
pydoop.hdfs as hdfs

Re: Can not access Namenode in Pyspark but in Python it works?

lott3 — Tue, 21 Jun 2016 15:22:44 GMT

This did not work (I also commented my question for further information)

Re: Can not access Namenode in Pyspark but in Python it works?

lott3 — Tue, 21 Jun 2016 15:31:59 GMT

https://github.com/crs4/pydoop/issues/158 this is the error I get - I use HDP 2.4 and Python 2.7 - This is why I am asking here...

Re: Can not access Namenode in Pyspark but in Python it works?

lott3 — Mon, 04 Jul 2016 16:20:47 GMT

https://github.com/crs4/pydoop/issues/218

Re: Can not access Namenode in Pyspark but in Python it works?

m_a_vervuurt — Tue, 05 Jul 2016 14:54:24 GMT

Hi Lukas,

PySpark's Spark Context (sc) also has it's own methods to read data from HDFS, sc.textFile(...), sc.wholeTextFile(...), sc.binaryFile(...). Why don't you try using those to read data from HDFS and you also directly get an RDD for the data you read in? However if you use these methods of the SparkContext make sure to add your core-site.xml and hdfs-site.xml config files to the Spark conf dir; and by the way the Spark Conf Dir can be set using the environment variable to any desired location SPARK_CONF_DIR.