Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Can not access Namenode in Pyspark but in Python it works?

Solved Go to solution
Highlighted

Can not access Namenode in Pyspark but in Python it works?

New Contributor

I just want to use HDFS.open in my Pyspark shell but get the following Error:

Someone got an idea ? In Python it works I can use HDFS.Open function - In Pyspark I can not access the Namenode? I do not get why it works in Python but not in Pyspark?

Python 2.7 (Anaconda 4) Spark 1.6.0 Hadoop 2.4 (Installed with Ambari)

I also asked on Stackoverflow: Stackoverflow-Python-Pydoop-Hdfs

16/06/20 16:11:40 WARN util.NativeCodeLoader: Unable to load native-hadoop 

libra                                                                                                             ry for your platform... using builtin-java classes where applicable
hdfsBuilderConnect(forceNewInstance=0, nn=xipcc01, port=8020, kerbTicketCachePat                                                                                                             h=(NULL), userName=(NULL)) error:
java.io.IOException: No FileSystem for scheme: hdfs
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:26                                                                                                             44)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651                                                                                                             )
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:268                                                                                                             7)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
        at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:160)
        at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:157)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma                                                                                                             tion.java:1709)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:157)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/__init_                                                                                                             _.py", line 121, in open
    fs = hdfs(host, port, user)
  File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/fs.py",                                                                                                              line 150, in __init__
    h, p, u, fs = _get_connection_info(host, port, user)
  File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/fs.py",                                                                                                              line 64, in _get_connection_info
    fs = core_hdfs_fs(host, port, user)
  File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/core/__                                                                                                             init__.py", line 57, in core_hdfs_fs
    return _CORE_MODULE.CoreHdfsFs(host, port, user)
RuntimeError: (255, 'Unknown error 255')
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Can not access Namenode in Pyspark but in Python it works?

New Contributor

Hi Lukas,

PySpark's Spark Context (sc) also has it's own methods to read data from HDFS, sc.textFile(...), sc.wholeTextFile(...), sc.binaryFile(...). Why don't you try using those to read data from HDFS and you also directly get an RDD for the data you read in? However if you use these methods of the SparkContext make sure to add your core-site.xml and hdfs-site.xml config files to the Spark conf dir; and by the way the Spark Conf Dir can be set using the environment variable to any desired location SPARK_CONF_DIR.

6 REPLIES 6

Re: Can not access Namenode in Pyspark but in Python it works?

Try adding spark assembly jar while running pyspark.

pyspark --jars /usr/hdp/current/spark-client/lib/spark-assembly-<version>-hadoop2<version>.jar

Re: Can not access Namenode in Pyspark but in Python it works?

New Contributor

This did not work (I also commented my question for further information)

Re: Can not access Namenode in Pyspark but in Python it works?

New Contributor

This is what I am executing with Pydoop in Jupyter:

file_X_train
= hdfs.open("/path../.csv")
import
pydoop.hdfs as hdfs

Re: Can not access Namenode in Pyspark but in Python it works?

New Contributor

https://github.com/crs4/pydoop/issues/158 this is the error I get - I use HDP 2.4 and Python 2.7 - This is why I am asking here...

Re: Can not access Namenode in Pyspark but in Python it works?

New Contributor

Re: Can not access Namenode in Pyspark but in Python it works?

New Contributor

Hi Lukas,

PySpark's Spark Context (sc) also has it's own methods to read data from HDFS, sc.textFile(...), sc.wholeTextFile(...), sc.binaryFile(...). Why don't you try using those to read data from HDFS and you also directly get an RDD for the data you read in? However if you use these methods of the SparkContext make sure to add your core-site.xml and hdfs-site.xml config files to the Spark conf dir; and by the way the Spark Conf Dir can be set using the environment variable to any desired location SPARK_CONF_DIR.