Support Questions

lott3 · ‎06-20-2016

I just want to use HDFS.open in my Pyspark shell but get the following Error:

Someone got an idea ? In Python it works I can use HDFS.Open function - In Pyspark I can not access the Namenode? I do not get why it works in Python but not in Pyspark?

Python 2.7 (Anaconda 4) Spark 1.6.0 Hadoop 2.4 (Installed with Ambari)

I also asked on Stackoverflow: Stackoverflow-Python-Pydoop-Hdfs

16/06/20 16:11:40 WARN util.NativeCodeLoader: Unable to load native-hadoop 

libra                                                                                                             ry for your platform... using builtin-java classes where applicable
hdfsBuilderConnect(forceNewInstance=0, nn=xipcc01, port=8020, kerbTicketCachePat                                                                                                             h=(NULL), userName=(NULL)) error:
java.io.IOException: No FileSystem for scheme: hdfs
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:26                                                                                                             44)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651                                                                                                             )
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:268                                                                                                             7)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
        at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:160)
        at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:157)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma                                                                                                             tion.java:1709)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:157)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/__init_                                                                                                             _.py", line 121, in open
    fs = hdfs(host, port, user)
  File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/fs.py",                                                                                                              line 150, in __init__
    h, p, u, fs = _get_connection_info(host, port, user)
  File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/fs.py",                                                                                                              line 64, in _get_connection_info
    fs = core_hdfs_fs(host, port, user)
  File "/home/cloud/anaconda2/lib/python2.7/site-packages/pydoop/hdfs/core/__                                                                                                             init__.py", line 57, in core_hdfs_fs
    return _CORE_MODULE.CoreHdfsFs(host, port, user)
RuntimeError: (255, 'Unknown error 255')

m_a_vervuurt · ‎07-05-2016

Hi Lukas,

PySpark's Spark Context (sc) also has it's own methods to read data from HDFS, sc.textFile(...), sc.wholeTextFile(...), sc.binaryFile(...). Why don't you try using those to read data from HDFS and you also directly get an RDD for the data you read in? However if you use these methods of the SparkContext make sure to add your core-site.xml and hdfs-site.xml config files to the Spark conf dir; and by the way the Spark Conf Dir can be set using the environment variable to any desired location SPARK_CONF_DIR.

View solution in original post

jyadav · ‎06-20-2016

Try adding spark assembly jar while running pyspark.

pyspark --jars /usr/hdp/current/spark-client/lib/spark-assembly-<version>-hadoop2<version>.jar

lott3 · ‎06-21-2016

This did not work (I also commented my question for further information)

lott3 · ‎06-21-2016

This is what I am executing with Pydoop in Jupyter:

file_X_train
= hdfs.open("/path../.csv")
import
pydoop.hdfs as hdfs

lott3 · ‎06-21-2016

https://github.com/crs4/pydoop/issues/158 this is the error I get - I use HDP 2.4 and Python 2.7 - This is why I am asking here...

lott3 · ‎07-04-2016

https://github.com/crs4/pydoop/issues/218

m_a_vervuurt · ‎07-05-2016

Hi Lukas,

PySpark's Spark Context (sc) also has it's own methods to read data from HDFS, sc.textFile(...), sc.wholeTextFile(...), sc.binaryFile(...). Why don't you try using those to read data from HDFS and you also directly get an RDD for the data you read in? However if you use these methods of the SparkContext make sure to add your core-site.xml and hdfs-site.xml config files to the Spark conf dir; and by the way the Spark Conf Dir can be set using the environment variable to any desired location SPARK_CONF_DIR.

Cloudera Community

Support Questions

Can not access Namenode in Pyspark but in Python it works?