Created 02-23-2018 06:15 PM
HI All,
I installed a HDP cluster on cloudbreak and am trying to run a simple Spark Job. I open the "pyspark" shell and run the following:
ip = "adl://alenzadls1.azuredatalakestore.net/path/to/my/input/directory" input_data = sc.textFile(ip) for x in input_data.collect(): print x
The print statement returns an error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/hdp/current/spark-client/python/pyspark/rdd.py", line 771, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.adl.AdlFileSystem not found
Can someone point me to where it is going wrong? I did not find anything related to this online.
Created 02-28-2018 03:04 PM
Maybe you might try it out on a newer Hadoop version.
As of HDP 2.6.1, it contains Hadoop 2.7.3, which contains a known bug very similar to yours.
Hope this helps!