Member since
04-22-2018
2
Posts
2
Kudos Received
0
Solutions
04-22-2018
06:19 PM
1 Kudo
I was referencing a local file system path. Need to ref it as: ‘file:///home...’ then. That tells it to use local file system. It worked.
... View more
04-22-2018
05:50 PM
1 Kudo
Hi, I'm new to cloudera and spark both. I'm trying to run ALS on MovieLens data using spark. I'm getting error while loading the model Py4JJavaError: An error occurred while calling o20.partitions. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/Downloads/ml-100k/u.data Below is my Code: import sys import os os.environ['SPARK_HOME'] = '/usr/lib/spark' os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python2.7' os.environ['PYSPARK_SUBMIT_ARGS'] = ('--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell') # SparkContext is available as sc and HiveContext is available as sqlContext. sys.path.append('/usr/lib/spark/python') sys.path.append('/usr/lib/spark/python/lib/py4j-0.9-src.zip') from pyspark import SparkContext from pyspark import HiveContext sc = SparkContext() sqlContext = HiveContext(sc) import numpy from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating # Load and parse the data data = sc.textFile('/home/cloudera/Downloads/ml-100k/u.data') ratings = data.map(lambda l: l.split('\t')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) # Build the Recommendation model using Alternative Least Squares rank = 10 numIterations = 10 model = ALS.train(ratings, rank, numIterations, seed=10, nonnegative=True) failing at this point this is running # r1 = Rating(1,2,3.0) # r2 = Rating(1,1,4.0) # r3 = Rating(2,1,1.0) # ratings1 = sc.parallelize([r1,r2,r3]) # model = ALS.trainImplicit(ratings1, 1, seed=10) # model.predict(2,2) # Evaluate the model on training data testdata = data.map(lambda p: (p[0], p[1])) predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ratesAndPred = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) MSE = ratesAndPred.map(lambda r: (r[1][0] - r[1][1])**2).mean() print("Mean Squared Error = " + str(MSE)) Error: model = ALS.train(ratings, rank, numIterations, seed=10, nonnegative=True) Traceback (most recent call last): File "<input>", line 1, in <module> File "/usr/local/lib/python2.7/site-packages/pyspark/mllib/recommendation.py", line 243, in train model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, iterations, File "/usr/local/lib/python2.7/site-packages/pyspark/mllib/recommendation.py", line 223, in _prepare first = ratings.first() File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 1315, in first rs = self.take(1) File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 1267, in take totalParts = self.getNumPartitions() File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 2363, in getNumPartitions return self._prev_jrdd.partitions().size() File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_...", line 813, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/local/lib/python2.7/site-packages/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/proto...", line 308, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o20.partitions. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/Downloads/ml-100k/u.data at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:64) at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:46) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp...:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) Please help me with this, I'm new to spark and there's no blog on this error. thanks
... View more