Created on 03-27-2015 10:14 AM - edited 09-16-2022 02:25 AM
Not sure what I'm doing wrong here but I keep getting the same error when I run terasort. Teragen works perfectly but terasort fails.
Input path does not exist: hdfs://node0:8020/user/hdfs/10000000000
Command line used:
sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hadoop-0.20-mapreduce/hadoop-examples-2.5.0-mr1-cdh5.3.1.jar terasort 10000000000 /home/ssd/hdfs-input /home/ssd/hdfs-output.
Created on 03-27-2015 10:31 AM - edited 03-27-2015 10:35 AM
Found why. Typo my mistake not removing the 1TB file size that was being generated by teragen 🙂
Command worked:
sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hadoop-0.20-mapreduce/hadoop-examples-2.5.0-mr1-cdh5.3.1.jar terasort /home/ssd/hdfs-input /home/ssd/hdfs-output
Works perfectly now.
Created 03-27-2015 10:27 AM
Was this cluster configured using Cloudera Director on AWS? Can you provide more details on how you do DNS and hostname configuration?
Created 04-22-2018 05:50 PM
Hi,
I'm new to cloudera and spark both.
I'm trying to run ALS on MovieLens data using spark. I'm getting error while loading the model
Py4JJavaError: An error occurred while calling o20.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/Downloads/ml-100k/u.data
Below is my Code:
import sys
import os
os.environ['SPARK_HOME'] = '/usr/lib/spark'
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python2.7'
os.environ['PYSPARK_SUBMIT_ARGS'] = ('--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell')
# SparkContext is available as sc and HiveContext is available as sqlContext.
sys.path.append('/usr/lib/spark/python')
sys.path.append('/usr/lib/spark/python/lib/py4j-0.9-src.zip')
from pyspark import SparkContext
from pyspark import HiveContext
sc = SparkContext()
sqlContext = HiveContext(sc)
import numpy
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
# Load and parse the data
data = sc.textFile('/home/cloudera/Downloads/ml-100k/u.data')
ratings = data.map(lambda l: l.split('\t')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the Recommendation model using Alternative Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations, seed=10, nonnegative=True) failing at this point
this is running
# r1 = Rating(1,2,3.0)
# r2 = Rating(1,1,4.0)
# r3 = Rating(2,1,1.0)
# ratings1 = sc.parallelize([r1,r2,r3])
# model = ALS.trainImplicit(ratings1, 1, seed=10)
# model.predict(2,2)
# Evaluate the model on training data
testdata = data.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPred = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPred.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))
Error:
model = ALS.train(ratings, rank, numIterations, seed=10, nonnegative=True)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pyspark/mllib/recommendation.py", line 243, in train
model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, iterations,
File "/usr/local/lib/python2.7/site-packages/pyspark/mllib/recommendation.py", line 223, in _prepare
first = ratings.first()
File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 1315, in first
rs = self.take(1)
File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 1267, in take
totalParts = self.getNumPartitions()
File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 2363, in getNumPartitions
return self._prev_jrdd.partitions().size()
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_...", line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python2.7/site-packages/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/proto...", line 308, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o20.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/Downloads/ml-100k/u.data
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:64)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:46)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp...:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Please help me with this, I'm new to spark and there's no blog on this error. thanks
Created 04-22-2018 06:19 PM
Created on 03-27-2015 10:31 AM - edited 03-27-2015 10:35 AM
Found why. Typo my mistake not removing the 1TB file size that was being generated by teragen 🙂
Command worked:
sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hadoop-0.20-mapreduce/hadoop-examples-2.5.0-mr1-cdh5.3.1.jar terasort /home/ssd/hdfs-input /home/ssd/hdfs-output
Works perfectly now.