Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Input path does not exist: hdfs://node0:8020/user/hdfs/10000000000

avatar
Expert Contributor

Not sure what I'm doing wrong here but I keep getting the same error when I run terasort. Teragen works perfectly but terasort fails.

 

Input path does not exist: hdfs://node0:8020/user/hdfs/10000000000

 

Command line used:

 

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hadoop-0.20-mapreduce/hadoop-examples-2.5.0-mr1-cdh5.3.1.jar terasort 10000000000 /home/ssd/hdfs-input /home/ssd/hdfs-output.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Found why.  Typo my mistake not removing the 1TB file size that was being generated by teragen 🙂

 

Command worked:

 

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hadoop-0.20-mapreduce/hadoop-examples-2.5.0-mr1-cdh5.3.1.jar terasort /home/ssd/hdfs-input /home/ssd/hdfs-output

 

Works perfectly now.

View solution in original post

4 REPLIES 4

avatar
Master Collaborator

Was this cluster configured using Cloudera Director on AWS? Can you provide more details on how you do DNS and hostname configuration? 

avatar
New Contributor

Hi,

I'm new to cloudera and spark both.

I'm trying to run ALS on MovieLens data using spark. I'm getting error while loading the model

Py4JJavaError: An error occurred while calling o20.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/Downloads/ml-100k/u.data

 

Below is my Code:

 

import sys
import os

 

os.environ['SPARK_HOME'] = '/usr/lib/spark'
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python2.7'
os.environ['PYSPARK_SUBMIT_ARGS'] = ('--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell')

 

# SparkContext is available as sc and HiveContext is available as sqlContext.
sys.path.append('/usr/lib/spark/python')
sys.path.append('/usr/lib/spark/python/lib/py4j-0.9-src.zip')

 

from pyspark import SparkContext
from pyspark import HiveContext
sc = SparkContext()
sqlContext = HiveContext(sc)

 

import numpy
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

 

# Load and parse the data
data = sc.textFile('/home/cloudera/Downloads/ml-100k/u.data')
ratings = data.map(lambda l: l.split('\t')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

 

# Build the Recommendation model using Alternative Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations, seed=10, nonnegative=True) failing at this point

 

this is running

# r1 = Rating(1,2,3.0)
# r2 = Rating(1,1,4.0)
# r3 = Rating(2,1,1.0)
# ratings1 = sc.parallelize([r1,r2,r3])
# model = ALS.trainImplicit(ratings1, 1, seed=10)
# model.predict(2,2)

 

# Evaluate the model on training data
testdata = data.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPred = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPred.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))

Error:
model = ALS.train(ratings, rank, numIterations, seed=10, nonnegative=True)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pyspark/mllib/recommendation.py", line 243, in train
model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, iterations,
File "/usr/local/lib/python2.7/site-packages/pyspark/mllib/recommendation.py", line 223, in _prepare
first = ratings.first()
File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 1315, in first
rs = self.take(1)
File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 1267, in take
totalParts = self.getNumPartitions()
File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 2363, in getNumPartitions
return self._prev_jrdd.partitions().size()
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_...", line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python2.7/site-packages/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/proto...", line 308, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o20.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/Downloads/ml-100k/u.data
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:64)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:46)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp...:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

Please help me with this, I'm new to spark and there's no blog on this error. thanks

avatar
New Contributor
I was referencing a local file system path.
Need to ref it as: ‘file:///home...’ then. That tells it to use local file system. It worked.

avatar
Expert Contributor

Found why.  Typo my mistake not removing the 1TB file size that was being generated by teragen 🙂

 

Command worked:

 

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hadoop-0.20-mapreduce/hadoop-examples-2.5.0-mr1-cdh5.3.1.jar terasort /home/ssd/hdfs-input /home/ssd/hdfs-output

 

Works perfectly now.