Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

Running Mahout examples problem

Rising Star

I'm trying to use Mahout to do a clustering job, I've been struggling with it and maven for a week now ... My code works fine on eclipse on local machine but when i build it in a jar and send it to the cluster i get some errors reading from HDFS i guess.

First I created the directory under /user/root/testdata

hadoop fs -mkdir /user/root/testdata

then put the downloaded file into it

hadoop fs -put /user/root/testdata/

Finally run the example using:

-mahout examples jar from mahout 0.9 downloaded from website:

hadoop jar mahout-examples-1.0-SNAPSHOT-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

-and the mahout-examples- file which is found in the mahout directory in the node:

hadoop jar /usr/hdp/ org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

and in both cases i get this error :

WARNING: Use "yarn jar" to launch YARN applications.
16/03/04 11:57:03 INFO kmeans.Job: Running with default arguments
16/03/04 11:57:05 INFO common.HadoopUtil: Deleting output
16/03/04 11:57:05 INFO kmeans.Job: Preparing Input
16/03/04 11:57:05 INFO impl.TimelineClientImpl: Timeline service address: http://vm2.local:8188/ws/v1/timeline/
16/03/04 11:57:05 INFO client.RMProxy: Connecting to ResourceManager at vm1.local/
16/03/04 11:57:06 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/03/04 11:57:07 INFO input.FileInputFormat: Total input paths to process : 1
16/03/04 11:57:07 INFO mapreduce.JobSubmitter: number of splits:1
16/03/04 11:57:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1456915204500_0029
16/03/04 11:57:07 INFO impl.YarnClientImpl: Submitted application application_1456915204500_0029
16/03/04 11:57:07 INFO mapreduce.Job: The url to track the job: http://vm1.local:8088/proxy/application_1456915204500_0029/
16/03/04 11:57:07 INFO mapreduce.Job: Running job: job_1456915204500_0029
16/03/04 11:57:14 INFO mapreduce.Job: Job job_1456915204500_0029 running in uber mode : false
16/03/04 11:57:14 INFO mapreduce.Job:  map 0% reduce 0%
16/03/04 11:57:20 INFO mapreduce.Job:  map 100% reduce 0%
16/03/04 11:57:20 INFO mapreduce.Job: Job job_1456915204500_0029 completed successfully
16/03/04 11:57:20 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=129757
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=288502
        HDFS: Number of bytes written=335470
        HDFS: Number of read operations=5
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=3457
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=3457
        Total vcore-seconds taken by all map tasks=3457
        Total megabyte-seconds taken by all map tasks=3539968
    Map-Reduce Framework
        Map input records=600
        Map output records=600
        Input split bytes=128
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=76
        CPU time spent (ms)=590
        Physical memory (bytes) snapshot=113729536
        Virtual memory (bytes) snapshot=2723696640
        Total committed heap usage (bytes)=62324736
    File Input Format Counters 
        Bytes Read=288374
    File Output Format Counters 
        Bytes Written=335470
16/03/04 11:57:20 INFO kmeans.Job: Running random seed to get initial clusters
16/03/04 11:57:20 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
16/03/04 11:57:20 INFO compress.CodecPool: Got brand-new compressor [.deflate]
16/03/04 11:57:21 INFO kmeans.RandomSeedGenerator: Wrote 6 Klusters to output/random-seeds/part-randomSeed
16/03/04 11:57:21 INFO kmeans.Job: Running KMeans with k = 6
16/03/04 11:57:21 INFO kmeans.KMeansDriver: Input: output/data Clusters In: output/random-seeds/part-randomSeed Out: output
16/03/04 11:57:21 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10
16/03/04 11:57:21 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
16/03/04 11:57:21 INFO impl.TimelineClientImpl: Timeline service address: http://vm2.local:8188/ws/v1/timeline/
16/03/04 11:57:21 INFO client.RMProxy: Connecting to ResourceManager at vm1.local/
16/03/04 11:57:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/03/04 11:57:22 INFO input.FileInputFormat: Total input paths to process : 1
16/03/04 11:57:22 INFO mapreduce.JobSubmitter: number of splits:1
16/03/04 11:57:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1456915204500_0030
16/03/04 11:57:22 INFO impl.YarnClientImpl: Submitted application application_1456915204500_0030
16/03/04 11:57:22 INFO mapreduce.Job: The url to track the job: http://vm1.local:8088/proxy/application_1456915204500_0030/
16/03/04 11:57:22 INFO mapreduce.Job: Running job: job_1456915204500_0030
16/03/04 11:57:33 INFO mapreduce.Job: Job job_1456915204500_0030 running in uber mode : false
16/03/04 11:57:33 INFO mapreduce.Job:  map 0% reduce 0%
16/03/04 11:57:37 INFO mapreduce.Job: Task Id : attempt_1456915204500_0030_m_000000_0, Status : FAILED
Error: java.lang.IllegalStateException: output/clusters-0
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(
    at org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(
    at org.apache.mahout.clustering.iterator.CIMapper.setup(
    at org.apache.hadoop.mapred.MapTask.runNewMapper(
    at org.apache.hadoop.mapred.YarnChild$
    at Method)
    at org.apache.hadoop.mapred.YarnChild.main(
Caused by: File output/clusters-0 does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(
    at org.apache.hadoop.fs.FileSystem.listStatus(
    at org.apache.hadoop.fs.FileSystem.listStatus(
    at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(
    at org.apache.hadoop.fs.FileSystem.listStatus(
    at org.apache.hadoop.fs.FileSystem.listStatus(
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.<init>(
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(
    ... 10 more

16/03/04 11:57:42 INFO mapreduce.Job: Task Id : attempt_1456915204500_0030_m_000000_1, Status : FAILED
Error: java.lang.IllegalStateException: output/clusters-0
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(
    at org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(
    at org.apache.mahout.clustering.iterator.CIMapper.setup(
    at org.apache.hadoop.mapred.MapTask.runNewMapper(
    at org.apache.hadoop.mapred.YarnChild$
    at Method)
    at org.apache.hadoop.mapred.YarnChild.main(
Caused by: File output/clusters-0 does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(
    at org.apache.hadoop.fs.FileSystem.listStatus(
    at org.apache.hadoop.fs.FileSystem.listStatus(
    at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(
    at org.apache.hadoop.fs.FileSystem.listStatus(
    at org.apache.hadoop.fs.FileSystem.listStatus(
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.<init>(
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(
    ... 10 more

16/03/04 11:57:46 INFO mapreduce.Job: Task Id : attempt_1456915204500_0030_m_000000_2, Status : FAILED
Error: java.lang.IllegalStateException: output/clusters-0
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(
    at org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(
    at org.apache.mahout.clustering.iterator.CIMapper.setup(
    at org.apache.hadoop.mapred.MapTask.runNewMapper(
    at org.apache.hadoop.mapred.YarnChild$
    at Method)
    at org.apache.hadoop.mapred.YarnChild.main(
Caused by: File output/clusters-0 does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(
    at org.apache.hadoop.fs.FileSystem.listStatus(
    at org.apache.hadoop.fs.FileSystem.listStatus(
    at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(
    at org.apache.hadoop.fs.FileSystem.listStatus(
    at org.apache.hadoop.fs.FileSystem.listStatus(
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.<init>(
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(
    ... 10 more

16/03/04 11:57:52 INFO mapreduce.Job:  map 100% reduce 100%
16/03/04 11:57:53 INFO mapreduce.Job: Job job_1456915204500_0030 failed with state FAILED due to: Task failed task_1456915204500_0030_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

16/03/04 11:57:53 INFO mapreduce.Job: Counters: 13
    Job Counters 
        Failed map tasks=4
        Killed reduce tasks=1
        Launched map tasks=4
        Other local map tasks=3
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=11687
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=11687
        Total time spent by all reduce tasks (ms)=0
        Total vcore-seconds taken by all map tasks=11687
        Total vcore-seconds taken by all reduce tasks=0
        Total megabyte-seconds taken by all map tasks=11967488
        Total megabyte-seconds taken by all reduce tasks=0
Exception in thread "main" java.lang.InterruptedException: Cluster Iteration 1 failed processing output/clusters-1
    at org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(
    at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(
    at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(
    at java.lang.reflect.Method.invoke(
    at org.apache.hadoop.util.RunJar.main(

I guess it's a version problem ...



Rising Star

Resolved this ...

Instead of using relative path like this :

new Path("/testdata/points")

you have to put the absolute Path of the directory in your cluster:

new Path("hdfs://vm1.local:8020/user/root/testdata/points")

View solution in original post


Master Mentor

Rising Star

@Neeraj Sabharwal Even in 0.11 it still exist .. see my answer below.

Rising Star

Resolved this ...

Instead of using relative path like this :

new Path("/testdata/points")

you have to put the absolute Path of the directory in your cluster:

new Path("hdfs://vm1.local:8020/user/root/testdata/points")