Created on 03-04-2016 10:58 AM - edited 09-16-2022 08:40 AM
I'm trying to use Mahout to do a clustering job, I've been struggling with it and maven for a week now ... My code works fine on eclipse on local machine but when i build it in a jar and send it to the cluster i get some errors reading from HDFS i guess.
First I created the directory under /user/root/testdata
hadoop fs -mkdir /user/root/testdata
then put the downloaded file synthetic_control.data into it
hadoop fs -put synthetic_control.data /user/root/testdata/
Finally run the example using:
-mahout examples jar from mahout 0.9 downloaded from website:
hadoop jar mahout-examples-1.0-SNAPSHOT-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
-and the mahout-examples-0.9.0.2.3.4.0-3485-job.jar file which is found in the mahout directory in the node:
hadoop jar /usr/hdp/2.3.4.0-3485/mahout/mahout-examples-0.9.0.2.3.4.0-3485-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
and in both cases i get this error :
WARNING: Use "yarn jar" to launch YARN applications. 16/03/04 11:57:03 INFO kmeans.Job: Running with default arguments 16/03/04 11:57:05 INFO common.HadoopUtil: Deleting output 16/03/04 11:57:05 INFO kmeans.Job: Preparing Input 16/03/04 11:57:05 INFO impl.TimelineClientImpl: Timeline service address: http://vm2.local:8188/ws/v1/timeline/ 16/03/04 11:57:05 INFO client.RMProxy: Connecting to ResourceManager at vm1.local/10.10.10.1:8050 16/03/04 11:57:06 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 16/03/04 11:57:07 INFO input.FileInputFormat: Total input paths to process : 1 16/03/04 11:57:07 INFO mapreduce.JobSubmitter: number of splits:1 16/03/04 11:57:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1456915204500_0029 16/03/04 11:57:07 INFO impl.YarnClientImpl: Submitted application application_1456915204500_0029 16/03/04 11:57:07 INFO mapreduce.Job: The url to track the job: http://vm1.local:8088/proxy/application_1456915204500_0029/ 16/03/04 11:57:07 INFO mapreduce.Job: Running job: job_1456915204500_0029 16/03/04 11:57:14 INFO mapreduce.Job: Job job_1456915204500_0029 running in uber mode : false 16/03/04 11:57:14 INFO mapreduce.Job: map 0% reduce 0% 16/03/04 11:57:20 INFO mapreduce.Job: map 100% reduce 0% 16/03/04 11:57:20 INFO mapreduce.Job: Job job_1456915204500_0029 completed successfully 16/03/04 11:57:20 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=129757 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=288502 HDFS: Number of bytes written=335470 HDFS: Number of read operations=5 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=3457 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=3457 Total vcore-seconds taken by all map tasks=3457 Total megabyte-seconds taken by all map tasks=3539968 Map-Reduce Framework Map input records=600 Map output records=600 Input split bytes=128 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=76 CPU time spent (ms)=590 Physical memory (bytes) snapshot=113729536 Virtual memory (bytes) snapshot=2723696640 Total committed heap usage (bytes)=62324736 File Input Format Counters Bytes Read=288374 File Output Format Counters Bytes Written=335470 16/03/04 11:57:20 INFO kmeans.Job: Running random seed to get initial clusters 16/03/04 11:57:20 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 16/03/04 11:57:20 INFO compress.CodecPool: Got brand-new compressor [.deflate] 16/03/04 11:57:21 INFO kmeans.RandomSeedGenerator: Wrote 6 Klusters to output/random-seeds/part-randomSeed 16/03/04 11:57:21 INFO kmeans.Job: Running KMeans with k = 6 16/03/04 11:57:21 INFO kmeans.KMeansDriver: Input: output/data Clusters In: output/random-seeds/part-randomSeed Out: output 16/03/04 11:57:21 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 16/03/04 11:57:21 INFO compress.CodecPool: Got brand-new decompressor [.deflate] 16/03/04 11:57:21 INFO impl.TimelineClientImpl: Timeline service address: http://vm2.local:8188/ws/v1/timeline/ 16/03/04 11:57:21 INFO client.RMProxy: Connecting to ResourceManager at vm1.local/10.10.10.1:8050 16/03/04 11:57:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 16/03/04 11:57:22 INFO input.FileInputFormat: Total input paths to process : 1 16/03/04 11:57:22 INFO mapreduce.JobSubmitter: number of splits:1 16/03/04 11:57:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1456915204500_0030 16/03/04 11:57:22 INFO impl.YarnClientImpl: Submitted application application_1456915204500_0030 16/03/04 11:57:22 INFO mapreduce.Job: The url to track the job: http://vm1.local:8088/proxy/application_1456915204500_0030/ 16/03/04 11:57:22 INFO mapreduce.Job: Running job: job_1456915204500_0030 16/03/04 11:57:33 INFO mapreduce.Job: Job job_1456915204500_0030 running in uber mode : false 16/03/04 11:57:33 INFO mapreduce.Job: map 0% reduce 0% 16/03/04 11:57:37 INFO mapreduce.Job: Task Id : attempt_1456915204500_0030_m_000000_0, Status : FAILED Error: java.lang.IllegalStateException: output/clusters-0 at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.java:78) at org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:208) at org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.io.FileNotFoundException: File output/clusters-0 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:429) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1515) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1555) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:574) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1515) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1555) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.<init>(SequenceFileDirValueIterator.java:70) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.java:76) ... 10 more 16/03/04 11:57:42 INFO mapreduce.Job: Task Id : attempt_1456915204500_0030_m_000000_1, Status : FAILED Error: java.lang.IllegalStateException: output/clusters-0 at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.java:78) at org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:208) at org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.io.FileNotFoundException: File output/clusters-0 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:429) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1515) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1555) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:574) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1515) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1555) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.<init>(SequenceFileDirValueIterator.java:70) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.java:76) ... 10 more 16/03/04 11:57:46 INFO mapreduce.Job: Task Id : attempt_1456915204500_0030_m_000000_2, Status : FAILED Error: java.lang.IllegalStateException: output/clusters-0 at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.java:78) at org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:208) at org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.io.FileNotFoundException: File output/clusters-0 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:429) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1515) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1555) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:574) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1515) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1555) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.<init>(SequenceFileDirValueIterator.java:70) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.java:76) ... 10 more 16/03/04 11:57:52 INFO mapreduce.Job: map 100% reduce 100% 16/03/04 11:57:53 INFO mapreduce.Job: Job job_1456915204500_0030 failed with state FAILED due to: Task failed task_1456915204500_0030_m_000000 Job failed as tasks failed. failedMaps:1 failedReduces:0 16/03/04 11:57:53 INFO mapreduce.Job: Counters: 13 Job Counters Failed map tasks=4 Killed reduce tasks=1 Launched map tasks=4 Other local map tasks=3 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=11687 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=11687 Total time spent by all reduce tasks (ms)=0 Total vcore-seconds taken by all map tasks=11687 Total vcore-seconds taken by all reduce tasks=0 Total megabyte-seconds taken by all map tasks=11967488 Total megabyte-seconds taken by all reduce tasks=0 Exception in thread "main" java.lang.InterruptedException: Cluster Iteration 1 failed processing output/clusters-1 at org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:183) at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:224) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:147) at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.run(Job.java:135) at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:60) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
I guess it's a version problem ...
Thanks.
Created 03-07-2016 12:47 PM
Resolved this ...
Instead of using relative path like this :
new Path("/testdata/points")
you have to put the absolute Path of the directory in your cluster:
new Path("hdfs://vm1.local:8020/user/root/testdata/points")
Created 03-05-2016 02:50 PM
Created 03-07-2016 12:49 PM
@Neeraj Sabharwal Even in 0.11 it still exist .. see my answer below.
Created 03-07-2016 12:47 PM
Resolved this ...
Instead of using relative path like this :
new Path("/testdata/points")
you have to put the absolute Path of the directory in your cluster:
new Path("hdfs://vm1.local:8020/user/root/testdata/points")