Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

couldnt able find the no of words in txt file using sparkR? how to find it? tell me how to read csv file which resides in hadoop cluster?

Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/

Spark context is available as sc, SQL context is available as sqlContext > Sys.setenv(SPARK_HOME="/usr/hdp/2.3.4.0.-3485/spark/bin/sparkR/") > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","Lib"),.libPaths())) > Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-oracle/") > library(SparkR) > lines<-SparkR:::textFile(sc,"hdfs: /user/midhun/f.txt") 17/03/23 14:37:58 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 306.2 KB, free 306.2 KB) 17/03/23 14:37:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 26.1 KB, free 332.3 KB) 17/03/23 14:37:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:39935 (size: 26.1 KB, free: 511.1 MB) 17/03/23 14:37:58 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 > words<-SparkR:::flatMap(lines,function(line){strsplit(line," ")[[1]]}) > wordcount<-SparkR:::lapply(words,function(word){list(word,1)}) > counts<-SparkR:::reduceByKey(wordcount,"+",numPartition=2) > output<-collect(counts)

17/03/23 14:40:03 INFO SparkContext: Starting job: collect at NativeMethodAccessorImpl.java:-2 17/03/23 14:40:03 WARN DAGScheduler: Creating new stage failed due to exception - job: 0 java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs:%20/user/midhun/f.txt at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.<init>(Path.java:171) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.r.BaseRRDD.getPartitions(RRDD.scala:47) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.r.BaseRRDD.getPartitions(RRDD.scala:47) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:91) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224) at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:386) at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:398) at org.apache.spark.scheduler.DAGScheduler.getParentStagesAndId(DAGScheduler.scala:299) at org.apache.spark.scheduler.DAGScheduler.newResultStage(DAGScheduler.scala:334) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:837) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1607) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused by: java.net.URISyntaxException: Relative path in absolute URI: hdfs:%20/user/midhun/f.txt at java.net.URI.checkPath(URI.java:1823) at java.net.URI.<init>(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:202) ... 44 more 17/03/23 14:40:03 INFO DAGScheduler: Job 0 failed: collect at NativeMethodAccessorImpl.java:-2, took 0.011744 s 17/03/23 14:40:03 ERROR RBackendHandler: collect on 17 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs:%20/user/midhun/f.txt at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.<init>(Path.java:171) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(R

2 REPLIES 2

Contributor

You should have posted the code you are trying to run, and how you are trying to run (how you submit the job into Spark).

Without that it is harder to get an answer.

From the error I can see you are trying to run wordcount on this file: hdfs:%20/user/midhun/f.txt

Have you tried something like this?

hdfs dfs -put /user/midhun/f.txt
spark-submit --class com.cloudera.sparkwordcount.SparkWordCount \
--master local --deploy-mode client --executor-memory 1g \
--name wordcount --conf "spark.app.id=wordcount" \
sparkwordcount-1.0-SNAPSHOT-jar-with-dependencies.jar hdfs://namenode_host:8020/user/midhun/f.txt 2

yeah i got it.thank yu for the reponse.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.