Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

couldnt able find the no of words in txt file using sparkR? how to find it? tell me how to read csv file which resides in hadoop cluster?

couldnt able find the no of words in txt file using sparkR? how to find it? tell me how to read csv file which resides in hadoop cluster?

New Contributor

Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/

Spark context is available as sc, SQL context is available as sqlContext > Sys.setenv(SPARK_HOME="/usr/hdp/2.3.4.0.-3485/spark/bin/sparkR/") > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","Lib"),.libPaths())) > Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-oracle/") > library(SparkR) > lines<-SparkR:::textFile(sc,"hdfs: /user/midhun/f.txt") 17/03/23 14:37:58 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 306.2 KB, free 306.2 KB) 17/03/23 14:37:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 26.1 KB, free 332.3 KB) 17/03/23 14:37:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:39935 (size: 26.1 KB, free: 511.1 MB) 17/03/23 14:37:58 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 > words<-SparkR:::flatMap(lines,function(line){strsplit(line," ")[[1]]}) > wordcount<-SparkR:::lapply(words,function(word){list(word,1)}) > counts<-SparkR:::reduceByKey(wordcount,"+",numPartition=2) > output<-collect(counts)

17/03/23 14:40:03 INFO SparkContext: Starting job: collect at NativeMethodAccessorImpl.java:-2 17/03/23 14:40:03 WARN DAGScheduler: Creating new stage failed due to exception - job: 0 java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs:%20/user/midhun/f.txt at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.<init>(Path.java:171) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.r.BaseRRDD.getPartitions(RRDD.scala:47) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.r.BaseRRDD.getPartitions(RRDD.scala:47) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:91) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224) at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:386) at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:398) at org.apache.spark.scheduler.DAGScheduler.getParentStagesAndId(DAGScheduler.scala:299) at org.apache.spark.scheduler.DAGScheduler.newResultStage(DAGScheduler.scala:334) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:837) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1607) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused by: java.net.URISyntaxException: Relative path in absolute URI: hdfs:%20/user/midhun/f.txt at java.net.URI.checkPath(URI.java:1823) at java.net.URI.<init>(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:202) ... 44 more 17/03/23 14:40:03 INFO DAGScheduler: Job 0 failed: collect at NativeMethodAccessorImpl.java:-2, took 0.011744 s 17/03/23 14:40:03 ERROR RBackendHandler: collect on 17 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs:%20/user/midhun/f.txt at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.<init>(Path.java:171) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(R

2 REPLIES 2

Re: couldnt able find the no of words in txt file using sparkR? how to find it? tell me how to read csv file which resides in hadoop cluster?

Contributor

You should have posted the code you are trying to run, and how you are trying to run (how you submit the job into Spark).

Without that it is harder to get an answer.

From the error I can see you are trying to run wordcount on this file: hdfs:%20/user/midhun/f.txt

Have you tried something like this?

hdfs dfs -put /user/midhun/f.txt
spark-submit --class com.cloudera.sparkwordcount.SparkWordCount \
--master local --deploy-mode client --executor-memory 1g \
--name wordcount --conf "spark.app.id=wordcount" \
sparkwordcount-1.0-SNAPSHOT-jar-with-dependencies.jar hdfs://namenode_host:8020/user/midhun/f.txt 2

Re: couldnt able find the no of words in txt file using sparkR? how to find it? tell me how to read csv file which resides in hadoop cluster?

New Contributor

yeah i got it.thank yu for the reponse.