Member since
03-22-2017
14
Posts
1
Kudos Received
0
Solutions
09-21-2017
05:34 AM
@Javier Teixeira Quevedo
usage : hdfs.write(object,con,hsync=FALSE) arguments: object: The R object to be written to disk. con: An open HDFS connection returned by ‘hdfs.file’ hsync: If TRUE, the file will be synched after writing details: The functions can be used to read and write files both on the local filesystem and the HDFS. If the object is a raw vector, it is written directly to the ‘con’ object, otherwise it is serialized and the bytes written to the ‘con’. No prefix (for example, length of bytes) are written and it is up to the user to handle this. ‘hdfs.seek’ seeks to the position ‘n’. It must be positive. ‘hdfs.tell’ returns the current location of the file pointer. code: data <- "hello world" modelfile <- hdfs.file("test.txt", "w") data1 <- toJSON(data) data2 <- charToRaw(data1) hdfs.write(data2,modelfile) hdfs.close(modelfile) description: you have to write data as raw vector to modelfile object .
... View more
04-18-2017
06:56 AM
04-01-2017
04:35 AM
1 Kudo
We have a HDP 2.4 cluster.Spark version is 1.6.0. I have to convert large csv file of 1gb as dataframe. I couldnt able to when master is set as local.Could you tell me that how to launch spark with master as yarn client?and also explain how to convert large csv file as dataframe in sparkR?
... View more
Labels:
03-24-2017
04:33 AM
data <- fread("/usr/bin/hadoop fs -text /path/to/the/file.csv"), fill=TRUE i used this command its working perfectly.whats the difference between both?
... View more
03-24-2017
04:22 AM
yeah i got it.thank yu for the reponse.
... View more
03-24-2017
03:49 AM
source ="com.databricks.spark.csv" is not available.
(spark version is 1.6
hdp 2.4.0.0-169
jvm/java-8-oracle)
is it possible to read without databricks?
how to add this file to cluster?
where will i get jar files or it has to include seperately?
... View more
03-23-2017
10:01 AM
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/ Spark context is available as sc, SQL context is available as sqlContext
> Sys.setenv(SPARK_HOME="/usr/hdp/2.3.4.0.-3485/spark/bin/sparkR/")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","Lib"),.libPaths()))
> Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-oracle/")
> library(SparkR)
> lines<-SparkR:::textFile(sc,"hdfs: /user/midhun/f.txt")
17/03/23 14:37:58 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 306.2 KB, free 306.2 KB)
17/03/23 14:37:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 26.1 KB, free 332.3 KB)
17/03/23 14:37:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:39935 (size: 26.1 KB, free: 511.1 MB)
17/03/23 14:37:58 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2
> words<-SparkR:::flatMap(lines,function(line){strsplit(line," ")[[1]]})
> wordcount<-SparkR:::lapply(words,function(word){list(word,1)})
> counts<-SparkR:::reduceByKey(wordcount,"+",numPartition=2)
> output<-collect(counts) 17/03/23 14:40:03 INFO SparkContext: Starting job: collect at NativeMethodAccessorImpl.java:-2
17/03/23 14:40:03 WARN DAGScheduler: Creating new stage failed due to exception - job: 0
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs:%20/user/midhun/f.txt
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.api.r.BaseRRDD.getPartitions(RRDD.scala:47)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.api.r.BaseRRDD.getPartitions(RRDD.scala:47)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:91)
at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:386)
at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:398)
at org.apache.spark.scheduler.DAGScheduler.getParentStagesAndId(DAGScheduler.scala:299)
at org.apache.spark.scheduler.DAGScheduler.newResultStage(DAGScheduler.scala:334)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:837)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1607)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: hdfs:%20/user/midhun/f.txt
at java.net.URI.checkPath(URI.java:1823)
at java.net.URI.<init>(URI.java:745)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 44 more
17/03/23 14:40:03 INFO DAGScheduler: Job 0 failed: collect at NativeMethodAccessorImpl.java:-2, took 0.011744 s
17/03/23 14:40:03 ERROR RBackendHandler: collect on 17 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs:%20/user/midhun/f.txt
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(R
... View more
Labels: