Reply
New Contributor
Posts: 5
Registered: ‎11-01-2016

spark dataframe write to file using scala

[ Edited ]

I am trying to read a file and add two extra columns. 1. Seq no and 2. filename. When I run spark job in scala IDE output is generated correctly but when I run in putty with local or cluster mode job is stucks at stage-2 (save at File_Process). There is no progress even i wait for an hour. I am testing on 1GB data.

Below is the code i am using

object File_Process
{
 Logger.getLogger("org").setLevel(Level.ERROR)  
 val spark = SparkSession
             .builder()
             .master("yarn")
             .appName("File_Process")
             .getOrCreate()
 def main(arg:Array[String])
 {
  val FileDF = spark.read
               .csv("/data/sourcefile/")
  val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
  val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
  val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
  val dataframefinal = datasetnew.withColumn("Filetag", lit(filename))
  val query = dataframefinal.write
              .mode("overwrite")
              .format("com.databricks.spark.csv")
              .option("delimiter", "|")
              .save("/data/text_file/")
  spark.stop()
 }

 

 output path is cluster location /data/text_file/. This folder is created by spark job when stage 2 starts and I can see temporary files created ex: /data/text_file/_temporary/0/_temporary/attempt_201704260541‌​02_0002_m_000000_0 and attempt_20170426054102_0002_m_000000_0 file is 0 kb

 

I am using Spark 2.1.0 in CDH 5.10.1

 

I am using spark-submit --deploy-mode cluster --class "File_Process" ~/File_Process.jar command to run spark job

 

Thanks in advance.

Thanks,
L Raghunath.