Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark CSV join is saving only column names + spark 1.4.1

Highlighted

Spark CSV join is saving only column names + spark 1.4.1

Rising Star

Hi,

Sample code

spark-shell  --packages com.databricks:spark-csv_2.10:1.1.0  --master yarn-client --driver-memory 512m --executor-memory 512m

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.orc._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType,FloatType ,LongType ,TimestampType };

val firstSchema = StructType(Seq(StructField("COLUMN1", StringType, true),StructField("COLUMN2", StringType, true),StructField("COLUMN2", StringType, true),StructField("COLUMN3", StringType, true)
StructField("COLUMN4", StringType, true),StructField("COLUMN5", StringType, true)))
val file1df = hiveContext.read.format("com.databricks.spark.csv").option("header", "true").schema(firstSchema).load("/tmp/File1.csv")
val secondSchema = StructType(Seq(
StructField("COLUMN1", StringType, true),
StructField("COLUMN2", NullType  , true),
StructField("COLUMN3", TimestampType , true),
StructField("COLUMN4", TimestampType , true),
StructField("COLUMN5", NullType , true),
StructField("COLUMN6", StringType, true),
StructField("COLUMN7", IntegerType, true),
StructField("COLUMN8", IntegerType, true),
StructField("COLUMN9", StringType, true),
StructField("COLUMN10", IntegerType, true),
StructField("COLUMN11", IntegerType, true),
StructField("COLUMN12", IntegerType, true)))
val file2df = hiveContext.read.format("com.databricks.spark.csv").option("header", "false").schema(secondSchema).load("/tmp/file2.csv")
val joineddf = file1df.join(file2df, file1df("COLUMN1") === file2df("COLUMN6"))
val selecteddata = joineddf.select(file1df("COLUMN2"),file2df("COLUMN10"))
//displaying the joined data in console 
joineddf.collect.foreach(println)
//Saving just the column names ,no joined data inside 
selecteddata.write.format("com.databricks.spark.csv").option("header", "true").save("/tmp/JoinedData.csv")
2 REPLIES 2

Re: Spark CSV join is saving only column names + spark 1.4.1

@Divya Gehlot I assume the join produced records and you were able to print out joineddf to see the resulting rows?

Can you provide File1.csv and file2.csv, or at least provide an anonymize version of them?

Re: Spark CSV join is saving only column names + spark 1.4.1

Hi @Divya Gehlot can you provide these files or isn't this an issue any more?