Created 04-10-2018 03:21 PM
I'm using Scala to read data from S3, and then perform some analysis on it.
Suppose that in /path/to/my/data, there are 4 "chunks": a.parquet, b.parquet, c.parquet, and d.parquet
In my results, I want one of the columns to show which chunk the data came from.
Is that possible, and if so, how?
val df = spark.read.parquet("s3://path/to/my/data") val frame = spark.sql( s""" SELECT some things... """);
Created 04-10-2018 04:35 PM
Zack,
I would use the input_file_name function to update df with the file name column
var df2 = df.withColumn("fileName", input_file_name())
Created 04-10-2018 04:35 PM
Zack,
I would use the input_file_name function to update df with the file name column
var df2 = df.withColumn("fileName", input_file_name())
Created 04-10-2018 05:19 PM
THanks @Amol Thacker
One quick followup: do you know what the syntax would be to strip the path from the file name?
So, convert /my/path/to/my/file.txt to file.txt
I'm new to scala and struggling w/ syntax...
Created 04-10-2018 07:51 PM
In the most trivial manner, val fileNameFromPath: String => String = _.split(“/“).takeRight(1).head import org.apache.spark.sql.functions.udf val fileNameUDF = udf(fileNameFromPath) var df2 = df.withColumn(“fileName”, fileNameUDF(input_file_name()))
Created 04-10-2018 08:31 PM
Here's what I ended up with:
spark.udf.register("getOnlyFileName", (fullPath: String) => fullPath.split("/").last)
val df2= df1.withColumn("source_file_name2", callUDF("getOnlyFileName", input_file_name()))