Support Questions

Find answers, ask questions, and share your expertise

Access chunk name in Spark / Scala

avatar
Super Collaborator

I'm using Scala to read data from S3, and then perform some analysis on it.

Suppose that in /path/to/my/data, there are 4 "chunks": a.parquet, b.parquet, c.parquet, and d.parquet

In my results, I want one of the columns to show which chunk the data came from.

Is that possible, and if so, how?

val df = spark.read.parquet("s3://path/to/my/data")

val frame = spark.sql(      s"""  SELECT some things... """);
1 ACCEPTED SOLUTION

avatar
Contributor

Zack,

I would use the input_file_name function to update df with the file name column

var df2 = df.withColumn("fileName", input_file_name())

View solution in original post

4 REPLIES 4

avatar
Contributor

Zack,

I would use the input_file_name function to update df with the file name column

var df2 = df.withColumn("fileName", input_file_name())

avatar
Super Collaborator

THanks @Amol Thacker

One quick followup: do you know what the syntax would be to strip the path from the file name?

So, convert /my/path/to/my/file.txt to file.txt

I'm new to scala and struggling w/ syntax...

avatar
Contributor
In the most trivial manner,

val fileNameFromPath: String => String = _.split(“/“).takeRight(1).head

import org.apache.spark.sql.functions.udf
val fileNameUDF = udf(fileNameFromPath)

var df2 = df.withColumn(“fileName”, fileNameUDF(input_file_name()))

avatar
Super Collaborator

Here's what I ended up with:
spark.udf.register("getOnlyFileName", (fullPath: String) => fullPath.split("/").last)
val df2= df1.withColumn("source_file_name2", callUDF("getOnlyFileName", input_file_name()))