Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Access chunk name in Spark / Scala

avatar
Super Collaborator

I'm using Scala to read data from S3, and then perform some analysis on it.

Suppose that in /path/to/my/data, there are 4 "chunks": a.parquet, b.parquet, c.parquet, and d.parquet

In my results, I want one of the columns to show which chunk the data came from.

Is that possible, and if so, how?

val df = spark.read.parquet("s3://path/to/my/data")

val frame = spark.sql(      s"""  SELECT some things... """);
1 ACCEPTED SOLUTION

avatar
Contributor

Zack,

I would use the input_file_name function to update df with the file name column

var df2 = df.withColumn("fileName", input_file_name())

View solution in original post

4 REPLIES 4

avatar
Contributor

Zack,

I would use the input_file_name function to update df with the file name column

var df2 = df.withColumn("fileName", input_file_name())

avatar
Super Collaborator

THanks @Amol Thacker

One quick followup: do you know what the syntax would be to strip the path from the file name?

So, convert /my/path/to/my/file.txt to file.txt

I'm new to scala and struggling w/ syntax...

avatar
Contributor
In the most trivial manner,

val fileNameFromPath: String => String = _.split(“/“).takeRight(1).head

import org.apache.spark.sql.functions.udf
val fileNameUDF = udf(fileNameFromPath)

var df2 = df.withColumn(“fileName”, fileNameUDF(input_file_name()))

avatar
Super Collaborator

Here's what I ended up with:
spark.udf.register("getOnlyFileName", (fullPath: String) => fullPath.split("/").last)
val df2= df1.withColumn("source_file_name2", callUDF("getOnlyFileName", input_file_name()))