question Re: Access chunk name in Spark / Scala in Archives of Support Questions (Read Only)

Access chunk name in Spark / Scala

zack_riesland — Tue, 10 Apr 2018 22:21:36 GMT

I'm using Scala to read data from S3, and then perform some analysis on it.

Suppose that in /path/to/my/data, there are 4 "chunks": a.parquet, b.parquet, c.parquet, and d.parquet

In my results, I want one of the columns to show which chunk the data came from.

Is that possible, and if so, how?

val df = spark.read.parquet("s3://path/to/my/data")

val frame = spark.sql(      s"""  SELECT some things... """);

Re: Access chunk name in Spark / Scala

athacker — Tue, 10 Apr 2018 23:35:11 GMT

Zack,

I would use the input_file_name function to update df with the file name column

var df2 = df.withColumn("fileName", input_file_name())

Re: Access chunk name in Spark / Scala

zack_riesland — Wed, 11 Apr 2018 00:19:53 GMT

THanks @Amol Thacker

One quick followup: do you know what the syntax would be to strip the path from the file name?

So, convert /my/path/to/my/file.txt to file.txt

I'm new to scala and struggling w/ syntax...

Re: Access chunk name in Spark / Scala

athacker — Wed, 11 Apr 2018 02:51:12 GMT

In the most trivial manner,

val fileNameFromPath: String => String = _.split(“/“).takeRight(1).head

import org.apache.spark.sql.functions.udf
val fileNameUDF = udf(fileNameFromPath)

var df2 = df.withColumn(“fileName”, fileNameUDF(input_file_name()))

Re: Access chunk name in Spark / Scala

zack_riesland — Wed, 11 Apr 2018 03:31:48 GMT

Here's what I ended up with:
spark.udf.register("getOnlyFileName", (fullPath: String) => fullPath.split("/").last)
val df2= df1.withColumn("source_file_name2", callUDF("getOnlyFileName", input_file_name()))