Archives of Support Questions (Read Only)

zack_riesland · ‎04-10-2018

I'm using Scala to read data from S3, and then perform some analysis on it.

Suppose that in /path/to/my/data, there are 4 "chunks": a.parquet, b.parquet, c.parquet, and d.parquet

In my results, I want one of the columns to show which chunk the data came from.

Is that possible, and if so, how?

val df = spark.read.parquet("s3://path/to/my/data")

val frame = spark.sql(      s"""  SELECT some things... """);

athacker · ‎04-10-2018

Zack,

I would use the input_file_name function to update df with the file name column

var df2 = df.withColumn("fileName", input_file_name())

View solution in original post

athacker · ‎04-10-2018

Zack,

I would use the input_file_name function to update df with the file name column

var df2 = df.withColumn("fileName", input_file_name())

zack_riesland · ‎04-10-2018

THanks @Amol Thacker

One quick followup: do you know what the syntax would be to strip the path from the file name?

So, convert /my/path/to/my/file.txt to file.txt

I'm new to scala and struggling w/ syntax...

athacker · ‎04-10-2018

In the most trivial manner,

val fileNameFromPath: String => String = _.split(“/“).takeRight(1).head

import org.apache.spark.sql.functions.udf
val fileNameUDF = udf(fileNameFromPath)

var df2 = df.withColumn(“fileName”, fileNameUDF(input_file_name()))

zack_riesland · ‎04-10-2018

Here's what I ended up with:
spark.udf.register("getOnlyFileName", (fullPath: String) => fullPath.split("/").last)
val df2= df1.withColumn("source_file_name2", callUDF("getOnlyFileName", input_file_name()))

Cloudera Community

Archives of Support Questions (Read Only)

Access chunk name in Spark / Scala