<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Access chunk name in Spark / Scala in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Access-chunk-name-in-Spark-Scala/m-p/215567#M77069</link>
    <description>&lt;P&gt;Zack,&lt;/P&gt;&lt;P&gt;I would use the &lt;A href="https://spark.apache.org/docs/latest/api/sql/index.html#input_file_name"&gt;input_file_name&lt;/A&gt; function to update df with the file name column&lt;/P&gt;&lt;PRE&gt;var df2 = df.withColumn("fileName", input_file_name())
&lt;/PRE&gt;</description>
    <pubDate>Tue, 10 Apr 2018 23:35:11 GMT</pubDate>
    <dc:creator>athacker</dc:creator>
    <dc:date>2018-04-10T23:35:11Z</dc:date>
    <item>
      <title>Access chunk name in Spark / Scala</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Access-chunk-name-in-Spark-Scala/m-p/215566#M77068</link>
      <description>&lt;P&gt;I'm using Scala to read data from S3, and then perform some analysis on it.&lt;/P&gt;&lt;P&gt;Suppose that in /path/to/my/data, there are 4 "chunks": a.parquet, b.parquet, c.parquet, and d.parquet&lt;/P&gt;&lt;P&gt;In my results, I want one of the columns to show which chunk the data came from.&lt;/P&gt;&lt;P&gt;Is that possible, and if so, how?&lt;/P&gt;&lt;PRE&gt;val df = spark.read.parquet("s3://path/to/my/data")

val frame = spark.sql(      s"""  SELECT some things... """);&lt;/PRE&gt;</description>
      <pubDate>Tue, 10 Apr 2018 22:21:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Access-chunk-name-in-Spark-Scala/m-p/215566#M77068</guid>
      <dc:creator>zack_riesland</dc:creator>
      <dc:date>2018-04-10T22:21:36Z</dc:date>
    </item>
    <item>
      <title>Re: Access chunk name in Spark / Scala</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Access-chunk-name-in-Spark-Scala/m-p/215567#M77069</link>
      <description>&lt;P&gt;Zack,&lt;/P&gt;&lt;P&gt;I would use the &lt;A href="https://spark.apache.org/docs/latest/api/sql/index.html#input_file_name"&gt;input_file_name&lt;/A&gt; function to update df with the file name column&lt;/P&gt;&lt;PRE&gt;var df2 = df.withColumn("fileName", input_file_name())
&lt;/PRE&gt;</description>
      <pubDate>Tue, 10 Apr 2018 23:35:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Access-chunk-name-in-Spark-Scala/m-p/215567#M77069</guid>
      <dc:creator>athacker</dc:creator>
      <dc:date>2018-04-10T23:35:11Z</dc:date>
    </item>
    <item>
      <title>Re: Access chunk name in Spark / Scala</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Access-chunk-name-in-Spark-Scala/m-p/215568#M77070</link>
      <description>&lt;P&gt;THanks &lt;A rel="user" href="https://community.cloudera.com/users/70526/athacker.html" nodeid="70526"&gt;@Amol Thacker&lt;/A&gt;&lt;/P&gt;&lt;P&gt;One quick followup: do you know what the syntax would be to strip the path from the file name?&lt;/P&gt;&lt;P&gt;So, convert /my/path/to/my/file.txt to file.txt&lt;/P&gt;&lt;P&gt;I'm new to scala and struggling w/ syntax...&lt;/P&gt;</description>
      <pubDate>Wed, 11 Apr 2018 00:19:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Access-chunk-name-in-Spark-Scala/m-p/215568#M77070</guid>
      <dc:creator>zack_riesland</dc:creator>
      <dc:date>2018-04-11T00:19:53Z</dc:date>
    </item>
    <item>
      <title>Re: Access chunk name in Spark / Scala</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Access-chunk-name-in-Spark-Scala/m-p/215569#M77071</link>
      <description>&lt;PRE&gt;In the most trivial manner,

val fileNameFromPath: String =&amp;gt; String = _.split(“/“).takeRight(1).head

import org.apache.spark.sql.functions.udf
val fileNameUDF = udf(fileNameFromPath)

var df2 = df.withColumn(“fileName”, fileNameUDF(input_file_name()))&lt;/PRE&gt;</description>
      <pubDate>Wed, 11 Apr 2018 02:51:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Access-chunk-name-in-Spark-Scala/m-p/215569#M77071</guid>
      <dc:creator>athacker</dc:creator>
      <dc:date>2018-04-11T02:51:12Z</dc:date>
    </item>
    <item>
      <title>Re: Access chunk name in Spark / Scala</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Access-chunk-name-in-Spark-Scala/m-p/215570#M77072</link>
      <description>&lt;P&gt;Here's what I ended up with:&lt;BR /&gt;spark.udf.register("getOnlyFileName", (fullPath: String) =&amp;gt; fullPath.split("/").last)&lt;BR /&gt;val df2= df1.withColumn("source_file_name2", callUDF("getOnlyFileName", input_file_name()))&lt;/P&gt;</description>
      <pubDate>Wed, 11 Apr 2018 03:31:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Access-chunk-name-in-Spark-Scala/m-p/215570#M77072</guid>
      <dc:creator>zack_riesland</dc:creator>
      <dc:date>2018-04-11T03:31:48Z</dc:date>
    </item>
  </channel>
</rss>

