<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question How to get the non group by columns in spark structured streaming in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/How-to-get-the-non-group-by-columns-in-spark-structured/m-p/200511#M162532</link>
    <description>&lt;P&gt;Hi, Below is the input schema and output schema.&lt;/P&gt;&lt;P&gt;i/p: row_id,ODS_WII_VERB,stg_load_ts,other_columns &lt;/P&gt;&lt;P&gt;o/p: get the max timestamp  group by row_id and ODS_WII_VERB&lt;/P&gt;&lt;P&gt;issue: As we use only row_id and ODS_WII_VERB in the group by clause we are unable to get the other columns. How to get other columns as well. We tried creating a spark sql subquery but it seems spark sub query is not working in spark structured streaming.
How to resolve this issue. &lt;/P&gt;&lt;P&gt;code snippet &lt;/P&gt;&lt;P&gt;val csvDF = sparkSession
      .readStream
      .option("sep", ",")
      .schema(userSchema)
      .csv("C:\\Users\\M1037319\\Desktop\\data") &lt;/P&gt;&lt;P&gt;    val updatedDf = csvDF.withColumn("ODS_WII_VERB", regexp_replace(col("ODS_WII_VERB"), "I", "U"))
    updatedDf.printSchema() &lt;/P&gt;&lt;P&gt;    val grpbyDF = updatedDf.groupBy("ROW_ID","ODS_WII_VERB").max("STG_LOAD_TS") &lt;/P&gt;</description>
    <pubDate>Sat, 03 Feb 2018 16:45:01 GMT</pubDate>
    <dc:creator>elango_rk</dc:creator>
    <dc:date>2018-02-03T16:45:01Z</dc:date>
  </channel>
</rss>

