Member since
07-29-2017
10
Posts
0
Kudos Received
0
Solutions
08-28-2018
11:43 AM
Hi, I am joining two tables. One table is skewed. How to handle this in spark SQL. I am using spark 2.2.1 in AWS EMR. Please assist on this.
... View more
Labels:
- Labels:
-
Apache Spark
02-04-2018
07:22 AM
@Shu Thanks a lot for the answer. In my case the non group by columns are string data types. Can I use non group by columns that are string data types in the aggregation function? Can I create temp view on the data frame and then use subquery to retrieve the results? Is this possible in structured streaming?
... View more
02-04-2018
05:28 AM
Hi Shu, @shu I have few other columns apart from the ROW_ID,ODS_WII_VERB columns in the input. But they are not part of group by clause. How to retrieve those columns as well.
... View more
02-03-2018
08:45 AM
Hi, Below is the input schema and output schema. i/p: row_id,ODS_WII_VERB,stg_load_ts,other_columns o/p: get the max timestamp group by row_id and ODS_WII_VERB issue: As we use only row_id and ODS_WII_VERB in the group by clause we are unable to get the other columns. How to get other columns as well. We tried creating a spark sql subquery but it seems spark sub query is not working in spark structured streaming.
How to resolve this issue. code snippet val csvDF = sparkSession
.readStream
.option("sep", ",")
.schema(userSchema)
.csv("C:\\Users\\M1037319\\Desktop\\data") val updatedDf = csvDF.withColumn("ODS_WII_VERB", regexp_replace(col("ODS_WII_VERB"), "I", "U"))
updatedDf.printSchema() val grpbyDF = updatedDf.groupBy("ROW_ID","ODS_WII_VERB").max("STG_LOAD_TS")
... View more
Labels:
- Labels:
-
Apache Spark