Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark Dataframe Query not working

avatar

I have been trying to make the following Dataframe query work but its not giving me the results. Can someone please help?

Is there a reference guide to see all the syntax to create dataframe queries? Here this is what I want - My dataframe df has many cols among which 4 are -

  1. id
  2. deviceFlag
  3. device
  4. domain

For all rows, if deviceFlag is set to 1, then form a string "device#domain" and store that value in a new col "id". Else, just take the value in the "device" col and store it in the new "id" col without any transformation. So I am creating a new column with the .withColumn and setting value for each row depending upon the corresponding values of "device" and "deviceFlag" cols in that row.

DataFrame result= df.withColumn("id",
    ((Column) when(df.col("deviceFlag").$eq$eq$eq(1),
        concat(df.col("device"), lit("#"),
            df.col("domain")))).otherwise(df.col("device")).alias("id"));
1 ACCEPTED SOLUTION

avatar
Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
12 REPLIES 12

avatar
Super Collaborator

Try

  1. df.withColumn("id",
  2. when($"deviceFlag"===1,
  3. concat($"device", lit("#"),
  4. $"domain")).otherwise($"device"));

avatar

@Karthik Narayanan - thanks, tried that but its giving me - " Incompatible operand types String and int" at the statement - when($"deviceFlag"===1. I am writing in java - is there something else I am missing?

avatar
Super Collaborator

Try with a "1" , deviceflag may be a string so it needs to be compared with string .

avatar

@Mark Wallace

This worked well for me in Spark (Scala), which looks to be the same as @Karthik Narayanan's answer:

import org.apache.spark.sql.SparkSession
import spark.implicits._

val data = Seq((1111, 1, "device1", "domain1"), (2222, 0, "device2", "domain2"), (3333, 1, "device3", "domain3"), (4444, 0, "device4", "domain4"))

val df = data.toDF("id","deviceFlag","device","domain")

df.show()

val result = df.withColumn("new_id", when($"deviceFlag"===1, concat($"device", lit("#"), $"domain")).otherwise($"device"))

result.show()

This will output:

12157-screen-shot-2017-02-06-at-25603-pm.png

As an alternative (and reference for others), the PySpark syntax will look like this:

from pyspark import SparkContext, SparkConf
from pyspark.sql import functions as F
from pyspark.sql.functions import lit, concat

data = [(1111, 1, 'device1', 'domain1'), (2222, 0, 'device2', 'domain2'), (3333, 1, 'device3', 'domain3'), (4444, 0, 'device4', 'domain4')]

df = spark.createDataFrame(data, ['id','deviceFlag','device','domain'])

df.show()

result = df.withColumn("new_id", F.when(df["deviceFlag"]==1, concat(df["device"], lit("#"), df["domain"])).otherwise(df["device"]))

result.show()

Hope this helps!

avatar
Super Collaborator

Good pick Dan. Just noticed that Id was part of his original DF.

avatar

Thanks for the detailed answer. My code is in Java so I am running into issues trying to convert this scala solution into Java. Any idea on what the version Java version of this could be?

@Karthik Narayanan - deviceFlag is Int so your comparison is correct. Just need the Java equivalent of this..

avatar
Super Collaborator
  1. df.withColumn("id",
  2. when(df.col("deviceFlag").equalTo(1),
  3. concat(df.col("device"), lit("#"),
  4. df.col("domain"))).otherwise(df.col("device")));

avatar

Thanks for the details, this is helpful

avatar
Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login