<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark Dataframe Query not working in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161159#M123538</link>
    <description>&lt;P&gt;Thanks for the details, this is helpful&lt;/P&gt;</description>
    <pubDate>Fri, 10 Feb 2017 03:09:14 GMT</pubDate>
    <dc:creator>excitingtimes03</dc:creator>
    <dc:date>2017-02-10T03:09:14Z</dc:date>
    <item>
      <title>Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161147#M123526</link>
      <description>&lt;P&gt;I have been trying to make the following Dataframe query work but its not giving me the results. Can someone please help?&lt;/P&gt;&lt;P&gt;Is there a reference guide to see all the syntax to create dataframe queries? Here this is what I want - My dataframe df has many cols among which 4 are -&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;id&lt;/LI&gt;&lt;LI&gt;deviceFlag&lt;/LI&gt;&lt;LI&gt;device&lt;/LI&gt;&lt;LI&gt;domain&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;For all rows, if deviceFlag is set to 1, then form a string "device#domain" and store that value in a new col "id". Else, just take the value in the "device" col and store it in the new "id" col without any transformation. So I am creating a new column with the .withColumn and setting value for each row depending upon the corresponding values of "device" and "deviceFlag" cols in that row.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;DataFrame result= df.withColumn("id",
    ((Column) when(df.col("deviceFlag").$eq$eq$eq(1),
        concat(df.col("device"), lit("#"),
            df.col("domain")))).otherwise(df.col("device")).alias("id"));&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Mon, 06 Feb 2017 10:17:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161147#M123526</guid>
      <dc:creator>excitingtimes03</dc:creator>
      <dc:date>2017-02-06T10:17:35Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161148#M123527</link>
      <description>&lt;P&gt;Try&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;&lt;CODE&gt;df.withColumn("id",&lt;/CODE&gt;&lt;/LI&gt;&lt;LI&gt;&lt;CODE&gt;when($"deviceFlag"===1,&lt;/CODE&gt;&lt;/LI&gt;&lt;LI&gt;&lt;CODE&gt;        concat($"device", lit("#"),&lt;/CODE&gt;&lt;/LI&gt;&lt;LI&gt;&lt;CODE&gt;$"domain")).otherwise($"device"));&lt;/CODE&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Mon, 06 Feb 2017 11:36:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161148#M123527</guid>
      <dc:creator>knarayanan</dc:creator>
      <dc:date>2017-02-06T11:36:10Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161149#M123528</link>
      <description>&lt;P&gt;@Karthik Narayanan - thanks, tried that but its giving me - " Incompatible operand types String and int" at the statement - &lt;EM&gt;when($"deviceFlag"===1.&lt;/EM&gt; I am writing in java - is there something else I am missing?&lt;/P&gt;</description>
      <pubDate>Tue, 07 Feb 2017 00:18:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161149#M123528</guid>
      <dc:creator>excitingtimes03</dc:creator>
      <dc:date>2017-02-07T00:18:03Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161150#M123529</link>
      <description>&lt;P&gt;Try with a "1" , deviceflag may be a string so it needs to be compared with string .&lt;/P&gt;</description>
      <pubDate>Tue, 07 Feb 2017 00:27:44 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161150#M123529</guid>
      <dc:creator>knarayanan</dc:creator>
      <dc:date>2017-02-07T00:27:44Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161151#M123530</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/15845/excitingtimes03.html" nodeid="15845" target="_blank"&gt;@Mark Wallace&lt;/A&gt;&lt;/P&gt;&lt;P&gt;This worked well for me in Spark (Scala), which looks to be the same as &lt;A rel="user" href="https://community.cloudera.com/users/10180/knarayanan.html" nodeid="10180" target="_blank"&gt;@Karthik Narayanan&lt;/A&gt;'s answer:&lt;/P&gt;&lt;PRE&gt;import org.apache.spark.sql.SparkSession
import spark.implicits._

val data = Seq((1111, 1, "device1", "domain1"), (2222, 0, "device2", "domain2"), (3333, 1, "device3", "domain3"), (4444, 0, "device4", "domain4"))

val df = data.toDF("id","deviceFlag","device","domain")

df.show()

val result = df.withColumn("new_id", when($"deviceFlag"===1, concat($"device", lit("#"), $"domain")).otherwise($"device"))

result.show()
&lt;/PRE&gt;&lt;P&gt;This will output:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="12157-screen-shot-2017-02-06-at-25603-pm.png" style="width: 710px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/20690iCC5091860E940F4A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="12157-screen-shot-2017-02-06-at-25603-pm.png" alt="12157-screen-shot-2017-02-06-at-25603-pm.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;As an alternative (and reference for others), the PySpark syntax will look like this:&lt;/P&gt;&lt;PRE&gt;from pyspark import SparkContext, SparkConf
from pyspark.sql import functions as F
from pyspark.sql.functions import lit, concat

data = [(1111, 1, 'device1', 'domain1'), (2222, 0, 'device2', 'domain2'), (3333, 1, 'device3', 'domain3'), (4444, 0, 'device4', 'domain4')]

df = spark.createDataFrame(data, ['id','deviceFlag','device','domain'])

df.show()

result = df.withColumn("new_id", F.when(df["deviceFlag"]==1, concat(df["device"], lit("#"), df["domain"])).otherwise(df["device"]))

result.show()
&lt;/PRE&gt;&lt;P&gt;Hope this helps!&lt;/P&gt;</description>
      <pubDate>Sun, 18 Aug 2019 11:55:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161151#M123530</guid>
      <dc:creator>dzaratsian</dc:creator>
      <dc:date>2019-08-18T11:55:35Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161152#M123531</link>
      <description>&lt;P&gt;Good pick Dan. Just noticed that Id was part of his original DF. &lt;/P&gt;</description>
      <pubDate>Tue, 07 Feb 2017 05:47:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161152#M123531</guid>
      <dc:creator>knarayanan</dc:creator>
      <dc:date>2017-02-07T05:47:32Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161153#M123532</link>
      <description>&lt;P&gt;Thanks for the detailed answer. My code is in Java so I am running into issues trying to convert this scala solution into Java. Any idea on what the version Java version of this could be? &lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/users/10180/knarayanan.html"&gt;@Karthik Narayanan&lt;/A&gt; - deviceFlag is Int so your comparison is correct. Just need the Java equivalent of this..&lt;/P&gt;</description>
      <pubDate>Tue, 07 Feb 2017 06:17:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161153#M123532</guid>
      <dc:creator>excitingtimes03</dc:creator>
      <dc:date>2017-02-07T06:17:14Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161154#M123533</link>
      <description>&lt;DIV&gt;&lt;A rel="user" href="https://community.cloudera.com/users/15845/excitingtimes03.html" nodeid="15845"&gt;@Mark Wallace&lt;/A&gt; &lt;/DIV&gt;&lt;OL&gt;&lt;LI&gt;&lt;CODE&gt;df.withColumn("id",&lt;/CODE&gt;&lt;/LI&gt;&lt;LI&gt;&lt;CODE&gt;when(df.col("deviceFlag").equalTo(1),&lt;/CODE&gt;&lt;/LI&gt;&lt;LI&gt;&lt;CODE&gt;concat(df.col("device"), lit("#"),&lt;/CODE&gt;&lt;/LI&gt;&lt;LI&gt;&lt;CODE&gt;df.col("domain"))).otherwise(df.col("device")));&lt;/CODE&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Tue, 07 Feb 2017 14:02:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161154#M123533</guid>
      <dc:creator>knarayanan</dc:creator>
      <dc:date>2017-02-07T14:02:05Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161155#M123534</link>
      <description>&lt;P&gt;I think your solution is ok. Probably you have problem with the input data (Try to call df.show() and df.printSchema() with the original df before the transformation).

I tried out your code with this class:&lt;/P&gt;&lt;PRE&gt;package com.hortonworks.spark.example;


import static org.apache.spark.sql.functions.when;
import static org.apache.spark.sql.functions.concat;
import static org.apache.spark.sql.functions.lit;

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;

import java.io.Serializable;
import java.util.Arrays;
import java.util.List;


public class SparkStarter {
  public static void main(String[] args) {
    new SparkStarter().run();
  }

  private void run() {

    SparkConf conf = new SparkConf().setAppName("test").setMaster("local[*]");
    JavaSparkContext sc = new JavaSparkContext(conf);

    List&amp;lt;Record&amp;gt; data = Arrays.asList(new Record(1, null, "b"),new Record(1, "a", "b"), new Record(0, "b", "c"));
    JavaRDD&amp;lt;Record&amp;gt; distData = sc.parallelize(data);
    SQLContext sql = new SQLContext(sc);
    Dataset&amp;lt;Row&amp;gt; df = sql.createDataFrame(distData, Record.class);

    df.withColumn("id",
        ((Column) when(df.col("deviceFlag").$eq$eq$eq(1),
            concat(df.col("device"), lit("#"),
                df.col("domain")))).otherwise(df.col("device")).alias("id")).show();
  }

  public static class Record implements Serializable {
    int deviceFlag;
    String device;
    String domain;

    public int getDeviceFlag() {
      return deviceFlag;
    }

    public void setDeviceFlag(int deviceFlag) {
      this.deviceFlag = deviceFlag;
    }

    public String getDevice() {
      return device;
    }

    public void setDevice(String device) {
      this.device = device;
    }

    public String getDomain() {
      return domain;
    }

    public void setDomain(String domain) {
      this.domain = domain;
    }

    public Record(int deviceFlag, String device, String domain) {
      this.deviceFlag = deviceFlag;
      this.device = device;
      this.domain = domain;

    }
  }
}
&lt;/PRE&gt;&lt;P&gt;And it worked well:&lt;/P&gt;&lt;PRE&gt;+------+----------+------+----+
|device|deviceFlag|domain|  id|
+------+----------+------+----+
|  null|         1|     b|null|
|     a|         1|     b| a#b|
|     b|         0|     c|   b|
+------+----------+------+----+&lt;/PRE&gt;&lt;P&gt;What was your problem exactly?&lt;/P&gt;</description>
      <pubDate>Tue, 07 Feb 2017 22:38:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161155#M123534</guid>
      <dc:creator>melek</dc:creator>
      <dc:date>2017-02-07T22:38:11Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161156#M123535</link>
      <description>&lt;P&gt;I am getting a NPE as below&lt;/P&gt;&lt;PRE&gt;df is a dataframe that I have from previous joins with other dataframes

DataFrame df = df1.join(....)


// On doing tis I get the following NPE
df.withColumn("id",
        ((Column) when(df.col("deviceFlag").$eq$eq$eq(1),
            concat(df.col("device"), lit("#"),
            df.col("domain")))).otherwise(df.col("device")).alias("id")).show();

The line 235 in ApplDriver is 
df.col("domain")))).otherwise(df.col("device")).alias("id")).show();

I have checked some input values of domain and device and they are coming fine. NPE tells me its expecting some object? But cols are there, "id" is anyway a new col so its not defined earlier and df is ofcourse the resulting DF from prev steps..Then what can cause the NPE? Its not going forward because of the NPE.

17/02/07 15:39:18 ERROR ApplicationMaster: User class threw exception: java.lang.NullPointerException
java.lang.NullPointerException
	at com.Driver.ApplDriver.lambda$main$1acd672$1(ApplDriver.java:235)
	at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$3.apply(JavaDStreamLike.scala:335)
	at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$3.apply(JavaDStreamLike.scala:335)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
	at scala.util.Try$.apply(Try.scala:161)
	at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:227)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:227)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:227)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:226)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

&lt;/PRE&gt;</description>
      <pubDate>Wed, 08 Feb 2017 05:51:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161156#M123535</guid>
      <dc:creator>excitingtimes03</dc:creator>
      <dc:date>2017-02-08T05:51:42Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161157#M123536</link>
      <description>&lt;P&gt;Some more testing showed that if I just add the new col with concat, it concats the values and shows up fine. But on using when() and otherwise() it is throwing an NPE&lt;/P&gt;</description>
      <pubDate>Wed, 08 Feb 2017 06:55:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161157#M123536</guid>
      <dc:creator>excitingtimes03</dc:creator>
      <dc:date>2017-02-08T06:55:49Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161158#M123537</link>
      <description>&lt;P&gt;Thanks this worked. Somehow a when() function was declared in my class returning null and hence the NPE.&lt;/P&gt;</description>
      <pubDate>Fri, 10 Feb 2017 03:08:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161158#M123537</guid>
      <dc:creator>excitingtimes03</dc:creator>
      <dc:date>2017-02-10T03:08:18Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Dataframe Query not working</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161159#M123538</link>
      <description>&lt;P&gt;Thanks for the details, this is helpful&lt;/P&gt;</description>
      <pubDate>Fri, 10 Feb 2017 03:09:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Dataframe-Query-not-working/m-p/161159#M123538</guid>
      <dc:creator>excitingtimes03</dc:creator>
      <dc:date>2017-02-10T03:09:14Z</dc:date>
    </item>
  </channel>
</rss>

