<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Comparing sql data with text file using Spark. in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166876#M53831</link>
    <description>&lt;P&gt;If you want to use PySpark, the the following works with Spark 1.6. This is from work I did parsing a text file to extract orders data. &lt;/P&gt;&lt;P&gt;1. Read the text file data and convert to DataFrame so that your data is organized into named columns:&lt;/P&gt;&lt;P&gt;## read text file and parse out fields needed. &lt;/P&gt;&lt;P&gt;path = "hdfs://my_server:8020/my_path/*"&lt;/P&gt;&lt;P&gt;lines = sc.textFile(path)&lt;/P&gt;&lt;P&gt;parts = lines.map(lambda l: l.split("|"))&lt;/P&gt;&lt;P&gt;orders = parts.map(lambda o: Row(platform=o[101], date=int(o[1]), hour=int(o[2]), order_id=o[29], parent_order_uuid=o[90]))&lt;/P&gt;&lt;P&gt;schemaOrders = sqlContext.createDataFrame(orders)&lt;/P&gt;&lt;P&gt;## register as a table&lt;/P&gt;&lt;P&gt;schemaOrders.registerTempTable("schemaOrders")&lt;/P&gt;&lt;P&gt;2. Now read your data from the SQL database and register as a table in Spark. Spark can connect to SQL Databases. Here is an article showing how to connect Spark to SQL Server: &lt;A href="https://community.hortonworks.com/content/kbentry/59205/spark-pyspark-to-extract-from-sql-server.html" target="_blank"&gt;https://community.hortonworks.com/content/kbentry/59205/spark-pyspark-to-extract-from-sql-server.html&lt;/A&gt;
&lt;/P&gt;&lt;P&gt;3. Join the 2 datasets, the data from the file with the SQL database&lt;/P&gt;</description>
    <pubDate>Thu, 09 Feb 2017 02:44:57 GMT</pubDate>
    <dc:creator>bmathew</dc:creator>
    <dc:date>2017-02-09T02:44:57Z</dc:date>
    <item>
      <title>Comparing sql data with text file using Spark.</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166874#M53829</link>
      <description>&lt;P&gt;Using Spark Can I compare 2 different data ( one from Sql DB and another from textfile )&lt;/P&gt;&lt;P&gt;I have two sets of data. One is text file and another is SQL table.&lt;/P&gt;&lt;P&gt;I would like to do a look up in to data presented in SQL table and text file and if they match, I want to delete some fields from the textfile.&lt;/P&gt;&lt;BLOCKQUOTE&gt;
&lt;PRE&gt;Text File :
ckt_id|location|usage|port|machine
AXZCSD21DF|USA|2GB|101|MAC1
ABZCSD21DF|OTH|4GB|101|MAC2
AXZCSD21DF|USA|6GB|101|MAC4
BXZCSD21DF|USA|7GB|101|MAC6

SQL table:
+-----------+-------+
|    CCKT_NO|SEV_LVL|
+-----------+-------+
| AXZCSD21DF|      1|
| BXZCSD21DF|      1|
| ABZCSD21DF|      3|
| CXZCSD21DF|      2|
| AXZCSD22DF|      2|
| XZDCSD21DF|      3|
|ADZZCSD21DF|      1|
+-----------+-------+

&lt;/PRE&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Can Someone please guide me on this ?&lt;/P&gt;</description>
      <pubDate>Thu, 09 Feb 2017 00:22:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166874#M53829</guid>
      <dc:creator>das_dineshk</dc:creator>
      <dc:date>2017-02-09T00:22:41Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing sql data with text file using Spark.</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166875#M53830</link>
      <description>&lt;P&gt;You can use dataframe. Convert the text file to a dataframe like the code below and do a join to start comparing.&lt;/P&gt;&lt;PRE&gt;sc.setLogLevel("WARN")
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.functions.broadcast
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._


val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val conf = new SparkConf()
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

case class d1(
  ckt_id:String,
  location:String,
  usage:String,
  port:String,
  machine:String
)

val f2 = sc.textFile("textfile location")

val f1_df = f2.map(_.split("\\|"))
              .map(x =&amp;gt; d1(
                x(0).toString,
                x(0).toString,
                x(0).toString,
                x(0).toString,
                x(0).toString
              )).toDF

// this will give u this table

+----------+----------+----------+----------+----------+
|    ckt_id|  location|     usage|      port|   machine|
+----------+----------+----------+----------+----------+
|AXZCSD21DF|AXZCSD21DF|AXZCSD21DF|AXZCSD21DF|AXZCSD21DF|
|ABZCSD21DF|ABZCSD21DF|ABZCSD21DF|ABZCSD21DF|ABZCSD21DF|
|AXZCSD21DF|AXZCSD21DF|AXZCSD21DF|AXZCSD21DF|AXZCSD21DF|
|BXZCSD21DF|BXZCSD21DF|BXZCSD21DF|BXZCSD21DF|BXZCSD21DF|
+----------+----------+----------+----------+----------+&lt;/PRE&gt;</description>
      <pubDate>Thu, 09 Feb 2017 01:15:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166875#M53830</guid>
      <dc:creator>adnanalvee</dc:creator>
      <dc:date>2017-02-09T01:15:51Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing sql data with text file using Spark.</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166876#M53831</link>
      <description>&lt;P&gt;If you want to use PySpark, the the following works with Spark 1.6. This is from work I did parsing a text file to extract orders data. &lt;/P&gt;&lt;P&gt;1. Read the text file data and convert to DataFrame so that your data is organized into named columns:&lt;/P&gt;&lt;P&gt;## read text file and parse out fields needed. &lt;/P&gt;&lt;P&gt;path = "hdfs://my_server:8020/my_path/*"&lt;/P&gt;&lt;P&gt;lines = sc.textFile(path)&lt;/P&gt;&lt;P&gt;parts = lines.map(lambda l: l.split("|"))&lt;/P&gt;&lt;P&gt;orders = parts.map(lambda o: Row(platform=o[101], date=int(o[1]), hour=int(o[2]), order_id=o[29], parent_order_uuid=o[90]))&lt;/P&gt;&lt;P&gt;schemaOrders = sqlContext.createDataFrame(orders)&lt;/P&gt;&lt;P&gt;## register as a table&lt;/P&gt;&lt;P&gt;schemaOrders.registerTempTable("schemaOrders")&lt;/P&gt;&lt;P&gt;2. Now read your data from the SQL database and register as a table in Spark. Spark can connect to SQL Databases. Here is an article showing how to connect Spark to SQL Server: &lt;A href="https://community.hortonworks.com/content/kbentry/59205/spark-pyspark-to-extract-from-sql-server.html" target="_blank"&gt;https://community.hortonworks.com/content/kbentry/59205/spark-pyspark-to-extract-from-sql-server.html&lt;/A&gt;
&lt;/P&gt;&lt;P&gt;3. Join the 2 datasets, the data from the file with the SQL database&lt;/P&gt;</description>
      <pubDate>Thu, 09 Feb 2017 02:44:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166876#M53831</guid>
      <dc:creator>bmathew</dc:creator>
      <dc:date>2017-02-09T02:44:57Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing sql data with text file using Spark.</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166877#M53832</link>
      <description>&lt;P&gt;Reference this article on how to join a text file to a SQL database table. The full working code is provided: &lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/articles/82346/spark-pyspark-for-etl-to-join-text-files-with-data.html" target="_blank"&gt;https://community.hortonworks.com/articles/82346/spark-pyspark-for-etl-to-join-text-files-with-data.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 09 Feb 2017 11:49:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166877#M53832</guid>
      <dc:creator>bmathew</dc:creator>
      <dc:date>2017-02-09T11:49:43Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing sql data with text file using Spark.</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166878#M53833</link>
      <description>&lt;P&gt;Thank you so much. You're Genius &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 09 Feb 2017 16:30:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166878#M53833</guid>
      <dc:creator>das_dineshk</dc:creator>
      <dc:date>2017-02-09T16:30:42Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing sql data with text file using Spark.</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166879#M53834</link>
      <description>&lt;P&gt;Thank you so much Sir. You are awesome..&lt;/P&gt;</description>
      <pubDate>Thu, 09 Feb 2017 16:53:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166879#M53834</guid>
      <dc:creator>das_dineshk</dc:creator>
      <dc:date>2017-02-09T16:53:52Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing sql data with text file using Spark.</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166880#M53835</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/3076/bmathew.html" nodeid="3076"&gt;@Binu Mathew&lt;/A&gt; &lt;/P&gt;&lt;P&gt;While doing sql operation am getting one of the table is not found.&lt;/P&gt;&lt;PRE&gt;scala&amp;gt; gsam.show()
+-----------+-------+
|    CCKT_NO|SEV_LVL|
+-----------+-------+
| AXZCSD21DF|      1|
| BXZCSD21DF|      1|
| ABZCSD21DF|      3|
| CXZCSD21DF|      2|
| AXZCSD22DF|      2|
| XZDCSD21DF|      3|
|ADZZCSD21DF|      1|
+-----------+-------+

scala&amp;gt; input_file.show()
+-----------+--------+-----+----+-------+
|     ckt_id|location|usage|port|machine|
+-----------+--------+-----+----+-------+
|     ckt_id|location|usage|port|machine|
| AXZCSD21DF|     USA|  2GB| 101|   MAC1|
| ABZCSD21DF|     OTH|  4GB| 101|   MAC2|
| AXZCSD21DF|     USA|  6GB| 101|   MAC4|
| BXZCSD21DF|     USA|  7GB| 101|   MAC6|
| CXZCSD21DF|     IND|  2GB| 101|   MAC9|
| AXZCSD21DF|     USA|  1GB| 101|   MAC0|
| AXZCSD22DF|     IND|  9GB| 101|   MAC3|
|ADZZCSD21DF|     USA|  1GB| 101|   MAC4|
| AXZCSD21DF|     USA|  2GB| 101|   MAC5|
| XZDCSD21DF|     OTH|  2GB| 101|   MAC1|
+-----------+--------+-----+----+-------+

scala&amp;gt; input_file.registerTempTable("input_file_temp")
scala&amp;gt; gsam.registerTempTable("gsam_temp")
scala&amp;gt; val tmp = sqlContext.sql("select a.ckt_id,a.location,a.usage,a.port,a.machine,b.CCKT_NO,b.SEV_LVL FROM input_file_temp  a, gsam_temp b where a.ckt_id=b.CCKT_NO AND b.sev_lvl='3'")
org.apache.spark.sql.AnalysisException: Table not found: input_file_temp;
        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
        at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:305)
        at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$9.applyOrElse(Analyzer.scala:314)
        at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$9.applyOrElse(Analyzer.scala:309)

&lt;/PRE&gt;</description>
      <pubDate>Thu, 09 Feb 2017 19:46:45 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166880#M53835</guid>
      <dc:creator>das_dineshk</dc:creator>
      <dc:date>2017-02-09T19:46:45Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing sql data with text file using Spark.</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166881#M53836</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/14978/dasdineshk.html" nodeid="14978"&gt;@Dinesh Das&lt;/A&gt; - The code in that article is done using PySpark and using Spark 2.1. It's working code. &lt;/P&gt;&lt;P&gt;What version of Spark are you using? I see that your using Scala. If you are using Spark version 2 or above, did you create a SparkSession? If an earlier version of Spark, did you create a SQLContext? &lt;/P&gt;</description>
      <pubDate>Fri, 10 Feb 2017 10:41:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166881#M53836</guid>
      <dc:creator>bmathew</dc:creator>
      <dc:date>2017-02-10T10:41:52Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing sql data with text file using Spark.</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166882#M53837</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/3076/bmathew.html" nodeid="3076"&gt;@Binu Mathew&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Thanks for the python code. Am tryin to do it in both scala n python for knowledge purpose.&lt;/P&gt;&lt;P&gt;Am using Spark 1.6.2 . Yes I have created SQLContext.&lt;/P&gt;</description>
      <pubDate>Fri, 10 Feb 2017 20:09:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Comparing-sql-data-with-text-file-using-Spark/m-p/166882#M53837</guid>
      <dc:creator>das_dineshk</dc:creator>
      <dc:date>2017-02-10T20:09:55Z</dc:date>
    </item>
  </channel>
</rss>

