<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question How to split the dataframe of multiple files into multiple smaller dataframes in Spark? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/How-to-split-the-dataframe-of-multiple-files-into-multiple/m-p/63155#M22478</link>
    <description>&lt;P&gt;In a directory, I have sub directories which are created everyday. My requirement is to work on the files that are created yesterday. To do that, I came up with a logic that will get the latest dirs. In my case yesterday's dirs. I was able to do it using the below code.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; simpDate &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&lt;/SPAN&gt; &lt;SPAN class="kwd"&gt;new&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; java&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;text&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;SimpleDateFormat&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="str"&gt;"yyyy-MM-dd"&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;)&lt;/SPAN&gt;
&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; currDate &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; simpDate&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;format&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="kwd"&gt;new&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; java&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;util&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Date&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;())&lt;/SPAN&gt;
&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; now &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Instant&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;now                                                                           &lt;/SPAN&gt;&lt;SPAN class="com"&gt;// Gets current date in the format:2017-12-13T09:40:29.920Z&lt;/SPAN&gt;
&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; today &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; now&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;toEpochMilli&lt;/SPAN&gt;&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; yesterday &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; now&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;minus&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Duration&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;ofDays&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;1&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;))&lt;/SPAN&gt;
&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; yesterdayMilliSec &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; yesterday&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;toEpochMilli&lt;/SPAN&gt;&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; todaySimpDate &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; t&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;today&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;)&lt;/SPAN&gt;
&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; yesterdaySimpDate &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; t&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;yesterdayMilliSec&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;)&lt;/SPAN&gt;
&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; local&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;String&lt;/SPAN&gt; &lt;SPAN class="pun"&gt;=&lt;/SPAN&gt; &lt;SPAN class="str"&gt;"file://"&lt;/SPAN&gt;
&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; folders &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; getFileTree&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="kwd"&gt;new&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;File&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="str"&gt;"dailylogs"&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;)).&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;filterNot&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;_&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;getName&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;endsWith&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="str"&gt;".log"&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;))&lt;/SPAN&gt;  &lt;SPAN class="com"&gt;// Gets the date of dir&lt;/SPAN&gt;
&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; folderCrtDateDesc &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; folders&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;toList&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;map&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;y &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;y&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;,&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;y&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;lastModified&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;)).&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;sortBy&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(-&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;_&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;_2&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;)&lt;/SPAN&gt;
&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; latestFolder &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; folderCrtDateDesc&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;map&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;y&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&amp;gt;(&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;y&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;_1&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;,&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;t&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;y&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;_2&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;)))&lt;/SPAN&gt;
&lt;SPAN class="kwd"&gt;val&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; folderToday &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; latestFolder&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;filter&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;(&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;y &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;=&amp;gt;&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; y&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;_2&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;==&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;todaySimpDate&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;)&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now I have the latest dir in folderToday which looks like: "dailylogs/auditlogsdec27". Using the above code I can load the whole dir into spark, which in turn loads all the files into spark in a single dataframe. Each file starts with the record: "JobID" and ends with the record:&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;"[Wed Dec 27 05:38:49 UTC 2017] INFO: Updating the job keeper..."&lt;/PRE&gt;&lt;P&gt;There are 3 kinds of status in files in that directory. They are error, success, failure&lt;/P&gt;&lt;P&gt;The status for 'error' can be identified from the third line. For 'success' &amp;amp; 'failure' the same could be found on sixth line in the file.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class="pln"&gt;file1&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; status&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; error&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;JobID&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;454&lt;/SPAN&gt;
&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;47&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Starting&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Auditing&lt;/SPAN&gt; &lt;SPAN class="kwd"&gt;for&lt;/SPAN&gt; &lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; baseTable1&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;49&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; SEVERE&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Error&lt;/SPAN&gt; &lt;SPAN class="kwd"&gt;while&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; compiling statement&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; FAILED&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;SemanticException&lt;/SPAN&gt; &lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Error&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;10004&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Line&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;1&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;261&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Invalid&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; table alias or column &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;49&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;
&lt;SPAN class="typ"&gt;Completed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Auditing&lt;/SPAN&gt; &lt;SPAN class="kwd"&gt;for&lt;/SPAN&gt; &lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; baseTable1&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;49&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Updating&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; the job keeper&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;...&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;file2&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; status&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; success
 &lt;/SPAN&gt;&lt;SPAN class="typ"&gt;JobID&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;455&lt;/SPAN&gt;
&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;18&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Starting&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Auditing&lt;/SPAN&gt; &lt;SPAN class="kwd"&gt;for&lt;/SPAN&gt; &lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; baseTable2&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;19&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Connections&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; established to gp and finance &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;...&lt;/SPAN&gt;
&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;20&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Starting&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; the auditing &lt;/SPAN&gt;&lt;SPAN class="kwd"&gt;for&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; the intial fetched set of records&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;...&lt;/SPAN&gt;
&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;20&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Number&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; of pk columns in the src table&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;16&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Number&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; of PK &lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Columns&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; in the dest table&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;16&lt;/SPAN&gt;
&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;20&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Success&lt;/SPAN&gt;
&lt;SPAN class="typ"&gt;Completed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Auditing&lt;/SPAN&gt; &lt;SPAN class="kwd"&gt;for&lt;/SPAN&gt; &lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; baseTable2&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;49&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Updating&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; the job keeper&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;...&lt;/SPAN&gt;&lt;SPAN class="pln"&gt;file3&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; status&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; failure
 &lt;/SPAN&gt;&lt;SPAN class="typ"&gt;JobID&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;547&lt;/SPAN&gt;
&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;18&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Starting&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Auditing&lt;/SPAN&gt; &lt;SPAN class="kwd"&gt;for&lt;/SPAN&gt; &lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; baseTable3&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;19&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Connections&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; established to gp and finance &lt;/SPAN&gt;&lt;SPAN class="pun"&gt;...&lt;/SPAN&gt;
&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;20&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Starting&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; the auditing &lt;/SPAN&gt;&lt;SPAN class="kwd"&gt;for&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; the intial fetched set of records&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;...&lt;/SPAN&gt;
&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;20&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Number&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; of pk columns in the src table&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;16&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Number&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; of PK &lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Columns&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; in the dest table&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;5&lt;/SPAN&gt;
&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;20&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Failed&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Invalid&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; data found&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;.&lt;/SPAN&gt;
&lt;SPAN class="typ"&gt;Completed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Auditing&lt;/SPAN&gt; &lt;SPAN class="kwd"&gt;for&lt;/SPAN&gt; &lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; baseTable3&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;[&lt;/SPAN&gt;&lt;SPAN class="typ"&gt;Wed&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Dec&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;27&lt;/SPAN&gt; &lt;SPAN class="lit"&gt;05&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;38&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt;&lt;SPAN class="lit"&gt;49&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; UTC &lt;/SPAN&gt;&lt;SPAN class="lit"&gt;2017&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;]&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; INFO&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;:&lt;/SPAN&gt; &lt;SPAN class="typ"&gt;Updating&lt;/SPAN&gt;&lt;SPAN class="pln"&gt; the job keeper&lt;/SPAN&gt;&lt;SPAN class="pun"&gt;...&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I know how to load a single file into spark and work on that dataframe. Since there are huge number of files in the dir everyday, I want to follow this approach of loading the whole dir into a single dataframe and then work on the data inside it rather open and read every small file. I want to split the dataframe based on the last record as the delimiter (in this case, each file ends with ... ) and create three separate dataframes for the error, success &amp;amp; failure (three dataframes of their own). Can anyone tell me how can I implement that ?&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 12:41:17 GMT</pubDate>
    <dc:creator>Sidhartha</dc:creator>
    <dc:date>2022-09-16T12:41:17Z</dc:date>
  </channel>
</rss>

