<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Data Processing Using Pig from local to HDFS in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Processing-Using-Pig-from-local-to-HDFS/m-p/166884#M29644</link>
    <description>&lt;P&gt;IMHO, all you should avoid having complex logic with "home developed shell script". Those kind of shell scripts are good to do some quick tests, but when you want to go into PoC, you need something less error prone and also more optimal (shell scripts will launch some many java processes, leading to quite some overhead and latencies).&lt;/P&gt;&lt;P&gt;I recommend you to have a look at ingestion tools such as Flume (&lt;A href="http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source"&gt;http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source&lt;/A&gt;) or Nifi (&lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.FetchFile/index.html"&gt;https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.FetchFile/index.html&lt;/A&gt;). Those tools already have lot of features to ingest files into your cluster, and archive files then after.&lt;/P&gt;</description>
    <pubDate>Wed, 25 May 2016 17:24:09 GMT</pubDate>
    <dc:creator>sluangsay</dc:creator>
    <dc:date>2016-05-25T17:24:09Z</dc:date>
    <item>
      <title>Data Processing Using Pig from local to HDFS</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Processing-Using-Pig-from-local-to-HDFS/m-p/166883#M29643</link>
      <description>&lt;P&gt;I have created a logic to achieve data transfer from local to hdfs and need to a alert particular mail id. please suggest me how to write scripts achieve my goal with validations or any alternative to achive. &lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;I have created two folders in the Local File System viz. &lt;STRONG&gt;Bdata1&lt;/STRONG&gt; and &lt;STRONG&gt;Bdata2&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Bdata1&lt;/STRONG&gt; is FTP folder&lt;/LI&gt;&lt;LI&gt;Compare the two folders to check if all the files match in both the folders. If not, the file names of the files that do not match are stored separately in a text file called &lt;STRONG&gt;compare.txt&lt;/STRONG&gt;. using this script&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;sh diff –r Bdata1 Bdata2 | grep Bdata1 | awk ‘{print $4}’ &amp;gt; compare.txt&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;Create a folder in HDFS hbdata.&lt;/LI&gt;&lt;LI&gt;Count the number of files in hbdata and store it in a variable say n1.&lt;/LI&gt;&lt;LI&gt;Count the number of files in compare.txt and store it in a variable say n2.&lt;/LI&gt;&lt;LI&gt;Copy the files mentioned in the compare.txt text file from the local file system to HDFS using the script&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;sh for i in ‘cat compare.txt’ ; do hadoop dfs –copyFromlocal Bdata1/$i hdfs://192.168.1.xxx:8020/hbdata&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;Count the number of files in hbdata and store it in a variable say n3.&lt;/LI&gt;&lt;LI&gt;If the difference between the variables n3 and n2 is equal to n1, then pass an alert saying the File Has Been Copied.&lt;/LI&gt;&lt;LI&gt;After the files are copied, they are moved to bdata2.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;sh for i in ‘compare.txt’; do mv Bdata1/$i Bdata2&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;If the difference is not equal as per the above condition, then pass an alert Files Not Copied and display the file names of the files not copied.&lt;/LI&gt;&lt;LI&gt;After all are completed i use pig to load command and need to create Hive ORC table to load the data&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Note: I tried to find direct comparison from local directory to HDFS but couldn't get so added to more steps &lt;/P&gt;</description>
      <pubDate>Wed, 25 May 2016 17:07:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Processing-Using-Pig-from-local-to-HDFS/m-p/166883#M29643</guid>
      <dc:creator>iyappan</dc:creator>
      <dc:date>2016-05-25T17:07:43Z</dc:date>
    </item>
    <item>
      <title>Re: Data Processing Using Pig from local to HDFS</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Processing-Using-Pig-from-local-to-HDFS/m-p/166884#M29644</link>
      <description>&lt;P&gt;IMHO, all you should avoid having complex logic with "home developed shell script". Those kind of shell scripts are good to do some quick tests, but when you want to go into PoC, you need something less error prone and also more optimal (shell scripts will launch some many java processes, leading to quite some overhead and latencies).&lt;/P&gt;&lt;P&gt;I recommend you to have a look at ingestion tools such as Flume (&lt;A href="http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source"&gt;http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source&lt;/A&gt;) or Nifi (&lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.FetchFile/index.html"&gt;https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.FetchFile/index.html&lt;/A&gt;). Those tools already have lot of features to ingest files into your cluster, and archive files then after.&lt;/P&gt;</description>
      <pubDate>Wed, 25 May 2016 17:24:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Processing-Using-Pig-from-local-to-HDFS/m-p/166884#M29644</guid>
      <dc:creator>sluangsay</dc:creator>
      <dc:date>2016-05-25T17:24:09Z</dc:date>
    </item>
    <item>
      <title>Re: Data Processing Using Pig from local to HDFS</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Processing-Using-Pig-from-local-to-HDFS/m-p/166885#M29645</link>
      <description>&lt;P&gt; @ Sourygna Luangsay  &lt;/P&gt;&lt;P&gt;Thanks for your valuable post. i will try to understand NIFI with HDF and let you know.since I'm newer to big data technologies if i stuck up please help me....again thanks. &lt;/P&gt;</description>
      <pubDate>Thu, 26 May 2016 11:18:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Processing-Using-Pig-from-local-to-HDFS/m-p/166885#M29645</guid>
      <dc:creator>iyappan</dc:creator>
      <dc:date>2016-05-26T11:18:59Z</dc:date>
    </item>
  </channel>
</rss>

