question Re: Data Processing Using Pig from local to HDFS in Archives of Support Questions (Read Only)

Data Processing Using Pig from local to HDFS

iyappan — Wed, 25 May 2016 17:07:43 GMT

I have created a logic to achieve data transfer from local to hdfs and need to a alert particular mail id. please suggest me how to write scripts achieve my goal with validations or any alternative to achive.

I have created two folders in the Local File System viz. Bdata1 and Bdata2
Bdata1 is FTP folder
Compare the two folders to check if all the files match in both the folders. If not, the file names of the files that do not match are stored separately in a text file called compare.txt. using this script
sh diff –r Bdata1 Bdata2 | grep Bdata1 | awk ‘{print $4}’ > compare.txt
Create a folder in HDFS hbdata.
Count the number of files in hbdata and store it in a variable say n1.
Count the number of files in compare.txt and store it in a variable say n2.
Copy the files mentioned in the compare.txt text file from the local file system to HDFS using the script

sh for i in ‘cat compare.txt’ ; do hadoop dfs –copyFromlocal Bdata1/$i hdfs://192.168.1.xxx:8020/hbdata

Count the number of files in hbdata and store it in a variable say n3.
If the difference between the variables n3 and n2 is equal to n1, then pass an alert saying the File Has Been Copied.
After the files are copied, they are moved to bdata2.
sh for i in ‘compare.txt’; do mv Bdata1/$i Bdata2
If the difference is not equal as per the above condition, then pass an alert Files Not Copied and display the file names of the files not copied.
After all are completed i use pig to load command and need to create Hive ORC table to load the data

Note: I tried to find direct comparison from local directory to HDFS but couldn't get so added to more steps

Re: Data Processing Using Pig from local to HDFS

sluangsay — Wed, 25 May 2016 17:24:09 GMT

IMHO, all you should avoid having complex logic with "home developed shell script". Those kind of shell scripts are good to do some quick tests, but when you want to go into PoC, you need something less error prone and also more optimal (shell scripts will launch some many java processes, leading to quite some overhead and latencies).

I recommend you to have a look at ingestion tools such as Flume (http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source) or Nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.FetchFile/index.html). Those tools already have lot of features to ingest files into your cluster, and archive files then after.

Re: Data Processing Using Pig from local to HDFS

iyappan — Thu, 26 May 2016 11:18:59 GMT

@ Sourygna Luangsay

Thanks for your valuable post. i will try to understand NIFI with HDF and let you know.since I'm newer to big data technologies if i stuck up please help me....again thanks.