Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Data Processing Using Pig from local to HDFS

avatar
Rising Star

I have created a logic to achieve data transfer from local to hdfs and need to a alert particular mail id. please suggest me how to write scripts achieve my goal with validations or any alternative to achive.

  • I have created two folders in the Local File System viz. Bdata1 and Bdata2
  • Bdata1 is FTP folder
  • Compare the two folders to check if all the files match in both the folders. If not, the file names of the files that do not match are stored separately in a text file called compare.txt. using this script
  • sh diff –r Bdata1 Bdata2 | grep Bdata1 | awk ‘{print $4}’ > compare.txt
  • Create a folder in HDFS hbdata.
  • Count the number of files in hbdata and store it in a variable say n1.
  • Count the number of files in compare.txt and store it in a variable say n2.
  • Copy the files mentioned in the compare.txt text file from the local file system to HDFS using the script

sh for i in ‘cat compare.txt’ ; do hadoop dfs –copyFromlocal Bdata1/$i hdfs://192.168.1.xxx:8020/hbdata

  • Count the number of files in hbdata and store it in a variable say n3.
  • If the difference between the variables n3 and n2 is equal to n1, then pass an alert saying the File Has Been Copied.
  • After the files are copied, they are moved to bdata2.
  • sh for i in ‘compare.txt’; do mv Bdata1/$i Bdata2
  • If the difference is not equal as per the above condition, then pass an alert Files Not Copied and display the file names of the files not copied.
  • After all are completed i use pig to load command and need to create Hive ORC table to load the data

Note: I tried to find direct comparison from local directory to HDFS but couldn't get so added to more steps

1 ACCEPTED SOLUTION

avatar
Super Collaborator

IMHO, all you should avoid having complex logic with "home developed shell script". Those kind of shell scripts are good to do some quick tests, but when you want to go into PoC, you need something less error prone and also more optimal (shell scripts will launch some many java processes, leading to quite some overhead and latencies).

I recommend you to have a look at ingestion tools such as Flume (http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source) or Nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.FetchFile/inde...). Those tools already have lot of features to ingest files into your cluster, and archive files then after.

View solution in original post

2 REPLIES 2

avatar
Super Collaborator

IMHO, all you should avoid having complex logic with "home developed shell script". Those kind of shell scripts are good to do some quick tests, but when you want to go into PoC, you need something less error prone and also more optimal (shell scripts will launch some many java processes, leading to quite some overhead and latencies).

I recommend you to have a look at ingestion tools such as Flume (http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source) or Nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.FetchFile/inde...). Those tools already have lot of features to ingest files into your cluster, and archive files then after.

avatar
Rising Star

@ Sourygna Luangsay

Thanks for your valuable post. i will try to understand NIFI with HDF and let you know.since I'm newer to big data technologies if i stuck up please help me....again thanks.