- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Data Processing Using Pig from local to HDFS
- Labels:
-
Apache Hadoop
-
Apache Pig
Created 05-25-2016 10:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have created a logic to achieve data transfer from local to hdfs and need to a alert particular mail id. please suggest me how to write scripts achieve my goal with validations or any alternative to achive.
- I have created two folders in the Local File System viz. Bdata1 and Bdata2
- Bdata1 is FTP folder
- Compare the two folders to check if all the files match in both the folders. If not, the file names of the files that do not match are stored separately in a text file called compare.txt. using this script
- sh diff –r Bdata1 Bdata2 | grep Bdata1 | awk ‘{print $4}’ > compare.txt
- Create a folder in HDFS hbdata.
- Count the number of files in hbdata and store it in a variable say n1.
- Count the number of files in compare.txt and store it in a variable say n2.
- Copy the files mentioned in the compare.txt text file from the local file system to HDFS using the script
sh for i in ‘cat compare.txt’ ; do hadoop dfs –copyFromlocal Bdata1/$i hdfs://192.168.1.xxx:8020/hbdata
- Count the number of files in hbdata and store it in a variable say n3.
- If the difference between the variables n3 and n2 is equal to n1, then pass an alert saying the File Has Been Copied.
- After the files are copied, they are moved to bdata2.
- sh for i in ‘compare.txt’; do mv Bdata1/$i Bdata2
- If the difference is not equal as per the above condition, then pass an alert Files Not Copied and display the file names of the files not copied.
- After all are completed i use pig to load command and need to create Hive ORC table to load the data
Note: I tried to find direct comparison from local directory to HDFS but couldn't get so added to more steps
Created 05-25-2016 10:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IMHO, all you should avoid having complex logic with "home developed shell script". Those kind of shell scripts are good to do some quick tests, but when you want to go into PoC, you need something less error prone and also more optimal (shell scripts will launch some many java processes, leading to quite some overhead and latencies).
I recommend you to have a look at ingestion tools such as Flume (http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source) or Nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.FetchFile/inde...). Those tools already have lot of features to ingest files into your cluster, and archive files then after.
Created 05-25-2016 10:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IMHO, all you should avoid having complex logic with "home developed shell script". Those kind of shell scripts are good to do some quick tests, but when you want to go into PoC, you need something less error prone and also more optimal (shell scripts will launch some many java processes, leading to quite some overhead and latencies).
I recommend you to have a look at ingestion tools such as Flume (http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source) or Nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.FetchFile/inde...). Those tools already have lot of features to ingest files into your cluster, and archive files then after.
Created 05-26-2016 04:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@ Sourygna Luangsay
Thanks for your valuable post. i will try to understand NIFI with HDF and let you know.since I'm newer to big data technologies if i stuck up please help me....again thanks.