Hi, I have a Large file (with 1 Million records) on Linux. This file is not yet copied to HDFS. It is still on ext3 file system of Linux. I would like to perform some kind of validation on this file (eg. Data type validation - First 9 bytes of each record should have digits). I have written a shell script to perform the validation.
My question is, Is it possible to run the shell script on Hadoop's Infrastruture ? that is, Can I execute the script without putting the file onto HDFS, but still using Hadoop's Multi node infrastructure ?
Yes, you can. SCP the file to any of the data nodes and then run your shell script to validate it. The file will be available in HDFS only when you use the "hadoop fs -put" command and put it into the HDFS.