Hi Vinuraj M
I know load the data for the first time and for incremental load what is the process we need to follow to incrementally load flat files into hadoop.
For Eg) If first file contains 200 records, and next it should automatically load from 201 record from flat file (comma separated text file).
Using NiFi allows you to pick up new files or to tail new lines of an existing file (GetFile). HDFS does not like many small files (takes up memory of Name Node) so if you are tailing a file you should MergeContent defined by a size threshold and then PutHDFS.
On HDFS, pig and Hive external tables can point to a parent directory holding multiple files with the same data structure and ignores any naming of these files. Thus, any time you add a new file to a parent directory that pig or Hive external tables are pointing to, you have unioned the new file with the data set, i.e. appended to the data set. So the key is to give these files unique names. The easy way to do that is to simply append _<timestamp> to a filename each time you put it to the HDFS parent directory.
If you want to stream to existing Hive tables, then use NiFi to tail the local file (GetFile) and stream directly into the Hive table (PutHiveStreaming). See this excellent article:
HI @Greg Keys,
can we get files from SFTP incrementally to HDFS and every day new directory will generate with date and new files will be dropped in same directory in timely bases.can we achieve this automatic way? means we have to give dynamic folder name where we have to get files.please help in this
Use GetSFTP to get the files.
When putting to HDFS, append timestamp to file or folder name. Consider putting all files in same folder, with folder name static and new files named filename_tstamp where tstamp is dynamic. (Cleaner design). You can see how to do this in this article ... you use NiFi expression language to assign the tstamp to an attribute and then use that attribute in the filename. See the UpdateAttribute processor. (Depending on how fast your files are ingested, you may may want to add seconds and milliseconds to the tstamp)
Hi @Greg Keys. thanks for reply..
I do not know any thing about NiFi, and I am using HDP2.5.2 cluster,in this NiFi is there,I need to get files(remote SFTP server) which are dropping in one directory on timely basis ,to HDFS incrementally, Can we use Flume for doing this? and remote server is belongs to 3rd party so I can not install any software on it.Can you help me out in this?