Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Incremental Flat File data loading into hadoop

Highlighted

Incremental Flat File data loading into hadoop

Hi Simon Elliston Ball,

@Simon Elliston Ball

I need to incrementally load flat files from local file system (HDP) into hadoop and data is continuously streaming and what are the ecosystems i need to use to load this flat files into hadoop and it is a comma separated text file.

8 REPLIES 8

Re: Incremental Flat File data loading into hadoop

Explorer

One option is to develop a script using "hadoop fs -put" command. See link below.

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#put

Highlighted

Re: Incremental Flat File data loading into hadoop

Hi Vinuraj M

@Vinuraj M,

I know load the data for the first time and for incremental load what is the process we need to follow to incrementally load flat files into hadoop.

For Eg) If first file contains 200 records, and next it should automatically load from 201 record from flat file (comma separated text file).

Highlighted

Re: Incremental Flat File data loading into hadoop

Explorer
Highlighted

Re: Incremental Flat File data loading into hadoop

Hi Vinuraj M

@Vinuraj M,

Whether the last value of the record from first file will be read and start from the next value from the next file and append to the file.

Highlighted

Re: Incremental Flat File data loading into hadoop

Guru

Using NiFi allows you to pick up new files or to tail new lines of an existing file (GetFile). HDFS does not like many small files (takes up memory of Name Node) so if you are tailing a file you should MergeContent defined by a size threshold and then PutHDFS.

On HDFS, pig and Hive external tables can point to a parent directory holding multiple files with the same data structure and ignores any naming of these files. Thus, any time you add a new file to a parent directory that pig or Hive external tables are pointing to, you have unioned the new file with the data set, i.e. appended to the data set. So the key is to give these files unique names. The easy way to do that is to simply append _<timestamp> to a filename each time you put it to the HDFS parent directory.

If you want to stream to existing Hive tables, then use NiFi to tail the local file (GetFile) and stream directly into the Hive table (PutHiveStreaming). See this excellent article:

https://community.hortonworks.com/articles/52856/stream-data-into-hive-like-a-king-using-nifi.html

Highlighted

Re: Incremental Flat File data loading into hadoop

HI @Greg Keys,

can we get files from SFTP incrementally to HDFS and every day new directory will generate with date and new files will be dropped in same directory in timely bases.can we achieve this automatic way? means we have to give dynamic folder name where we have to get files.please help in this

Highlighted

Re: Incremental Flat File data loading into hadoop

Guru

Use GetSFTP to get the files.

When putting to HDFS, append timestamp to file or folder name. Consider putting all files in same folder, with folder name static and new files named filename_tstamp where tstamp is dynamic. (Cleaner design). You can see how to do this in this article ... you use NiFi expression language to assign the tstamp to an attribute and then use that attribute in the filename. See the UpdateAttribute processor. (Depending on how fast your files are ingested, you may may want to add seconds and milliseconds to the tstamp)

Highlighted

Re: Incremental Flat File data loading into hadoop

Hi @Greg Keys. thanks for reply..

I do not know any thing about NiFi, and I am using HDP2.5.2 cluster,in this NiFi is there,I need to get files(remote SFTP server) which are dropping in one directory on timely basis ,to HDFS incrementally, Can we use Flume for doing this? and remote server is belongs to 3rd party so I can not install any software on it.Can you help me out in this?

Don't have an account?
Coming from Hortonworks? Activate your account here