09-21-2017 06:30 AM
I have 100 diffrent files which come to 100 diffrent folders at end of the day. all 100 files are loaded into its respective diffrent Hive table with partition of date and timestamp. file sizes are in MB like 50 MB or 320 MBs, few files in KB also.
currently we have J2EE application where we have deployed file watcher which keep polling to folder and loads the file as soon as file arrives in its respective folder.
My question is, is it a good idea to replace with Flume ? becuase what i think, for all 100 diffrent files i need to configure 100 diffrent spool agents where 100 agents will be keep running continuously and consume the resources like RAM and processor unncessarily.
09-23-2017 01:05 AM - edited 09-23-2017 01:08 AM
There are couple of option for this . if you are looking for more robust ingestion then you can sort it out with Flume, apache nifi ,streamsets ,apache kafka . Each has its own merits .
Coming to spooling directory -there are couple of things that needs to be taken care , each file should be unique name and its should be immeutable after it lands in spooling directory . more over you have to consider which flume channel is best for your use case file channel (durability) - which can crash and come back pick it up from it left (checkpointing thats is the reason the file needs to be immutable) and process on the other hand memory channel its not the case but provides high throughtput . flume can be configured for HA
please take a look into apache nifi processors like PutHQL,SelectHQL,GetFile,GetFTP couple of built in processors , the queue for NIFI is 100,000 if remember correctly more over one big advantage is you can apply back pressure etc on the down side configuring HA is bit tricky in my prespective.
Nifi docs - https://nifi.apache.org/docs.html
Hope this is suffice.