Created 06-01-2016 05:40 AM
Currently reading different paper and article i'm wondering if their is a well known set of good tools/pattern to transfer, process and land on hdfs large ingest files & logs.
I saw this article , but apart from saying NIFI are their other solution ?
Currently we use SFTP, but this is not parallel FTP and may face performance issues based on size and latency. I had a look to flume but unfortunately it sounds a non production idea to use flume to tranfer gzipped files. You have to use a blob that load the all file in memory.
I'm a little surprised that nothing exist out of the box to chunk a file and send the data in parallel over several TCP connection. Likely the code exist for video transfer and i'm wondering if someone somewhere in apache have incoporate such code to transfer large log files and get them landed on hdfs.
Any hints opinion welcome :
Created 06-01-2016 08:54 AM
Hi @x name.
I don't believe there is anything pre-built within flume to do exactly what you need. Flume itself is certainly production ready and has been in constant use by a very wide range of people for a long time now, it just hasn't evolved past that point very much. It also starts to struggle with significant load under the kind of scenario's you're discussing unless it's very carefully managed.
You've already identified the tool set that I'd probably recommend for your requirement which is NiFi. You've also identified another article so I won't go into that any further.
As for other tools or patterns, I've seen people build some of their own ingest frameworks using a combination of scripts and things like webhdfs, or indeed a lot of custom code on top of Kafka.
However with the way that the technology is stacking up now, unless you have a strong reason not to, NiFi solves all the issues you bring up and is easy to use as well, I'd strongly recommend it.
If you do find something else please do add a comment here, likewise if you try NiFi and you get stuck at all, don't hesitate to fire over another question!
Hope that helps.
Created 06-01-2016 08:54 AM
Hi @x name.
I don't believe there is anything pre-built within flume to do exactly what you need. Flume itself is certainly production ready and has been in constant use by a very wide range of people for a long time now, it just hasn't evolved past that point very much. It also starts to struggle with significant load under the kind of scenario's you're discussing unless it's very carefully managed.
You've already identified the tool set that I'd probably recommend for your requirement which is NiFi. You've also identified another article so I won't go into that any further.
As for other tools or patterns, I've seen people build some of their own ingest frameworks using a combination of scripts and things like webhdfs, or indeed a lot of custom code on top of Kafka.
However with the way that the technology is stacking up now, unless you have a strong reason not to, NiFi solves all the issues you bring up and is easy to use as well, I'd strongly recommend it.
If you do find something else please do add a comment here, likewise if you try NiFi and you get stuck at all, don't hesitate to fire over another question!
Hope that helps.