Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to process data in ftp/directory using flink?

Solved Go to solution

How to process data in ftp/directory using flink?

Contributor

We have file in certain directory/ftp from 10MB to 100MB at interval of 1 min to hours. What would be the proper architecture to harvest data for real time consumption and batch analysis. I have though of following architecture:

files --> flink --> Hbase (Real time query)

files --> flink --> HDFS

or

files --> HDFS --> flink --> Hbase

What is the most appropriate architecture for this purpose? What tools should I use? I want to use flink because I would like to get metric during transformation.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: How to process data in ftp/directory using flink?

Explorer

You need a bunch of APIs, tooling in Flink to achieve this:

- Apache Flink's DataStream API comes with a data source that is able to continiously monitor a directory for new files, and process these files in a stream as they appear in the directory: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/datastream_api.html#data-sources This will allow you to get a stream of events from the files.

- Whether you are letting Flink read the files from FTP, or from HDFS is up to you. Both is possible (for FTP, use the `FTPFileSystem` provided by Hadoop and supported by Flink.), for HDFS, its pretty obvious how to read from there.

- Flink has a HBase Sink for writing data. It also comes with a bucketing file sink, which integrates with the exactly-once checkpointing mechanism. So if you are going for the (FTP/HDFS) -> Flink -> RollingSink(HDFS) approach, you'll get end-to-end exactly once.

Maybe as a general note: Flink is much more than just a system you can use for transporting data from A to B. It can do really advanced analytics on data streams, using its powerful Windowing, CEP, SQL APIs, and its great support for stateful computations over streams. Maybe you'll be able to express most of your problems as streaming jobs, and not using HBase for the queries?


I can not give more concrete advise, because the question doesn't mention details such as use-case, SLA requirements and pros/cons for the different approaches.

View solution in original post

2 REPLIES 2
Highlighted

Re: How to process data in ftp/directory using flink?

Explorer

You need a bunch of APIs, tooling in Flink to achieve this:

- Apache Flink's DataStream API comes with a data source that is able to continiously monitor a directory for new files, and process these files in a stream as they appear in the directory: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/datastream_api.html#data-sources This will allow you to get a stream of events from the files.

- Whether you are letting Flink read the files from FTP, or from HDFS is up to you. Both is possible (for FTP, use the `FTPFileSystem` provided by Hadoop and supported by Flink.), for HDFS, its pretty obvious how to read from there.

- Flink has a HBase Sink for writing data. It also comes with a bucketing file sink, which integrates with the exactly-once checkpointing mechanism. So if you are going for the (FTP/HDFS) -> Flink -> RollingSink(HDFS) approach, you'll get end-to-end exactly once.

Maybe as a general note: Flink is much more than just a system you can use for transporting data from A to B. It can do really advanced analytics on data streams, using its powerful Windowing, CEP, SQL APIs, and its great support for stateful computations over streams. Maybe you'll be able to express most of your problems as streaming jobs, and not using HBase for the queries?


I can not give more concrete advise, because the question doesn't mention details such as use-case, SLA requirements and pros/cons for the different approaches.

View solution in original post

Highlighted

Re: How to process data in ftp/directory using flink?

Contributor

Thanks got the idea but my implementation didn't work. I will ask new question for that.

Don't have an account?
Coming from Hortonworks? Activate your account here