Created 10-19-2017 04:46 AM
We have file in certain directory/ftp from 10MB to 100MB at interval of 1 min to hours. What would be the proper architecture to harvest data for real time consumption and batch analysis. I have though of following architecture:
files --> flink --> Hbase (Real time query)
files --> flink --> HDFS
or
files --> HDFS --> flink --> Hbase
What is the most appropriate architecture for this purpose? What tools should I use? I want to use flink because I would like to get metric during transformation.
Created 10-19-2017 06:50 PM
You need a bunch of APIs, tooling in Flink to achieve this:
- Apache Flink's DataStream API comes with a data source that is able to continiously monitor a directory for new files, and process these files in a stream as they appear in the directory: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/datastream_api.html#data-sources This will allow you to get a stream of events from the files.
- Whether you are letting Flink read the files from FTP, or from HDFS is up to you. Both is possible (for FTP, use the `FTPFileSystem` provided by Hadoop and supported by Flink.), for HDFS, its pretty obvious how to read from there.
- Flink has a HBase Sink for writing data. It also comes with a bucketing file sink, which integrates with the exactly-once checkpointing mechanism. So if you are going for the (FTP/HDFS) -> Flink -> RollingSink(HDFS) approach, you'll get end-to-end exactly once.
Maybe as a general note: Flink is much more than just a system you can use for transporting data from A to B. It can do really advanced analytics on data streams, using its powerful Windowing, CEP, SQL APIs, and its great support for stateful computations over streams. Maybe you'll be able to express most of your problems as streaming jobs, and not using HBase for the queries?
I can not give more concrete advise, because the question doesn't mention details such as use-case, SLA requirements and pros/cons for the different approaches.
Created 10-19-2017 06:50 PM
You need a bunch of APIs, tooling in Flink to achieve this:
- Apache Flink's DataStream API comes with a data source that is able to continiously monitor a directory for new files, and process these files in a stream as they appear in the directory: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/datastream_api.html#data-sources This will allow you to get a stream of events from the files.
- Whether you are letting Flink read the files from FTP, or from HDFS is up to you. Both is possible (for FTP, use the `FTPFileSystem` provided by Hadoop and supported by Flink.), for HDFS, its pretty obvious how to read from there.
- Flink has a HBase Sink for writing data. It also comes with a bucketing file sink, which integrates with the exactly-once checkpointing mechanism. So if you are going for the (FTP/HDFS) -> Flink -> RollingSink(HDFS) approach, you'll get end-to-end exactly once.
Maybe as a general note: Flink is much more than just a system you can use for transporting data from A to B. It can do really advanced analytics on data streams, using its powerful Windowing, CEP, SQL APIs, and its great support for stateful computations over streams. Maybe you'll be able to express most of your problems as streaming jobs, and not using HBase for the queries?
I can not give more concrete advise, because the question doesn't mention details such as use-case, SLA requirements and pros/cons for the different approaches.
Created 10-20-2017 03:08 PM
Thanks got the idea but my implementation didn't work. I will ask new question for that.