Member since
12-08-2015
9
Posts
4
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4298 | 10-19-2017 06:50 PM | |
2672 | 07-06-2016 08:14 AM |
10-19-2017
06:50 PM
1 Kudo
You need a bunch of APIs, tooling in Flink to achieve this:
- Apache Flink's DataStream API comes with a data source that is able to continiously monitor a directory for new files, and process these files in a stream as they appear in the directory: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/datastream_api.html#data-sources This will allow you to get a stream of events from the files.
- Whether you are letting Flink read the files from FTP, or from HDFS is up to you. Both is possible (for FTP, use the `FTPFileSystem` provided by Hadoop and supported by Flink.), for HDFS, its pretty obvious how to read from there. - Flink has a HBase Sink for writing data. It also comes with a bucketing file sink, which integrates with the exactly-once checkpointing mechanism. So if you are going for the (FTP/HDFS) -> Flink -> RollingSink(HDFS) approach, you'll get end-to-end exactly once. Maybe as a general note: Flink is much more than just a system you can use for transporting data from A to B. It can do really advanced analytics on data streams, using its powerful Windowing, CEP, SQL APIs, and its great support for stateful computations over streams. Maybe you'll be able to express most of your problems as streaming jobs, and not using HBase for the queries? I can not give more concrete advise, because the question doesn't mention details such as use-case, SLA requirements and pros/cons for the different approaches.
... View more
09-05-2016
06:48 PM
I think the Ambari Plugin has been developed and tested against Ambari 2.3. Maybe its not compatible with the latest HDP release anymore. (See also: https://github.com/abajwa-hw/ambari-flink-service)
... View more
08-25-2016
06:56 AM
Hi Klaus, did you check whether the TaskManagers properly started? Their log files should tell you if there was an issue.
... View more
07-06-2016
08:14 AM
1 Kudo
Hi,
unfinished buckets have the .pending extension. Once a bucket is closed (for example for time-bucketing, once the time is over), the file will be renamed.
Since you are using the NonRollingBucketer, the files will never be closed. I would recommend you to use the DateTimeBucketer. As a side note: I would recommend you to increase the checkpointing intervall a bit. 123 milliseconds is very frequent and the application doesn't look like being extremely latency critical. A value like 2000 milliseconds is probably more appropriate.
... View more
01-04-2016
08:32 AM
Thank you for posting the code here @Hemant Kumar. Sorry for the delay, I've now looked at and it looks good.
If you want to optimize the code, I would try to move the code from the FlinkJSONObject constructor into the HTTPJSONStream class. It seems that the do/while loop is actually controlled from within the FlinkJSONObject.
... View more
12-21-2015
03:33 PM
1 Kudo
Sorry for the late reply. As I said in the other thread, I would recommend using a SourceFunction here. Feel free to post your code here as well, I can review it and give some hints for improving it.
... View more
12-21-2015
03:30 PM
Right now, there are not methods for reading HTTP streams. But implementing a custom "SourceFunction" (as you suggested) is the recommended appropach. Please let me know if you have further questions.
... View more