About rmetzger1

rmetzger1 · ‎10-19-2017

You need a bunch of APIs, tooling in Flink to achieve this: - Apache Flink's DataStream API comes with a data source that is able to continiously monitor a directory for new files, and process these files in a stream as they appear in the directory: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/datastream_api.html#data-sources This will allow you to get a stream of events from the files. - Whether you are letting Flink read the files from FTP, or from HDFS is up to you. Both is possible (for FTP, use the `FTPFileSystem` provided by Hadoop and supported by Flink.), for HDFS, its pretty obvious how to read from there. - Flink has a HBase Sink for writing data. It also comes with a bucketing file sink, which integrates with the exactly-once checkpointing mechanism. So if you are going for the (FTP/HDFS) -> Flink -> RollingSink(HDFS) approach, you'll get end-to-end exactly once. Maybe as a general note: Flink is much more than just a system you can use for transporting data from A to B. It can do really advanced analytics on data streams, using its powerful Windowing, CEP, SQL APIs, and its great support for stateful computations over streams. Maybe you'll be able to express most of your problems as streaming jobs, and not using HBase for the queries? I can not give more concrete advise, because the question doesn't mention details such as use-case, SLA requirements and pros/cons for the different approaches.

rmetzger1 · ‎09-05-2016

I think the Ambari Plugin has been developed and tested against Ambari 2.3. Maybe its not compatible with the latest HDP release anymore. (See also: https://github.com/abajwa-hw/ambari-flink-service)

rmetzger1 · ‎08-25-2016

Hi Klaus, did you check whether the TaskManagers properly started? Their log files should tell you if there was an issue.

rmetzger1 · ‎07-06-2016

Hi, unfinished buckets have the .pending extension. Once a bucket is closed (for example for time-bucketing, once the time is over), the file will be renamed. Since you are using the NonRollingBucketer, the files will never be closed. I would recommend you to use the DateTimeBucketer. As a side note: I would recommend you to increase the checkpointing intervall a bit. 123 milliseconds is very frequent and the application doesn't look like being extremely latency critical. A value like 2000 milliseconds is probably more appropriate.

rmetzger1 · ‎01-04-2016

Thank you for posting the code here @Hemant Kumar. Sorry for the delay, I've now looked at and it looks good. If you want to optimize the code, I would try to move the code from the FlinkJSONObject constructor into the HTTPJSONStream class. It seems that the do/while loop is actually controlled from within the FlinkJSONObject.

rmetzger1 · ‎12-21-2015

Sorry for the late reply. As I said in the other thread, I would recommend using a SourceFunction here. Feel free to post your code here as well, I can review it and give some hints for improving it.

rmetzger1 · ‎12-21-2015

Right now, there are not methods for reading HTTP streams. But implementing a custom "SourceFunction" (as you suggested) is the recommended appropach. Please let me know if you have further questions.

Online	Offline
Last Visited	‎10-19-2017 06:50 PM

Member Since	‎12-08-2015 05:07 PM
Last Visited	‎10-19-2017 06:50 PM
Posts	9
Kudos received	4

Cloudera Community

Re: How to process data in ftp/directory using fli...

Re: Flink : Files written to HDFS are stuck in .pe...

Re: How to process data in ftp/directory using fli...

Re: Problem with flink( Hortonworks)

Re: Flink cluster configuration issue - no slots a...

Re: Flink : Files written to HDFS are stuck in .pe...

Re: How to : HTTP Stream in flink

Re: How to : HTTP Stream in flink

Re: Error while creating a DataStream using fromEl...