Created on 07-05-2016 08:03 AM - edited 09-16-2022 03:28 AM
Hi ,
I am doing a poc in which I am trying to write some data on the HDFS using flink . Though I can see the files are getting written but they are stuck with a postfix ".pending" . Any help will be appreciated , also is there a way that only one file is written
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(123, CheckpointingMode.AT_LEAST_ONCE, true);
DataStream<String> text = env.readTextFile("D:/names.txt");
DataStream<String> parsed = text.map(new MapFunction<String, String>() { @Override public String map(String value) { return (value); } });
parsed.flatMap(new FlatMapFunction<String, String>() { public void flatMap(String value, Collector<String> out) throws Exception { for (String s : value.split(" ")) { out.collect(s); } } });
System.setProperty("HADOOP_USER_NAME", "hdfs");
RollingSink<String> sink = new RollingSink<String>("hdfs://MYMACHINE:8020/flink/test8");
sink.setBucketer(new NonRollingBucketer());
parsed.addSink(sink);
env.execute();
Created 07-06-2016 08:14 AM
Hi, unfinished buckets have the .pending extension. Once a bucket is closed (for example for time-bucketing, once the time is over), the file will be renamed. Since you are using the NonRollingBucketer, the files will never be closed. I would recommend you to use the DateTimeBucketer.
As a side note: I would recommend you to increase the checkpointing intervall a bit. 123 milliseconds is very frequent and the application doesn't look like being extremely latency critical. A value like 2000 milliseconds is probably more appropriate.
Created 07-06-2016 08:14 AM
Hi, unfinished buckets have the .pending extension. Once a bucket is closed (for example for time-bucketing, once the time is over), the file will be renamed. Since you are using the NonRollingBucketer, the files will never be closed. I would recommend you to use the DateTimeBucketer.
As a side note: I would recommend you to increase the checkpointing intervall a bit. 123 milliseconds is very frequent and the application doesn't look like being extremely latency critical. A value like 2000 milliseconds is probably more appropriate.
Created 07-06-2016 03:16 PM
thanks Robert, it worked