Created on 04-08-2014 06:39 AM - edited 09-16-2022 01:56 AM
Hi.
Is it possible configure Apache Flume to save my logs in HDFS with Parquet?
Thanks very much!!!!
Miguel Angel.
Created 04-08-2014 11:50 PM
There's been some debate about this. Personally, I believe that Flume is an inherently stream-based, row-oriented system, and Parquet is an inherently batch-optimized, column-oriented format. So I'm not sure whether it's a great fit in terms of direct output. On the other hand, some folks argue that it can make sense in some cases, and that is true.
While I don't know of a way to get Parquet directly out of Flume today, I explored one way to get data from Flume into Impala, and ultimately stored as Parquet for fast, columnar querying in this presentation I gave at Hadoop Summit 2013:
The basic idea is that you store the data in Avro format from Flume, then use Impala to convert the data to Parquet on a schedule. This has some pretty nice properties, like low-latency access to the data. Now that Views are available in recent versions of Impala, that approach should be even easier to use.
Hope this helps!
Mike
Created 04-08-2014 11:50 PM
There's been some debate about this. Personally, I believe that Flume is an inherently stream-based, row-oriented system, and Parquet is an inherently batch-optimized, column-oriented format. So I'm not sure whether it's a great fit in terms of direct output. On the other hand, some folks argue that it can make sense in some cases, and that is true.
While I don't know of a way to get Parquet directly out of Flume today, I explored one way to get data from Flume into Impala, and ultimately stored as Parquet for fast, columnar querying in this presentation I gave at Hadoop Summit 2013:
The basic idea is that you store the data in Avro format from Flume, then use Impala to convert the data to Parquet on a schedule. This has some pretty nice properties, like low-latency access to the data. Now that Views are available in recent versions of Impala, that approach should be even easier to use.
Hope this helps!
Mike
Created 04-09-2014 08:15 AM
Yes. Use Impala or Hive to convert to Parquet stream from Flume is a good option, although it would be nice to have it natively.
Thanks!!!!
Miguel Angel.
Created 04-10-2014 11:22 AM
You're welcome!
Created 05-29-2014 02:53 AM
Hi Mike,
How do you convert the avro data to parquet, and what do you use to schedule this process?
Is the code hosted somewhere? Thanks.
Created 05-29-2014 02:58 AM