Support Questions

masfworld · ‎04-08-2014

Hi.

Is it possible configure Apache Flume to save my logs in HDFS with Parquet?

Thanks very much!!!!

Miguel Angel.

mpercy · ‎04-08-2014

There's been some debate about this. Personally, I believe that Flume is an inherently stream-based, row-oriented system, and Parquet is an inherently batch-optimized, column-oriented format. So I'm not sure whether it's a great fit in terms of direct output. On the other hand, some folks argue that it can make sense in some cases, and that is true.

While I don't know of a way to get Parquet directly out of Flume today, I explored one way to get data from Flume into Impala, and ultimately stored as Parquet for fast, columnar querying in this presentation I gave at Hadoop Summit 2013:

https://github.com/mpercy/flume-rtq-hadoop-summit-2013/blob/master/flume-low-latency-analytics-hadoo...

The basic idea is that you store the data in Avro format from Flume, then use Impala to convert the data to Parquet on a schedule. This has some pretty nice properties, like low-latency access to the data. Now that Views are available in recent versions of Impala, that approach should be even easier to use.

Hope this helps!

Mike

View solution in original post

mpercy · ‎04-08-2014

There's been some debate about this. Personally, I believe that Flume is an inherently stream-based, row-oriented system, and Parquet is an inherently batch-optimized, column-oriented format. So I'm not sure whether it's a great fit in terms of direct output. On the other hand, some folks argue that it can make sense in some cases, and that is true.

While I don't know of a way to get Parquet directly out of Flume today, I explored one way to get data from Flume into Impala, and ultimately stored as Parquet for fast, columnar querying in this presentation I gave at Hadoop Summit 2013:

https://github.com/mpercy/flume-rtq-hadoop-summit-2013/blob/master/flume-low-latency-analytics-hadoo...

The basic idea is that you store the data in Avro format from Flume, then use Impala to convert the data to Parquet on a schedule. This has some pretty nice properties, like low-latency access to the data. Now that Views are available in recent versions of Impala, that approach should be even easier to use.

Hope this helps!

Mike

masfworld · ‎04-09-2014

Yes. Use Impala or Hive to convert to Parquet stream from Flume is a good option, although it would be nice to have it natively.

Thanks!!!!

Miguel Angel.

mpercy · ‎04-10-2014

You're welcome!

mohit.mehrotra · ‎05-29-2014

Hi Mike,

How do you convert the avro data to parquet, and what do you use to schedule this process?

Is the code hosted somewhere? Thanks.

mpercy · ‎05-29-2014

Impala can do the conversion via SQL statements. I'd recommend asking the
Impala guys for advice there as my information is a bit dated on this
front, now that views and improved meta store features have been added.

Mike

Cloudera Community

Support Questions

Apache Flume and parquet