Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Apache Flume and parquet

avatar
Contributor

Hi.

 

Is it possible configure Apache Flume to save my logs in HDFS with Parquet?

 

Thanks very much!!!!

 

Miguel Angel.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

There's been some debate about this. Personally, I believe that Flume is an inherently stream-based, row-oriented system, and Parquet is an inherently batch-optimized, column-oriented format. So I'm not sure whether it's a great fit in terms of direct output. On the other hand, some folks argue that it can make sense in some cases, and that is true.

 

While I don't know of a way to get Parquet directly out of Flume today, I explored one way to get data from Flume into Impala, and ultimately stored as Parquet for fast, columnar querying in this presentation I gave at Hadoop Summit 2013:

 

https://github.com/mpercy/flume-rtq-hadoop-summit-2013/blob/master/flume-low-latency-analytics-hadoo...

 

The basic idea is that you store the data in Avro format from Flume, then use Impala to convert the data to Parquet on a schedule. This has some pretty nice properties, like low-latency access to the data. Now that Views are available in recent versions of Impala, that approach should be even easier to use.

 

Hope this helps!

 

Mike

 

View solution in original post

5 REPLIES 5

avatar
Super Collaborator

There's been some debate about this. Personally, I believe that Flume is an inherently stream-based, row-oriented system, and Parquet is an inherently batch-optimized, column-oriented format. So I'm not sure whether it's a great fit in terms of direct output. On the other hand, some folks argue that it can make sense in some cases, and that is true.

 

While I don't know of a way to get Parquet directly out of Flume today, I explored one way to get data from Flume into Impala, and ultimately stored as Parquet for fast, columnar querying in this presentation I gave at Hadoop Summit 2013:

 

https://github.com/mpercy/flume-rtq-hadoop-summit-2013/blob/master/flume-low-latency-analytics-hadoo...

 

The basic idea is that you store the data in Avro format from Flume, then use Impala to convert the data to Parquet on a schedule. This has some pretty nice properties, like low-latency access to the data. Now that Views are available in recent versions of Impala, that approach should be even easier to use.

 

Hope this helps!

 

Mike

 

avatar
Contributor

Yes. Use Impala or Hive to convert to Parquet stream from Flume is a good option, although it would be nice to have it natively.

 

Thanks!!!!

 

Miguel Angel.

avatar
Super Collaborator

You're welcome!

avatar

Hi Mike,

How do you convert the avro data to parquet, and what do you use to schedule this process?

Is the code hosted somewhere? Thanks.

avatar
Super Collaborator
Impala can do the conversion via SQL statements. I'd recommend asking the
Impala guys for advice there as my information is a bit dated on this
front, now that views and improved meta store features have been added.

Mike