Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Apache Flume and parquet

Solved Go to solution

Apache Flume and parquet

Explorer

Hi.

 

Is it possible configure Apache Flume to save my logs in HDFS with Parquet?

 

Thanks very much!!!!

 

Miguel Angel.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Apache Flume and parquet

Expert Contributor

There's been some debate about this. Personally, I believe that Flume is an inherently stream-based, row-oriented system, and Parquet is an inherently batch-optimized, column-oriented format. So I'm not sure whether it's a great fit in terms of direct output. On the other hand, some folks argue that it can make sense in some cases, and that is true.

 

While I don't know of a way to get Parquet directly out of Flume today, I explored one way to get data from Flume into Impala, and ultimately stored as Parquet for fast, columnar querying in this presentation I gave at Hadoop Summit 2013:

 

https://github.com/mpercy/flume-rtq-hadoop-summit-2013/blob/master/flume-low-latency-analytics-hadoo...

 

The basic idea is that you store the data in Avro format from Flume, then use Impala to convert the data to Parquet on a schedule. This has some pretty nice properties, like low-latency access to the data. Now that Views are available in recent versions of Impala, that approach should be even easier to use.

 

Hope this helps!

 

Mike

 

5 REPLIES 5

Re: Apache Flume and parquet

Expert Contributor

There's been some debate about this. Personally, I believe that Flume is an inherently stream-based, row-oriented system, and Parquet is an inherently batch-optimized, column-oriented format. So I'm not sure whether it's a great fit in terms of direct output. On the other hand, some folks argue that it can make sense in some cases, and that is true.

 

While I don't know of a way to get Parquet directly out of Flume today, I explored one way to get data from Flume into Impala, and ultimately stored as Parquet for fast, columnar querying in this presentation I gave at Hadoop Summit 2013:

 

https://github.com/mpercy/flume-rtq-hadoop-summit-2013/blob/master/flume-low-latency-analytics-hadoo...

 

The basic idea is that you store the data in Avro format from Flume, then use Impala to convert the data to Parquet on a schedule. This has some pretty nice properties, like low-latency access to the data. Now that Views are available in recent versions of Impala, that approach should be even easier to use.

 

Hope this helps!

 

Mike

 

Highlighted

Re: Apache Flume and parquet

Explorer

Yes. Use Impala or Hive to convert to Parquet stream from Flume is a good option, although it would be nice to have it natively.

 

Thanks!!!!

 

Miguel Angel.

Re: Apache Flume and parquet

Expert Contributor

You're welcome!

Re: Apache Flume and parquet

New Contributor

Hi Mike,

How do you convert the avro data to parquet, and what do you use to schedule this process?

Is the code hosted somewhere? Thanks.

Re: Apache Flume and parquet

Expert Contributor
Impala can do the conversion via SQL statements. I'd recommend asking the
Impala guys for advice there as my information is a bit dated on this
front, now that views and improved meta store features have been added.

Mike

Don't have an account?
Coming from Hortonworks? Activate your account here