Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Impala for real time data ingest

avatar
Explorer

I am looking at using Impala for “real time” reporting via BI tools.  It is my understanding that Impala is the tool to use for real time reporting vs. HIVE or custom MapReduce.  I also read that using the Parquet format can increase the query speed if only selecting a few columns of data.

 

What I would like to do is have Impala managed tables with the Parquet format and using partitions on time, where partitions are at least 1 GB.  Another requirement would be that I need to load data into the table in real time at a rate of 10’s of 1,000’s per second. 

 

What I am noticing is that the Impala insert times are slow and won’t keep up with this rate. 

 

Is an Impala managed table not designed for real time ingest?

 

Is there a better way to load data that can be done in real time?

 

Should I be utilizing an Impala external table?  If so, does the files loaded to HDFS need to be in Parquet format?  How do I manage automatic partition creation?

 

Would performance be better if I were to create a table off data in HBase?  If so, does each record need to be structured the same (same fields) for Impala to know how to access it?

1 ACCEPTED SOLUTION

avatar
Contributor

@MattH wrote:

Thanks for the quick reply,

 

Do you mean 10000 rows per second? What's the size of those rows? i.e. whats the rough MB/second ingest rate?

Yes about 10,000 to 20,000 records per second, somewhere around 75 MB/sec.

That's a pretty good ingest rate. I'd look at Kite and Flume to see if that meets your needs.

 

A good approach is to stage the ingest data into a row based format (e.g. avro). Periodically, when enough data has been

accumulated, the data would be transformed into parquet (This could be done via Impala for example by doing an "insert into

<parquet_table> select * from staging_table".) Impala can query tables that are mixed format so the data in the staging format

would still be immediately accessible.

This makes sense.  I'm assuming I would need to create the process to do the "insert into..." and there is not a built in timer to run this task on an interval or watch the staging table?

No, there is no builtin way to scheduling queries periodically. Flume has mechanisms to do this either based on time or data volume.

 

There is very little distinction between external and managed. The big difference is if it is managed, the files in HDFS are

deleted if the table is dropped. The automatic partition creation is done in the ingest. For example, if you did the ingest via

an insert into a date partitioned table, Impala will automatically create partitions for new date values.

If it was external, I would not issue an "insert into..." command since since some other process would be putting the data to HDFS and the table would just be an interface over that data, right?

It doesn't matter if the table is external or managed, you can still drop files into the path in HDFS and have it picked up. The

distinction it what happens on the drop table path.

 

 


 

View solution in original post

4 REPLIES 4

avatar
Contributor

Hey Matt,

 

You may find the dataset-staging example from the Kite SDK to be instructive on how to write messages into HDFS using the row-based Avro format for high write throughput and then periodically rewrite the files using the columnar Parquet format. I haven't tried this example out myself with Impala, so let me know how it works for you if you do make time to try it out!

 

Regards,
Jeff

avatar
Contributor
Inline.
@MattH wrote:

I am looking at using Impala for “real time” reporting via BI tools.  It is my understanding that Impala is the tool to use for real time reporting vs. HIVE or custom MapReduce.  I also read that using the Parquet format can increase the query speed if only selecting a few columns of data.

 

What I would like to do is have Impala managed tables with the Parquet format and using partitions on time, where partitions are at least 1 GB.  Another requirement would be that I need to load data into the table in real time at a rate of 10’s of 1,000’s per second. 

Do you mean 10000 rows per second? What's the size of those rows? i.e. whats the rough MB/second ingest rate?

 

What I am noticing is that the Impala insert times are slow and won’t keep up with this rate. 

A single insert statement will always generate a new file so this is probably not what you want to do. 

 

Is an Impala managed table not designed for real time ingest?

Is there a better way to load data that can be done in real time?

A good approach is to stage the ingest data into a row based format (e.g. avro). Periodically, when enough data has been

accumulated, the data would be transformed into parquet (This could be done via Impala for example by doing an "insert into

<parquet_table> select * from staging_table".) Impala can query tables that are mixed format so the data in the staging format

would still be immediately accessible.

 

Take a look at the flume project which will help with this. http://flume.apache.org/

 

Should I be utilizing an Impala external table?  If so, does the files loaded to HDFS need to be in Parquet format?  How do I manage automatic partition creation?

There is very little distinction between external and managed. The big difference is if it is managed, the files in HDFS are

deleted if the table is dropped. The automatic partition creation is done in the ingest. For example, if you did the ingest via

an insert into a date partitioned table, Impala will automatically create partitions for new date values.

 

Would performance be better if I were to create a table off data in HBase?  If so, does each record need to be structured the same (same fields) for Impala to know how to access it?

Scan performance ontop of HBase is much worse currently than ontop of HDFS. You can take a look at this doc on how to match the schemas. http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Im...


 

avatar
Explorer

Thanks for the quick reply,

 

Do you mean 10000 rows per second? What's the size of those rows? i.e. whats the rough MB/second ingest rate?

Yes about 10,000 to 20,000 records per second, somewhere around 75 MB/sec.

 

A good approach is to stage the ingest data into a row based format (e.g. avro). Periodically, when enough data has been

accumulated, the data would be transformed into parquet (This could be done via Impala for example by doing an "insert into

<parquet_table> select * from staging_table".) Impala can query tables that are mixed format so the data in the staging format

would still be immediately accessible.

This makes sense.  I'm assuming I would need to create the process to do the "insert into..." and there is not a built in timer to run this task on an interval or watch the staging table?

There is very little distinction between external and managed. The big difference is if it is managed, the files in HDFS are

deleted if the table is dropped. The automatic partition creation is done in the ingest. For example, if you did the ingest via

an insert into a date partitioned table, Impala will automatically create partitions for new date values.

If it was external, I would not issue an "insert into..." command since since some other process would be putting the data to HDFS and the table would just be an interface over that data, right?

 

 

avatar
Contributor

@MattH wrote:

Thanks for the quick reply,

 

Do you mean 10000 rows per second? What's the size of those rows? i.e. whats the rough MB/second ingest rate?

Yes about 10,000 to 20,000 records per second, somewhere around 75 MB/sec.

That's a pretty good ingest rate. I'd look at Kite and Flume to see if that meets your needs.

 

A good approach is to stage the ingest data into a row based format (e.g. avro). Periodically, when enough data has been

accumulated, the data would be transformed into parquet (This could be done via Impala for example by doing an "insert into

<parquet_table> select * from staging_table".) Impala can query tables that are mixed format so the data in the staging format

would still be immediately accessible.

This makes sense.  I'm assuming I would need to create the process to do the "insert into..." and there is not a built in timer to run this task on an interval or watch the staging table?

No, there is no builtin way to scheduling queries periodically. Flume has mechanisms to do this either based on time or data volume.

 

There is very little distinction between external and managed. The big difference is if it is managed, the files in HDFS are

deleted if the table is dropped. The automatic partition creation is done in the ingest. For example, if you did the ingest via

an insert into a date partitioned table, Impala will automatically create partitions for new date values.

If it was external, I would not issue an "insert into..." command since since some other process would be putting the data to HDFS and the table would just be an interface over that data, right?

It doesn't matter if the table is external or managed, you can still drop files into the path in HDFS and have it picked up. The

distinction it what happens on the drop table path.