Member since
12-19-2013
14
Posts
2
Kudos Received
0
Solutions
05-05-2014
11:32 AM
@MattH wrote: Thanks for the quick reply, Do you mean 10000 rows per second? What's the size of those rows? i.e. whats the rough MB/second ingest rate? Yes about 10,000 to 20,000 records per second, somewhere around 75 MB/sec. That's a pretty good ingest rate. I'd look at Kite and Flume to see if that meets your needs. A good approach is to stage the ingest data into a row based format (e.g. avro). Periodically, when enough data has been accumulated, the data would be transformed into parquet (This could be done via Impala for example by doing an "insert into <parquet_table> select * from staging_table".) Impala can query tables that are mixed format so the data in the staging format would still be immediately accessible. This makes sense. I'm assuming I would need to create the process to do the "insert into..." and there is not a built in timer to run this task on an interval or watch the staging table? No, there is no builtin way to scheduling queries periodically. Flume has mechanisms to do this either based on time or data volume. There is very little distinction between external and managed. The big difference is if it is managed, the files in HDFS are deleted if the table is dropped. The automatic partition creation is done in the ingest. For example, if you did the ingest via an insert into a date partitioned table, Impala will automatically create partitions for new date values. If it was external, I would not issue an "insert into..." command since since some other process would be putting the data to HDFS and the table would just be an interface over that data, right? It doesn't matter if the table is external or managed, you can still drop files into the path in HDFS and have it picked up. The distinction it what happens on the drop table path.
... View more