Member since
05-01-2014
11
Posts
0
Kudos Received
0
Solutions
05-01-2014
01:12 PM
Thanks for the quick reply, Do you mean 10000 rows per second? What's the size of those rows? i.e. whats the rough MB/second ingest rate? Yes about 10,000 to 20,000 records per second, somewhere around 75 MB/sec. A good approach is to stage the ingest data into a row based format (e.g. avro). Periodically, when enough data has been accumulated, the data would be transformed into parquet (This could be done via Impala for example by doing an "insert into <parquet_table> select * from staging_table".) Impala can query tables that are mixed format so the data in the staging format would still be immediately accessible. This makes sense. I'm assuming I would need to create the process to do the "insert into..." and there is not a built in timer to run this task on an interval or watch the staging table? There is very little distinction between external and managed. The big difference is if it is managed, the files in HDFS are deleted if the table is dropped. The automatic partition creation is done in the ingest. For example, if you did the ingest via an insert into a date partitioned table, Impala will automatically create partitions for new date values. If it was external, I would not issue an "insert into..." command since since some other process would be putting the data to HDFS and the table would just be an interface over that data, right?
... View more
05-01-2014
08:05 AM
I am looking at using Impala for “real time” reporting via BI tools. It is my understanding that Impala is the tool to use for real time reporting vs. HIVE or custom MapReduce. I also read that using the Parquet format can increase the query speed if only selecting a few columns of data. What I would like to do is have Impala managed tables with the Parquet format and using partitions on time, where partitions are at least 1 GB. Another requirement would be that I need to load data into the table in real time at a rate of 10’s of 1,000’s per second. What I am noticing is that the Impala insert times are slow and won’t keep up with this rate. Is an Impala managed table not designed for real time ingest? Is there a better way to load data that can be done in real time? Should I be utilizing an Impala external table? If so, does the files loaded to HDFS need to be in Parquet format? How do I manage automatic partition creation? Would performance be better if I were to create a table off data in HBase? If so, does each record need to be structured the same (same fields) for Impala to know how to access it?
... View more
Labels: