Member since
09-27-2013
27
Posts
1
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4346 | 07-07-2014 11:48 AM | |
5299 | 05-05-2014 11:32 AM |
07-07-2014
11:48 AM
How many resulting partitions are you expecting? The query is generating all the output partitions on each DN which can result in instability. Take a look at these docs, in particular the SHUFFLE and NOSHUFFLE sections. http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_insert.html
... View more
05-05-2014
11:32 AM
@MattH wrote: Thanks for the quick reply, Do you mean 10000 rows per second? What's the size of those rows? i.e. whats the rough MB/second ingest rate? Yes about 10,000 to 20,000 records per second, somewhere around 75 MB/sec. That's a pretty good ingest rate. I'd look at Kite and Flume to see if that meets your needs. A good approach is to stage the ingest data into a row based format (e.g. avro). Periodically, when enough data has been accumulated, the data would be transformed into parquet (This could be done via Impala for example by doing an "insert into <parquet_table> select * from staging_table".) Impala can query tables that are mixed format so the data in the staging format would still be immediately accessible. This makes sense. I'm assuming I would need to create the process to do the "insert into..." and there is not a built in timer to run this task on an interval or watch the staging table? No, there is no builtin way to scheduling queries periodically. Flume has mechanisms to do this either based on time or data volume. There is very little distinction between external and managed. The big difference is if it is managed, the files in HDFS are deleted if the table is dropped. The automatic partition creation is done in the ingest. For example, if you did the ingest via an insert into a date partitioned table, Impala will automatically create partitions for new date values. If it was external, I would not issue an "insert into..." command since since some other process would be putting the data to HDFS and the table would just be an interface over that data, right? It doesn't matter if the table is external or managed, you can still drop files into the path in HDFS and have it picked up. The distinction it what happens on the drop table path.
... View more
05-01-2014
12:52 PM
Inline. @MattH wrote: I am looking at using Impala for “real time” reporting via BI tools. It is my understanding that Impala is the tool to use for real time reporting vs. HIVE or custom MapReduce. I also read that using the Parquet format can increase the query speed if only selecting a few columns of data. What I would like to do is have Impala managed tables with the Parquet format and using partitions on time, where partitions are at least 1 GB. Another requirement would be that I need to load data into the table in real time at a rate of 10’s of 1,000’s per second. Do you mean 10000 rows per second? What's the size of those rows? i.e. whats the rough MB/second ingest rate? What I am noticing is that the Impala insert times are slow and won’t keep up with this rate. A single insert statement will always generate a new file so this is probably not what you want to do. Is an Impala managed table not designed for real time ingest? Is there a better way to load data that can be done in real time? A good approach is to stage the ingest data into a row based format (e.g. avro). Periodically, when enough data has been accumulated, the data would be transformed into parquet (This could be done via Impala for example by doing an "insert into <parquet_table> select * from staging_table".) Impala can query tables that are mixed format so the data in the staging format would still be immediately accessible. Take a look at the flume project which will help with this. http://flume.apache.org/ Should I be utilizing an Impala external table? If so, does the files loaded to HDFS need to be in Parquet format? How do I manage automatic partition creation? There is very little distinction between external and managed. The big difference is if it is managed, the files in HDFS are deleted if the table is dropped. The automatic partition creation is done in the ingest. For example, if you did the ingest via an insert into a date partitioned table, Impala will automatically create partitions for new date values. Would performance be better if I were to create a table off data in HBase? If so, does each record need to be structured the same (same fields) for Impala to know how to access it? Scan performance ontop of HBase is much worse currently than ontop of HDFS. You can take a look at this doc on how to match the schemas. http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html
... View more