We are looking for impala for our Online real time reporting platform which involve both heavy read and write operation.
We have some dimension tables that are getting updated real time and we use them with our facts to display aggregated data to our clients. I was reading about Impala real time analytics and found interesting. But everybody talks about its fast read capabality and nobody gave me any insight about its read capablity so i am bit curious to know whether is it going to work for our use case (whrere the data in out dimensions keep changes during the day ) ?
Have you already got the pipeline working where the real-time data is loaded into HDFS? Impala can recognize when new data files are added to an existing table. (You issue a 'REFRESH <table_name>' statement and Impala re-checks the data files within the table directory.) You can put the files into HDFS by any means, it doesn't need to be an Impala INSERT statement.
But the append-only nature of HDFS makes me wonder about the performance in your case. Impala works best (i.e. fastest) with a small number of big (multi-megabyte) files to read. If the data is coming in small pieces and you end up with many small files, that's not likely to perform well on the query side.