Can data marts be implemented in Cloudera Impala? Can I get any documentation related to it?
Also, I require a few tables in teradata to access through Cloudera, can I access those teradata tables directly or I have to bring data in CDH as Cloudera impala canreceive data only from CDH?
To your first question - Impala itself is a query engine. Other Hadoop components will be necessary to implement a complete data mart: HDFS for storage, Sqoop or Flume for data ingest, Oozie for managing workflows, etc.
We have a lot of documentation for all those components. Anything in particular you are looking for? Use-cases? White papers? Books?
Regarding Teradata and Impala - Impala cannot query data stored in Teradata directly. We typically export the data from Teradata and load it to Hadoop using Sqoop and our Sqoop-for-Teradata connectors. You can find the connectors and the documentation here:
Thanks for your answer. I had some questons in response to your answer.
Can we build tables in impala on which reporting can be done? Is it a supported/suggested approach or Impala can only be used for Ad hoc queries to get data?
Following are other queries which I had about Impala -
1) How much data can be handled or supported by Impala effectively? Can we quantify the limit (whether its 50 GB of data or 10 GB?)?
2) How impala addresses concurrency, if any reporting tool sits on top of it then how many reports or users can be supported at a given time?
3) How how secure impala is? How is provides the security? Can it control the user access?
4) Can Impala sit on top of HDFS and directly interact with it?
Can you suggest in particular e-book/documentation which can help the cause.
> Can we build tables in impala on which reporting can be done? Is it a supported/suggested approach or Impala can only be used for Ad hoc queries to get data?
Impala can be used to create tables. Its a supported approach.
1) As far as I know, there's no real limit. We and our customers used Impala on clusters with over 100 nodes. And on tables with over 10TB of data. 10-50GB is not a problem and not even close to the limits.
2) It supports multiple users running reports at the same time. How many depends on the reports and the size of the cluster (number of nodes, cpus, disks, amount of memory) and also on whether the cluster is doing other things at the same time (MapReduce, HBase...)
3) Impala is integrated with Hadoop security - which is generally done using Kerberos. Kerberos allows authenticating users with ActiveDirectory or LDAP.
On top of that we added Sentry, which gives the administrator ability to control access to tables using groups and roles. For example: an admin may be able to read and write to all tables, but a BI reporting user can only read data, and another user can only read and write data to few specific tables. Its very similar to the ability to GRANT privileges in other databases.
4) Impala sits on top of HDFS and interact with it directly. Thats exactly what Impala does and how it works.
Good reading on Impala (and Sentry):