Hello, I am completely new to HDP. We have few usecases and working on the right topology.
One of the usecases is to > Read a large file(10 GB, fixed length flat files) > Load into a HDFS/NoSql (HBase/Cassandra) database > Build a representation layer on the stored data by querying it.
Any ideas or solutions are highly appreciated.
Best Regards, Andy
This is common ETL Bigdata usecase..
You can load to HDFS and Access the data using HQL(Hive) and also you can connect from MS Excel /Tableau etc.. via JDBC/ODBC.
Based on your requirement if you are looking for low latency access and If its time series data definitely you can go with Hbase..
Avoma, thank you for that quickie.
Forgot to mention that we need to transform only a part of the data from the file before loading into Hdfs. Should we go for Map reduce?
And why not HBase for low latency?
@Andy, there are lot of things to consider.
There are various storage and querying options in Hive which can help you store the table in varying formats and also provides compatibility of querying the data in SQL over ODBC/JDBC. Thus, it can integrate with any third-party/open-source visualization tools.
Very loaded question. A function with so many variables. More details can narrow-down your solution.
Additional to everything that was already said in this forum, and especially from @srai, it is important to know whether you need low latency or near real-time, if you wish to use SQL or no, level of concurrency, how do you plan to represent the data visually, whether you plan to do aggregations, etc. ... All these influence your representation layer.
If you are big in SQL then you just PUT your files to HDFS, then use Hive to transform and store data to Hive internal or external tables, use Hive to query. When using Tez engine or LLAP (new), you response time can range between sub-second to tens of seconds for reasonable user queries. A batch job would require more than to complete. If your data model is fit for columnar and you plan to build a low latency dashboard for a a recent timeframe, then HBase can be your database of choice. You could still use SQL via Phoenix which acts like a JDBC driver between your front-end and HBase.
Thankyou all. @ Andrew, How files should HDP receive the files is still under discussion. Process files with Kafka/Nifi and transform & load into database with Storm/spark? or Load files to HDFS directly.
- Max capacity of data to be quieried at one time is 100 million rows. - Data loading intervals - daily twice (avg load time 20 mins) - Presentation layer - Web interface built with Webservices & Java/Jdbc
Low latency has more priority over near real time. In the current situation, Views (on aggregated/hierarchical data) have been built in RDBMS, are used to represent data. Currently queries(on db views) take upto 15 mins to retrieve data from RDBMS, which triggered us to look for alternatives.
Can Cassandra replace HBase for our requirement ?