Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Load files to Hadoop

Highlighted

Load files to Hadoop

New Contributor

Hello, I am completely new to HDP. We have few usecases and working on the right topology.

One of the usecases is to > Read a large file(10 GB, fixed length flat files) > Load into a HDFS/NoSql (HBase/Cassandra) database > Build a representation layer on the stored data by querying it.

Any ideas or solutions are highly appreciated.

Best Regards, Andy

9 REPLIES 9
Highlighted

Re: Load files to Hadoop

Contributor

This is common ETL Bigdata usecase..

You can load to HDFS and Access the data using HQL(Hive) and also you can connect from MS Excel /Tableau etc.. via JDBC/ODBC.

Ref:

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-connect-excel-hive-odbc-driver/

Based on your requirement if you are looking for low latency access and If its time series data definitely you can go with Hbase..

Re: Load files to Hadoop

Contributor
Highlighted

Re: Load files to Hadoop

New Contributor

Avoma, thank you for that quickie.

Forgot to mention that we need to transform only a part of the data from the file before loading into Hdfs. Should we go for Map reduce?

And why not HBase for low latency?

Regards,

Andy

Highlighted

Re: Load files to Hadoop

How do files arrive to HDP? Take a look at Apache NiFi (and HDF) for managing your data movement too.

Highlighted

Re: Load files to Hadoop

Guru

@Andy, there are lot of things to consider.

  • What is the max capacity of data to be queried at one time ?
  • Data loading intervals (how much of fresh data will be generated and loaded and the frequency)
  • What are the possible presentation layers ?

There are various storage and querying options in Hive which can help you store the table in varying formats and also provides compatibility of querying the data in SQL over ODBC/JDBC. Thus, it can integrate with any third-party/open-source visualization tools.

Highlighted

Re: Load files to Hadoop

@Andy

Very loaded question. A function with so many variables. More details can narrow-down your solution.

Additional to everything that was already said in this forum, and especially from @srai, it is important to know whether you need low latency or near real-time, if you wish to use SQL or no, level of concurrency, how do you plan to represent the data visually, whether you plan to do aggregations, etc. ... All these influence your representation layer.

If you are big in SQL then you just PUT your files to HDFS, then use Hive to transform and store data to Hive internal or external tables, use Hive to query. When using Tez engine or LLAP (new), you response time can range between sub-second to tens of seconds for reasonable user queries. A batch job would require more than to complete. If your data model is fit for columnar and you plan to build a low latency dashboard for a a recent timeframe, then HBase can be your database of choice. You could still use SQL via Phoenix which acts like a JDBC driver between your front-end and HBase.

Highlighted

Re: Load files to Hadoop

New Contributor

Thankyou all. @ Andrew, How files should HDP receive the files is still under discussion. Process files with Kafka/Nifi and transform & load into database with Storm/spark? or Load files to HDFS directly.

@srai

- Max capacity of data to be quieried at one time is 100 million rows. - Data loading intervals - daily twice (avg load time 20 mins) - Presentation layer - Web interface built with Webservices & Java/Jdbc

@Constantin,

Low latency has more priority over near real time. In the current situation, Views (on aggregated/hierarchical data) have been built in RDBMS, are used to represent data. Currently queries(on db views) take upto 15 mins to retrieve data from RDBMS, which triggered us to look for alternatives.

Can Cassandra replace HBase for our requirement ?

Rgds,

Andy

Highlighted

Re: Load files to Hadoop

Super Guru

yes hbase replaces cassandra

Highlighted

Re: Load files to Hadoop

Super Guru
Don't have an account?
Coming from Hortonworks? Activate your account here