Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to deal with small data ?


How to deal with small data ?

New Contributor

One of our customers, wants to create a ODS ( Operational Data Store ) for near realtime data and in rest data. All these data comes from Database now and will receive systems logs in future.

But their volumes are small, something like 100gb at total, but it will increase in the next phases of project.

Our Questions :

- Whats the solution to handle with this ? cause data are small for a cluster, we are thinking in use HBASE/Drill

- any sugestion for this small volume ?

I don't want to say it "Keep in Oracle" cause we want to prepare a architecture for the future.

Tks, Regards


Re: How to deal with small data ?

I think it depend what you plan to do with it. As you said you want an architecture of the future.

HBase/Phoenix seems to be a natural solution if you want to do small queries ( single key lookups, updates, and small aggregations ( thousands up to millions of rows )

For full aggregations Hive will still be the best way forward. ORC tables, and enough files/buckets to make sure the data is well distributed and the full cluster is utilized during queries. If you do that later on it will still scale. In 2017 we will get LLAP which will enhance Hive significantly for short answer times so plan for an upgrade :-).

If you say Operational Data will you mostly use it for post processing ( cleaning, filtering, processing? ) Then Hive will be at least a good intermediate store. I know its not much data but Tez is pretty good at that. However if you need the fastest possible small query runtimes I would keep a copy in Phoenix as well. You might want to look at Spark for some fancy data mining functions if needed.

Now Drill might also be a solution but its not supported by HDP so I do not know too much about it.

Re: How to deal with small data ?

+1 the use-cases for how you query the data will determine what system is best for storing/accessing your data. Couldn't have put it better myself. While a traditional RDBMS might be capable of hosting your data at your current scale, the power is that you can use HBase/Phoenix or Hive now and have a much easier time scaling up your data volumes from 10x, 100x, 1000x.

Re: How to deal with small data ?

@Marco Garcia for small data set sizes that require big data compute flexibility you may want to consider a cloud approach. Data is cheap in the cloud and you only pay for what you use. Today you can start in the cloud and move on-prem when needed (assuming you don't need to transfer a lot of data from the cloud to on-prem) or you can implement an ongoing hybrid solution. You have lots of choices and flexibility.