Created 08-18-2016 03:48 PM
I am facing issue of huge data in mysql table which is increasing very fast , so to scale what is the other alternative?
Some of the columns are type of blob as they are raw data.
So to query faster what is the right approach.
Some more background of problem :
Finally I have to query data and expose it via rest api
So somebody can help me in complete design with some explainations?
Created 08-19-2016 06:20 PM
@vpemawat I am not sure why anyone down voted your questions. make no sense to me.
On to your question. You are seeing typical challenge with RDBMS.
For ingestion I would say start with Apache NiFi. This will ingest data directory into the hortonworks platform.
Now you need to decide where to egress the data to.
hive -> great tool for unknown query patterns. Ie BI
HAWQ -> great tool for unknown query patterns. Ie BI. Highly optimized for BI
Phoenix -> great for known query patterns. SQL on hbase
HDFS -> here you can store your raw data and federate to phoenix, HAWQ, hive, etc.
I hope this help start the conversation on landing data in the platform using the right tool for the use case.
Created 08-18-2016 04:21 PM
I think this is a perfect use case to consider offloading your older data into Hadoop (HDP). Have you considered that?
Created 08-18-2016 06:15 PM
You already tagged the question as HDFS, Hive and SolR. I guess that you know the answer 🙂
Created 08-19-2016 01:10 AM
No I am not sure which one to use?
hive or solr so thats the reason I tagged.
Created 08-20-2016 12:28 AM
@Constantin Stanca did you down voted the questions? if yes why.
Created 08-19-2016 06:20 PM
@vpemawat I am not sure why anyone down voted your questions. make no sense to me.
On to your question. You are seeing typical challenge with RDBMS.
For ingestion I would say start with Apache NiFi. This will ingest data directory into the hortonworks platform.
Now you need to decide where to egress the data to.
hive -> great tool for unknown query patterns. Ie BI
HAWQ -> great tool for unknown query patterns. Ie BI. Highly optimized for BI
Phoenix -> great for known query patterns. SQL on hbase
HDFS -> here you can store your raw data and federate to phoenix, HAWQ, hive, etc.
I hope this help start the conversation on landing data in the platform using the right tool for the use case.
Created 08-20-2016 12:22 AM
@Sunile Manjee thanks a lot for detailed answer
Created 08-20-2016 01:19 AM
@vpemawat your most def welcome. Have I answered your question? if so request you accept the answer of your choosing.
Created 08-20-2016 12:50 PM
Yes. Hipchat me. I'll explain. The question is loaded and I'd like to be able to give you good help for your design exercise. I have a few starter questions which if it is too much to answer, especially since you were satisfied with an answer, we can discuss in the same HipChat. I'd like to learn how it met your requirements and if I can help you with anything.
1. What is "huge data" for MySQL? What is the current size, what is the daily growth?
2. How long it took since the MySQL solution was put in place to realize that it will not scale? What was the rate of growth since then? It must have been something that drove the choice of MySQL from the first place and probably something changed in conditions. What is the change in conditions? Why was chosen MySQL to store blob from the first place? What kind of blob?
3. About "to scale": Is it that you have to query more data preserving the concurrency and the response or you want all to be better, more data, higher concurrency, lower response time? How the SLA changed for your customer to want all these? What is the new use case which was not accounted by the original design that used MySQL? Usually, I would think that the challenge is the data growth challenge, but it seems that the expectation is that by replacing MySQL with something else, the response time needs also to be better.
4. How much time it takes now to query? To measure success of a better solution, a reference baseline is good.
5. The three-week data is often queried, how is it stored and what was done to address challenges today? About the rest of the queries (10%) going beyond three weeks, is the expected response time similar? What is the concurrency needed for those 90% and, respectively, 10%?
6. Could you share a bit about the infrastructure used currently? Need to understand how is setup to still be able to satisfy the requirements until replaced. I guess that the business is still running. How does it do it? What was the mitigation in MySQL to keep it running?
7. Could you share a about data access security requirements, in transport and at rest?
8. Could you explain how blob columns are currently used by the query? Are they just retrieved as a whole or you do more with them in the query?
9. What is an example of WHERE clause on those 90% queries?
...
I asked these sample questions with a goal: to understand the thinking process for the initial choice, changing in conditions and driver for new requirements, matching to one technology or other from the list of technologies that are very popular these days in big data. Some of the responses would help to recommend, for example, HBase, Hive, SolR, HDFS etc. I went in so many details because you mentioned "design" and not "please help to find at 10,000 ft view big data technology". That't how I read your question, but based on the accepted answer you were actually looking for that 10,000 ft.