Member since
09-12-2015
15
Posts
2
Kudos Received
0
Solutions
03-13-2017
01:40 AM
if you are looking for simple storage and analytics on logs then HDFS if you are looking for low latency reads/writes on log events then phoenix/hbase for cyber security, metron+nifi+hdfs For searching on logs, solr For low latency reads/writes and searching, HBase+solr (using lily indexer)
... View more
12-22-2016
02:27 AM
3 Kudos
@vpemawat If you don't change anything from your process and logging approach (e.g. separate workload to not compete for the same disks IOPS, timing etc), the only option left is SSD which will increase IOPS significantly. Even then is good to separate the workload to avoid contention. One of your challenges is driven by the quite high number of files written. If you would have used a tool like NiFi (or at least Flume) to ingest the logs and write lesser number output files and spread those to log folders across dedicated drives, then you could see some improvements. There is no magic bullet.
... View more
10-18-2016
03:36 PM
3 Kudos
@vpemawat If you are not using log4j: If you are looking to delete the files for good then there is not many options available other than rm -rf; however there are few tweaks that you can do to make it faster you can perhaps run multiple rm scripts in parallel (multiple threads) In order to do this, you should be able to logically separate the log files either by folder or name format Once you have done that, you can run multiple rm commands in background like something below nohup rm -fr app1-2016* > /tmp/nohup.out 2>&1 &
nohup rm -fr app1-2015* > /tmp/nohup.out 2>&1 & If using log4j: You should probably be 'DailyRollingFileAppender" with 'maxBackupIndex' - this will essentially limit the max file size of your log and then purge the older contents. More details here: http://www.codeproject.com/Articles/81462/DailyRollingFileAppender-with-maxBackupIndex Outside of this, you should consider the below 2 things for future use cases Organize the logs by folder (normally broken down like /logs/appname/yyyy/mm/dd/hh/<log files> Have a mechanism that will either delete the old log files, or archive it to a different log archive server Hopefully this helps. If it does, please 'accept' and 'upvote' the answer. Thank you!!
... View more
08-20-2016
12:50 PM
2 Kudos
@vpemawat Yes. Hipchat me. I'll explain. The question is loaded and I'd like to be able to give you good help for your design exercise. I have a few starter questions which if it is too much to answer, especially since you were satisfied with an answer, we can discuss in the same HipChat. I'd like to learn how it met your requirements and if I can help you with anything. 1. What is "huge data" for MySQL? What is the current size, what is the daily growth? 2. How long it took since the MySQL solution was put in place to realize that it will not scale? What was the rate of growth since then? It must have been something that drove the choice of MySQL from the first place and probably something changed in conditions. What is the change in conditions? Why was chosen MySQL to store blob from the first place? What kind of blob? 3. About "to scale": Is it that you have to query more data preserving the concurrency and the response or you want all to be better, more data, higher concurrency, lower response time? How the SLA changed for your customer to want all these? What is the new use case which was not accounted by the original design that used MySQL? Usually, I would think that the challenge is the data growth challenge, but it seems that the expectation is that by replacing MySQL with something else, the response time needs also to be better. 4. How much time it takes now to query? To measure success of a better solution, a reference baseline is good. 5. The three-week data is often queried, how is it stored and what was done to address challenges today? About the rest of the queries (10%) going beyond three weeks, is the expected response time similar? What is the concurrency needed for those 90% and, respectively, 10%? 6. Could you share a bit about the infrastructure used currently? Need to understand how is setup to still be able to satisfy the requirements until replaced. I guess that the business is still running. How does it do it? What was the mitigation in MySQL to keep it running? 7. Could you share a about data access security requirements, in transport and at rest? 8. Could you explain how blob columns are currently used by the query? Are they just retrieved as a whole or you do more with them in the query? 9. What is an example of WHERE clause on those 90% queries? ... I asked these sample questions with a goal: to understand the thinking process for the initial choice, changing in conditions and driver for new requirements, matching to one technology or other from the list of technologies that are very popular these days in big data. Some of the responses would help to recommend, for example, HBase, Hive, SolR, HDFS etc. I went in so many details because you mentioned "design" and not "please help to find at 10,000 ft view big data technology". That't how I read your question, but based on the accepted answer you were actually looking for that 10,000 ft.
... View more