Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.
I m working with millions of csv separate files. Every 5 minutes I receive 4 csv files I developed spark job for transform these 4 files to mongodb documents(the job execute each 5 minutes). I m using Zepplin for data discovery and explorations tasks based on spark interpreter and mongodb spark connector it works well but the problem is for 10 days of data in mogodb collection (purge the first day and add the actual one) and with 48 gb RAM 12 cpu it's slow, actually I want a 30 historical days this task will be impossible so I m reflecting to replace mongodb and store the result of transforming csv files into hdfs in the json format. I don't know if this solution will give me a better performance(speed and memory)? any suggestions, please