Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Creating Data PipeLine

Creating Data PipeLine

New Contributor

Here is the Problem Scenario :

We receive CSV from legacy system, consumed every 1 hour with 1TB of data.

We are in the process of Creating a data model which can store this data .

This data model needs to support millions of read per second.

Question 1 : What is the persistence storage system we can use to store this data.

and using Spark Job in Scala which can takes the CSV file and store that in the storage system of our choice.

Question 2: What real-time or batch processing technologies we can use ?

Many Thanks !

Don't have an account?
Coming from Hortonworks? Activate your account here