Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Huge CSV import to Cassandra

Solved Go to solution
Highlighted

Huge CSV import to Cassandra

I need to read huge CSV file from user where historian sensor data is stored. I can't use http to upload the csv into my JEE web application as the size of csv would be 200gb max.

Format:

sensor_name,timestamp,value

sensor1,timestamp1,value1

sensor1,timestamp2,value2

sensor2,timestamp1,value1

Once user uploads csv, I need to display unique values from first column where user can map existing sensor(keyspace.table.pk1)with the sensor from csv (sensor1). I need to import timestamp, value from sensor1 to keyspace.table.pk1.

I tried using Nifi but got struck. How can I notify user that the reading is done? so that user can start mapping.

How can I implement this feature? Shall I use Spark to calculate unique values? Where can I write the output? How to notify user? How to trigger Spark job every time user uploads the file? How do I transfer my file from the client app, What happens when there are failures (do we retry, etc.), How often my jobs will be run (will it be triggered every time user uploads the file or it can be a cron job)?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Huge CSV import to Cassandra

Expert Contributor

I am assuming you are using Hortonworks data platform. Create an external table pointing to the HDFS location of these CSV files. Once the data is loaded onto your server, move the files to this HDFS location using a cron job like ncron to move the files once it completely transferred. You could write a hive API to read the files using select statements via your java program using jdbc/rest or whatever. ( ncron triggers as soon as a file is copied into the source directory ).

View solution in original post

1 REPLY 1

Re: Huge CSV import to Cassandra

Expert Contributor

I am assuming you are using Hortonworks data platform. Create an external table pointing to the HDFS location of these CSV files. Once the data is loaded onto your server, move the files to this HDFS location using a cron job like ncron to move the files once it completely transferred. You could write a hive API to read the files using select statements via your java program using jdbc/rest or whatever. ( ncron triggers as soon as a file is copied into the source directory ).

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here