Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Huge CSV import to Cassandra

avatar

I need to read huge CSV file from user where historian sensor data is stored. I can't use http to upload the csv into my JEE web application as the size of csv would be 200gb max.

Format:

sensor_name,timestamp,value

sensor1,timestamp1,value1

sensor1,timestamp2,value2

sensor2,timestamp1,value1

Once user uploads csv, I need to display unique values from first column where user can map existing sensor(keyspace.table.pk1)with the sensor from csv (sensor1). I need to import timestamp, value from sensor1 to keyspace.table.pk1.

I tried using Nifi but got struck. How can I notify user that the reading is done? so that user can start mapping.

How can I implement this feature? Shall I use Spark to calculate unique values? Where can I write the output? How to notify user? How to trigger Spark job every time user uploads the file? How do I transfer my file from the client app, What happens when there are failures (do we retry, etc.), How often my jobs will be run (will it be triggered every time user uploads the file or it can be a cron job)?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

I am assuming you are using Hortonworks data platform. Create an external table pointing to the HDFS location of these CSV files. Once the data is loaded onto your server, move the files to this HDFS location using a cron job like ncron to move the files once it completely transferred. You could write a hive API to read the files using select statements via your java program using jdbc/rest or whatever. ( ncron triggers as soon as a file is copied into the source directory ).

View solution in original post

1 REPLY 1

avatar
Super Collaborator

I am assuming you are using Hortonworks data platform. Create an external table pointing to the HDFS location of these CSV files. Once the data is loaded onto your server, move the files to this HDFS location using a cron job like ncron to move the files once it completely transferred. You could write a hive API to read the files using select statements via your java program using jdbc/rest or whatever. ( ncron triggers as soon as a file is copied into the source directory ).