I have files with several GB's and i convert those files into smaller chunks.After converting it into smaller chunks still the smallest chunk is of size 5 to 10 MB. Think it is not a good idea to post objects of 5 to 10 MB size to Kafka server.How can split the 5 MB size file further into chunks of few Kilo Bytes and push it to the Kafka server and later recreate the file in the consumer end and process the file?, Is there any better way to process big files?
some details are missing in your post, but as an general answer: if you want to do a batch processing of some huuuge files, Kafka is the wrong tool to use. Kafka's strength is managing STREAMING data.
Based on your description I am assuming that your use-case is, bringing huge files to HDFS and process it afterwards. For that I won't split the files at all, just upload it as a whole (e.g. via WebHDFS). Then you can use tools like Hive/Tez, Spark, ... to process your data (whatever you mean with "process", clean/filter/aggregate/merge/...or at the end "analyze" in an sql-like manner)
Thanks much for your valuable reply. My requirement is to process Huge files which are transported from an upstream system. As of today we have Spring Batch to split the files into smaller in sizes and do a batch process and store it in Oracle DB , just each smaller files in a row as a compressed blob (For persisting). We poll the DB and pull the files and process each file with our application , the processed/transformed files are fed into Active MQ for further processing , we face lot of Queue related issues like persisting the objects in queue and restarting the queue on system failures and reprocessing the whole stuff again irrespective of whether it is processed or not. Here i thought of bringing in Kafka which you mentioned it does not suit my need.As on today we have pipeline of queues for three levels (Process 1 -> Queue -> Process2 -> Queue -> Process3 -> Final Product). As you said viable solution is to use HDFS/Spark/Hive , what about different transformation stages , how to handle it?
Thanks again for your valuable suggestion.
you can keep your pipeline if you need to and write intermediate output (after each processing) either with Spark into HDFS again, or by using Hive into another table.
From what you are describing, it sounds like a huge (and useless) overhead to split your huge files, put it into a RDBMS, grab it from there into AMQ and process it from there....that is ways to expensive/complicated/error-prone.
Just upload your huge files to HDFS and e.g. create a directory structure which reflects your processing pipeline, like
...and put your output after each processing into it accordingly
Thanks much for your valuable inputs.
I can store the files directly in HDFS as you said.Say if I am using Apache Spark for processing the files (already we have our application in java for processing the files) , can we Integrate our existing java application with Spark believe that is very hard and needs a huge code change. Any high level suggestions on how to integrate this.
Sathiyanarayana kumar. N