Member since
05-10-2017
10
Posts
1
Kudos Received
0
Solutions
04-10-2019
02:27 AM
When I hear this usecase, the first thing I think about is NiFi. There are many solutions possible, one that may be relevant for you: 1. Ingest the zipfile (e.g. GetFile processor) 2. Unzip the file (UnPack or CompressContent processor) 3. Route based on the unpacked file size (RouteOnAttribute processor) 4. Write the file to HDFS, simply put large files there directly and append smaller files; You can also write directly to Hive or Kudu
... View more
06-07-2018
06:02 PM
Hey @Gaurav Gupta! In my humble opinion, I think it'll be better if you ask for zipped files cause the transfer will be faster. I mean, usually network cost is bigger than cpu cost (to unzip). And for the small files you can try to use the following parameters in your hive queries: hive.merge.mapfiles hive.merge.mapredfiles hive.merge.size.per.task hive.merge.smallfiles.avgsize hive.merge.mapfiles hive.merge.mapredfiles *BTW if its possible for the tables with small files, try to avoid partitions they'll break your data in smaller chunks. And for your hdfs dfs -put, maybe, you can try to use flume with spooldir. Basically flume agent will grab any new files inside a directory and put into HDFS or whatever sink you may configure. Hope this helps!
... View more
05-11-2017
09:09 PM
Thanks Aver for replying. Well, Initially the data would be dumped into HDFS and post processing into HBase(which i am assuming to be less than 2 TB)
... View more