Appreciate your response to the following problem.
1. I have got 2 zip files (each of size 1 GB) stored at a directory on windows server.
2. Each zip file has 42 .dat files(5 files of 3-9 GB size range, 3 files of more than 128 MB (more than block size), rest of the files are in KB). These files have pipe delimited records that are actually table dump of sql server.
1 file has 1 table dump.
3. Incremental load is on daily basis. The files can be pushed to a SFTP location too.
4. I need to create an automated data ingestion pipeline for the same. Can anyone suggest the best way to do this?
Solution Approach (I could think of):
1. Get files(zipped) pushed to a SFTP server. Run a python script to pull and store on Edge node.
2. Unzip both zip files into local file system on edge node and run a shell/python script to pick file one by one and load using hdfs -put into HDFS.
3. Create Hive tables above it to be consumed by business users using Impala/PowerBI.
4. NiFi is out of scope due to some business constraints.
1. What is the best way to handle zip files. Ask source system to load it in unzipped at SFTP OR get it unzipped at Edge node local file system? Unzipping both zip files would create 84 files of total 40 GB in size.Remember I would be creating 84 Hive tables one for each file where the data would be appended for each incremental feed.
2. How should I handle small files in KBs. I don't think merge is an option since they are individual SQL table dump and there would be 1 hive table created for each and the business would like to do processing on them daily especially when they have Alteryx to do self service ETL.
3. I am not sure how fast will it be to first get data onto Edge node and then use hdfs -put considering the file size. Please suggest a better alternative.
Note: Cluster is hosted on AWS.
Hey @Gaurav Gupta!
In my humble opinion, I think it'll be better if you ask for zipped files cause the transfer will be faster. I mean, usually network cost is bigger than cpu cost (to unzip).
And for the small files you can try to use the following parameters in your hive queries:
*BTW if its possible for the tables with small files, try to avoid partitions they'll break your data in smaller chunks.
And for your hdfs dfs -put, maybe, you can try to use flume with spooldir. Basically flume agent will grab any new files inside a directory and put into HDFS or whatever sink you may configure.
Hope this helps!