New Contributor
Posts: 5
Registered: ‎05-10-2017

Need Help: Ingest Huge + Small files combo in HDFS

Hi Everyone,


Appreciate your response to the following problem.


Problem Statement:

1. I have got 2 zip files (each of size 1 GB) stored at a directory on windows server.

2. Each zip file has 42 .dat files(5 files of 3-9 GB size range, 3 files of more than 128 MB (more than block size), rest of the files are in KB). These files have pipe delimited records that are actually table dump of sql server.

1 file has 1 table dump.

3. Incremental load is on daily basis. The files can be pushed to a SFTP location too.

4. I need to create an automated data ingestion pipeline for the same. Can anyone suggest the best way to do this?


Solution Approach (I could think of):

1. Get files(zipped) pushed to a SFTP server. Run a python script to pull and store on Edge node.

2. Unzip both zip files into local file system on edge node and run a shell/python script to pick file one by one and load using hdfs -put into HDFS.

3. Create Hive tables above it to be consumed by business users using Impala/PowerBI.



1. What is the best way to handle zip files. Ask source system to load it in unzipped at SFTP OR get it unzipped at Edge node local file system? Unzipping both zip files would create 84 files of total 40 GB in size.Remember I would be creating 84 Hive tables one for each file where the data would be appended for each incremental feed.


2. How should I handle small files in KBs. I don't think merge is an option since they are individual SQL table dump and there would be 1 hive table created for each and the business would like to do processing on them daily especially when they have Alteryx to do self service ETL.


3. I am not sure how fast will it be to first get data onto Edge node and then use hdfs -put considering the file size. Please suggest a better alternative. 


Note: Cluster is hosted on AWS.