Created 05-13-2018 06:43 AM
We have lot of small files (in kbs) in local filesystems (around 3000). Need to ingest these files in a hive table . It is taking lot of time.Any way to merge the files before loading or ingest the small files at reduced time.
Created 05-13-2018 11:59 PM
Move files into HDFS directory
You can move all files from Local file system to HDFS directory,
[bash $] hadoop fs -put <local-path> <hdfs-directory>
Write a shell script to move all the required files into HDFS directory.
Then create hive table(assume this table is staging table) on top of the hdfs directory(i.e where we have moved the files).
In this method we are not loading the data from each file(by using load data local inpath ...) instead we are copying the files into HDFS directory and creating table on top of copied files.
Even these small files in the table will create performance issues so create another table(i.e final table) and
insert overwrite finaltable select * from staging table order by <fileld> //now we are going to create only one file in the final table.if you are having millions of records then you need to use other than order by clause to initialize more than one reducer.
(or)
Merge small files in local
Merging files into one big file
[bash $] cat file-name1 file-name2 file-name3 > merge.txt //or we can even use wild cards in filenames also
Now we are creating one merge.txt file by merging all the files into one.
Once you merge all the files then move the merged file into hadoop directory then create table on top of moved directory (or) by using load data local inpath <merged-file-path> into table <table-name>;
(or)
By Using NiFi
Use List and Fetch File processors (or) GetFile processor to fetch local files into NiFi and then Use MergeContent processor to merge small files into one big file based on your required maximum size then store the file into HDFS using PutHDFS processor.
In addition you can use Record Processors to read incoming data and change the output flowfile format then create ORC format files inside NiFi then store the files into HDFS.
References regarding Merge content processor NiFi
https://community.hortonworks.com/questions/64337/apache-nifi-merge-content.html
https://community.hortonworks.com/questions/161827/mergeprocessor-nifi-using-the-correlation-attribu...
References regarding record oriented processor
-
If the Answer addressed your question, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.
Created 05-14-2018 02:14 PM
Great suggestions up there.
We handle this case by dumping small files as ORC in daily partitions and then running Hive ALTER TABLE/PARTITION CONCATENATE very week or so.