Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What is the exact difference between Sequence file and text file? I could see some forums are saying that it will resolve the small file problems. Why cant this small file problem resolved with text file. Please help me understand this

avatar
Explorer
 
1 ACCEPTED SOLUTION

avatar
Guru

Sequence files are binary files containing key-value pairs. They can be compressed at the record (key-value pair) or block levels. A Java API is typically used to write and read sequence files but Sqoop can convert to sequence files. Because they are binary, they have faster read/write than text formatted files.

The small file problem arises when many small files cause memory overhead for the namenode referencing large amounts of small files. Large is a relative term, but if for example you have daily ingests of many small files ... over time you will start paying the price in memory as just stated. Also, map-reduce operates on blocks of data and when files contain less than a block of data the job spins up more mappers (with overhead cost) compared to those for files with over a block of data.

Sequence files can solve the small file problem if they are used in the following way. Sequence file is written to hold multiple key-value pairs and the key is a unique file metadata, like ingest filename or filename+timestamp and value is the content of the ingested file. Now you have a single file holding many ingested files as splittable key-value pairs. So if you loaded it into pig for example and grouped by key, each file content would be its own record. Sequence files often are used in custom-written map-reduce programs.

Like any decision for file formats, you need to understand what problem you are solving by deciding on a particular file format for a particular use case. If you are writing your own map-reducing programs and especially if you are also ingesting many small files repeatedly (and perhaps also want to do processing only on the ingested file metadata as well as its contents), then sequence files are a good fit. If on the other hand you want to load the data into hive tables (and especially where most queries are on subsets of columns), you would be better off landing the small files into hdfs, merging them and converting to ORC, and then deleting the landed small files.

View solution in original post

1 REPLY 1

avatar
Guru

Sequence files are binary files containing key-value pairs. They can be compressed at the record (key-value pair) or block levels. A Java API is typically used to write and read sequence files but Sqoop can convert to sequence files. Because they are binary, they have faster read/write than text formatted files.

The small file problem arises when many small files cause memory overhead for the namenode referencing large amounts of small files. Large is a relative term, but if for example you have daily ingests of many small files ... over time you will start paying the price in memory as just stated. Also, map-reduce operates on blocks of data and when files contain less than a block of data the job spins up more mappers (with overhead cost) compared to those for files with over a block of data.

Sequence files can solve the small file problem if they are used in the following way. Sequence file is written to hold multiple key-value pairs and the key is a unique file metadata, like ingest filename or filename+timestamp and value is the content of the ingested file. Now you have a single file holding many ingested files as splittable key-value pairs. So if you loaded it into pig for example and grouped by key, each file content would be its own record. Sequence files often are used in custom-written map-reduce programs.

Like any decision for file formats, you need to understand what problem you are solving by deciding on a particular file format for a particular use case. If you are writing your own map-reducing programs and especially if you are also ingesting many small files repeatedly (and perhaps also want to do processing only on the ingested file metadata as well as its contents), then sequence files are a good fit. If on the other hand you want to load the data into hive tables (and especially where most queries are on subsets of columns), you would be better off landing the small files into hdfs, merging them and converting to ORC, and then deleting the landed small files.