Created 06-29-2016 01:33 PM
Hi experts, I'm a beginner using Hadoop and was reading a book that talks about Storage Format. My source data are some text files and my question is: I need/can transform my files into Sequence File, Avro, Parquet or Optimized Row Columnar? There I take some advantage using it instead of text files? Many thanks!!
Created 06-29-2016 01:41 PM
With "text" you mean delimited files right?
You can convert them in hive using a CTAS statement for example. ( or in pig reading with PigStorage and writing with any of the other Storage classes ) etc.
like CREATE TABLE X ... ROW FORMAT DELIMITED FIELDS TERMINATED BY ...;
CREATE TABLE ORCX STORED AS ORC AS SELECT * FROM X;
Regarding which file formats are best:
Delimited files:
Good for import/export, you can often leave the input data unchanged which is desirable in a system of records, often no conversion needed.
Sequence File:
Binary format, not readable, but faster to read write. Native format of Hadoop. With the arrival of ORC files a bit out of vogure
Optimized column storage: (use ORC in HDP, Parquet in Cloudera but they are very similar):
Optimized column storage format. 10-100x faster for queries. Definitely the way to go to store your Hive data for optimal performance. Including compression ( 10x for zip ), predicate pushdown ( skipping blocks based on where conditions ), column storage ( only the needed columns are read ) ...
Avro :
The way to go if you have XML/Json files and changing schemas. You can add columns to the formats above but its hard. Avro supports schema evolution and the integration into hive allows you to change the hive table schema based on new changed underlying data. If your input is XML/Json data this can be a very good data format. Because unlike Json/XML its binary and fast while still keeping the schema.
Created 06-29-2016 01:41 PM
With "text" you mean delimited files right?
You can convert them in hive using a CTAS statement for example. ( or in pig reading with PigStorage and writing with any of the other Storage classes ) etc.
like CREATE TABLE X ... ROW FORMAT DELIMITED FIELDS TERMINATED BY ...;
CREATE TABLE ORCX STORED AS ORC AS SELECT * FROM X;
Regarding which file formats are best:
Delimited files:
Good for import/export, you can often leave the input data unchanged which is desirable in a system of records, often no conversion needed.
Sequence File:
Binary format, not readable, but faster to read write. Native format of Hadoop. With the arrival of ORC files a bit out of vogure
Optimized column storage: (use ORC in HDP, Parquet in Cloudera but they are very similar):
Optimized column storage format. 10-100x faster for queries. Definitely the way to go to store your Hive data for optimal performance. Including compression ( 10x for zip ), predicate pushdown ( skipping blocks based on where conditions ), column storage ( only the needed columns are read ) ...
Avro :
The way to go if you have XML/Json files and changing schemas. You can add columns to the formats above but its hard. Avro supports schema evolution and the integration into hive allows you to change the hive table schema based on new changed underlying data. If your input is XML/Json data this can be a very good data format. Because unlike Json/XML its binary and fast while still keeping the schema.
Created 10-16-2018 03:41 PM
I would like to elaborate more on the answer already given. This is a attempt to simplify explaination on what it takes to make a choice to follow a specific format.
There is now choice available within HDFS that can manage file format and compression techniques. Alternative to explicit encoding and splitting using LZO or BZIP. There is many format that today support block compression and columnar row compression with features.
A storage format is a way you define how information is to be stored. This is sometimes usually indicated by the extension of the file. For example we know images can be several storage formats, PNG, JPG, and GIF etc. All these formats can store the same image, but each has specific storage characteristics.
In Hadoop filesystem you have all of traditional storage formats available to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and unstructured data.
Why is it important to know these formats
In any performance tradeoffs, a huge bottleneck for HDFS-enabled applications like MapReduce, Hive, HBase, and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. These issues are accentuated when you manage large datasets. The Hadoop file formats have evolved to ease these issues across a number of use cases.
Choosing an appropriate file format can have some significant benefits:
Some file formats are designed for general use, others are designed for more specific use cases (like powering a database), and some are designed with specific data characteristics in mind. So there really is quite a lot of choice when storing data in Hadoop and one should know to optimally store data in HDFS. Currently my go to storage is ORC format.
Check if your Big data components (Spark, Hive, HBase etc) support these format and make the decision accordingly. For example, I am currently injecting data into Hive and converting it into ORC format which works for me in terms of compression and performance.
Some common storage formats for Hadoop include:
Plain text storage (eg, CSV, TSV files, Delimited file etc)
Data is laid out in lines, with each line being a record. Lines are terminated by a newline character \n in the typical UNIX world. Text-files are inherently splittable. but if you want to compress them you’ll have to use a file-level compression codec that support splitting, such as BZIP2. This is not efficient and will require a bit of work when performing MapReduce tasks.
Sequence Files
Originally designed for MapReduce therefore very easy to integrate with Hadoop MapReduce processes. They encode a key and a value for each record and nothing more. Stored in a binary format that is smaller than a text-based format. Even here it doesn't encode the key and value in anyway. One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks. Though still not efficient as per statistics like Parquet and ORC.
Avro
The format encodes the schema of its contents directly in the file which allows you to store complex objects natively. Its file format with additional framework for, serialization and deserialization framework. With regular old sequence files you can store complex objects but you have to manage the process. It also supports block-level compression.
Parquet
My favorite and hot format these days. Its a columnar file storage structure while it encodes and writes to the disk. So datasets are partitioned both horizontally and vertically. One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar). Try using this if your processing can optimally use column storage. You can refer to advantages of columnar storages.
If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of your application, but frankly if you have an application that usually needs entire rows of data then the columnar formats may actually be a detriment to performance due to the increased network activity required.
ORC
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats. An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.
It is similar to the Parquet but with different encoding technique. Its not for this thread but you can lookup on Google for differences.