Created 02-19-2018 10:44 AM
I am trying to understand ORC file format. In the documentation it is written that a file is divided into fixed sized stripes. But according my basic hdfs understanding, stripe must be divided from blocks stored in datanodes? Am I correct?
Created 02-19-2018 03:03 PM
A block will not be divided into stripes. The block of HDFS is the lowest level, and is splitted independent from the file format based on the number of bytes. Binary files as well as text files or ORC files will be splitted into blocks.
Now with ORC File format you will have several stripes within the file. HDFS will split the file, not considering the stripe format. So one block in HDFS may contain a part of a stripe, a complete stripe or even multiple stripes in one block. Of course with setting proper values you can optimize it, as described here:
https://community.hortonworks.com/articles/75501/orc-creation-best-practices.html
Created 02-19-2018 03:03 PM
A block will not be divided into stripes. The block of HDFS is the lowest level, and is splitted independent from the file format based on the number of bytes. Binary files as well as text files or ORC files will be splitted into blocks.
Now with ORC File format you will have several stripes within the file. HDFS will split the file, not considering the stripe format. So one block in HDFS may contain a part of a stripe, a complete stripe or even multiple stripes in one block. Of course with setting proper values you can optimize it, as described here:
https://community.hortonworks.com/articles/75501/orc-creation-best-practices.html
Created 02-20-2018 10:08 AM
One more doubt. When we create an ORC table and dump data into it from an existing one, what is the flow of data. I mean the data to be compressed was in data nodes. Now this data has to be processed (stripe and indexing) for ORC. So can you briefly tell the working?
Created 02-20-2018 11:37 AM
On the functional level you will have the logic for the ORC file (stripes and indexes) independent from the distribution of the blocks in hdfs. So on application level you simply read the ORC file and write the data to the next ORC file. The application handling the ORC format knows at which position/byte in the file the data is stored, while hdfs nows which bytes of the files are stored on which node.
If you now read the complete file, all Nodes will deliver their block. If this data is then just stored again in a hdfs file hdfs will typically decide how many blocks it will use and distribute it on the nodes. So the data gets transfered from the source nodes to the target nodes. This is typically transparent to the user. The application will have to take care if the stripes are differently defined to write the data correctly into the ORC file.
Created 02-20-2018 11:59 AM
Thanks for the reply @haraldberghoff.
Correct me where I'm wrong -
a) The client reads the file (which we wish to compress i.e. write into an orc file)
b) This reading is done normally by reaching out to name node and getting block addresses.
c) Now the client applies all logic of orc in this file (creates stripes, indexes, wrapper etc).
d) Now again a typical write operation is carried out by reaching out to name node for getting new addresses.
e) The only part where ORC comes in was (c) which was independent of name nodes and data nodes.
Created 02-20-2018 12:23 PM
yes, that's correct.
The only interesting point is, how your client works. If it is a proper hadoop client, it will run directly and in parallel on the nodes storing the file blocks. If you have a non-hadoop client it will really retrieve the full file from hdfs, process it and write back to hdfs. In a Spark application each of the steps a-d will be executed in parallel on different nodes, while the hadoop framework takes care to bring the execution to the data.
And if the stripes are unluckily distributed on the blocks (and therefore hdfs nodes), the data tranfer between the nodes is much higher, than if the stripes are well distributed. But this is exactly because the stripes are created independent from the blocks. And it also is the key to optimize the stripe size (together with your use pattern).