I got one question which i am struggling to find out answers.
I have been working with Avro and Parquet for quite a some time. However, i am lacking few granular and in depth knowledge about storage format and object model.
I know basic details that, Storage format is Disk representation and Object Model is in memory representation. But, the missing link for me is,
1) If i have used a storage format such as parquet, how does they fit in memory and how exactly the object model come into play in processing.?
2) Does it come into picture actually.? Every data needs to bring in to memory when they are being processed. Does parquet do not worry about how they represented in memory or how it works for memory representations.
Ok I am not completely sure what you mean with Object Model. The Storage format is described by two things in Hive
- the Storage Handler
This describes the way data is stored on disc on a file level. Essentially defining the Outputformat if you would put it in MapReduce terminology. For example text, sequence, orc, parquet.
The return of this is a row with values. Which brings us to the next part
- The SERDE ( serializer deserializer ), this hive class parses the row and returns the actual data objects. There is a Hive interface for this where the Serde return a Map of Datatypes for every row. ( There is also something called vectorization which goes one level deeper but that would be difficult now )
For data formats like text and sequence files, serdes and storage formats are divided from each other. This is different for ORC and Parquett files though. That is one block since this column formats means rows need to be pieced together.
1) Not sure exactly what you mean with "fit in memory" but basically its not too different. Parquet ( and ORC ) are column formats so the Parquet storagehandler reads blocks of data at a time only the columns that are needed by the query and returns them up into the Hive SQL engine as Maps of Datatyped values. Not everything needs to fit in memory just a buffer block ( 10000 rows often )
2) As said there is an interface each storage handler implements which is essentially "next row, next row..." and the return format is fixed a Map of serialized column:value maps. Not unsimilar to what a JDBC driver does. He doesn't really care what the engine does with it, nor does the engine care too much how the rows where read from disc
( not completely true anymore for ORC Hive now supports vectorization which means functions are executed on a thousand rows at a time deeper down the stack and the engine can also push down hints like where conditions into the storage handler so he can skip file parts but by and large its still true )
Interesting question. If I would want to answer this in one word it would be: Serialization (https://en.wikipedia.org/wiki/Serialization)
Parquett depends on Avro, which is a Serialization framework like Thrift or Protocol Buffers. There is a high change you will encounter all three of them in Hadoop. These frameworks help in making serialization available a cross different languagaes like Python, Java, C, and so on. So an object in Python stored in a file can be de-serialized by a Java application, if they both use the same serialization framework. Therefor each framework needs to provide a implementation for that language, containing type mapping an more.
Avro (and the others) do not define how de-serialized objects are represented in memory. This is the core of a programming language. So Python and Java have different memory models. Avro just helps to make them shareable.
If you want to change the way objects are represented in memory by a programming language you have to change the way the programming language does serialization. Kryo (https://github.com/EsotericSoftware/kryo), which is used by Hive, Storm, and Spark is such a framework to change the serialization of Java.
In the below picture I try to show how Serialization is being used in Storm. Objects (Tuple) are (de-)serialized before sending over the network instead of a file. Hope hope this helps to illustrated.