About Hadoopy

Hadoopy · ‎03-05-2018

Will usage of DISTRIBUTE BY or SORT BY be helpful?

myoung · ‎03-02-2018

@Akshat Mathur That is handy, thank you for sharing! If you think my response was helpful, please accept the answer to make it easier for others to find answers.

anarasimham · ‎02-26-2018

Serialization is the algorithm by which data is written to disk or transmitted somewhere. Different applications have different ways to serialize data to optimize for a specific outcome, whether it is dealing with reads or writes. As it says in the Hive language manual, integers and strings are encoded to disk and compressed in different ways, and it lists out the rules which it uses to do so. For example, variable-width encoding optimizes the space usage of the data, as it uses less space to encode smaller data. See the following Wikipedia article for more detail: https://en.wikipedia.org/wiki/Serialization

nmaillard1 · ‎02-27-2018

Adding columns to the end of the table works from Hive 1.2 (via HDP 2.5.4). In Hive 2.1, you get additional abilities to change the column types. In the eventual Hive 2.2, you'll get the ability to delete and reorder columns. Hive .13 is a little early for those features

Hadoopy · ‎02-23-2018

Hey, I am pretty much confused which storage format is suited for which type of data. You said "Parquet is well suited for data warehouse kind of solutions where aggregations are required on certain column over a huge set of data.", But I think its true for ORC too. And As @owen said, ORC contains indexes at 3 levels (2 levels in parquet), shouldn't ORC be faster than Parquet for aggregations.

arald · ‎02-20-2018

yes, that's correct. The only interesting point is, how your client works. If it is a proper hadoop client, it will run directly and in parallel on the nodes storing the file blocks. If you have a non-hadoop client it will really retrieve the full file from hdfs, process it and write back to hdfs. In a Spark application each of the steps a-d will be executed in parallel on different nodes, while the hadoop framework takes care to bring the execution to the data. And if the stripes are unluckily distributed on the blocks (and therefore hdfs nodes), the data tranfer between the nodes is much higher, than if the stripes are well distributed. But this is exactly because the stripes are created independent from the blocks. And it also is the key to optimize the stripe size (together with your use pattern).

Hadoopy · ‎03-02-2018

@owen My number of mapper and reducers are almost down to half in ORC for a query. bytes read from HDFS is also reduced significantly. But still the time taken by ORC query is almost same as sequence file query.

kgautam · ‎02-19-2018

Yess its possible only and only if there are no repetition in a column. In this case one will end up with the file and meta info of the columnar file format.

Online	Offline
Last Visited	‎03-12-2018 05:48 AM

Member Since	‎02-01-2018 12:06 AM
Last Visited	‎03-12-2018 05:48 AM
Posts	37
Kudos received	2

Cloudera Community

Re: Can we sort a column of a Hive table just befo...

Re: How to confirm my files are snappy compresses ...

Re: What is serialization in ORC?

Re: To what extent is schema evolution available i...

Re: Between Avro, Parquet, and RC/ORC which is use...

Re: Is a block in data node divided into multiple ...

Re: Can someone explain me the output of orcfiledu...

Re: Can columnar format occupy more space than row...