About Hadoopy

Hadoopy · ‎03-08-2018

@Binu Mathew Hey, In my case lot of mappers are launched when I run a select query on ORC file. Also, are there some particular settings of hive to be turned on so that read operations in ORC use ppd. I have tried a lot but almost all my queries read the same as size(of my ORC table), which means reader is reading the whole ORC file. I run Hive 0.13.

Hadoopy · ‎03-05-2018

Will usage of DISTRIBUTE BY or SORT BY be helpful?

Hadoopy · ‎03-05-2018

My Hive table is in ORC format and queries in it run fastest when columns in where clause are sorted. But in my case there are not currently. What is the syntax to sort a column just before query?

Hadoopy · ‎03-02-2018

@owen My number of mapper and reducers are almost down to half in ORC for a query. bytes read from HDFS is also reduced significantly. But still the time taken by ORC query is almost same as sequence file query.

Hadoopy · ‎03-02-2018

Yup @Michael Young. Another way I found was through hadoop fs -text <file-location> On the top of results, INFO compress.CodecPool: Got brand-new decompressor [.snappy] is written which I think is a confirmation that snappy compression is applied.

Hadoopy · ‎02-27-2018

I am using Hive 0.13 I didn't try turning on vectorization yet. It was sum of entire column (of a partition in my table).

Hadoopy · ‎02-27-2018

So I compressed my table in hive using snappy compression and it did get compress. The size was reduced. But when i run hadoop fs -lsr /hive/user.db/table_name, I see no file extensions with .snappy. I want to know if they really were snappy compressed or not?

Hadoopy · ‎02-26-2018

Link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC Above link has a part named serialization. Can somebody tell what serialization is and for what it is used for?

Hadoopy · ‎02-25-2018

Are the changes in schema like adding, deleting, renaming, modifying data type in columns permitted without breaking anything in ORC files in Hive 0.13.

Hadoopy · ‎02-23-2018

Hey, I am pretty much confused which storage format is suited for which type of data. You said "Parquet is well suited for data warehouse kind of solutions where aggregations are required on certain column over a huge set of data.", But I think its true for ORC too. And As @owen said, ORC contains indexes at 3 levels (2 levels in parquet), shouldn't ORC be faster than Parquet for aggregations.

Online	Offline
Last Visited	‎03-12-2018 05:48 AM

Member Since	‎02-01-2018 12:06 AM
Last Visited	‎03-12-2018 05:48 AM
Posts	37
Kudos received	2

Cloudera Community

Re: Will there be any performance issues if we sel...

Re: Can we sort a column of a Hive table just befo...

Can we sort a column of a Hive table just before q...

Re: Can someone explain me the output of orcfiledu...

Re: How to confirm my files are snappy compresses ...

Re: Can someone explain me the output of orcfiledu...

How to confirm my files are snappy compresses in h...

What is serialization in ORC?

To what extent is schema evolution available in OR...

Re: Between Avro, Parquet, and RC/ORC which is use...