About owen1

owen1 · ‎02-26-2018

Akshat, I need more information. Which version of the software are you using? Are you using the vectorized reader or the row by row reader? The vectorized reader is much faster. Does your query have any predicate pushdown or is it a sum of the entire column?

owen1 · ‎02-21-2018

Only ORC and Parquet have the necessary features Predicate pushdown where a condition is checked against the metadata to see if the rows need to be read. Column projection to only read the bytes for the necessary columns. ORC can use predicate pushdown based on either: min and max for each column optional bloom filter for looking for particular values Parquet only has min/max. ORC can filter at the file level, stripe level, or 10k row level. Parquet can only filter at the file level or stripe level. The previous answer mentions some of Avro's properties that are shared by ORC and Parquet: They are both language neutral with C++, Java, and other language implementations. They are both self describing. They are both splittable when compressed. They both support schema evolution.

owen1 · ‎02-21-2018

ORC is for more than Hive. It is a separate project now and has support from Spark and Nifi.

owen1 · ‎02-20-2018

Reading is still much faster than most formats. You're right that predicate pushdown based on the min/max values is much more effective when the data is sorted. Another thing that you can use if you often need to search using equality predicates is bloom filters. They occupy additional space in the file, but can be a huge win when looking for particular values. For example, one customer has their purchase table sorted by time, but sometimes need to find a particular customer's records quickly. Bloom filter on the customer column lets them find just the sets of 10k rows that have that customer in them.

owen1 · ‎02-19-2018

The sum of the string columns is actually the sum of the lengths of the strings in the column. Stripes are the units of an ORC file that can be read independently. This stripe starts at byte offset 3, contains 6 rows of data and the storage breaks down as: * data: 58 bytes * index: 67 bytes * metadata: 49 bytes The streams give you details about how each column is stored. The encodings tell you whether a dictionary or direct encoding was used. Both of your columns had all unique values, so they ended up with a direct encoding.

owen1 · ‎03-21-2017

Actually, we have implemented an example of how to do it in the ORC project. In particular, ORC-150 added both a JSON schema discovery tool and JSON to ORC converter. Given the single row of data above, the schema discovery tool produces the schema below. Obviously given more data, it would produce a better schema. The JSON conversion tool uses a provided schema (or runs the schema discovery tool) to convert the data. struct< eventHeader:struct< eventOutcome:string, eventType:string>, eventPayload:struct< Flag:boolean, amts:array<struct<amt:tinyint,impPrcsnAmt:tinyint,type:string>>, from:binary, nr:binary, to:binary>>

owen1 · ‎12-18-2015

Actually, it is Hive that doesn't support it. ORC files are self-describing, so if you read the file programmatically, the reader provides the schema. Unfortunately, Hive's serdes are asked for the schema without being given the file path.

Online	Offline
Last Visited	‎08-19-2019 04:28 PM

Member Since	‎10-21-2015 03:01 AM
Last Visited	‎08-19-2019 04:28 PM
Posts	9
Kudos received	8

Cloudera Community

Re: Can someone explain me the output of orcfiledu...

Re: Can someone explain me the output of orcfiledu...

Re: Between Avro, Parquet, and RC/ORC which is use...

Re: Between Avro, Parquet, and RC/ORC which is use...

Re: Can someone explain me the output of orcfiledu...

Re: Can someone explain me the output of orcfiledu...

Re: Parsing JSON data and Storing it in ORC Hive

Re: Orc Table Creation without column details