Member since
10-21-2015
9
Posts
8
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
10074 | 02-19-2018 10:46 PM |
02-26-2018
04:52 PM
Akshat, I need more information. Which version of the software are you using? Are you using the vectorized reader or the row by row reader? The vectorized reader is much faster. Does your query have any predicate pushdown or is it a sum of the entire column?
... View more
02-21-2018
06:59 PM
2 Kudos
Only ORC and Parquet have the necessary features Predicate pushdown where a condition is checked against the metadata to see if the rows need to be read. Column projection to only read the bytes for the necessary columns. ORC can use predicate pushdown based on either: min and max for each column optional bloom filter for looking for particular values Parquet only has min/max. ORC can filter at the file level, stripe level, or 10k row level. Parquet can only filter at the file level or stripe level. The previous answer mentions some of Avro's properties that are shared by ORC and Parquet: They are both language neutral with C++, Java, and other language implementations. They are both self describing. They are both splittable when compressed. They both support schema evolution.
... View more
02-21-2018
06:45 PM
1 Kudo
ORC is for more than Hive. It is a separate project now and has support from Spark and Nifi.
... View more
02-20-2018
06:38 PM
1 Kudo
Reading is still much faster than most formats. You're right that predicate pushdown based on the min/max values is much more effective when the data is sorted. Another thing that you can use if you often need to search using equality predicates is bloom filters. They occupy additional space in the file, but can be a huge win when looking for particular values. For example, one customer has their purchase table sorted by time, but sometimes need to find a particular customer's records quickly. Bloom filter on the customer column lets them find just the sets of 10k rows that have that customer in them.
... View more
02-19-2018
10:46 PM
1 Kudo
The sum of the string columns is actually the sum of the lengths of the strings in the column. Stripes are the units of an ORC file that can be read independently. This stripe starts at byte offset 3, contains 6 rows of data and the storage breaks down as: * data: 58 bytes * index: 67 bytes * metadata: 49 bytes The streams give you details about how each column is stored. The encodings tell you whether a dictionary or direct encoding was used. Both of your columns had all unique values, so they ended up with a direct encoding.
... View more
03-21-2017
11:31 PM
1 Kudo
Actually, we have implemented an example of how to do it in the ORC project. In particular, ORC-150 added both a JSON schema discovery tool and JSON to ORC converter. Given the single row of data above, the schema discovery tool produces the schema below. Obviously given more data, it would produce a better schema. The JSON conversion tool uses a provided schema (or runs the schema discovery tool) to convert the data. struct<
eventHeader:struct<
eventOutcome:string,
eventType:string>,
eventPayload:struct<
Flag:boolean,
amts:array<struct<amt:tinyint,impPrcsnAmt:tinyint,type:string>>,
from:binary,
nr:binary,
to:binary>>
... View more
12-18-2015
06:19 PM
1 Kudo
Actually, it is Hive that doesn't support it. ORC files are self-describing, so if you read the file programmatically, the reader provides the schema. Unfortunately, Hive's serdes are asked for the schema without being given the file path.
... View more