Support Questions

Find answers, ask questions, and share your expertise

Can someone explain me the output of orcfiledump?

avatar
Contributor

My table test_orc contains (for one partition):

col1 col2 part1 .
abc def 1 .
ghi jkl 1 .
mno pqr 1 .
koi hai 1 .
jo pgl 1 .
hai tre 1 .

hive --orcfiledump /hive/user.db/test_orc/part1=1/000000_0 gives output

Structure for /hive/a0m01lf.db/test_orc/part1=1/000000_0 .
2018-02-18 22:10:24 INFO: org.apache.hadoop.hive.ql.io.orc.ReaderImpl - Reading ORC rows from /hive/a0m01lf.db/test_orc/part1=1/000000_0 with {include: null, offset: 0, length: 9223372036854775807} .
Rows: 6 .
Compression: ZLIB .
Compression size: 262144 .
Type: struct<_col0:string,_col1:string> .

Stripe Statistics:
Stripe 1:
Column 0: count: 6 .
Column 1: count: 6 min: abc max: mno sum: 17 .
Column 2: count: 6 min: def max: tre sum: 18 .

File Statistics:
Column 0: count: 6 .
Column 1: count: 6 min: abc max: mno sum: 17 .
Column 2: count: 6 min: def max: tre sum: 18 .

Stripes:
Stripe: offset: 3 data: 58 rows: 6 tail: 49 index: 67 .
Stream: column 0 section ROW_INDEX start: 3 length 9 .
Stream: column 1 section ROW_INDEX start: 12 length 29 .
Stream: column 2 section ROW_INDEX start: 41 length 29 .
Stream: column 1 section DATA start: 70 length 20 .
Stream: column 1 section LENGTH start: 90 length 12 .
Stream: column 2 section DATA start: 102 length 21 .
Stream: column 2 section LENGTH start: 123 length 5 .
Encoding column 0: DIRECT .
Encoding column 1: DIRECT_V2 .
Encoding column 2: DIRECT_V2 .

I did not understand the Stripes part!

And how do they calculate sum of column (string values)?

1 ACCEPTED SOLUTION

avatar
Cloudera Employee

The sum of the string columns is actually the sum of the lengths of the strings in the column.

Stripes are the units of an ORC file that can be read independently. This stripe starts at byte offset 3, contains 6 rows of data and the storage breaks down as:

* data: 58 bytes

* index: 67 bytes

* metadata: 49 bytes

The streams give you details about how each column is stored. The encodings tell you whether a dictionary or direct encoding was used. Both of your columns had all unique values, so they ended up with a direct encoding.

View solution in original post

7 REPLIES 7

avatar
Cloudera Employee

The sum of the string columns is actually the sum of the lengths of the strings in the column.

Stripes are the units of an ORC file that can be read independently. This stripe starts at byte offset 3, contains 6 rows of data and the storage breaks down as:

* data: 58 bytes

* index: 67 bytes

* metadata: 49 bytes

The streams give you details about how each column is stored. The encodings tell you whether a dictionary or direct encoding was used. Both of your columns had all unique values, so they ended up with a direct encoding.

avatar
Contributor

Thanks for reply Owen, Had a doubt. These minimum and maximum values are used for skipping files and stripes right? But as they are not sorted, not many stripes and files will be skipped. So how does read become significant faster in ORC?

avatar
Cloudera Employee

Reading is still much faster than most formats.

You're right that predicate pushdown based on the min/max values is much more effective when the data is sorted.

Another thing that you can use if you often need to search using equality predicates is bloom filters. They occupy additional space in the file, but can be a huge win when looking for particular values. For example, one customer has their purchase table sorted by time, but sometimes need to find a particular customer's records quickly. Bloom filter on the customer column lets them find just the sets of 10k rows that have that customer in them.

avatar
Contributor

Hey Owen, my file A was 130GB in sequence file format and 78GB in ORC+ZLIB format. Now rolling out a sum (columnA) query on ORC+ZLIB format takes 11132 sec cumulative CPU time whereas 10858 sec in sequence format. Theoretically ORC+ZLIB should have calculated sum much much faster than sequence file. Is there a specific reason for this result?

avatar
Cloudera Employee

Akshat, I need more information.

Which version of the software are you using?

Are you using the vectorized reader or the row by row reader? The vectorized reader is much faster.

Does your query have any predicate pushdown or is it a sum of the entire column?

avatar
Contributor

I am using Hive 0.13
I didn't try turning on vectorization yet.
It was sum of entire column (of a partition in my table).

avatar
Contributor

@owen

My number of mapper and reducers are almost down to half in ORC for a query. bytes read from HDFS is also reduced significantly. But still the time taken by ORC query is almost same as sequence file query.