<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Can someone explain me the output of orcfiledump? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220937#M74720</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/514/owen.html" nodeid="514"&gt;@owen&lt;/A&gt; &lt;/P&gt;&lt;P&gt;My number of mapper and reducers are almost down to half in ORC for a query. bytes read from HDFS is also reduced significantly. But still the time taken by ORC query is almost same as sequence file query.&lt;/P&gt;</description>
    <pubDate>Sat, 03 Mar 2018 02:39:39 GMT</pubDate>
    <dc:creator>Hadoopy</dc:creator>
    <dc:date>2018-03-03T02:39:39Z</dc:date>
    <item>
      <title>Can someone explain me the output of orcfiledump?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220930#M74713</link>
      <description>&lt;P&gt;My table test_orc contains (for one partition):&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;col1 col2 part1&lt;/EM&gt;&lt;/STRONG&gt; .&lt;BR /&gt;abc def 1 .&lt;BR /&gt;ghi jkl 1 .&lt;BR /&gt;mno pqr 1 .&lt;BR /&gt;koi hai 1 .&lt;BR /&gt;jo pgl 1 .&lt;BR /&gt;hai tre 1 .&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;hive --orcfiledump /hive/user.db/test_orc/part1=1/000000_0 gives output&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Structure for /hive/a0m01lf.db/test_orc/part1=1/000000_0 .&lt;BR /&gt;2018-02-18 22:10:24 INFO: org.apache.hadoop.hive.ql.io.orc.ReaderImpl - Reading ORC rows from /hive/a0m01lf.db/test_orc/part1=1/000000_0 with {include: null, offset: 0, length: 9223372036854775807} .&lt;BR /&gt;Rows: 6 .&lt;BR /&gt;Compression: ZLIB .&lt;BR /&gt;Compression size: 262144 .&lt;BR /&gt;Type: struct&amp;lt;_col0:string,_col1:string&amp;gt; .&lt;/P&gt;&lt;P&gt;Stripe Statistics:&lt;BR /&gt;Stripe 1:&lt;BR /&gt;Column 0: count: 6 .&lt;BR /&gt;Column 1: count: 6 min: abc max: mno sum: 17 .&lt;BR /&gt;Column 2: count: 6 min: def max: tre sum: 18 .&lt;/P&gt;&lt;P&gt;File Statistics:&lt;BR /&gt;Column 0: count: 6 .&lt;BR /&gt;Column 1: count: 6 min: abc max: mno sum: 17 .&lt;BR /&gt;Column 2: count: 6 min: def max: tre sum: 18 .&lt;/P&gt;&lt;P&gt;Stripes:&lt;BR /&gt;Stripe: offset: 3 data: 58 rows: 6 tail: 49 index: 67 .&lt;BR /&gt;Stream: column 0 section ROW_INDEX start: 3 length 9 .&lt;BR /&gt;Stream: column 1 section ROW_INDEX start: 12 length 29 .&lt;BR /&gt;Stream: column 2 section ROW_INDEX start: 41 length 29 .&lt;BR /&gt;Stream: column 1 section DATA start: 70 length 20 .&lt;BR /&gt;Stream: column 1 section LENGTH start: 90 length 12 .&lt;BR /&gt;Stream: column 2 section DATA start: 102 length 21 .&lt;BR /&gt;Stream: column 2 section LENGTH start: 123 length 5 .&lt;BR /&gt;Encoding column 0: DIRECT .&lt;BR /&gt;Encoding column 1: DIRECT_V2 .&lt;BR /&gt;Encoding column 2: DIRECT_V2 .&lt;/P&gt;&lt;P&gt;I did not understand the Stripes part!&lt;/P&gt;&lt;P&gt;And how do they calculate sum of column (string values)?&lt;/P&gt;</description>
      <pubDate>Mon, 19 Feb 2018 14:34:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220930#M74713</guid>
      <dc:creator>Hadoopy</dc:creator>
      <dc:date>2018-02-19T14:34:50Z</dc:date>
    </item>
    <item>
      <title>Re: Can someone explain me the output of orcfiledump?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220931#M74714</link>
      <description>&lt;P&gt;The sum of the string columns is actually the sum of the lengths of the strings in the column.&lt;/P&gt;&lt;P&gt;Stripes are the units of an ORC file that can be read independently. This stripe starts at byte offset 3, contains 6 rows of data and the storage breaks down as:&lt;/P&gt;&lt;P&gt;* data: 58 bytes&lt;/P&gt;&lt;P&gt;* index: 67 bytes&lt;/P&gt;&lt;P&gt;* metadata: 49 bytes&lt;/P&gt;&lt;P&gt;The streams give you details about how each column is stored. The encodings tell you whether a dictionary or direct encoding was used. Both of your columns had all unique values, so they ended up with a direct encoding.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Feb 2018 06:46:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220931#M74714</guid>
      <dc:creator>owen1</dc:creator>
      <dc:date>2018-02-20T06:46:17Z</dc:date>
    </item>
    <item>
      <title>Re: Can someone explain me the output of orcfiledump?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220932#M74715</link>
      <description>&lt;P&gt;Thanks for reply Owen, Had a doubt. These minimum and maximum values are used for skipping files and stripes right? But as they are not sorted, not many stripes and files will be skipped. So how does read become significant faster in ORC?&lt;/P&gt;</description>
      <pubDate>Tue, 20 Feb 2018 19:27:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220932#M74715</guid>
      <dc:creator>Hadoopy</dc:creator>
      <dc:date>2018-02-20T19:27:15Z</dc:date>
    </item>
    <item>
      <title>Re: Can someone explain me the output of orcfiledump?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220933#M74716</link>
      <description>&lt;P&gt;Reading is still much faster than most formats.&lt;/P&gt;&lt;P&gt;You're right that predicate pushdown based on the min/max values is much more effective when the data is sorted.&lt;/P&gt;&lt;P&gt; Another thing that you can use if you often need to search using equality predicates is bloom filters. They occupy additional space in the file, but can be a huge win when looking for particular values. For example, one customer has their purchase table sorted by time, but sometimes need to find a particular customer's records quickly. Bloom filter on the customer column lets them find just the sets of 10k rows that have that customer in them.&lt;/P&gt;</description>
      <pubDate>Wed, 21 Feb 2018 02:38:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220933#M74716</guid>
      <dc:creator>owen1</dc:creator>
      <dc:date>2018-02-21T02:38:40Z</dc:date>
    </item>
    <item>
      <title>Re: Can someone explain me the output of orcfiledump?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220934#M74717</link>
      <description>&lt;P&gt;Hey Owen, my file A was 130GB in sequence file format and 78GB in ORC+ZLIB format. Now rolling out a sum (columnA) query on ORC+ZLIB format takes 11132 sec cumulative CPU time whereas 10858 sec in sequence format. Theoretically ORC+ZLIB should have calculated sum much much faster than sequence file. Is there a specific reason for this result?&lt;/P&gt;</description>
      <pubDate>Thu, 22 Feb 2018 14:44:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220934#M74717</guid>
      <dc:creator>Hadoopy</dc:creator>
      <dc:date>2018-02-22T14:44:36Z</dc:date>
    </item>
    <item>
      <title>Re: Can someone explain me the output of orcfiledump?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220935#M74718</link>
      <description>&lt;P&gt;Akshat, I need more information. &lt;/P&gt;&lt;P&gt;Which version of the software are you using? &lt;/P&gt;&lt;P&gt;Are you using the vectorized reader or the row by row reader? The vectorized reader is much faster.&lt;/P&gt;&lt;P&gt;Does your query have any predicate pushdown or is it a sum of the entire column?&lt;/P&gt;</description>
      <pubDate>Tue, 27 Feb 2018 00:52:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220935#M74718</guid>
      <dc:creator>owen1</dc:creator>
      <dc:date>2018-02-27T00:52:11Z</dc:date>
    </item>
    <item>
      <title>Re: Can someone explain me the output of orcfiledump?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220936#M74719</link>
      <description>&lt;P&gt;I am using Hive 0.13&lt;BR /&gt;I didn't try turning on vectorization yet.&lt;BR /&gt;It was sum of entire column (of a partition in my table).&lt;/P&gt;</description>
      <pubDate>Wed, 28 Feb 2018 02:45:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220936#M74719</guid>
      <dc:creator>Hadoopy</dc:creator>
      <dc:date>2018-02-28T02:45:17Z</dc:date>
    </item>
    <item>
      <title>Re: Can someone explain me the output of orcfiledump?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220937#M74720</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/514/owen.html" nodeid="514"&gt;@owen&lt;/A&gt; &lt;/P&gt;&lt;P&gt;My number of mapper and reducers are almost down to half in ORC for a query. bytes read from HDFS is also reduced significantly. But still the time taken by ORC query is almost same as sequence file query.&lt;/P&gt;</description>
      <pubDate>Sat, 03 Mar 2018 02:39:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-someone-explain-me-the-output-of-orcfiledump/m-p/220937#M74720</guid>
      <dc:creator>Hadoopy</dc:creator>
      <dc:date>2018-03-03T02:39:39Z</dc:date>
    </item>
  </channel>
</rss>

