<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: puthiveQL is very slow in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/puthiveQL-is-very-slow/m-p/185722#M83285</link>
    <description>&lt;P&gt;The problem here could be that the external table isn't structured to make the filter/split of this file optimal, for example " WHERE department = 'xxx' AND time='yyyy';" executed against a non-partitioned external table causes a complete file scan of the 10gb for each statement (so you're reading the 10GB entirely every time)&lt;/P&gt;&lt;P&gt;You may want to read the file into nifi flow file with a configured buffer as actual data in stead of taking the external table approach, alternatively, you can use an intermediate orc table that inserts the entire external file in some sort of sorted manner, before splitting it into multiple tables based on some filter (which you would optimize for in your intermediate table structure), I'd personally recommend the first approach though&lt;/P&gt;</description>
    <pubDate>Sat, 08 Sep 2018 02:52:55 GMT</pubDate>
    <dc:creator>rtheron</dc:creator>
    <dc:date>2018-09-08T02:52:55Z</dc:date>
    <item>
      <title>puthiveQL is very slow</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/puthiveQL-is-very-slow/m-p/185721#M83284</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;&lt;P&gt;I have a 10GB file every minute coming to a location (/dir), and there is an external table for that location.&lt;/P&gt;&lt;P&gt;The file is as below&lt;/P&gt;&lt;PRE&gt;karlon,n_d_1,26,6234,2019-09-08,1536278400
d'lov,research,20,1001,2019-09-08,1536278400
kris'a,b_x_3,20,4532,2019-09-08,1536278400&lt;/PRE&gt;&lt;P&gt;external table name: ex_t&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;STRONG&gt;name&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;department&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;age&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;id&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;date&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;time&lt;/STRONG&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;karlon&lt;/TD&gt;&lt;TD&gt;n_d_1&lt;/TD&gt;&lt;TD&gt;26&lt;/TD&gt;&lt;TD&gt;6234&lt;/TD&gt;&lt;TD&gt;2019-09-08&lt;/TD&gt;&lt;TD&gt;1536278400&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;d'lov&lt;/TD&gt;&lt;TD&gt;research&lt;/TD&gt;&lt;TD&gt;20&lt;/TD&gt;&lt;TD&gt;1001&lt;/TD&gt;&lt;TD&gt;2018-09-08&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;1536278400&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;I have puthiveql processor in my flow which gets data from external table and inserts in to multiple ORC table.&lt;/P&gt;&lt;P&gt;ORC : table_1, table_2, table_3,table_4,table_5, table_6&lt;/P&gt;&lt;P&gt;Every table(orc table) has same columns.&lt;/P&gt;&lt;P&gt;name(string),department (string),age (int),id (int),date (string),partition_value (int)&lt;/P&gt;&lt;P&gt;The puthiveql processor has multiple insert queries in it.&lt;/P&gt;&lt;PRE&gt;INSERT INTO table_1 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'research' AND time='1536278400';

INSERT INTO table_2 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'n_d_1' AND time='1536278400';

INSERT INTO table_3 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'b_x_3' AND time='1536278400';

INSERT INTO table_4 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'research' AND time='1536278400';

INSERT INTO table_5 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'x_in_1' AND time='1536278400';

INSERT INTO table_6 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'z_e_3' AND time='1536278400';&lt;/PRE&gt;&lt;P&gt;The above is sent as a flowfile to puthiveql, which is scheduled every minute,  as the file arrives every minute.&lt;/P&gt;&lt;P&gt;Puthiveql is very slow process the above and the inserts are not happening frequently.&lt;/P&gt;&lt;P&gt;Can you please suggest how to improve the performance of the puthiveql, I have increased the concurrent processor but it did not help, some times the flowfiles(which have insert statements) get queued and never execute.&lt;/P&gt;&lt;P&gt;Suggestions are highly appreciated.&lt;/P&gt;</description>
      <pubDate>Fri, 07 Sep 2018 20:35:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/puthiveQL-is-very-slow/m-p/185721#M83284</guid>
      <dc:creator>mark_hadoop</dc:creator>
      <dc:date>2018-09-07T20:35:10Z</dc:date>
    </item>
    <item>
      <title>Re: puthiveQL is very slow</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/puthiveQL-is-very-slow/m-p/185722#M83285</link>
      <description>&lt;P&gt;The problem here could be that the external table isn't structured to make the filter/split of this file optimal, for example " WHERE department = 'xxx' AND time='yyyy';" executed against a non-partitioned external table causes a complete file scan of the 10gb for each statement (so you're reading the 10GB entirely every time)&lt;/P&gt;&lt;P&gt;You may want to read the file into nifi flow file with a configured buffer as actual data in stead of taking the external table approach, alternatively, you can use an intermediate orc table that inserts the entire external file in some sort of sorted manner, before splitting it into multiple tables based on some filter (which you would optimize for in your intermediate table structure), I'd personally recommend the first approach though&lt;/P&gt;</description>
      <pubDate>Sat, 08 Sep 2018 02:52:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/puthiveQL-is-very-slow/m-p/185722#M83285</guid>
      <dc:creator>rtheron</dc:creator>
      <dc:date>2018-09-08T02:52:55Z</dc:date>
    </item>
    <item>
      <title>Re: puthiveQL is very slow</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/puthiveQL-is-very-slow/m-p/185723#M83286</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/87360/rtheron.html" nodeid="87360"&gt;@rtheron&lt;/A&gt; &lt;/P&gt;&lt;P&gt;for some reason I cannot follow the first approach.&lt;/P&gt;&lt;P&gt;I tried creating an intermediate orc with partitions and loaded the data in to it from external table.&lt;/P&gt;&lt;P&gt;now when I load in to the destination from the intermediate table, puthiveql is taking a lot of time.&lt;/P&gt;&lt;P&gt;any suggestions are appreciated.&lt;/P&gt;</description>
      <pubDate>Wed, 12 Sep 2018 20:57:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/puthiveQL-is-very-slow/m-p/185723#M83286</guid>
      <dc:creator>mark_hadoop</dc:creator>
      <dc:date>2018-09-12T20:57:32Z</dc:date>
    </item>
  </channel>
</rss>

