Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

SelectHiveQl properties(Fetch Size,Maximum Number of Fragments)

avatar
Rising Star

Can anybody explain the

Fetch Size,Maximum Number of Fragments properties with examples .

1 ACCEPTED SOLUTION

avatar
Master Guru

Fetch Size is a hint to the driver that says "when I ask for a row, please send X rows over, if I need more rows I will iterate through them before asking you to send more rows". If Fetch Size was 1, the driver would send a network request to the DB, the DB would send a single row back, it would be processed by the client, then for the next row another network request would be sent, and so on. This is very inefficient as the network overhead would overtake the amount of data being passed (especially for tables with a small number of rows). On the other hand, if the entire table was sent at the first request, the client could run out of memory, especially for tables with a very large number of rows. As a tuning parameter, many drivers allow the client to specify the Fetch Size in order to select the appropriate number of rows to send during a network request/response. For example I think the Hive driver's default is 50 rows, or at least it used to be.

Maximum Number of Fragments is an advanced user property usually only used for debugging. It basically stops fetching rows after a certain number have been processed, even if the result set has more rows available. For example, if Max Rows Per Flow File were set to 100, and Max Number of Fragments were set to 1, and a query returned 101 rows, then only 1 flow file (aka "fragment" in this case) would be sent, containing the first 100 rows. The 101th row would not be processed or sent in any flow file. Re-running the processor would have the exact same effect, and thus the 101th row is "unreachable". Maximum Number of Fragments was added to let the user avoid memory problems for large tables, as it would put an upper bound on the number of rows that would be processed. As mentioned, certain rows would not be available, but the processor wouldn't run out of memory either.

Hope this helps, please let me know if you'd like more clarification.

View solution in original post

2 REPLIES 2

avatar
Master Guru

Fetch Size is a hint to the driver that says "when I ask for a row, please send X rows over, if I need more rows I will iterate through them before asking you to send more rows". If Fetch Size was 1, the driver would send a network request to the DB, the DB would send a single row back, it would be processed by the client, then for the next row another network request would be sent, and so on. This is very inefficient as the network overhead would overtake the amount of data being passed (especially for tables with a small number of rows). On the other hand, if the entire table was sent at the first request, the client could run out of memory, especially for tables with a very large number of rows. As a tuning parameter, many drivers allow the client to specify the Fetch Size in order to select the appropriate number of rows to send during a network request/response. For example I think the Hive driver's default is 50 rows, or at least it used to be.

Maximum Number of Fragments is an advanced user property usually only used for debugging. It basically stops fetching rows after a certain number have been processed, even if the result set has more rows available. For example, if Max Rows Per Flow File were set to 100, and Max Number of Fragments were set to 1, and a query returned 101 rows, then only 1 flow file (aka "fragment" in this case) would be sent, containing the first 100 rows. The 101th row would not be processed or sent in any flow file. Re-running the processor would have the exact same effect, and thus the 101th row is "unreachable". Maximum Number of Fragments was added to let the user avoid memory problems for large tables, as it would put an upper bound on the number of rows that would be processed. As mentioned, certain rows would not be available, but the processor wouldn't run out of memory either.

Hope this helps, please let me know if you'd like more clarification.

avatar
Rising Star

@Matt Burgess Thank you for clarification ! Cheers