Created 10-09-2017 09:37 PM
From a table that has more than 3.9 Billion rows, I performed select count(*) and a select statement with a where condition as mentioned below, but this does not have an entry in yarn logs.
It should kick off a MR job and I should see some entries in logs, but I do not see so. Want to confirm if the mentioned 2 operations do kick of a MR job or NOT?
Select count(*) from xyz; 3989090043;
Select * from xyz where id=2253998123456607;
Thank you for your help in advance!
Created 10-09-2017 10:10 PM
Try as below
Select count(*) from xyz limit 2; and see if that solves your problem.
Technically it gathers information from table status, it doesn't trigger any MR job unless there is real aggrigation.
Created 10-11-2017 03:17 PM
Sridhar,
Can you please check my comment.
Created 10-09-2017 10:13 PM
@Lakhman Pervatoju The simple answer is Yes.
But, if you are using LLAP(im memory processing) then you may not see a new job in RM because it uses already running LLAP jobs, which keeps running in memory. If you are using Ambari View 2 then it runs a job first time and 2nd time it gets data from cache.
Created 10-10-2017 03:42 PM
@Sridhar Reddy
Select count(*) from xyz limit 2;
This did kick off a job and made an entry in yarn logs.
From table status may be the first query can be processed with out triggering any jobs "select count(*)"
But why even while filtering a table by a where clause (in the 2nd query) does not trigger any MR jobs?
Created 10-10-2017 03:43 PM
@Manish
We do not have LLAP configured
Created 10-12-2017 05:07 AM
The reason for a MR job not being triggered while filtering a table is because for simple queries Hive does a direct fetch operation from HDFS which is much more faster/efficient than triggering a MR job. This is controlled by the property hive.fetch.task.conversion
Below is a detailed description of this property:
hive.fetch.task.conversion Default Value: minimal in Hive 0.10.0 through 0.13.1, more in Hive 0.14.0 and laterAdded In: Hive 0.10.0 with HIVE-2925; default changed in Hive 0.14.0 with HIVE-7397Some select queries can be converted to a single FETCH task, minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incur RS – ReduceSinkOperator, requiring a MapReduce task), lateral views and joins. Supported values are none, minimal and more. 0. none: Disable hive.fetch.task.conversion (value added in Hive 0.14.0 with HIVE-8389) 1. minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only 2. more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns) "more" can take any kind of expressions in the SELECT clause, including UDFs. (UDTFs and lateral views are not yet supported – see HIVE-5718.)