Created 05-30-2017 06:53 PM
Trying to run query:
select profile_id,count(*) cnt from mydb group by profile_id order by cnt desc limit 100;
Getting Vertex_Failure due to null pointer.
/etc/hosts looks fine. I can connect from one node to the other on various ports. They have a lot of RAM, and Disk available at this point. I have increased a bunch of buffers in Hive and some other settings from searching the Internet. Still no resolution.
I tried query with TEZ disabled and ran into errors.
2017-05-30 14:27:32,270 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1496156002153_0004_2][Event:TASK_ATTEMPT_FINISHED]: vertexName=Reducer 2, taskAttemptId=attempt_1496156002153_0004_2_01_000000_1, creationTime=1496168846938, allocationTime=1496168846940, startTime=1496168847079, finishTime=1496168852270, timeTaken=5191, status=FAILED, errorEnum=INPUT_READ_ERROR, diagnostics=Error: exceptionThrown=org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in fetcher {Map_1} #14 at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:382) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:334) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:693) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1567) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474) at org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191) , errorMessage=Shuffle Runner Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in fetcher {Map_1} #14 at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:382) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:334) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:693) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1567) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474) at org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191) , nodeHttpAddress=datalake.node.com:8042, counters=Counters: 60, File System Counters, FILE_BYTES_READ=1978141, org.apache.tez.common.counters.TaskCounter, REDUCE_INPUT_GROUPS=0, REDUCE_INPUT_RECORDS=0, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=18, NUM_SKIPPED_INPUTS=0, NUM_FAILED_SHUFFLE_INPUTS=0, MERGED_MAP_OUTPUTS=18, GC_TIME_MILLIS=0, CPU_MILLISECONDS=750, PHYSICAL_MEMORY_BYTES=8445231104, VIRTUAL_MEMORY_BYTES=11002957824, COMMITTED_HEAP_BYTES=8445231104, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, ADDITIONAL_SPILLS_BYTES_READ=72, SHUFFLE_BYTES=1594950, SHUFFLE_BYTES_DECOMPRESSED=3398576, SHUFFLE_BYTES_TO_MEM=0, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_DISK_DIRECT=1594950, NUM_MEM_TO_DISK_MERGES=0, NUM_DISK_TO_DISK_MERGES=0, SHUFFLE_PHASE_TIME=23, MERGE_PHASE_TIME=5060, FIRST_EVENT_RECEIVED=12, LAST_EVENT_RECEIVED=17, Shuffle Errors, BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0, Shuffle Errors_Reducer_2_INPUT_Map_1, BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0, TaskCounter_Reducer_2_INPUT_Map_1, ADDITIONAL_SPILLS_BYTES_READ=72, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, COMBINE_INPUT_RECORDS=0, FIRST_EVENT_RECEIVED=12, LAST_EVENT_RECEIVED=17, MERGED_MAP_OUTPUTS=18, MERGE_PHASE_TIME=5060, NUM_DISK_TO_DISK_MERGES=0, NUM_FAILED_SHUFFLE_INPUTS=0, NUM_MEM_TO_DISK_MERGES=0, NUM_SHUFFLED_INPUTS=18, NUM_SKIPPED_INPUTS=0, REDUCE_INPUT_GROUPS=0, REDUCE_INPUT_RECORDS=0, SHUFFLE_BYTES=1594950, SHUFFLE_BYTES_DECOMPRESSED=3398576, SHUFFLE_BYTES_DISK_DIRECT=1594950, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=0, SHUFFLE_PHASE_TIME=23, SPILLED_RECORDS=0 2017-05-30 14:27:32,271 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: Scheduling new attempt for task: task_1496156002153_0004_2_01_000000, currentFailedAttempts: 2, maxFailedAttempts: 4
Created 05-31-2017 07:35 AM
Could you try running the query after setting property tez.runtime.shuffle.ssl.enable=false and let us know the result?
Created 05-31-2017 02:29 PM
That did not change the issue. It is inconsistent. I ran this 3 times and it failed 2 of the three times. The current failure came out on the console as:
hive> select profile_id,count(*) cnt from webordersbase group by profile_id order by cnt desc limit 100 ; Query ID = my_user_20170531102337_ddee6338-5e67-496c-b2af-3e380288396e Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1496156002153_0008) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 9 9 0 0 5 0 Reducer 2 RUNNING 1 0 0 1 4 0 Reducer 3 INITED 1 0 0 1 0 0 -------------------------------------------------------------------------------- VERTICES: 01/03 [=====================>>-----] 81% ELAPSED TIME: 31.84 s -------------------------------------------------------------------------------- Status: Failed Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_3_00 Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_3_00 Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_3_00 Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_3_00 Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_3_01 Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_3_01 Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_3_01 Vertex failed, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_3_01, diagnostics=[Task failed, taskId=task_1496156002153_0008_3_01_000000, diagnostics=[TaskAttempt 0 failed, info=[attempt_1496156002153_0008_3_01_000000_0 being failed for too many output errors. failureFraction=1.0, MAX_ALLOWED_OUTPUT_FAILURES_FRACTION=0.1, uniquefailedOutputReports=1, MAX_ALLOWED_OUTPUT_FAILURES=10, MAX_ALLOWED_TIME_FOR_TASK_READ_ERROR_SEC=300, readErrorTimespan=0], TaskAttempt 1 failed, info=[attempt_1496156002153_0008_3_01_000000_1 being failed for too many output errors. failureFraction=1.0, MAX_ALLOWED_OUTPUT_FAILURES_FRACTION=0.1, uniquefailedOutputReports=1, MAX_ALLOWED_OUTPUT_FAILURES=10, MAX_ALLOWED_TIME_FOR_TASK_READ_ERROR_SEC=300, readErrorTimespan=0], TaskAttempt 2 failed, info=[attempt_1496156002153_0008_3_01_000000_2 being failed for too many output errors. failureFraction=1.0, MAX_ALLOWED_OUTPUT_FAILURES_FRACTION=0.1, uniquefailedOutputReports=1, MAX_ALLOWED_OUTPUT_FAILURES=10, MAX_ALLOWED_TIME_FOR_TASK_READ_ERROR_SEC=300, readErrorTimespan=0], TaskAttempt 3 failed, info=[Error: exceptionThrown=org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in fetcher {Map_1} #5 at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:382) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:334) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:693) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1567) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474) at org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191) , errorMessage=Shuffle Runner Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in fetcher {Map_1} #5 at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:382) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:334) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException ...
Created 05-31-2017 02:34 PM
Both reports are with SSL enabled=false. Set in the Custom tez-site section.
Sometimes the error reports differently:
-------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 9 9 0 0 7 0 Reducer 2 ...... SUCCEEDED 1 1 0 0 4 0 Reducer 3 RUNNING 1 0 1 0 0 0 -------------------------------------------------------------------------------- VERTICES: 02/03 [=======================>>---] 90% ELAPSED TIME: 35.85 s -------------------------------------------------------------------------------- Status: Failed Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00 Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00 Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00 Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00 Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00 Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00 Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00 Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01 Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01 Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01 Vertex failed, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01, diagnostics=[Vertex vertex_1496156002153_0008_4_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1496156002153_0008_4_01 [Reducer 2] failed as task task_1496156002153_0008_4_01_000000 failed after vertex succeeded.] DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01Vertex failed, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01, diagnostics=[Vertex vertex_1496156002153_0008_4_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1496156002153_0008_4_01 [Reducer 2] failed as task task_1496156002153_0008_4_01_000000 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
Created 06-01-2017 09:05 PM
And the solution appears to be we had some files
.../hadoop/yarn/local/usercache/hive/filecache
with odd permissions. Flushing that directory on the data nodes and the query runs.