Support Questions

Find answers, ask questions, and share your expertise

Receiving: DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0

avatar

Trying to run query:

select profile_id,count(*) cnt from mydb group by profile_id order by cnt desc limit 100;

Getting Vertex_Failure due to null pointer.

/etc/hosts looks fine. I can connect from one node to the other on various ports. They have a lot of RAM, and Disk available at this point. I have increased a bunch of buffers in Hive and some other settings from searching the Internet. Still no resolution.

I tried query with TEZ disabled and ran into errors.

2017-05-30 14:27:32,270 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1496156002153_0004_2][Event:TASK_ATTEMPT_FINISHED]: vertexName=Reducer 2, taskAttemptId=attempt_1496156002153_0004_2_01_000000_1, creationTime=1496168846938, allocationTime=1496168846940, startTime=1496168847079, finishTime=1496168852270, timeTaken=5191, status=FAILED, errorEnum=INPUT_READ_ERROR, diagnostics=Error: exceptionThrown=org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in fetcher {Map_1} #14
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:382)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:334)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
	at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:693)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1567)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
	at org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)
, errorMessage=Shuffle Runner Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in fetcher {Map_1} #14
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:382)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:334)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
	at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:693)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1567)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
	at org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)
, nodeHttpAddress=datalake.node.com:8042, counters=Counters: 60, File System Counters, FILE_BYTES_READ=1978141, org.apache.tez.common.counters.TaskCounter, REDUCE_INPUT_GROUPS=0, REDUCE_INPUT_RECORDS=0, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=18, NUM_SKIPPED_INPUTS=0, NUM_FAILED_SHUFFLE_INPUTS=0, MERGED_MAP_OUTPUTS=18, GC_TIME_MILLIS=0, CPU_MILLISECONDS=750, PHYSICAL_MEMORY_BYTES=8445231104, VIRTUAL_MEMORY_BYTES=11002957824, COMMITTED_HEAP_BYTES=8445231104, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, ADDITIONAL_SPILLS_BYTES_READ=72, SHUFFLE_BYTES=1594950, SHUFFLE_BYTES_DECOMPRESSED=3398576, SHUFFLE_BYTES_TO_MEM=0, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_DISK_DIRECT=1594950, NUM_MEM_TO_DISK_MERGES=0, NUM_DISK_TO_DISK_MERGES=0, SHUFFLE_PHASE_TIME=23, MERGE_PHASE_TIME=5060, FIRST_EVENT_RECEIVED=12, LAST_EVENT_RECEIVED=17, Shuffle Errors, BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0, Shuffle Errors_Reducer_2_INPUT_Map_1, BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0, TaskCounter_Reducer_2_INPUT_Map_1, ADDITIONAL_SPILLS_BYTES_READ=72, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, COMBINE_INPUT_RECORDS=0, FIRST_EVENT_RECEIVED=12, LAST_EVENT_RECEIVED=17, MERGED_MAP_OUTPUTS=18, MERGE_PHASE_TIME=5060, NUM_DISK_TO_DISK_MERGES=0, NUM_FAILED_SHUFFLE_INPUTS=0, NUM_MEM_TO_DISK_MERGES=0, NUM_SHUFFLED_INPUTS=18, NUM_SKIPPED_INPUTS=0, REDUCE_INPUT_GROUPS=0, REDUCE_INPUT_RECORDS=0, SHUFFLE_BYTES=1594950, SHUFFLE_BYTES_DECOMPRESSED=3398576, SHUFFLE_BYTES_DISK_DIRECT=1594950, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=0, SHUFFLE_PHASE_TIME=23, SPILLED_RECORDS=0
2017-05-30 14:27:32,271 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: Scheduling new attempt for task: task_1496156002153_0004_2_01_000000, currentFailedAttempts: 2, maxFailedAttempts: 4

4 REPLIES 4

avatar
@Michael Grabenstein

Could you try running the query after setting property tez.runtime.shuffle.ssl.enable=false and let us know the result?

avatar

That did not change the issue. It is inconsistent. I ran this 3 times and it failed 2 of the three times. The current failure came out on the console as:

hive> select profile_id,count(*) cnt from webordersbase group by profile_id order by cnt desc limit 100 ;
Query ID = my_user_20170531102337_ddee6338-5e67-496c-b2af-3e380288396e
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1496156002153_0008)


--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      9          9        0        0       5       0
Reducer 2            RUNNING      1          0        0        1       4       0
Reducer 3             INITED      1          0        0        1       0       0
--------------------------------------------------------------------------------
VERTICES: 01/03  [=====================>>-----] 81%   ELAPSED TIME: 31.84 s
--------------------------------------------------------------------------------
Status: Failed
Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_3_00
Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_3_00
Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_3_00
Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_3_00
Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_3_01
Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_3_01
Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_3_01
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_3_01, diagnostics=[Task failed, taskId=task_1496156002153_0008_3_01_000000, diagnostics=[TaskAttempt 0 failed, info=[attempt_1496156002153_0008_3_01_000000_0 being failed for too many output errors. failureFraction=1.0, MAX_ALLOWED_OUTPUT_FAILURES_FRACTION=0.1, uniquefailedOutputReports=1, MAX_ALLOWED_OUTPUT_FAILURES=10, MAX_ALLOWED_TIME_FOR_TASK_READ_ERROR_SEC=300, readErrorTimespan=0], TaskAttempt 1 failed, info=[attempt_1496156002153_0008_3_01_000000_1 being failed for too many output errors. failureFraction=1.0, MAX_ALLOWED_OUTPUT_FAILURES_FRACTION=0.1, uniquefailedOutputReports=1, MAX_ALLOWED_OUTPUT_FAILURES=10, MAX_ALLOWED_TIME_FOR_TASK_READ_ERROR_SEC=300, readErrorTimespan=0], TaskAttempt 2 failed, info=[attempt_1496156002153_0008_3_01_000000_2 being failed for too many output errors. failureFraction=1.0, MAX_ALLOWED_OUTPUT_FAILURES_FRACTION=0.1, uniquefailedOutputReports=1, MAX_ALLOWED_OUTPUT_FAILURES=10, MAX_ALLOWED_TIME_FOR_TASK_READ_ERROR_SEC=300, readErrorTimespan=0], TaskAttempt 3 failed, info=[Error: exceptionThrown=org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in fetcher {Map_1} #5
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:382)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:334)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
        at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:693)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1567)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
        at org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)
, errorMessage=Shuffle Runner Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in fetcher {Map_1} #5
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:382)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:334)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
...

avatar

Both reports are with SSL enabled=false. Set in the Custom tez-site section.

Sometimes the error reports differently:

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      9          9        0        0       7       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       4       0
Reducer 3            RUNNING      1          0        1        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/03  [=======================>>---] 90%   ELAPSED TIME: 35.85 s
--------------------------------------------------------------------------------
Status: Failed
Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00
Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00
Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00
Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00
Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00
Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00
Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00
Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01
Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01
Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01, diagnostics=[Vertex vertex_1496156002153_0008_4_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1496156002153_0008_4_01 [Reducer 2] failed as task task_1496156002153_0008_4_01_000000 failed after vertex succeeded.]
DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1496156002153_0008_4_00Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01Vertex re-running, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01Vertex failed, vertexName=Reducer 2, vertexId=vertex_1496156002153_0008_4_01, diagnostics=[Vertex vertex_1496156002153_0008_4_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1496156002153_0008_4_01 [Reducer 2] failed as task task_1496156002153_0008_4_01_000000 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0

avatar

And the solution appears to be we had some files

.../hadoop/yarn/local/usercache/hive/filecache

with odd permissions. Flushing that directory on the data nodes and the query runs.