Created 06-13-2017 02:38 PM
We are using Hive to load data to S3 (using s3a). We've started seeing the following error:
2017-06-13 08:51:49,042 ERROR [main]: exec.Task (SessionState.java:printError(962)) - Failed with exception Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CopyObjectResultHandler). Response Code: 200, Response Text: OK com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CopyObjectResultHandler). Response Code: 200, Response Text: OK at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1507) at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:143) at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:131) at com.amazonaws.services.s3.transfer.internal.CopyMonitor.copy(CopyMonitor.java:189) at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:134) at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CopyObjectResultHandler at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150) at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseCopyObjectResponse(XmlResponsesSaxParser.java:417) at com.amazonaws.services.s3.model.transform.Unmarshallers$CopyObjectUnmarshaller.unmarshall(Unmarshallers.java:192) at com.amazonaws.services.s3.model.transform.Unmarshallers$CopyObjectUnmarshaller.unmarshall(Unmarshallers.java:189) at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62) at com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:44) at com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:30) at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712) ... 13 more Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.read(InputRecord.java:503) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930) at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281) at org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:251) at org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:209) at org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:171) at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.read1(BufferedReader.java:212) at java.io.BufferedReader.read(BufferedReader.java:286) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.skipSpaces(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141) ... 20 more
Anyone else seen this before? Is it a data size/length issue? Loading too much data at once? Timeout?
Created 06-20-2017 02:10 PM
Here are the final hive configs that seem to have fixed this issue. Seems to be related to timeouts.
set hive.execution.engine=mr; set hive.default.fileformat=Orc; set hive.exec.orc.default.compress=SNAPPY; set hive.exec.copyfile.maxsize=1099511627776; set hive.warehouse.subdir.inherit.perms=false; set hive.metastore.pre.event.listeners=; set hive.stats.fetch.partition.stats=false; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set fs.trash.interval=0; set fs.s3.buffer.dir=/tmp/s3a; set fs.s3a.attempts.maximum=50; set fs.s3a.connection.establish.timeout=120000; set fs.s3a.connection.timeout=120000; set fs.s3a.fast.upload=true; set fs.s3a.fast.upload.buffer=disk; set fs.s3a.multiobjectdelete.enable=true; set fs.s3a.max.total.tasks=2000; set fs.s3a.threads.core=30; set fs.s3a.threads.max=512; set fs.s3a.connection.maximum=30; set fs.s3a.fast.upload.active.blocks=12; set fs.s3a.threads.keepalivetime=120;
Created 06-13-2017 03:35 PM
This seems to be random. Sometimes we see this error; if we run it again and it succeeds. Not sure why we're seeing it though.
Here are the hive properties we're using:
set hive.execution.engine=mr; set hive.default.fileformat=Orc; set hive.exec.orc.default.compress=SNAPPY; set fs.s3a.attempts.maximum=50; set fs.s3a.connection.establish.timeout=30000; set fs.s3a.connection.timeout=30000; set fs.s3a.fast.upload=true; set fs.s3a.fast.upload.buffer=disk; set fs.s3n.multipart.uploads.enabled=true; set fs.s3a.threads.keepalivetime=60; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true;
We're running HDP 2.4.2 (HDP-2.4.2.0-258).
Created 06-20-2017 02:10 PM
Here are the final hive configs that seem to have fixed this issue. Seems to be related to timeouts.
set hive.execution.engine=mr; set hive.default.fileformat=Orc; set hive.exec.orc.default.compress=SNAPPY; set hive.exec.copyfile.maxsize=1099511627776; set hive.warehouse.subdir.inherit.perms=false; set hive.metastore.pre.event.listeners=; set hive.stats.fetch.partition.stats=false; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set fs.trash.interval=0; set fs.s3.buffer.dir=/tmp/s3a; set fs.s3a.attempts.maximum=50; set fs.s3a.connection.establish.timeout=120000; set fs.s3a.connection.timeout=120000; set fs.s3a.fast.upload=true; set fs.s3a.fast.upload.buffer=disk; set fs.s3a.multiobjectdelete.enable=true; set fs.s3a.max.total.tasks=2000; set fs.s3a.threads.core=30; set fs.s3a.threads.max=512; set fs.s3a.connection.maximum=30; set fs.s3a.fast.upload.active.blocks=12; set fs.s3a.threads.keepalivetime=120;
Created 06-29-2017 07:45 PM
That error from AWS suspected to be the S3 connection being broken, and the XML parser in the Amazon SDK getting the end of the document & failing. I'm surprised you are seeing it frequently though; it's generally pretty rare (i.e. rare enough that we've not got that much details on what is going on).
It might be fs.s3a.connection.timeout is the parameter to tune, but the other possiblity is that you have too many threads/tasks talking to S3 and either your network bandwidth is used up or AWS S3 is actually throttling you. Try smaller values of fs.s3a.threads.max (say 64 or fewer) and of fs.s3a.max.total.tasks (try 128). That cuts down the # of threads which may write at a time, and then has a smaller queue of waiting blocks to write before it blocks whatever thread is actually generating lots of of data.